BeautifulSoup解析器没有按标签正确分割

我正在抓取一个网站，然后试图拆分成段落。通过查看被刮掉的文本，我可以清楚地看到一些段落分隔符没有被正确拆分。请参阅下面的代码来重新创建问题！BeautifulSoup解析器没有按标签正确分割

from bs4 import BeautifulSoup 
import requests 

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=111395" 
response = requests.get(link) 
soup = BeautifulSoup(response.content, 'html.parser') 
paras = soup.findAll('p') 
# Note that in printing the below, there are still a lot of "<p>" in that paragraph :( 
print paras[614]

我尝试过使用其他解析器 - 类似的问题。

来源

2016-07-23 Craig

这是设计。这是因为该页面包含嵌套的段落，例如：

<p>Neurosurgeon Ben Carson. [<i>applause</i>] <p>New Jersey

我会用这个小黑客来解决这个问题：

html = response.content.replace('<p>', '</p><p>') # so there will be no nested <p> tags in your soup 

# then your code

来源

2016-07-24 01:57:05 Bob

你试过吗，lxml解析器？我有类似的问题和lxml解决了我的问题。

import lxml 
... 
soup = BeautifulSoup(response.text, "lxml")

而且不是response.content尝试response.text得到Unicode的对象。

来源

2016-07-24 01:50:28

不行的，不幸的是（或者LXML或使用response.text）。感谢您的建议！ – Craig

BeautifulSoup解析器没有按标签正确分割

回答

相关问题