问题... BeautifulSoup解析

<h2 class="sectionTitle">BACKGROUND</h2> 
Mr. Paul J. Fribourg has bla bla</span> 
<div style="margin-top:8px;"> 
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
</div>

我想从保罗先生提取信息BLABLA 一些网页有盈保罗先生的，所以我可以使用FindNext('p') 然而，一些网页没有像上面的例子..问题... BeautifulSoup解析

这是我当有

background = bs2.find(text=re.compile("BACKGROUND")) 
bb= background.findNext('p').contents

代码但是，当我没有做我怎么能提取信息？

来源

2011-08-27 Willy

从你给我们的例子很难说，但在我看来，你可以在h2之后获得下一个节点。在这个例子中，刘易斯·卡罗尔有p -aragraph标签和您的朋友保罗只有关闭span标签：

>>> from BeautifulSoup import BeautifulSoup 
>>> 
>>> html = ''' 
... <h2 class="sectionTitle">BACKGROUND</h2> 
... <p>Mr. Lewis Carroll has bla bla</p> 
... <div style="margin-top:8px;"> 
...  <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
... </div> 
... <h2 class="sectionTitle">BACKGROUND</h2> 
... Mr. Paul J. Fribourg has bla bla</span> 
... <div style="margin-top:8px;"> 
...  <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a> 
... </div> 
... ''' 
>>> 
>>> soup = BeautifulSoup(html) 
>>> headings = soup.findAll('h2', text='BACKGROUND') 
>>> for section in headings: 
...  p = section.findNext('p') 
...  if p: 
...   print '> ', p.string 
...  else: 
...   print '> ', section.parent.next.next.strip() 
... 
> Mr. Lewis Carroll has bla bla 
> Mr. Paul J. Fribourg has bla bla

以下意见：

>>> from BeautifulSoup import BeautifulSoup 
>>> from urllib2 import urlopen 
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP') 
>>> soup = BeautifulSoup(html) 
>>> headings = soup.findAll('h2', text='BACKGROUND') 
>>> for section in headings: 
...  paragraph = section.findNext('p') 
...  if paragraph and paragraph.string: 
...   print '> ', paragraph.string 
...  else: 
...   print '> ', section.parent.next.next.strip() 
... 
> Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

来源

2011-08-28 00:37:30 Johnsyweb

谢谢为实物回答！其实，保罗先生之前没有 ..所以如果我运行你的代码，显示Read Full Background ....你介意让我知道解决这个问题的方法吗？ – Willy

@Willy：我原来的回答是基于一个显然是你的问题的编辑，其中有人添加了''标签。我相应地编辑了我的答案。 – Johnsyweb

哦谢谢你太多了！它工作得很好..但在我的原始网站上它不起作用..：（（我想哭.. – Willy

“有些网页有盈保罗先生的，所以我可以使用FindNext中（‘P’），然而，一些网页没有像上面的例子。”

你没有给予足够的信息，以便能够识别您的字符串：

固定节点结构如getChildren（）[1] .getChildren（）[0] .text
如果根据您的代码在魔术字符串'BACKGROUND'前面加上魔术字符串，那么您找到下一个节点的方法看起来不错 - 只是不要构建假设该标记的名称是“p”
正则表达式（如“（先生|女士）......”）

向我们展示一个HTML例子，当它没有在前面名字？

来源

2011-08-28 00:09:09 smci

谢谢你的好评！我认为你的第二点是正确的..字符串背景可能是魔术字符串..但我一直在考虑在单词后面提取文本的方式..我不知道..它不工作.. – Willy

问题... BeautifulSoup解析

回答

相关问题