从你给我们的例子很难说,但在我看来,你可以在h2
之后获得下一个节点。在这个例子中,刘易斯·卡罗尔有p
-aragraph标签和您的朋友保罗只有关闭span
标签:
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... p = section.findNext('p')
... if p:
... print '> ', p.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Lewis Carroll has bla bla
> Mr. Paul J. Fribourg has bla bla
以下意见:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... paragraph = section.findNext('p')
... if paragraph and paragraph.string:
... print '> ', paragraph.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]
你当然可,希望检查版权声明,et cetera ...
谢谢为实物回答!其实,保罗先生之前没有 ..所以如果我运行你的代码,显示Read Full Background ....你介意让我知道解决这个问题的方法吗? – Willy
@Willy:我原来的回答是基于一个显然是你的问题的编辑,其中有人添加了''标签。我相应地编辑了我的答案。 – Johnsyweb
哦谢谢你太多了!它工作得很好..但在我的原始网站上它不起作用..:((我想哭.. – Willy