2015-11-03 624 views
3

我很迷惑我如何使用带有BeautifulSoup的ResultSet对象,即bs4.element.ResultSetBeautifulSoup,提取HTML标签内的字符串,ResultSet对象

使用find_all()后,如何提取文本?

实施例:

bs4文档,HTML文档html_doc看起来像:

<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p> 

One开始通过创建soup和查找所有href

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
soup.find_all('a') 

其输出

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 

我们也可以做

for link in soup.find_all('a'): 
    print(link.get('href')) 

其输出

http://example.com/elsie 
http://example.com/lacie 
http://example.com/tillie 

我想从class_="sister"得到文本,即

Elsie 
Lacie 
Tillie 

一个可以尝试

for link in soup.find_all('a'): 
    print(link.get_text()) 

但这会导致一个错误:

AttributeError: 'ResultSet' object has no attribute 'get_text' 

回答

4

请在class_='sister'一个find_all()过滤。

注:通知的class后强调。这是一个特例,因为课是一个保留字。

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_ :

来源:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你把所有带班的妹妹标签,呼吁他们.text来获取文本。一定要去掉文字。

例如:

from bs4 import BeautifulSoup 

html_doc = '''<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p>''' 

soup = BeautifulSoup(html_doc, 'html.parser') 
sistertags = soup.find_all(class_='sister') 
for tag in sistertags: 
    print tag.text.strip() 

输出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py 
Elsie 
Lacie 
Tillie 
+0

完美的作品,谢谢。我很困惑,因为“sistertags.text”正在抛出一个错误 – ShanZhengYang