BeautifulSoup，提取HTML标签内的字符串，ResultSet对象

我很迷惑我如何使用带有BeautifulSoup的ResultSet对象，即bs4.element.ResultSet。BeautifulSoup，提取HTML标签内的字符串，ResultSet对象

使用find_all()后，如何提取文本？

实施例：

在bs4文档，HTML文档html_doc看起来像：

<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p>

One开始通过创建soup和查找所有href，

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
soup.find_all('a')

其输出

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我们也可以做

for link in soup.find_all('a'): 
    print(link.get('href'))

其输出

http://example.com/elsie 
http://example.com/lacie 
http://example.com/tillie

我想从class_="sister"得到仅文本，即

Elsie 
Lacie 
Tillie

一个可以尝试

for link in soup.find_all('a'): 
    print(link.get_text())

但这会导致一个错误：

AttributeError: 'ResultSet' object has no attribute 'get_text'

来源

2015-11-03 ShanZhengYang

请在class_='sister'一个find_all()过滤。

注：通知的class后强调。这是一个特例，因为课是一个保留字。

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_ :

来源：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你把所有带班的妹妹标签，呼吁他们.text来获取文本。一定要去掉文字。

例如：

from bs4 import BeautifulSoup 

html_doc = '''<p class="story"> 
    Once upon a time there were three little sisters; and their names were 
    <a class="sister" href="http://example.com/elsie" id="link1"> 
    Elsie 
    </a> 
    , 
    <a class="sister" href="http://example.com/lacie" id="link2"> 
    Lacie 
    </a> 
    and 
    <a class="sister" href="http://example.com/tillie" id="link2"> 
    Tillie 
    </a> 
    ; and they lived at the bottom of a well. 
    </p>''' 

soup = BeautifulSoup(html_doc, 'html.parser') 
sistertags = soup.find_all(class_='sister') 
for tag in sistertags: 
    print tag.text.strip()

输出：

(bs4)macbook:bs4 joeyoung$ python bs4demo.py 
Elsie 
Lacie 
Tillie

来源

2015-11-03 23:55:11

完美的作品，谢谢。我很困惑，因为“sistertags.text”正在抛出一个错误 – ShanZhengYang

BeautifulSoup，提取HTML标签内的字符串，ResultSet对象

回答

相关问题