Python3 BeautifulSoup返回串联字符串

我想从这个网站拉角色名单，一旦我找到它Python3 BeautifulSoup返回串联字符串

actors_anchor = soup.find('a', href = re.compile('Actor&p')) 
parent_tag = actors_anchor.parent 
next_td_tag = actors_anchor_parent.findNext('td') 

next_td_tag 

<font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert   
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>

的问题是，当我拉文则会返回一个字符串名称之间没有空格

print(next_td_tag.get_text()) 
'''this returns''' 
'Wes BentleyBryce Dallas HowardRobert RedfordKarl Urban'

我需要这些名称为每个名字就像 [“韦斯宾利”，“布莱斯·达拉斯·霍华德”，“罗伯特·雷德福”，“卡尔·厄本”]

分隔的列表

任何建议都非常有用。

来源

2016-12-31 Chace Mcguyer

你不能使用'find_all（'a'，...）'和'for-loop'而没有'parent'和'findNext'吗？ – furas

请详细说明。感谢您的格式编辑这是我的第一篇文章。 –

所以问题是，并非所有演员的名字都包含在一个标签html中的许多名称出现在
标签之间，当我使用该方法时，它不允许我获得'Wes Bentley' –

找到发现td内的所有a元素：

[a.get_text() for a in next_td_tag.find_all('a')]

这虽然不会覆盖“韦斯本特利”的文字被挂无a元素。

我们用另一种方式，并找到所有文本节点代替：

next_td_tag.find_all(text=True)

您可能需要清理，删除“空”的项目：

texts = [text.strip().replace("\n", " ") for text in next_td_tag.find_all(text=True)] 
texts = [text for text in texts if text] 
print(texts)

将打印：

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

来源

2016-12-31 03:25:07 alecxe

这解决了我的问题。现在很简单，我明白了，你的帮助是值得赞赏的。 –

您可以使用stripped_strings让所有的字符串作为列表

html = '''<td><font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font></td>''' 

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html, 'html.parser') 

next_td_tag = soup.find('td') 

print(list(next_td_tag.stripped_strings))

结果

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

stripped_strings是发电机，所以你可以用for -loop使用它，或者用得到的所有元素list()

来源

2016-12-31 03:37:31 furas

啊，完全适合这个问题！ – alecxe

@alecxe头部或尾部没有空白，stripped_strings在这里注意。并且答案的html代码被修改，'\ n'被删除。 –

import bs4 

html = '''<font size="2">Wes Bentley<br><a href="/people/chart/ 
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a 
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert   
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>''' 

soup = bs4.BeautifulSoup(html, 'lxml') 

text = soup.get_text(separator='|') # concat the stings by separator 
# 'Wes Bentley|Bryce Dallas Howard|Robert  \nRedford|Karl Urban' 
split_text = text.replace('  \n', '').split('|') # than split string in separator. 
# ['Wes Bentley', 'Bryce Dallas Howard', 'RobertRedford', 'Karl Urban'] 

# do it one line 
list_text = soup.get_text(separator='|').replace('  \n', '').split('|')

或者使用字符串生成器来避免手动将字符串拆分为列表：

[i.replace('  \n', '') for i in soup.strings]

来源

2016-12-31 04:00:58

Python3 BeautifulSoup返回串联字符串

回答

相关问题