如何用美丽的汤提取文字“alt”

我刚刚发现了美丽的汤，这似乎很强大。我想知道是否有一种简单的方法来提取文本“alt”字段。一个简单的例子是如何用美丽的汤提取文字“alt”

from bs4 import BeautifulSoup 

html_doc =""" 
<body> 
<p>Among the different sections of the orchestra you will find:</p> 
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p> 
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p> 
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p> 
</body> 
""" 
soup = BeautifulSoup(html_doc, 'html.parser') 
print(soup.get_text())

这将导致

其中管弦乐队的不同部分，你会发现：

一个在弦

一个在黄铜

A木管乐器

但我想有字符提取，这将使

其中管弦乐队的不同部分内中高音场，你会发现：

小提琴的琴弦

在小号黄铜

甲单簧管和萨克斯在木管乐器

由于

来源

2017-04-24 Portland

看一看：http://stackoverflow.com/questions/2612548/extracting -an-attribute-value-with-beautifulsoup – JacobIRR

请考虑这种方法。

from bs4 import BeautifulSoup 

html_doc =""" 
<body> 
<p>Among the different sections of the orchestra you will find:</p> 
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p> 
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet" /> in the brass</p> 
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p> 
</body> 
""" 
soup = BeautifulSoup(html_doc, 'html.parser') 
ptag = soup.find_all('p') # get all tags of type <p> 

for tag in ptag: 
    instrument = tag.find('img') # search for <img> 
    if instrument: # if we found an <img> tag... 
     # ...create a new string with the content of 'alt' in the middle if 'tag.text' 
     temp = tag.text[:2] + instrument['alt'] + tag.text[2:] 
     print(temp) # print 
    else: # if we haven't found an <img> tag we just print 'tag.text' 
     print(tag.text)

输出是

Among the different sections of the orchestra you will find: 
A violin in the strings 
A trumpet in the brass 
A clarinet and saxophone in the woodwinds

的策略是：

找到所有<p>标签
搜索的<img>标签在这些<p>标签
如果我们发现与<img>标签插入co其alt属性到tag.text和ntent打印出来
如果我们没有找到一个<img>标签只是打印出来

来源

2017-04-24 14:41:00 datell

非常感谢@datell。它工作正常。还有一个问题。如果我在同一段中有两个图像，例如

在管弦乐队的不同部分中，您会发现：

A violin 中的字符串。在黄铜

甲 clarinet and saxophone 甲 trumpet 在木管乐器

，那就不能提取第二个。任何关于2 pr更多“img”在同一段中的想法？ – Portland

a = soup.findAll('img') 

for every in a: 
    print(every['alt'])

这将完成这项工作。

1.line找到所有的IMG（我们使用.find 所有）

或文本

print (a.text) 
for eachline in a: 
    print(eachline.text)

简单的for循环，通过每一结果或手动soup.findAll('img')[0]然后去soup.findAll('img')[1] ..等等

来源

2017-04-24 04:12:38

谢谢，但你的代码返回小提琴小号单簧管和萨克斯管。这不是我的问题，我想根据我原来的帖子，将这些内容放在“正确的地方”。 – Portland

如何用美丽的汤提取文字“alt”

回答

相关问题