2016-07-27 82 views
0

我试图通过使用beautifulsoup从html代码中删除br标记。Python beautifulsoup删除自我关闭标记

HTML如:

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;"> 
Doctor of Philosophy (Software Engineering), Universiti Teknologi Petronas 
<br> 
Master of Science (Computer Science), Government College University Lahore 
<br> 
Master of Science (Computer Science), University of Agriculture Faisalabad 
<br> 
Bachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad 
<br></span> 

我的Python代码:

for link2 in soup.find_all('br'): 
     link2.extract() 
for link2 in soup.findAll('span',{'class':'qualification'}): 
     print(link2.string) 

的问题是,以前的代码只是获取第一个资格。

回答

1

因为这些都不<br> S的已关闭的同行,美丽的汤加上他们就自动生成了以下HTML:

In [23]: soup = BeautifulSoup(html) 

In [24]: soup.br 
Out[24]: 
<br> 
Master of Science (Computer Science), Government College University Lahore 
<br> 
Master of Science (Computer Science), University of Agriculture Faisalabad 
<br> 
Bachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad 
<br/></br></br></br> 

当你在第一<br>标签调用Tag.extract删除其所有后代和字符串其后代包含:

In [27]: soup 
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;"> 
Doctor of Philosophy (Software Engineering), Universiti Teknologi Petronas 
</span> 

看来,你只需要提取从span元素的所有文本。如果是这样的话,不要打扰消除任何:

In [28]: soup.span.text 
Out[28]: '\nDoctor of Philosophy (Software Engineering), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science (Computer Science), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad\n' 

Tag.text属性提取从给定标签的所有字符串。

+0

所以,如果beautifulsoup自动添加了''
结束标记,可这个问题可以通过使用XHTML兼容''
避免? – HolyDanna

+0

@HolyDanna:是的。尽管如此,OP仍然需要使用'Tag.text'或'Tag.stripped_strings'来获取'span'的内容。 – vaultah

0

使用解包应该工作

soup = BeautifulSoup(html) 
for match in soup.findAll('br'): 
    match.unwrap() 
0

这里有一个办法做到这一点:

for link2 in soup.findAll('span',{'class':'qualification'}): 
    for s in link2.stripped_strings: 
     print(s) 

这是没有必要删除<br>标签,除非你需要以供日后处理去除。这里link2.stripped_strings是一个生成器,它会生成标记中的每个字符串,并删除前导和尾随空格。打印循环可更简洁地写为:

for link2 in soup.findAll('span',{'class':'qualification'}): 
    print(*link2.stripped_strings, sep='\n') 
+0

谢谢,它的工作原理 – Aaron