用BeautifulSoup分解元素

我有一些我用BeautifulSoup解析的html代码。其中一个要求是标签不嵌套在段落或其他文本标签中。用BeautifulSoup分解元素

例如，如果我有这样的代码：

<p> 
    first text 
    <a href="..."> 
     <img .../> 
    </a> 
    second text 
</p>

我需要把它改造成这样的：

<p>first text</p> 
<img .../> 
<p>second text</p>

我做了一些提取图像和后添加他们该段落，像这样：

for match in soup.body.find_all(True, recursive=False):     
    try:    
     for desc in match.descendants: 
      try: 
       if desc.name in ['img']:  

        if (hasattr(desc, 'src')):        
         # add image as an independent tag 
         tag = soup.new_tag("img") 
         tag['src'] = desc['src'] 

         if (hasattr(desc, 'alt')): 
          tag['alt'] = desc['alt'] 
         else 
          tag['alt'] = '' 

         match.insert_after(tag) 

        # remove image from its container        
        desc.extract() 

      except AttributeError: 
       temp = 1 

    except AttributeError: 
     temp = 1

我写了另一段代码删除空的电子邮件lement（像它的图像被删除后留空的标签），但我不知道如何将元素拆分为两个不同的元素。

来源

2012-09-27 alex.ac

import string 
the_string.split(the_separator[,the_limit])

这将产生一个数组，因此您可以通过for循环或获取元素manualy。

the_limit不需要

在你的情况我认为the_separator需要“\ n” 但是，从案件依赖于情况。解析是非常有趣的，但有时候是一件棘手的事情。

来源

2012-09-27 08:23:20 Develoger

我试图远离字符串解析，因为我可能会结束与未封闭的标签。我希望BeautifulSoup知道如何修复html代码并使其有效。无论哪种方式，我会尝试一下，看看会发生什么:) –

美丽的肥皂有美化选项，所以做这个soup.prettify（）来测试它，它会返回格式良好的HTML。 – Develoger

@DušanRadojević美丽的肥皂总是洗的HTML（： – Rubens

-1

from bs4 import BeautifulSoup as bs 
from bs4 import NavigableString 
import re 

html = """ 
<div> 
<p> <i>begin </i><b>foo1</b><i>bar1</i>SEPATATOR<b>foo2</b>some text<i>bar2 </i><b>end </b> </p> 
</div> 
""" 
def insert_tags(parent,tag_list): 
    for tag in tag_list: 
     if isinstance(tag, NavigableString): 
      insert_tag = s.new_string(tag.string) 
      parent.append(insert_tag) 
     else: 
      insert_tag = s.new_tag(tag.name) 
      insert_tag.string = tag.string 
      parent.append(insert_tag) 

s = bs(html) 
p = s.find('p') 
print s.div 
m = re.match(r"^<p>(.*?)(SEPATATOR.*)</p>$", str(p)) 
part1 = m.group(1).strip() 
part2 = m.group(2).strip() 

part1_p = s.new_tag("p") 
insert_tags(part1_p,bs(part1).contents) 

part2_p = s.new_tag("p") 
insert_tags(part2_p,bs(part2).contents) 

s.div.p.replace_with(part2_p) 
s.div.p.insert_before(part1_p) 
print s.div

因为我没有为此目的使用嵌套的HTML，所以适合我。无可否认，它仍然看起来很尴尬。它产生在我的例子

<div> 
<p><i>begin </i><b>foo1</b><i>bar1</i></p> 
<p>SEPATATOR<b>foo2</b>some text<i>bar2 </i><b>end </b></p> 
</div>

来源

2013-02-01 17:48:21 user1491229

用BeautifulSoup分解元素

回答

相关问题