用BeautifulSoup包装多个标签

我正在写一个python脚本，允许将html文档转换为reveal.js幻灯片。为此，我需要在<section>标签内包装多个标签。用BeautifulSoup包装多个标签

使用wrap()方法很容易将单个标签包裹在另一个标签内。不过，我无法弄清楚如何包装多个标签。

澄清一个例子，原始的HTML：

html_doc = """ 
<html> 

<head> 
    <title>The Dormouse's story</title> 
</head> 

<body> 

    <h1 id="first-paragraph">First paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    <div> 
    <a href="http://link.com">Here's a link</a> 
    </div> 

    <h1 id="second-paragraph">Second paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 

    <script src="lib/.js"></script> 
</body> 

</html> 
""" 


"""

我想包住<h1>和他们的下一个标签内<section>标签，就像这样：

<html> 
<head> 
    <title>The Dormouse's story</title> 
</head> 
<body> 

    <section> 
    <h1 id="first-paragraph">First paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    <div> 
     <a href="http://link.com">Here's a link</a> 
    </div> 
    </section> 

    <section> 
    <h1 id="second-paragraph">Second paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    </section> 

    <script src="lib/.js"></script> 
</body> 

</html>

下面是如何做选择：

from bs4 import BeautifulSoup 
import itertools 
soup = BeautifulSoup(html_doc) 
h1s = soup.find_all('h1') 
for el in h1s: 
    els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'], el.next_elements)] 
    els.insert(0, el) 
    print(els)

产量：

[<h1 id="first-paragraph">First paragraph</h1>, 'First paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n ', <div><a href="http://link.com">Here's a link</a> </div>, '\n ', <a href="http://link.com">Here's a link</a>, "Here's a link", '\n ', '\n\n '] 

[<h1 id="second-paragraph">Second paragraph</h1>, 'Second paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n\n ']

的选择是正确的，但我看不出如何包装一个<section>标签内的每个选择。

来源

2015-08-28 Ben

你能编辑你的文章并显示预期的输出吗？ – styvane

请发布预期的输出。 –

我添加了显式输出。 – Ben

最后我发现在这种情况下如何使用wrap方法。我需要明白，汤对象的每一个变化是在地方。

from bs4 import BeautifulSoup 
import itertools 
soup = BeautifulSoup(html_doc) 

# wrap all h1 and next siblings into sections 
h1s = soup.find_all('h1') 
for el in h1s: 
    els = [i for i in itertools.takewhile(
       lambda x: x.name not in [el.name, 'script'], 
       el.next_siblings)] 
    section = soup.new_tag('section') 
    el.wrap(section) 
    for tag in els: 
     section.append(tag) 

print(soup.prettify())

这给了我想要的输出。希望这是帮助。

来源

2015-08-29 19:43:13 Ben

谢谢。我想指出我学到的一些可能并不明显的事情。 1）在别处附加标签（例如通过追加）将其从其先前位置移除。 2）由于（1），因为.next_siblings是一个生成器，而不是一个列表，所以在迭代通过调用section.append（tag）的循环之前，需要将它转换为列表。您的复杂'els = [... ]'那样做。我不需要过滤，所以我尝试了'els = el.next_siblings'。这失败了，因为兄弟姐妹的第一步打破了兄弟姐妹链。 'els = list（el.next_siblings）'有效。 – wojtow

用BeautifulSoup包装多个标签

回答

相关问题