2017-05-26 109 views
1

尝试创建一个程序,该程序可以使用Beautiful Soup模块在某些指定元素中查找和替换标签。但是 - 我无法通过在元素字符串中找到的特定单词“搜索”来查找如何“查找”这些元素。假设我可以让我的代码通过它们指定的单词字符串“查找”这些元素,然后“解开”元素的“p”标签并将它们“包装”到它们的新“h1”标签中。基于元素字符串中的特定单词搜索HTML元素

下面是一些例子HTML代码作为输入:

<p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p> 
<p> Example#2 this element ignored </p> 
<p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <p> 

这里是我到目前为止的代码(由“ExampleStringWord#1”搜索):

for h1_tag in soup.find_all(string="ExampleStringWord#1"): 
      soup.p.wrap(soup.h1_tag("h1")) 

如果使用上面的例子HTML的输入,我想这样的代码出来:

<h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1> 
<p> Example#2 this element ignored </p> 
<h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different <h1> 

但是,我的代码只发现元素它明确包含“ExampleStringWord#1”,并将排除包含任何字符串过去的字词的元素。 我相信我会以某种方式需要利用正则表达式来查找我指定的单词(除了后面的任何字符串用语)元素。不过,我对正则表达式并不是很熟悉,所以我不确定如何结合BeautifulSoup模块来处理这个问题。

另外 - 我已经浏览了Beautiful Soup中用于传递正则表达式作为过滤器的文档(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression),但是我无法在我的情况下使用它。我也回顾了其中一些与通​​过美丽的汤传递正则表达式有关的帖子,但是我没有发现任何能够充分解决我的问题的东西。 任何帮助表示赞赏!

回答

2

如果你会找到p元素与指定的字符串(注意re.compile()部分),然后用h1替换元素的name:

import re 

from bs4 import BeautifulSoup 

data = """ 
<body> 
    <p> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </p> 
    <p> Example#2 this element ignored </p> 
    <p> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </p> 
</body> 
""" 

soup = BeautifulSoup(data, "html.parser") 
for p in soup.find_all("p", string=re.compile("ExampleStringWord#1")): 
    p.name = 'h1' 
print(soup) 

打印:

<body> 
    <h1> ExampleStringWord#1 needs to “find” this entire element based on the "finding" of the first word </h1> 
    <p> Example#2 this element ignored </p> 
    <h1> ExampleStringWord#1 needs to find this entire element as well because the first word of this string is what I’m “searching” for, even though the wording after the first word in the string is different </h1> 
</body>