使用BeautifulSoup获取“Something new”等特定文本Python

我正在制作重点抓取工具，并在找到文档中的关键短语时遇到问题。使用BeautifulSoup获取“Something new”等特定文本Python

假设关键短语我想在文档中进行搜索是使用BeautifulSoup使用Python“新东西”

我下面

if soup.find_all(text = re.compile("Something new",re.IGNORECASE)): 
     print true

我希望它仅适用于以下情况下打印真

“新东西” - > true

“$＃something new ,.” - >真

而不是针对以下情况：

“thisSomething新闻” - >假

“Somethingnew” - >假

假设特殊字符。

有没有人曾经做过这样的事情。 ??

感谢您的帮助。

来源

2014-09-26 patz

然后，搜索something new和不适用re.IGNORECASE：

import re 

from bs4 import BeautifulSoup 


data = """ 
<div> 
    <span>something new</span> 
    <span>$#something new,.</span> 
    <span>thisSomething news</span> 
    <span>Somethingnew</span> 
</div> 
""" 

soup = BeautifulSoup(data) 
for item in soup.find_all(text=re.compile("something new")): 
    print item

打印：

something new 
$#something new,.

您也可以采取非正则表达式的方法和pass a function，而不是编译正则表达式：

for item in soup.find_all(text=lambda x: 'something new' in x): 
    print item

对于上面使用的示例HTML，它也打印：

something new 
$#something new,.

来源

2014-09-26 00:28:52 alecxe

做pratikgala张贴他的问题错了？他只想要忽略大小写和符号。但是他也想要“新东西” - >错误。这使得你的答案完美:-)然后，我将不得不删除我对这个问题的投票：p – 2014-09-26 03:00:40

@ Md.Mohsin是的，这是什么让我想到是否发布答案或不。该代码适用于OP提供的输入，我们将看看是否有其他人在这里进行。谢谢。 – alecxe 2014-09-26 03:02:28

感谢您的回答。这对我有用 soup.find_all（text = re.compile（“\\ bSomething new \\ b”，re.IGNORECASE）） – patz 2014-09-26 03:22:55

这是我使用了替代的方法中的一种：

soup.find_all(text = re.compile("\\bSomething new\\b",re.IGNORECASE))

谢谢大家。

来源

2014-09-26 06:55:37 patz

使用BeautifulSoup获取“Something new”等特定文本Python

回答

相关问题