Python正则表达式提取标签内的html文件内容

-2

我在文件夹中有很多html格式文件。我需要检查它们是否包含这个标签：Python正则表达式提取标签内的html文件内容

<strong>QQ</strong>

而且只需要提取“QQ”及其内容。我首先阅读其中一个要测试的文件，但看起来我的正则表达式不匹配。如果我取代fo_read作为标签

<strong>QQ</strong>

它将虽然相匹配。

fo = open('4251-fu.html', "r") 
fo_read = fo.read() 
m = re.search('<strong>(QQ)</strong>', fo_read) 
if m: 
    print 'Match found: ', m.group(1) 
else: 
    print 'No match' 
fo.close()

来源

2017-05-28 Michael Lin

你有使用HTML解析器，而不是考虑？ [使用正则表达式来解析HTML是可怕的]（https://stackoverflow.com/a/1732454/5067311）。 –

我有beautifulsoup，但在html中有几个强大的标签。它如何工作？ –

如果您有多个标签，而不是使用HTML解析器的另一个原因。我不熟悉这个主题，但是BS4文档或[标准html模块]（https://docs.python.org/3/library/html.parser.html）（oops：[python2 for you] （https://docs.python.org/2/library/htmlparser.html））文档和一些有针对性的谷歌搜索应该是有帮助的。 –

result = soup.find("strong", string=re.compile("Question-and-Answer Session")) 
if result: 
    print("Question-and-Answer Session") 
    # for the rest of text in the parent 
    rest = result.parent.text.split("Question-and-Answer Session")[-1].strip() 
    print(rest) 
else: 
    print("no match")

来源

2017-05-28 01:03:54 Serge

它返回[u'\ n问题和答案会话\ n']，我怎样才能得到问答会话？ –

你可以在'result.parent.text.split（...）[ - 1]'末尾添加一个'.strip（）'' –

splitting有点怪异，对于任何严重的项目都可以尝试'next_sibling' .. .. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways – Serge

你可以用BeautifulSoup尝试：

from bs4 import BeautifulSoup 
f = open('4251-fu.html',mode = 'r') 
soup = BeautifulSoup(f, 'lxml') 
search_result = [str(e) for e in soup.find_all('strong')] 
print search_result 
if '<strong>Question-and-Answer Session</strong>' in search_result: 
    print 'Match found' 
else: 
    print 'No match' 
f.close()

输出：

['<strong>Question-and-Answer Session1</strong>', '<strong>Question-and-Answer Session</strong>', '<strong>Question-and-Answer Session3</strong>'] 
Match found

来源

2017-05-28 00:52:03

有几个强大的标签，但我只希望有问答环节 –

但强标签在不同的地方，并不总是在开始。 –

它会在html文件中找到所有强标记，无论它在哪里。 –

Python正则表达式提取标签内的html文件内容

回答

相关问题