找到一个段落，找到这一段用正则表达式

内的字符串我有一个HTML页面内一些线路是这样的：找到一个段落，找到这一段用正则表达式

<div> 
    <p class="match"> this sentence should match </p> 
    some text 
    <a class="a"> some text </a> 
</div> 
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text 
    <a class ="b"> some text </a> 
</div>

我想提取<p class="match">里面的线，但只有当里面有div含<a class="a">。

什么，我这样做的远低于（我首先找到<a class="a">里面的段落，我在迭代结果找到一个<p class="match">里面的句子）：

import re 
file_to_r = open("a") 

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL) 

regex_match = re.compile(r'<p class="match">(.+)</p>') 
for m in regex_div.findall(file_to_r.read()): 
    print(regex_match.findall(m))

，但我不知道是否有另一种（仍然有效）的方式一次做到这一点？

来源

2014-08-28 Simon

尝试美丽汤4解析HTML文件.. – 2014-08-28 17:04:48

http://stackoverflow.com/a/1732454 – carloabelli 2014-08-28 17:04:54

你应该使用一个HTML解析器，但如果你仍然笏正则表达式，你可以使用这样的事情：

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

Working demo

enter image description here

来源

2014-08-28 17:20:38

[not really reliable ...]（http://regex101.com/r/pV9qY8/2）。 – Jerry 2014-08-28 17:22:29

@Jerry，因为我在我的答案建议我不会使用正则表达式来解析HTML。但我发布了答案作为使用正则表达式回答问题的选项。 – 2014-08-28 17:30:30

使用HTML解析器，如BeautifulSoup。

找到a标签与a类，然后find previous sibling - p与match类标签：

from bs4 import BeautifulSoup 

data = """ 
<div> 
    <p class="match"> this sentence should match </p> 
    some text 
    <a class="a"> some text </a> 
</div> 
<div> 
    <p class="match"> this sentence shouldn't match</p> 
    some text 
    <a class ="b"> some text </a> 
</div> 
""" 

soup = BeautifulSoup(data) 
a = soup.find('a', class_='a') 
print a.find_previous_sibling('p', class_='match').text

打印：

this sentence should match

也明白为什么你应该避免使用正则表达式这里解析HTML：

RegEx match open tags except XHTML self-contained tags

来源

2014-08-28 17:06:14 alecxe

@ user3683807请仔细阅读所链接的话题 - HTML解析器正在解析HTML明确提出 - 针对特定任务的特定工具。我建议在这里使用'BeautifulSoup' - 它使HTML解析变得简单可靠。 – alecxe 2014-08-28 17:19:14

<div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

你可以使用这个。

查看演示。

http://regex101.com/r/lK9iD2/7

来源

2014-08-28 17:34:45 vks

找到一个段落，找到这一段用正则表达式

回答

相关问题