2017-02-15 45 views
0

正则表达式的findall没有预期的多个结果我在Python以下两个片段(short_sentencelong_sentence这儿的一部分)可以用Python

short_sentence = '<p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p>' 

long_sentence = '<description>&lt;img src=&quot;http://cdn.static-economist.com/sites/default/files/images/print-edition/20170211_LDC811.png&quot; alt=&quot;&quot; title=&quot;&quot; height=&quot;376&quot; width=&quot;458&quot; class=&quot; blog-post-article-image blog-post-article-image__slim&quot; data-reactid=&quot;388&quot;/&gt;&lt;p data-reactid=&quot;389&quot;&gt;THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.&lt;/p&gt;&lt;p data-reactid=&quot;390&quot;&gt;To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.&lt;/p&gt;' 

我想解析每个的(最短)子这里介于&lt; + anything + *&gt;&lt;/p&gt;之间的字符串。我知道,在short_sentence有一个这样的occurence:

THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d. 

在long_sentence,上面有一个和另一个:

To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d. 

据我所知,Python的re.findall()还给匹配的潜台词出现的所有一个文本。当我尝试执行以下命令:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", short_sentence) 

我得到正确的假设结果:

['THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.'] 

与此同时,当我尝试从long_sentence有以下分析两个字符串:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", long_sentence) 

我仍然只得到一个occurence(第二个):

['To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.'] 

我的问题是:第二种情况在这里出了什么问题?为什么不返回它的两个出现?

+0

使用're.findall( “<页。*?>(。*?)</P >”,long_sentence)' –

+0

如果你试图解析HTML或XML,可以考虑使用HTML或XML解析库而不是正则表达式。 – Kevin

回答

0

p.*是贪婪的,所以它会尽其所能。如果您使用p.*?,您将获得预期结果。

多一点关于该主题的信息在这里,如果你需要它:http://www.regular-expressions.info/repeat.html

摘录:

假设你想使用正则表达式匹配一个HTML标签。你知道输入将是一个有效的HTML文件,所以正则表达式不需要排除任何无效的尖括号。如果它位于尖括号之间,它是一个HTML标记。

大多数刚接触正则表达式的人都会尝试使用<。当他们在一个字符串上进行测试时,他们会感到惊讶。这是一个第一个测试。您可能会希望正则表达式匹配,并在匹配后继续,