正则表达式的findall没有预期的多个结果我在Python以下两个片段(short_sentence
是long_sentence
这儿的一部分)可以用Python
short_sentence = '<p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p>'
long_sentence = '<description><img src="http://cdn.static-economist.com/sites/default/files/images/print-edition/20170211_LDC811.png" alt="" title="" height="376" width="458" class=" blog-post-article-image blog-post-article-image__slim" data-reactid="388"/><p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p><p data-reactid="390">To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.</p>'
我想解析每个的(最短)子这里介于< + anything + *>
和</p>
之间的字符串。我知道,在short_sentence
有一个这样的occurence:
THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.
在long_sentence,上面有一个和另一个:
To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.
据我所知,Python的re.findall()
还给匹配的潜台词出现的所有一个文本。当我尝试执行以下命令:
re.findall("<p.*>(.*?)</p>", short_sentence)
我得到正确的假设结果:
['THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.']
与此同时,当我尝试从long_sentence
有以下分析两个字符串:
re.findall("<p.*>(.*?)</p>", long_sentence)
我仍然只得到一个occurence(第二个):
['To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.']
我的问题是:第二种情况在这里出了什么问题?为什么不返回它的两个出现?
使用're.findall( “<页。*?>(。*?)</P >”,long_sentence)' –
如果你试图解析HTML或XML,可以考虑使用HTML或XML解析库而不是正则表达式。 – Kevin