如何使用RegEx从HTML中提取值？

考虑下面的HTML：如何使用RegEx从HTML中提取值？

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

我想获得元素中的值。我还想获得元素上class属性的值。

理想情况下，我可以通过函数运行一些HTML并获取提取实体的字典（基于上面定义的解析）。

上述代码是来自较大源HTML文件的代码片段，它无法与XML解析器进行比较。所以我正在寻找一个可能的正则表达式来帮助提取感兴趣的信息。

来源

2011-03-16 Paul Fryer

什么编程语言是您使用？有一些库会采用HTML不是有效的XML，并且仍允许使用xpath表达式等来查询信息。 – 2011-03-16 15:26:37

编程语言= .net – 2011-03-16 15:32:40

使用该工具（免费）： http://www.radsoftware.com.au/regexdesigner/

使用这个表达式：

"<span[^>]*>(.*?)</span>"

在组1（每场比赛）的值将是你所需要的文本。

在C＃中它会看起来像：

  Regex regex = new Regex("<span[^>]*>(.*?)</span>"); 
      string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
      if (regex.IsMatch(toMatch)) 
      { 
       MatchCollection collection = regex.Matches(toMatch); 
       foreach (Match m in collection) 
       { 
        string val = m.Groups[1].Value; 
        //Do something with the value 
       } 
      }

Ammended回答评论：

  Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>"); 
      string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
      if (regex.IsMatch(toMatch)) 
      { 
       MatchCollection collection = regex.Matches(toMatch); 
       foreach (Match m in collection) 
       { 
        string class = m.Groups[1].Value; 
        string val = m.Groups[2].Value; 
        //Do something with the class and value 
       } 
      }

来源

2011-03-16 15:53:22

我的示例代码不适用于嵌套跨度，但是然后再次没有在您提供的示例html中。 – 2011-03-16 16:03:57

这适用于获取值，谢谢。你有什么想法，我怎么能得到“类”属性的价值呢？ – 2011-03-16 16:09:36

这正是我正在寻找的 - 你摇滚！谢谢 – 2011-03-16 16:21:41

假设你有没有嵌套 span标签，下面应该工作：

/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

我只是做了它的一些基本的测试，但它会匹配类的跨度标签（如果存在的话）以及数据，直到标签被关闭。

来源

2011-03-16 15:39:35

很酷，你有什么想法，我可以如何在C＃中使用它来返回一个提取值的字典？谢谢。 – 2011-03-16 15:50:40

我强烈建议您使用一个真正的HTML或XML解析器代替它。 You cannot reliably parse HTML or XML with regular expressions - 你能做的最多的事情就是靠近，越接近你的正则表达式就越复杂和耗时。如果你有一个大的HTML文件需要解析，那么很可能会破坏任何简单的正则表达式模式。

正则表达式像<span[^>]*>(.*?)会对您的例子，但有关于XML的有效代码有很多这是很难甚至不可能用正则表达式来解析（例如，foo bar将打破上面的图案）。如果你想要其他HTML样本可以使用的东西，那么正则表达式不是这里的方法。

由于您的HTML代码不是XML有效的，请考虑HTML Agility Pack，我听说它非常好。

来源

2011-03-16 15:53:18

回答

相关问题