regex
  • html-content-extraction
  • text-extraction
  • 2011-03-16 72 views 2 likes 
    2

    考虑下面的HTML:如何使用RegEx从HTML中提取值?

    <p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p> 
    

    我想获得<span>元素中的值。我还想获得<span>元素上class属性的值。

    理想情况下,我可以通过函数运行一些HTML并获取提取实体的字典(基于上面定义的<span>解析)。

    上述代码是来自较大源HTML文件的代码片段,它无法与XML解析器进行比较。所以我正在寻找一个可能的正则表达式来帮助提取感兴趣的信息。

    +0

    什么编程语言是您使用?有一些库会采用HTML不是有效的XML,并且仍允许使用xpath表达式等来查询信息。 – 2011-03-16 15:26:37

    +0

    编程语言= .net – 2011-03-16 15:32:40

    回答

    6

    使用该工具(免费): http://www.radsoftware.com.au/regexdesigner/

    使用这个表达式:

    "<span[^>]*>(.*?)</span>" 
    

    在组1(每场比赛)的值将是你所需要的文本。

    在C#中它会看起来像:

      Regex regex = new Regex("<span[^>]*>(.*?)</span>"); 
          string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
          if (regex.IsMatch(toMatch)) 
          { 
           MatchCollection collection = regex.Matches(toMatch); 
           foreach (Match m in collection) 
           { 
            string val = m.Groups[1].Value; 
            //Do something with the value 
           } 
          } 
    

    Ammended回答评论:

      Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>"); 
          string toMatch = "<span class=\"ajjsjs\">Some text</span>"; 
          if (regex.IsMatch(toMatch)) 
          { 
           MatchCollection collection = regex.Matches(toMatch); 
           foreach (Match m in collection) 
           { 
            string class = m.Groups[1].Value; 
            string val = m.Groups[2].Value; 
            //Do something with the class and value 
           } 
          } 
    
    +0

    我的示例代码不适用于嵌套跨度,但是然后再次没有在您提供的示例html中。 – 2011-03-16 16:03:57

    +0

    这适用于获取值,谢谢。你有什么想法,我怎么能得到“类”属性的价值呢? – 2011-03-16 16:09:36

    +0

    这正是我正在寻找的 - 你摇滚!谢谢 – 2011-03-16 16:21:41

    2

    假设你有没有嵌套 span标签,下面应该工作:

    /<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/

    我只是做了它的一些基本的测试,但它会匹配类的跨度标签(如果存在的话)以及数据,直到标签被关闭。

    +0

    很酷,你有什么想法,我可以如何在C#中使用它来返回一个提取值的字典?谢谢。 – 2011-03-16 15:50:40

    1

    强烈建议您使用一个真正的HTML或XML解析器代替它。 You cannot reliably parse HTML or XML with regular expressions - 你能做的最多的事情就是靠近,越接近你的正则表达式就越复杂和耗时。如果你有一个大的HTML文件需要解析,那么很可能会破坏任何简单的正则表达式模式。

    正则表达式像<span[^>]*>(.*?)</span>会对您的例子,但有关于XML的有效代码有很多这是很难甚至不可能用正则表达式来解析(例如,<span>foo <span>bar</span></span>将打破上面的图案)。如果你想要其他HTML样本可以使用的东西,那么正则表达式不是这里的方法。

    由于您的HTML代码不是XML有效的,请考虑HTML Agility Pack,我听说它非常好。

    相关问题