2014-09-23 70 views
1

我在我的java应用程序中使用jsoup来解析html代码,但现在我需要解析表数据,并且我想获得第一个<td>元素的第一个值,在<tr>之后,如果第一个数据包含单词“过期”它将跳过,如果没有过期,它将解析到第三个表格,并以“.rpm”单词获得该值,并且无法使其工作。我尝试了很多方法,但都不成功,所以如果有人有经验,我想在这里尝试运气。在Java中使用jsoup的解析元素

public class rpms { 

    public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException { 
     String fragment = sourceTd; 
     Document doc = Jsoup.parseBodyFragment(fragment); 
     Elements myElements = doc.getElementsByClass("confluenceTable tablesorter").first().getElementsByTag("tr"); 
     for (Element element : myElements) { 
      if (element.select("td").contains("Outdated")) { 
       String rpms = element.ownText(); 
       System.out.println(rpms); 
      } 
     } 
    } 

    public static void main(String[] args) { 
     URLget rpms = new URLget(); 
     try { 
      getTdSibling(sendGetRequest(URL).toString()); 

     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 
} 

并请参阅下表中的HTML代码中元素的解析情况如下:

<table class="confluenceTable tablesorter"> 
    <tbody class=""> 
     <tr> 
      <td colspan="1" class="confluenceTd">RHSA-2014:1172</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>The procmail program is used for local mail delivery. In addition to just 
        <br>delivering mail, procmail can be used for automatic filtering, presorting, 
        <br>and other mail handling jobs.</p> 
       <p>A heap-based buffer overflow flaw was found in procmail's formail utility. 
        <br>A remote attacker could send an email with specially crafted headers that, 
        <br>when processed by formail, could cause procmail to crash or, possibly, 
        <br>execute arbitrary code as the user running formail. (CVE-2014-3618) 
       </p> 
      </td> 
      <td colspan="1" class="confluenceTd">procmail-3.22-17.1.2.x86_64.rpm</td> 
      <td colspan="1" class="confluenceTd"> 
       <img class="emoticon emoticon-tick" src="/s/en_GB-1988229788/4733/f235dd088df5682b0560ab6fc66ed22c9124c0be.57/_/images/icons/emoticons/check.png" data-emoticon-name="tick" alt="(tick)"> 
      </td> 
     </tr> 

     <tr> 
      <td colspan="1" class="confluenceTd">Outdated RHSA-2014:1166</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>Jakarta Commons HTTPClient implements the client side of HTTP standards.</p> 
       <p>It was discovered that the HTTPClient incorrectly extracted host name from 
        <br>an X.509 certificate subject's Common Name (CN) field. A man-in-the-middle 
        <br>attacker could use this flaw to spoof an SSL server using a specially 
        <br>crafted X.509 certificate. (CVE-2014-3577)</p> 
      </td> 
      <td colspan="1" class="confluenceTd"> 
       <p>jakarta-commons-httpclient-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-demo-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-javadoc-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
       <p>jakarta-commons-httpclient-manual-3.0-7jpp.4.el5_10.x86_64.rpm</p> 
      </td> 
     </tr> 

     <tr> 
      <td colspan="1" class="confluenceTd">RHSA-2014:1148-1</td> 
      <td colspan="1" class="confluenceTd"> 
       <p>A flaw was found in the way Squid handled malformed HTTP Range headers. 
        <br>A remote attacker able to send HTTP requests to the Squid proxy could use 
        <br>this flaw to crash Squid. (CVE-2014-3609) 
       </p> 
       <p>A buffer overflow flaw was found in Squid's DNS lookup module. A remote 
        <br>attacker able to send HTTP requests to the Squid proxy could use this flaw 
        <br>to crash Squid. (CVE-2013-4115)</p> 
      </td> 
      <td colspan="1" class="confluenceTd"><span>squid-2.6.STABLE21-7.el5_10.x86_64.rpm</span> 
      </td> 
      <td colspan="1" class="confluenceTd"></td> 
     </tr> 
</table> 

需要你的帮助。我已经尝试了很多次,并从这里阅读文章,但它不能。谢谢。

回答

0

小心你的元素的存取(见文档here):

你只能给一个类getElementsByClass

public static void getTdSibling(String sourceTd) throws FileNotFoundException, UnsupportedEncodingException { 
    String fragment = sourceTd; 
    Document doc = Jsoup.parseBodyFragment(fragment); 
    Elements myElements = doc.getElementsByClass("confluenceTable").first().getElementsByTag("tr"); 
    for (Element element : myElements) { 
     // select the TDs 
     Elements tds = element.getElementsByTag("td"); 
     // do you condition here 
     if (tds.first().text().contains("Outdated")) { 
      // access the <p> children of the 3rd td 
      Elements rpms = tds.get(2).children(); 
      for (Element rpm : rpms) { 
       if (rpm.text().contains(".rpm")) { 
        System.out.println(rpm.text()); 
       } 
      } 
     } 
    } 
} 

编辑,现在连续进入第三个TD。

+0

你可以修改这个元素'tds:element.getElementsByTag(“td”);'它是错误的。 – user3278908 2014-09-24 03:40:37

+0

我的错字,抱歉。还有一个失踪的';' – yunandtidus 2014-09-24 07:37:19