2012-08-08 82 views
2

我需要使用HtmlAgilityPack和C#解析这个html代码。我可以得到div class =“patent_bibdata”节点,但我不知道如何循环通过子节点。循环遍历由HtmlAgilityPack创建的节点

在这个示例中有6个hrefs,但我需要将它们分成两组;发明人,分类。我对最后两个不感兴趣。这个div中可以有任意数量的hrefs。

正如你所看到的,在两组之前有一段文字说明hrefs是什么。

代码片段

HtmlWeb hw = new HtmlWeb(); 
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943"); 
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']"; 
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath); 

所以,你会怎么做呢?

<div class="patent_bibdata"> 
    <b>Inventors</b>:&nbsp; 
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22"> 
    Ronald T. Lashley 
    </a>, 
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22"> 
    Ronald T. Lashley 
    </a><br> 
    <b>Current U.S. Classification</b>:&nbsp; 
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>; 
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a><br> 
    <br> 
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://patft.uspto.gov/netacgi/nph-Parser%3FSect2%3DPTO1%26Sect2%3DHITOFF%26p%3D1%26u%3D/netahtml/PTO/search-bool.html%26r%3D1%26f%3DG%26l%3D50%26d%3DPALL%26RefSrch%3Dyes%26Query%3DPN/3748943&usg=AFQjCNGKUic_9BaMHWdCZtCghtG5SYog-A"> 
    View patent at USPTO</a><br> 
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://assignments.uspto.gov/assignments/q%3Fdb%3Dpat%26pat%3D3748943&usg=AFQjCNGbD7fvsJjOib3GgdU1gCXKiVjQsw"> 
    Search USPTO Assignment Database 
    </a><br> 
</div> 

通缉的结果 InventorGroup =

<a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22"> 
    Ronald T. Lashley 
    </a> 
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22"> 
    Thomas R. Lashley 
    </a> 

ClassificationGroup

<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>; 
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a> 

我试图刮掉页:http://www.google.com/patents/US3748943

//安德斯

PS!我知道在这个页面中发明人的名字是相同的,但是在大多数人中他们是不同的!

回答

4

XPATH是你的朋友!像这样的东西会给你发明者的名字:

HtmlWeb w = new HtmlWeb(); 
HtmlDocument doc = w.Load("http://www.google.com/patents/US3748943"); 
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='patent_bibdata']/br[1]/preceding-sibling::a")) 
{ 
    Console.WriteLine(node.InnerHtml); 
} 
+0

不错!但是,如何获得分类组中的hrefs? – Andis59 2012-08-08 17:01:15

2

所以很明显,我不明白XPath(还)。所以我想出了这个解决方案。 也许不是最聪明的解决方案,但它的工作原理!

//安德斯

List<string> inventorList = new List<string>(); 
List<string> classificationList = new List<string>(); 

string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']"; 
HtmlNode nodes = m_doc.DocumentNode.SelectSingleNode(xpath); 
bool bInventors = false; 
bool bClassification = false; 
for (int i = 0; i < nodes.ChildNodes.Count; i++) 
{ 
    HtmlNode node = nodes.ChildNodes[i]; 
    string txt = node.InnerText; 
    if (txt.IndexOf("Inventor") > -1) 
    { 
     bClassification = false; 
     bInventors = true; 
    } 
    if (txt.IndexOf("Classification") > -1) 
    { 
     bClassification = true; 
     bInventors = false; 
    } 
    if (txt.IndexOf("USPTO") > -1) 
    { 
     bClassification = false; 
     bInventors = false; 
    } 
    string name = node.Name; 
    if (name.IndexOf("a") > -1) 
    { 
     if (bInventors) 
     { 
      string inventor = node.InnerText; 
      inventorList.Add(inventor); 
     } 
     if (bClassification) 
     { 
      string classification = node.InnerText; 
      classificationList.Add(classification); 
     } 
    }