2017-04-20 101 views
0

我目前正在爬取一些网站,并从中检索信息以存储到数据库中供以后使用。我正在使用HtmlAgilityPack,并且我已经为几个网站成功完成了这项工作,但出于某种原因,这个问题给我带来了问题。我对XPath语法还很陌生,所以我可能在那里搞砸了。XPath检索<a> href,文本和<span>

什么继承人从网站的代码看起来像我想中检索:

<form ... id="_subcat_ids_"> 
    <input ....> 
    <ul ...> 
    <li ....> 
     <input .....> 
     <a class="facet-seleection multiselect-facets " 
     .... href="INeedThisHref#1"> 
     Text I Need       //need to retrieve this text between then <a></a> 
     <span class="subtle-note">(2)</span> //I Need that number from inside the span 
     </a> 
    </li> 
    <li ....> 
     <input .....> 
     <a class="facet-seleection multiselect-facets " 
     .... href="INeedThisHref#2"> 
     Text I Need #2      //need to retrieve this text between then <a></a> 
     <span class="subtle-note">(6)</span> //I Need that number from inside the span 
     </a> 
    </li> 

那些每一个代表一个页面上的项目,但我只对什么有兴趣的发生每个<a></a>。我想从<a>里面检索href值,然后在开始和结束之间的文字,然后我需要<span>里面的文字。我将其他标签中的内容排除在外,因为它们无法唯一标识每个项目,<a>内部的类是他们共享的唯一内容,并且它们都在formid="_subcat_ids_"之内。

继承人我的代码:

try 
{ 
    string fullUrl = "..."; 
    HtmlWeb web = new HtmlWeb(); 
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12; 
    HtmlDocument html = web.Load(fullUrl); 

    foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form 
    { 
    foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object' 
    { 
     //get the href 
     string tempHref = node2.GetAttributeValue("href", string.Empty); 
     //get the text between <a> 
     string tempCat = node2.InnerText.Trim(); 
     //get the text between <span> 
     string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim(); 
    } 
    } 
} 
catch (Exception ex) 
{ 
    Console.Write("\nError: " + ex.ToString()); 
} 

首先foreach循环没有错误,但第二个让我object reference not set to an instance of an object在哪里我的第二个foreach循环是行。就像我之前提到的那样,我对这种语法仍然陌生,我在另一个网站上使用了这种类型的方法,并取得了巨大的成功,但我在这个网站遇到了一些麻烦。任何提示将不胜感激。

+0

检查提供的详细资料的正确性,因为有你'XPath'表达几个错别字/不准确和'HTML'像'seleection' /'selection'这样的样本,班级名称中的空格编号... – Andersson

回答

0

好吧,我想通了,继承人的代码

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) 
{ 
    //get the categories, store in list 
    foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]")) 
    { 
    string tempCat = node2.InnerText.Trim(); 
    categoryList.Add(tempCat); 
    Console.Write("\nCategory: " + tempCat);   
    } 
    foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']")) 
    { 
    //get href for each category, store in list 
    string tempHref = node3.GetAttributeValue("href", string.Empty); 
    LinkCatList.Add(tempHref); 
    Console.Write("\nhref: " + tempHref); 
    //get the number of items from categories, store in list 
    string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim(); 
    string tp = tempNum.Replace("(", ""); 
    tempNum = tp; 
    tp = tempNum.Replace(")", ""); 
    tempNum = tp; 
    Console.Write("\nNumber of items: " + tempNum + "\n\n"); 
    } 
} 

的作品就像一个魅力