我目前正在爬取一些网站,并从中检索信息以存储到数据库中供以后使用。我正在使用HtmlAgilityPack,并且我已经为几个网站成功完成了这项工作,但出于某种原因,这个问题给我带来了问题。我对XPath语法还很陌生,所以我可能在那里搞砸了。XPath检索<a> href,文本和<span>
什么继承人从网站的代码看起来像我想中检索:
<form ... id="_subcat_ids_">
<input ....>
<ul ...>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#1">
Text I Need //need to retrieve this text between then <a></a>
<span class="subtle-note">(2)</span> //I Need that number from inside the span
</a>
</li>
<li ....>
<input .....>
<a class="facet-seleection multiselect-facets "
.... href="INeedThisHref#2">
Text I Need #2 //need to retrieve this text between then <a></a>
<span class="subtle-note">(6)</span> //I Need that number from inside the span
</a>
</li>
那些每一个代表一个页面上的项目,但我只对什么有兴趣的发生每个<a></a>
。我想从<a>
里面检索href值,然后在开始和结束之间的文字,然后我需要<span>
里面的文字。我将其他标签中的内容排除在外,因为它们无法唯一标识每个项目,<a>
内部的类是他们共享的唯一内容,并且它们都在form
的id="_subcat_ids_"
之内。
继承人我的代码:
try
{
string fullUrl = "...";
HtmlWeb web = new HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
HtmlDocument html = web.Load(fullUrl);
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form
{
foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
{
//get the href
string tempHref = node2.GetAttributeValue("href", string.Empty);
//get the text between <a>
string tempCat = node2.InnerText.Trim();
//get the text between <span>
string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
}
}
}
catch (Exception ex)
{
Console.Write("\nError: " + ex.ToString());
}
首先foreach循环没有错误,但第二个让我object reference not set to an instance of an object
在哪里我的第二个foreach循环是行。就像我之前提到的那样,我对这种语法仍然陌生,我在另一个网站上使用了这种类型的方法,并取得了巨大的成功,但我在这个网站遇到了一些麻烦。任何提示将不胜感激。
检查提供的详细资料的正确性,因为有你'XPath'表达几个错别字/不准确和'HTML'像'seleection' /'selection'这样的样本,班级名称中的空格编号... – Andersson