2012-03-08 41 views
0

我有以下节点比较XPath列表以找到离其他节点最近的列表?

"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[7]/p[1]/#text[1]" 

我怎样才能弄清楚,这些最后一个是最接近的一个?

"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[4]/div[1]/img[1]" 
"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[4]/div[3]/a[1]/img[1]" 
"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[4]/div[3]/a[2]/img[1]" 
"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[4]/div[5]/img[1]" 
"/html[1]/body[1]/div[1]/div[1]/div[3]/div[1]/div[5]/div[1]/img[1]" 

它并不一定是最后一个。

这里是我如何到达那里:

protected string GuessThumbnail(HtmlDocument document) 
{ 
    HtmlNode root = document.DocumentNode; 
    IEnumerable<string> result = new List<string>(); 

    HtmlNode description = root.SelectSingleNode(DescriptionPredictiveXPath); 
    if (description != null) // in this case, we predict relevant images are the ones closest to the description text node. 
    { 
     HtmlNode node = description.ParentNode; 
     while (node != null) 
     { 
      string path = string.Concat(node.XPath, ImageXPath); 
      node = node.ParentNode; 
      IEnumerable<HtmlNode> nodes = root.SelectNodesOrEmpty(path); 

      // find the image tag that's closest to the text node. 
      if (nodes.Any()) 
      { 
       var xpaths = nodes.Select(n => n.XPath); 
       xpaths.ToList(); 

       // return closest 
      } 
     } 
    } 
    // figure some other way to do it 

    throw new NotImplementedException(); 
} 
+0

你的意思最接近的是如何接近它是文档结构目标元素内? – JamieSee 2012-03-08 17:56:01

+0

是的,就是这样。我想知道'div [7]'比'div [5]'更接近'div [4]',如果有多个'div [5]',那么检查下一层,等等直到找到最接近的元素。 – bevacqua 2012-03-08 17:59:12

+0

您的代码是否使用CodePlex的Html Agility Pack? – JamieSee 2012-03-08 18:27:57

回答

0

难道这样的:

protected string GuessThumbnail(HtmlDocument document) 
    { 
     HtmlNode root = document.DocumentNode; 
     HtmlNode description = root.SelectSingleNode(DescriptionPredictiveXPath); 

     if (description != null) 
     { 
      // in this case, we predict relevant images are the ones closest to the description text node. 
      HtmlNode parent = description.ParentNode; 
      while (parent != null) 
      { 
       string path = string.Concat(parent.XPath, ImageXPath); 
       IList<HtmlNode> images = root.SelectNodesOrEmpty(path).ToList(); 

       // find the image tag that's closest to the text node. 
       if (images.Any()) 
       { 
        HtmlNode descriptionOutermost = description.ParentNodeUntil(parent); // get the first child towards the description from the parent node. 
        int descriptionIndex = descriptionOutermost.GetIndex(); // get the index of the description's outermost element. 

        HtmlNode closestToDescription = null; 
        int distanceToDescription = int.MaxValue; 

        foreach (HtmlNode image in images) 
        { 
         int index = image.ParentNodeUntil(parent).GetIndex(); // get the index of the image's outermost element. 
         if (index > descriptionIndex) 
         { 
          index *= -1; 
         } 
         int distance = descriptionIndex - index; 
         if (distance < distanceToDescription) 
         { 
          closestToDescription = image; 
          distanceToDescription = distance; 
         } 
        } 
        if (closestToDescription != null) 
        { 
         string source = closestToDescription.Attributes["src"].Value; 
         return source; 
        } 
       } 

       parent = parent.ParentNode; 
      } 
     } 
     // figure some other way to do it 

     throw new NotImplementedException(); 
    } 


public static HtmlNode ParentNodeUntil(this HtmlNode node, HtmlNode parent) 
{ 
    while (node.ParentNode != parent) 
    { 
     node = node.ParentNode; 
    } 
    return node; 
} 
public static int GetIndex(this HtmlNode node) 
{ 
    return node.ParentNode.ChildNodes.IndexOf(node); 
} 
0

考虑“在深度优先的顺序在整个树中的位置”到每个节点分配。这样比较2个节点将会非常简单。

如果您可以将任意数据附加到您的节点 - 直接添加它。否则有所有节点的字典来定位地图。

请注意,取决于您需要进行这种比较的次数,这种方法可能会让您放慢速度,但应该很容易实施并测量它,以满足您的要求。