在文本节点中获取锚点中的文本

我正在解析亚马逊上的产品评论，我希望获取评论的完整文本，其中包含链接中的文本。在文本节点中获取锚点中的文本

我目前正在使用jSoup，就像它一样好，它会忽略锚点。当然，我可以通过使用选择器来从锚点获取所有文本，但是我会丢失关于该文本所处环境的信息。

我认为一个例子是解释自己的最佳方式。

样品结构：

<div class="container"> 
    <div style="a">Something...</div> 
    <div style="b">...Nested spans and divs... </div> 
    <div class="tiny">_____ </div> 
    " From the makers of the incredible <a href="SOMELINK">SOMEPRODUCT</a> we have this other product that blablabla.... Amazing specs, but <a href="SOME_OTHER_LINK">this other product</a> is somehow better".

我得到什么：“从不可思议的制造商，我们有blablabla这个其它产品...惊人的规格，但不知何故更好”。

我想要的是：“从令人难以置信的SOMEPRODUCT制造商那里，我们有这款blablabla其他产品......令人惊叹的规格，但这种其他产品在某种程度上更好”。

使用jSoup我的代码：

Elements allContainers = doc.select(".container"); 
for (Element container : allContainers) { 
    String reviewText = container.ownText(); // THIS EXCLUDES TEXT FROM LINKS 
StdOut.println(reviewText);

我找不到这样做，因为它看起来并不像jSoup的方式对待文本节点的实际节点，因此那些主播似乎并没有被考虑下一个节点的孩子。

我也接受其他想法，比如尝试使用：not选择器来获取它们，但我无法相信jSoup不允许保留链接文本，这太常见了相信他们忽略了这个功能。

来源

2012-10-24 Tex

它看起来并不像jSoup把文本节点的实际节点，

否 - JSoup文本节点是实际的节点，是元素。

您所描述的问题的方法，你有一个非常具体的要求，我同意，没有内置在做的正是你在一个单一的呼叫想要的东西。然而，用简单的帮助方法，问题是可以解决的。

首先让我们回顾一下这个问题 - 父div有以下孩子：

div div div #text a #text a # text

过程和每个div和a元素还有其他的孩子，包括文本节点。根据你的例子，这听起来像你想遍历所有的孩子，忽略任何不是文本节点。找到第一个文本节点时，收集它的文本和任何后续节点的文本。

肯定是可行的，但我并不感到惊讶没有内置的方法做到这一点。

这是一个实现解决的问题：

public static String textPlus(Element elem) 
    { 
     List<TextNode> textNodes = elem.textNodes(); 
     if (textNodes.isEmpty()) 
     return ""; 

     StringBuilder result = new StringBuilder(); 
     // start at the first text node 
     Node currentNode = textNodes.get(0); 
     while (currentNode != null) 
     { 
     // append deep text of all subsequent nodes 
     if (currentNode instanceof TextNode) 
     { 
      TextNode currentText = (TextNode) currentNode; 
      result.append(currentText.text()); 
     } 
     else if (currentNode instanceof Element) 
     { 
      Element currentElement = (Element) currentNode; 
      result.append(currentElement.text()); 
     } 
     currentNode = currentNode.nextSibling(); 
     } 
     return result.toString(); 
    }

要调用这个用途：

Elements allContainers = doc.select(".container"); 
for (Element container : allContainers) { 
    String reviewText = textPlus(container); 
    StdOut.println(reviewText); 
}

鉴于你的样本HTML文本，此代码返回：

“从令人难以置信的SOMEPRODUCT的制造商，我们有这种其他产品blablabla ....惊人的规格，但这种其他产品是以某种方式更好。“

希望这会有所帮助。

来源

2012-10-24 03:27:49

不幸的不是！如果你使用container.text（），我将获得包含在div中的EVERYTHING。回到这个例子中，结果如下： “Something ...（text included in）嵌套跨度和divs ... ____ \”从令人难以置信的SOMEPRODUCT的制造商，我们有这种其他产品blablabla .. 。惊人的规格，但这种其他产品是以某种方式更好\“” – Tex

明白了。我已经更新了答案。 –

非常接近，因此接受:-) – Tex

我接受了圭多的回答，因为即使它不适合我，它肯定会让我走上正轨。

Guido的代码从第一个节点获取文本，然后迭代通过兄弟。不幸的是，我的现实世界的例子有两个更复杂的问题：

1 - 仍然没有任何要求，特别是来自锚点的文本，而不是其他任何东西。我想要更强大的东西，所以我在Guido的结构中加入了这个选择。

2 - 这仍然会从不需要的链接中获得文本，例如每个亚马逊评论结束时的“评论”和“永久链接”链接。其他选择器在那里清除它们。

我发布的代码确实对我有用，供将来参考。希望它可以帮助:-)

public static String textPlus(Element elem) 
{ 
    List<TextNode> textNodes = elem.textNodes(); 
    if (textNodes.isEmpty()) 
     return ""; 

    StringBuilder result = new StringBuilder(); 

    Node currentNode = textNodes.get(0); 

    while (currentNode != null) 
    { 
     // append deep text of all subsequent nodes 
     if (currentNode instanceof TextNode) 
     { 
      TextNode currentText = (TextNode) currentNode; 
      String curtext = currentText.text(); 
      result.append("\n\n" + currentText.text()); 
     } 
     else if (currentNode instanceof Element) 
     { 
      Element currentElement = (Element) currentNode; 
      Elements anchorElements = currentElement.select("a[href]").select(":not(:contains(Comment))").select(":not(:contains(Permalink))"); 
      if (!anchorElements.isEmpty()) { 
       for (Element anchorElement : anchorElements) 
        result.append("\n\n" + anchorElement.text()); 
      } 
     } 
     currentNode = currentNode.nextSibling(); 
    } 
    return result.toString().trim();

来源

2012-10-24 22:05:26 Tex

我没有测试过，但根据要素类，你应该使用方法的文字，而不是ownText

文本

公共字符串文本jsoup API文档（）

Gets the combined text of this element and all its children. 

For example, given HTML <p>Hello <b>there</b> now!</p>, p.text() returns "Hello there now!" 

Returns: 
    unencoded text, or empty string if none. 
See Also: 
    ownText(), textNodes()

ownText

公共字符串ownText（）

Gets the text owned by this element only; does not get the combined text of all children. 

For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!". Note that the text within the b element is not returned, as it is not a direct child of the p element. 

Returns: 
    unencoded text, or empty string if none. 
See Also: 
    text(), textNodes()

来源

2012-11-05 00:06:56 mirek

是的，但不幸的是，DIV是外部DIV文本的子项，因此，仅使用文本（）将不起作用:-) 因此，最终我确实使用了文本（），但连同一个消除所有非链接节点的过滤器（element.select（“a [href]”）） – Tex

在文本节点中获取锚点中的文本

回答

相关问题