与Jsoup

我想要实现KrovetzStemmer为我下载的页面整合一个词干。我有最大的问题是我不能简单地用给定的文档使用body().text()，然后干所有的话。究其原因是因为我需要href链接不应在所有梗。所以我想，也许如果我能与href环节得到身体，然后我可以HREF拆分，然后使用一个LinkedHashMap作为Element和布尔或会指定Element无论是文字或链接枚举类型。与Jsoup

所以问题是假设给定的HTML

<!DOCTYPE html> 
<html> 
<body> 
<h1> This is the heading part. This is for testing purposes only.</h1> 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
<p>This is the first paragraph to be considered.</p> 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
<p>This is the second paragraph to be considered.</p> 
<img border="0" src="/images/pulpit.jpg" alt="Pulpit rock" width="304" height="228"> 
<a href="http://www.thirdsite.com">Third Link</a> 
</body> 
</html>

我想只能够得到这样的：

This is the heading part. This is for testing purposes only. 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
This is the first paragraph to be considered. 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
This is the second paragraph to be considered. 
<a href="http://www.thirdsite.com">Third Link</a>

然后将它们分割，然后插入到LinkedHashMap所以如果我做是这样的：

int i = 1; 
for (Entry<Element, Boolean> entry : splitedList.getEntry()) { 
     if(!entry.getValue()) { System.out.println(i + ": " + entry.getKey());} 
     i++;  
}

然后将打印：

1: This is the heading part. This is for testing purposes only. 
3: This is the first paragraph to be considered. 
5: This is the second paragraph to be considered.

这样我就可以应用词干并保持迭代顺序。

现在，我不知道如何实现这个，因为我不知道如何：

一）获取正文与href链接仅

B）拆分体（我知道有我们总是可以使用字符串split()，但我正在谈论的是页面正文的元素）

我将如何能够完成上述两件事？

而且我也不太清楚我的解决方案是一个很好的解决与否。有更好/更简单的方法来做到这一点？

来源

2014-03-30 Sarp Kaya

如需更好的帮助，请尝试添加输入示例和预期输出/结果，并附上一些解释，为什么会这样。 – Pshemo

@Pshemo我现在举了一个例子。 –

现在，我明白你的要求，我更新了新的答案在这里的帖子：

所以考虑你的HTML文档doc通过解析给定HTML

你可以得到所有的a标签和包起来<xmp>标签（看here）

for (Element element : doc.body().select("a")) 
    element.wrap("<xmp></xmp>");

现在需要新的HTML加载到doc，所以Jsoup将避免解析里面<xmp>标签

doc = Jsoup.parse(doc.html()); 
System.out.println(doc.body().text());

内容的输出将是：

This is the heading part. This is for testing purposes only. 
<a href="http://www.firstsite.com/this is a sub directory/">First Link</a> 
This is the first paragraph to be considered. 
<a href="http://www.secondsite.com/it is the correct page/">Second Link</a> 
This is the second paragraph to be considered. 
<a href="http://www.thirdsite.com">Third Link</a>

现在你可以继续做你想要输出的东西。

更新基于注释的代码，用于分离

for (Element element : doc.body().select("a")) 
    element.wrap("<xmp>split-me-here</xmp>split-me-here"); 

doc = Jsoup.parse(doc.html()); 

int cnt = 0; 
List<String> splitText = Arrays.asList(doc.body().text().split("split-me-here")); 
for (String text : splitText) { 
    cnt++; 
    if (!text.contains("</a>")) 
     System.out.println(cnt + "." + text.trim()); 
}

上面的代码将打印输出如下：

1.本是标题部分。这仅用于测试目的。

3.这是要考虑的第一段。

5.这是要考虑的第二段。

来源

2014-03-30 09:01:21 AKS

我不认为你理解它是正确的。我不想从元素中删除任何东西。正如我所提到的，我可以简单地通过获得已经返回废弃文本的'.body（）。text（）'来干掉所有的单词。 –

那么你需要正文文本或元素文本？ – AKS

我需要文档正文中的文本和'href'元素。 –

回答

相关问题