提取内容使用纯Java

我想提取使用使用Java的XPath的HTML内容的HTML文档。在红宝石中，我可以使用nokogiri来做到这一点，如下所示。提取内容使用纯Java

xpath = '/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p' 
doc = Nokogiri::HTML(open('test_001_html64.html')) 
doc.xpath().each do |link| 
puts link.content 
end

我想用纯Java做。我看着Jsoup，但找不到任何使用xpath执行此操作的文档或示例。有人可以提出一种方法吗？

感谢

来源

2012-03-19 Mir

许多相关的/重复 - 见http://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java http://stackoverflow.com/questions/3352594/querying -an-HTML页面与 - XPath的在Java的http://stackoverflow.com/questions/3361263/library-to-query-html-with-xpath-in-java – 2013-01-07 00:43:59

您可以使用HtmlUnit该任务。

这里有一个简单的例子：

final WebClient webClient = new WebClient(); 
final HtmlPage startPage = webClient.getPage("http://www.google.com/"); 
List<DomNode> nodes = page.getByXPath("/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p"); 
for (DomNode node : nodes) { 
    System.out.println(node.getNodeName()); 
}

来源

2012-03-19 08:13:30 bezmax

这里是你如何可以JAXP（在Java中捆绑）做到这一点：JAXP Manual

来源

2012-03-19 08:17:36 bezmax

-2

您可以轻松地做到这一点的jsoup。

Document doc = Jsoup.connect("test_001_html64.html").get(); 
Elements info = doc.getElementsByTag("html"); 
//iterate recursively to the desired location in the dom tree.

为了更快的解析，您可以使用特定的标签/标识。

为jsoup（jsoup.org/apidocs）文档也存在。这个问题的

来源

2012-03-19 08:18:42

这不是XPath的。 – bezmax 2012-03-19 08:59:47

jsoup不提供一个XPath的机制，但提供了更方便的way.https：//norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser – 2012-03-19 11:02:28

问题被加上'xpath'。 – bezmax 2012-03-19 13:49:52

提取内容使用纯Java

回答

相关问题