0
我似乎遇到了这样的错误:文本被写入文件两次,第一次格式不正确,第二次格式正确。 The method below takes in this URL after it's been converted properly.该方法应该在所有正文文本所在的分隔符“ffaq”的子节点的分隔符的所有子节点的文本转换之间打印换行符。任何帮助,将不胜感激。我对使用jsoup相当陌生,所以解释也会很好。Jsoup在写入文件时解析html复制
/**
* Method to deal with HTML 5 Gamefaq entries.
* @param url The location of the HTML 5 entry to read.
**/
public static void htmlDocReader(URL url) {
try {
Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString());
//parse pagination label
String[] num = doc.select("div.span12").
select("ul.paginate").
select("li").
first().
text().
split("\\s+");
//get the max page number
final int max_pagenum = Integer.parseInt(num[num.length - 1]);
//create a new file based on the url path
File file = urlFile(url);
PrintWriter outFile = new PrintWriter(file, "UTF-8");
//Add every page to the text file
for(int i = 0; i < max_pagenum; i++) {
//if not the first page then change the url
if(i != 0) {
String new_url = url.toString() + "?page=" + i;
doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8",
new_url.toString());
}
Elements walkthroughs = doc.select("div.ffaq");
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
}
outFile.close();
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}