2014-09-27 96 views
0

我似乎遇到了这样的错误:文本被写入文件两次,第一次格式不正确,第二次格式正确。 The method below takes in this URL after it's been converted properly.该方法应该在所有正文文本所在的分隔符“ffaq”的子节点的分隔符的所有子节点的文本转换之间打印换行符。任何帮助,将不胜感激。我对使用jsoup相当陌生,所以解释也会很好。Jsoup在写入文件时解析html复制

/** 
* Method to deal with HTML 5 Gamefaq entries. 
* @param url The location of the HTML 5 entry to read. 
**/ 
public static void htmlDocReader(URL url) { 
    try { 
     Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString()); 
     //parse pagination label 
     String[] num = doc.select("div.span12"). 
           select("ul.paginate"). 
           select("li"). 
           first(). 
           text(). 
           split("\\s+"); 
     //get the max page number 
     final int max_pagenum = Integer.parseInt(num[num.length - 1]); 

     //create a new file based on the url path 
     File file = urlFile(url); 
     PrintWriter outFile = new PrintWriter(file, "UTF-8"); 

     //Add every page to the text file 
     for(int i = 0; i < max_pagenum; i++) { 
      //if not the first page then change the url 
      if(i != 0) { 
       String new_url = url.toString() + "?page=" + i; 
       doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8", 
            new_url.toString()); 
      } 
      Elements walkthroughs = doc.select("div.ffaq"); 
       for(Element elem : walkthroughs.select("div")) { 
        for(Element inner : elem.children()) { 
         outFile.println(inner.text()); 
        } 
       } 
     } 
     outFile.close(); 
    } catch(Exception e) { 
     e.printStackTrace(); 
     System.exit(1); 
    } 
} 

回答

1

对于您称为text()的每个元素,您都会打印其结构的所有文本。 假设下面的例子

<div> 
text of div 
<span>text of span</span> 
</div> 

如果调用text()div element你会得到

文本范围

的格文本,然后,如果你打电话text()跨度,你会得到

text of span

您需要什么,以避免重复是使用ownText()。这将只获得元素的直接文本,而不是其子元素的文本。

说来话长排序改变这种

for(Element elem : walkthroughs.select("div")) { 
    for(Element inner : elem.children()) { 
     outFile.println(inner.text()); 
    } 
} 

对此

for(Element elem : walkthroughs.select("div")) { 
    for(Element inner : elem.children()) { 
     String line = inner.ownText().trim(); 
     if(!line.equals("")) //Skip empty lines 
      outFile.println(line); 
    } 
}