2014-09-19 68 views
-1

我试图解析多个文件并将它们分成一组HashMap中的字段。这是一个样本文件。解析Java中的文本文件以获取字段的HashMap

COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS 

    ROTTERDAM, March 18 - Contract terms for trade in coconut 
oil are to be changed from long tons to tonnes with effect from 
the Aug/Sep contract onwards, Dutch vegetable oil traders said. 
    Operators have already started to take account of the 
expected change and reported at least one trade in tonnes for 
Aug/Sept shipment yesterday. 

我需要的程序,这个文档解析为一个自定义文档类具有键,文件名,文件名称,地点,日期,作者,内容,类别字段中。

这是我尝试过的。

public static Document parse(String filename) { 

     File f = new File(filename); 

     if (f.isFile()){ 



      String fileId; 
      if (filename.indexOf(".") > 0) { 
       fileId = filename.substring(0, filename.lastIndexOf(".")); 
      } 
      String category = f.getParent(); 

      InputStream in = new FileInputStream(f); 

      byte buf[] = new byte[1024]; 
      int len = in.read(buf); 
      while(len > 0){ 
       .......... 
      } 
      in.close(); 
     } 


     return null; 
    } 
+0

我很抱歉你试图在这里完成? :O – 2014-09-19 19:18:44

+0

那么,这是一个开始,但很难以相同的方式继续。如果我是你,我现在不再编写代码,首先找出需要采取的高级步骤。把这些步骤写在一张纸上。 '1。将文件完全读入字符串。 2.提取文件标题...等等。然后你可以开始一步一步编码,在每一步之后测试结果。 – biziclop 2014-09-19 19:20:17

回答

0

下面的代码可以帮助你:

try { 
     FileInputStream fstream = new FileInputStream("myFile.txt"); 
     DataInputStream in = new DataInputStream(fstream); 
     BufferedReader br = new BufferedReader(new InputStreamReader(in)); 
     StringBuffer contentBuffer = new StringBuffer(); 
     String line = null; 
     boolean foundTitle = false; 
     boolean foundPlaceAndDate = false; 
     String date = ""; 
     while ((line = br.readLine()) != null) { 
      if (line.matches("^[a-z-A-Z0-9].*") && !foundTitle) { 
       // If line starts with a letter or number and has no title yet, that's the title 
       System.out.println("Title: " + line); 
       foundTitle = true; 
      } else if (line.matches("^[\\ \t].*") && !foundPlaceAndDate) { 
       // If line starts with a space or tab and it's out first paragraph, then this paragraph has place and date 
       System.out.println("Place: " + line.trim().substring(0, line.trim().indexOf(","))); 
       date = line.trim().substring(line.trim().indexOf(",") + 1, line.trim().indexOf("-")).trim(); 
       System.out.println("Date: " + date); 
       foundPlaceAndDate = true; 
      } 
      contentBuffer.append(line); 
     } 

     String content = contentBuffer.toString().substring(contentBuffer.toString().indexOf(date) + date.length() + 2).trim(); 
     System.out.println("Content: " + content); 

     br.close(); 
     fstream.close(); 
    } catch (Exception e) { 
     System.err.println("Oh no! I got the following error: " + e.getMessage()); 
    } 

输出将是:

标题:椰子油合同变更 - 荷兰商人

地点: ROTTERDAM

日期:3月18日

内容:贸易在椰子油合同条款将被从长吨改为吨,起fromthe八月/九月合同的效力,荷兰植物油贸易商称。运营商已经开始考虑预期的变化,并且昨天至少报告了一次交易的吨数。

+0

这确实让我开始了,但我需要将该文件解析为文档类,它看起来像this.public类文档{0} {0} {0} {0} \t \t \t公共文献(){ \t \t地图=新的HashMap (); \t} \t \t \t \t 公共无效setField(FIELDNAMES FN,字符串... O){ \t \t map.put(FN,O); \t} \t \t \t \t \t公共字符串[] getfield命令(FIELDNAMES FN){ \t \t返回map.get(FN); \t} } – 2014-09-19 19:52:27

+0

现在您只需填写Document类的字段即可。例如:'Document document = new Document(); document.setField(“title”,title);' – shimatai 2014-09-22 18:10:59