阅读Nutch的用java

我想读的部分文件夹中的内容数据是如何产生的段文件夹的内容数据。我认为内容数据文件是自定义的format 阅读Nutch的用java

我尝试过使用nutch的Content类，但它不能识别格式。

2011-09-21 surajz

org.apache.nutch.segment.SegmentReader

有一个map reduction实现，用于读取segment目录中的内容数据。

来源

2011-09-22 03:39:46 surajz

import java.io.IOException; 

import org.apache.commons.cli.Options; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.nutch.protocol.Content; 
import org.apache.nutch.util.NutchConfiguration; 

public class ContentReader { 
    public static void main(String[] args) throws IOException { 
     // Setup the parser 
     Configuration conf = NutchConfiguration.create(); 
     Options opts = new Options(); 
     GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args); 
     String[] remainingArgs = parser.getRemainingArgs(); 
     FileSystem fs = FileSystem.get(conf); 
     String segment = remainingArgs[0]; 
     Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
     SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 
     Text key = new Text(); 
     Content content = new Content(); 
     // Loop through sequence files 
     while (reader.next(key, content)) { 
      try { 
       System.out.write(content.getContent(), 0, 
         content.getContent().length); 
      } catch (Exception e) { 
      } 
     } 
    } 
}

来源

2013-04-02 12:21:07 kitwalker

感谢您对以上！任何有助于检索给定文件类型（docx，pdf等）的方法。 – change

String contentType = content.getContentType（）; \t \t \t \t \t if（！contentType.equalsIgnoreCase（“application/pdf”））{ – kitwalker

真棒！谢谢！ argv代表的论点和顺序又是什么？ – change

阅读Nutch的用java

回答

相关问题