2011-09-21 53 views
1

我想读的部分文件夹中的内容数据是如何产生的段文件夹的内容数据。我认为内容数据文件是自定义的format阅读Nutch的用java

我尝试过使用nutch的Content类,但它不能识别格式。

回答

0
org.apache.nutch.segment.SegmentReader 

有一个map reduction实现,用于读取segment目录中的内容数据。

5
import java.io.IOException; 

import org.apache.commons.cli.Options; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.FileSystem; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.SequenceFile; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.util.GenericOptionsParser; 
import org.apache.nutch.protocol.Content; 
import org.apache.nutch.util.NutchConfiguration; 

public class ContentReader { 
    public static void main(String[] args) throws IOException { 
     // Setup the parser 
     Configuration conf = NutchConfiguration.create(); 
     Options opts = new Options(); 
     GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args); 
     String[] remainingArgs = parser.getRemainingArgs(); 
     FileSystem fs = FileSystem.get(conf); 
     String segment = remainingArgs[0]; 
     Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
     SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 
     Text key = new Text(); 
     Content content = new Content(); 
     // Loop through sequence files 
     while (reader.next(key, content)) { 
      try { 
       System.out.write(content.getContent(), 0, 
         content.getContent().length); 
      } catch (Exception e) { 
      } 
     } 
    } 
} 
+0

感谢您对以上!任何有助于检索给定文件类型(docx,pdf等)的方法。 – change

+0

String contentType = content.getContentType(); \t \t \t \t \t if(!contentType.equalsIgnoreCase(“application/pdf”)){ – kitwalker

+0

真棒!谢谢! argv代表的论点和顺序又是什么? – change