2016-01-06 47 views
0

这是Hadop MapReduce V1中可用的Gridmix代码片段,但我有以下问题。如何用`org.apache.hadoop.io.Text`创建SequenceFiles?

他们分别设定org.apache.hadoop.mapred.SequenceFileInputFormatorg.apache.hadoop.mapred.SequenceFileOutputFormatinFormatoutFormat,并且它还具有org.apache.hadoop.io.TextoutKeyoutValue。对我来说,这似乎是这个例子接受文本文件作为序列文件。我如何用org.apache.hadoop.io.Text创建SequenceFiles?

WEBDATASCAN("webdataScan") { 
public void addJob(int numReducers, boolean mapoutputCompressed, 
    boolean outputCompressed, Size size, JobControl gridmix) { 
    final String prop = String.format("webdataScan.%sJobs.inputFiles", size); 
    final String indir = getInputDirsFor(prop, size.defaultPath(VARCOMPSEQ)); 
    final String outdir = addTSSuffix("perf-out/webdata-scan-out-dir-" + size); 
    StringBuffer sb = new StringBuffer(); 
    sb.append("-keepmap 0.2 "); 
    sb.append("-keepred 5 "); 
    sb.append("-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat "); 
    sb.append("-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat "); 
    sb.append("-outKey org.apache.hadoop.io.Text "); 
    sb.append("-outValue org.apache.hadoop.io.Text "); 
    sb.append("-indir ").append(indir).append(" "); 
    sb.append("-outdir ").append(outdir).append(" "); 
    sb.append("-r ").append(numReducers); 

    String[] args = sb.toString().split(" "); 
    clearDir(outdir); 
    try { 
    JobConf jobconf = GenericMRLoadJobCreator.createJob(
     args, mapoutputCompressed, outputCompressed); 
    jobconf.setJobName("GridmixWebdatascan." + size); 
    Job job = new Job(jobconf); 
    gridmix.addJob(job); 
    } catch (Exception ex) { 
    System.out.println(ex.getStackTrace()); 
    } 
} 
} 

回答

1

您正在混合的文件格式和键值类型。要阅读纯文本数据,我们有TextFileInputFormat。关键值类型在个人记录级别。序列输出文件格式采用文本格式的键和值,并在将数据存储到HDFS之前将数据内部序列化为二进制格式。它在内部维护键和值的元数据。

旧mapreduce api有org.apache.hadoop.mapred包有输入和输出格式,org.apache.hadoop.io包有键和值类型。键和值类型包括文本,IntWritable,FloatWritable等。