在我的Map-Reduce作业中使用第三方库（使用分布式缓存）

在我的映射代码中，我使用了JTS.jar的第三方库。我需要将它放在hadoop的分布式缓存上，以便所有节点都可以访问它。我发现在this链接，可以使用-libjars来做到这一点。在我的Map-Reduce作业中使用第三方库（使用分布式缓存）

我现在用

hadoop jar -libjars JTS.jar my_jar.jar classname inputFiles outputFiles执行我的代码。

但这不起作用。有关如何解决这个问题的任何建议？

来源

2012-07-13 reza

细节？你得到一个NoClassDefFoundError？哪里？ – 2012-07-13 02:46:23

试图通过在hadoop 0.20.2上使用上述命令执行我的jar文件，通过抛出“异常在线程中”主“java.io.IOException：打开作业jar时出错：-libjars”失败。看起来-libjars根本不支持 – reza 2012-07-13 02:58:32

在不同的努力，我试图按照this链接。

1）我复制的JAR库使用到Hadoop的：

hadoop fs -copyFromLocal JTS.jar /someHadoopFolder/JTS.jar

2）然后我修改了我的配置如下：

 Configuration conf = new Configuration(); 

    Job job = new Job(conf); 
    job.setJobName("TEST JOB"); 

    List<String> other_args = parseArguments(args, job); 

    DistributedCache.addFileToClassPath(new Path("/someHadoopFolder/JTS.jar"), conf); 

    job.setMapOutputKeyClass(LongWritable.class); 
    job.setMapOutputValueClass(Text.class); 

    job.setOutputKeyClass(LongWritable.class); 
    job.setOutputValueClass(Text.class); 

    job.setMapperClass(myMapper.class); 
    //job.setCombinerClass(myReducer.class); 
    //job.setReducerClass(myReducer.class); 

    job.setInputFormatClass(TextInputFormat.class); 
    job.setOutputFormatClass(TextOutputFormat.class); 


    String inPath = other_args.get(0); 
    String outPath = other_args.get(1);  
    TextInputFormat.setInputPaths(job, inPath); 
    TextOutputFormat.setOutputPath(job, new Path(outPath)); 

    TextInputFormat.setMinInputSplitSize(job, 32 * MEGABYTES); 
    TextInputFormat.setMaxInputSplitSize(job, 32 * MEGABYTES); 

    job.setJarByClass(myFile.class); 

    job.waitForCompletion(true);

3）的教程，然后说：“使用缓存的文件在映射器”，所以我的映射是这样的：

public static class myMapper extends Mapper<LongWritable, Text, LongWritable, Text>{ 
     private Path[] localArchives; 
     private Path[] localFiles; 

     public void configure(Configuration conf) throws IOException { 
     localArchives = DistributedCache.getLocalCacheArchives(conf); 
     localFiles = DistributedCache.getLocalCacheFiles(conf); 
     } 



    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ 
      //ENVELOPE IS FROM THE JTS.JAR library 
     Envelope e1 = new Envelope(-180, 85, 180, -85); 
      context.write(key, value); 

    } 

}

尽管做这些，该代码仍然失败投掷“发现类回合”。任何帮助？

来源

2012-07-13 03:31:41 reza

以所描述的方式使用DistributedCache与链接到jar中的类不同（例如直接指向'Envelope'）。令人难以置信的是，javadoc并没有提到通过Java I/O（即'FileInputStream'）来“使用”档案和文件的唯一方法。我敢肯定，你可以使用一些奇特的类加载技术，但是我建议坚持使用-libjars来处理你正在尝试做的事情（在后台执行大部分相同的事情）。 – 2012-07-13 03:51:38

很高兴知道这是行不通的。我花了4-5小时试图让它工作。希望我能很快找出-libjars。 – reza 2012-07-13 06:20:29

尝试使用命令行参数的正确顺序。我认为这个错误信息非常具有启发性。

hadoop jar my_jar.jar classname -libjars JTS.jar inputFiles outputFiles

来源

2012-07-13 03:04:05

作业开始，但它会抛出“找不到类”，这意味着该库没有被访问。我认为这是因为无论在“classname”之后传递给我的代码作为参数。因此，-libjars根本不被考虑。 – reza 2012-07-13 03:21:59

http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ – 2012-07-13 03:41:37

检查罐子是否分布，或吐出一些东西在驱动代码中标准输出，以确认-libjars参数确实正确解析。除了在Mapper/Reducer中使用JTS吗？ – 2012-07-13 03:43:29

我想我有点晚了，这样做的一种方法是将jar文件复制到hadoop的安装文件夹下。因为我在/ usr/local/hadoop/share/hadoop/common中完成了XXX.jars（第三方jar），然后将这些文件添加为外部jar文件。

这解决了我的问题，如果你不想要做这种方式的其他方式应该包含在出口HADOOP_CLASSPATH =/XXX/example.jar的外部jar文件的目录/文件路径：...

来源

2015-01-21 19:06:25 letsBeePolite

在我的Map-Reduce作业中使用第三方库（使用分布式缓存）

回答

相关问题