如何在Hadoop中的映射器之间共享HashMap？

我可以分享HashMap不同映射器相同值如静态变量？我在hadoop集群中运行作业，并且我试图在所有在不同datanode上运行的mapper之间共享变量值。如何在Hadoop中的映射器之间共享HashMap？

INPUT ==>文件路径写到FileID

InputFormat => KeyValueTextInputFormat

public class Demo { 

    static int termID=0; 

    public static class DemoMapper extends Mapper<Object, Text, IntWritable, Text> { 


     static HashMap<String, Integer> termMapping = new HashMap<String, Integer>(); 


     @Override 
     protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { 

       BufferedReader reader = new BufferedReader(new FileReader(value)); 

       String line; 
       String currentTerm; 

       while ((line = reader.readLine()) != null) { 
        tokenizer = new StringTokenizer(line, " "); 
        while (tokenizer.hasMoreTokens()) { 
         currentTerm = tokenizer.nextToken(); 
         if (!termMap.containsKey(currentTerm)) { 
          if (!termMapping.containsKey(currentTerm)) { 
           termMapping.put(currentTerm, termID++); 
          } 
          termMap.put(currentTerm, 1); 
         } else { 
          termMap.put(currentTerm, termMap.get(currentTerm) + 1); 
         } 
        } 
       } 
     } 
    } 


    public static void main(String[] args) { 

    } 

}

来源

2017-06-13 D. Jagatiya

我知道你可以播放地图在Spark之间的任务。从未尝试过使用MapReduce –

Thx，但我不想使用Spark –

好吧，然后显示您尝试添加Map的MapReduce代码。你得到了什么错误？ –

我不认为你真的需要分享的内容。

所有你在这里做的是各种简单的字数（的路径）。

只输出(currentTerm, 1)并让减速器处理适当的聚合。您也可以使用组合器来提高性能。

您不必担心重复 - 只需回顾一下WordCount示例。

另外，我觉得你的类型应该改为extends Mapper<LongWritable, Text, Text, IntWritable>如果你正在阅读文件和outputing (String, int)数据

还有一个MapWritable类，但这似乎是大材小用

来源

2017-06-13 17:26:15

假设我有4个输入拆分，其中包含10个文件路径，将执行4个映射器。 **我想要为每个单词生成唯一的ID ** **而不是WordCount **这就是为什么每个映射器都需要一些计数器，其中每个唯一的单词 –

好的，然后在这里解释。你可以使用Reducer的'setup'方法初始化一个计数器。 https://stackoverflow.com/questions/11737750/how-to-handle-id-generation-on-a-hadoop-cluster –

另外，我不确定你的文件路径是什么样的，但是'FileReader（value） '将需要从网络文件路径读取，而不是本地磁盘。 –

如何在Hadoop中的映射器之间共享HashMap？

回答

相关问题