Spark Java Accumulator不会增加

刚刚开始使用Spark-Java中的婴儿步骤。以下是一个字数统计程序，其中包含一个可以跳过列表中的单词的停用词列表。我有2个累加器来计算跳过的单词和未跳过的单词。Spark Java Accumulator不会增加

但是，程序结束时的Sysout始终将累加器值设置为0。

请指出我要出错的地方。

public static void main(String[] args) throws FileNotFoundException { 

     SparkConf conf = new SparkConf(); 
     conf.setAppName("Third App - Word Count WITH BroadCast and Accumulator"); 
     JavaSparkContext jsc = new JavaSparkContext(conf); 
     JavaRDD<String> fileRDD = jsc.textFile("hello.txt"); 
     JavaRDD<String> words = fileRDD.flatMap(new FlatMapFunction<String, String>() { 

      public Iterable<String> call(String aLine) throws Exception { 
       return Arrays.asList(aLine.split(" ")); 
      } 
     }); 

     String[] stopWordArray = getStopWordArray(); 

     final Accumulator<Integer> skipAccumulator = jsc.accumulator(0); 
     final Accumulator<Integer> unSkipAccumulator = jsc.accumulator(0); 

     final Broadcast<String[]> stopWordBroadCast = jsc.broadcast(stopWordArray); 

     JavaRDD<String> filteredWords = words.filter(new Function<String, Boolean>() { 

      public Boolean call(String inString) throws Exception { 
       boolean filterCondition = !Arrays.asList(stopWordBroadCast.getValue()).contains(inString); 
       if(!filterCondition){ 
        System.out.println("Filtered a stop word "); 
        skipAccumulator.add(1); 
       }else{ 
        unSkipAccumulator.add(1); 
       } 
       return filterCondition; 

      } 
     }); 

     System.out.println("$$$$$$$$$$$$$$$Filtered Count "+skipAccumulator.value()); 
     System.out.println("$$$$$$$$$$$$$$$ UN Filtered Count "+unSkipAccumulator.value()); 

     /* rest of code - works fine */ 
     jsc.stop(); 
     jsc.close(); 
     }

我正在运行的JAR和使用

spark-submit jarname

------------编辑在Hortonworks沙盒2.4提交作业------- ---------

REST该云在注释部分

JavaPairRDD<String, Integer> wordOccurrence = filteredWords.mapToPair(new PairFunction<String, String, Integer>() { 

      public Tuple2<String, Integer> call(String inWord) throws Exception { 
       return new Tuple2<String, Integer>(inWord, 1); 
      } 
     }); 

     JavaPairRDD<String, Integer> summed = wordOccurrence.reduceByKey(new Function2<Integer, Integer, Integer>() { 

      public Integer call(Integer a, Integer b) throws Exception { 
       return a+b; 
      } 
     }); 

     summed.saveAsTextFile("hello-out");

来源

2016-06-01 Arun A K

这两个累加器都是0，并且由于停止词有5个出现，文本过滤停止词打印5次。 –

优米的代码的发布了重要部分/* rest of code - works fine */。我可以保证你在其他代码中调用一些操作。这会触发DAG使用累加器执行代码。尝试在println之前添加filteredWords.collect，您应该看到输出。请记住，Spark在转换上很懒，只能在动作上执行。

来源

2016-06-01 05:09:01

编辑了这个问题。 :) –

正确答案 - Spark是懒惰的转换，并只执行动作我强制执行第一个（）让它工作。 –

Spark Java Accumulator不会增加

回答

相关问题