在我最近的BigData项目中,我需要使用Spark。使用Spark Java进行数据匹配RDD
的第一个要求是如下
我们有两组数据从不同的数据源来让说一个从平面文件和其他从HDFS的。
数据集可能有也可能没有共同的列,但我们有手中的映射规则,例如,
功能1(data1.columnA)==函数2(data2.columnB)
我试图通过在RDD一个其它内部执行的foreach到实现这一目标,但是这没有在允许的火花,
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.foreach(RDD.scala:910) at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:332) at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:46) at com.pramod.engine.DataMatchingEngine.lambda$execute$4e658232$1(DataMatchingEngine.java:44) at com.pramod.engine.DataMatchingEngine$$Lambda$9/1172080526.call(Unknown Source) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:332) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.apply(JavaRDDLike.scala:332) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
请帮助我以最佳方式实现此目标。
我认为你需要提供更多的细节(至少我不明白),确切地说......你需要做什么? –