我试图改善与broadcast
变量您的解决方案在filter()
val data1 = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)))
val data2 = sc.parallelize(Seq(("a", 3), ("b", 5)))
// broadcast data2 key list to use in filter method, which runs in executor nodes
val bcast = sc.broadcast(data2.map(_._1).collect())
val result = data1.filter(r => bcast.value.contains(r._1))
println(result.collect().toList)
//Output
List((a,1), (a,2), (b,2), (b,3))
EDIT1:(按注释使用collect()
,以解决与出可扩展性)
val data1 = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1)))
val data2 = sc.parallelize(Seq(("a", 3), ("b", 5)))
val cogroupRdd: RDD[(String, (Iterable[Int], Iterable[Int]))] = data1.cogroup(data2)
/* List(
(a, (CompactBuffer(1, 2), CompactBuffer(3))),
(b, (CompactBuffer(2, 3), CompactBuffer(5))),
(c, (CompactBuffer(1), CompactBuffer()))
) */
//Now filter keys which have two non empty CompactBuffer. You can do that with
//filter(row => row._2._1.nonEmpty && row._2._2.nonEmpty) also.
val filterRdd = cogroupRdd.filter { case (k, (v1, v2)) => v1.nonEmpty && v2.nonEmpty }
/* List(
(a, (CompactBuffer(1, 2), CompactBuffer(3))),
(b, (CompactBuffer(2, 3), CompactBuffer(5)))
) */
//As we care about first data only, lets pick first compact buffer only
// by doing v1.map(val1 => (k, val1))
val result = filterRdd.flatMap { case (k, (v1, v2)) => v1.map(val1 => (k, val1)) }
//List((a, 1), (a, 2), (b, 2), (b, 3))
EDIT2:
val resultRdd = data1.join(data2).map(r => (r._1, r._2._1)).distinct()
//List((b,2), (b,3), (a,2), (a,1))
这里data1.join(data2)
拥有与普通钥匙对(内加入)
//List((a,(1,3)), (a,(2,3)), (b,(2,5)), (b,(2,1)), (b,(3,5)), (b,(3,1)))
我不明白'cogroup'很好,如果我想使用的功能操作,比如'VAL结果什么= data1.filter(R => bcast.value.contains(myFuncOper(r._1)) )'在'cogroup'中? –