我要过滤的RDD源的列:如何才能不使用IN子句中的过滤条件的火花
val source = sql("SELECT * from sample.source").rdd.map(_.mkString(","))
val destination = sql("select * from sample.destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val src = source_primary_key.subtractByKey(destination_primary_key)
我想用在过滤条件子句中过滤出存在于src中的值从源代码,类似下面(编者):
val source = spark.read.csv(inputPath + "/source").rdd.map(_.mkString(","))
val destination = spark.read.csv(inputPath + "/destination").rdd.map(_.mkString(","))
val source_primary_key = source.map(rec => (rec.split(",")(0)))
val destination_primary_key = destination.map(rec => (rec.split(",")(0)))
val extra_in_source = source_primary_key.filter(rec._1 != destination_primary_key._1)
相当于SQL代码
SELECT * FROM SOURCE WHERE ID IN (select ID from src)
谢谢
你的值是什么类型? – eliasah
数据类型可能会有所不同,有时INT和有时字符串 – Vignesh
这不是我所要求的。 'src'或'source'的类型是什么?你在使用RDD或DataFrame吗? – eliasah