你试过很可能让你有解决方案。
你的数据看起来像这样
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
那你自己认为分组和计数。如果你不喜欢这样
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
,那么你可以实际做这样的事情以后得到所有的方式:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
希望它能帮助。
我都试过,但它返回对数计数 df.groupBy( “数字”)。COUNT()。选择( “*”)。其中( “计数> 1”) 我需要所有重复行与所有专栏 –