2017-10-05 163 views
0

我需要更新delta数据的数据框的row_number列。为Delta数据更新Spark Dataframe的窗口函数row_number列

我已经实现如下的基本负载的ROW_NUMBER:

输入数据:

val base = List(List("001", "a", "abc"), List("001", "a", "123"),List("003", "c", "456") ,List("002", "b", "dfr"), List("003", "c", "ytr")) 
    .map(row => (row(0), row(1), row(2))) 

    val DS1 = base.toDF("KEY1", "KEY2" ,"VAL") 

    DS1.show() 
+----+----+---+ 
|KEY1|KEY2|VAL| 
+----+----+---+ 
| 001| a|abc| 
| 001| a|123| 
| 003| c|456| 
| 002| b|dfr| 
| 003| c|ytr| 
+----+----+---+ 

现在我已经添加使用如下面的窗口函数的ROW_NUMBER:

val baseDF = DS1.select(col("KEY1"), col("KEY2"), col("VAL") ,row_number().over(Window.partitionBy(col("KEY1"), col("KEY2")).orderBy(col("KEY1"), col("KEY2").asc)).alias("Row_Num")) 

baseDF.show() 

|KEY1|KEY2|VAL|Row_Num| 
+----+----+---+-------+ 
|001 |a |abc|1  | 
|001 |a |123|2  | 
|002 |b |dfr|1  | 
|003 |c |456|1  | 
|003 |c |ytr|2  | 
+----+----+---+-------+ 

现在三角洲负荷降至:

val delta = List(List("001", "a", "y45") ,List("002", "b", "444")) 
    .map(row => (row(0), row(1), row(2))) 

val DS2 = delta.toDF("KEY1", "KEY2" ,"VAL") 

DS2.show() 

+----+----+---+ 
|KEY1|KEY2|VAL| 
+----+----+---+ 
| 001| a|y45| 
| 002| b|444| 
+----+----+---+ 

所以预计更新的结果应该是

baseDF.show() 

|KEY1|KEY2|VAL|Row_Num| 
+----+----+---+-------+ 
|001 |a |abc|1  | 
|001 |a |123|2  | 
| 001| a|y45|3  | -----> Delta record 
|002 |b |dfr|1  | 
| 002| b|444|2  | -----> Delta record 
|003 |c |456|1  | 
|003 |c |ytr|2  | 
+----+----+---+-------+ 

任何建议,以实现使用dataframes /数据集,这个解决方案? 我们可以通过spark rdd的zipWithIndex实现上述解决方案吗?与更新的行号码加增量

回答

2

的一种方式是:1)与大量添加列Row_NumDS2,2)联合baseDF有了它,和3)计算新的行号,如下所示:

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.expressions.Window 

val combinedDF = baseDF.union(
    DS2.withColumn("Row_Num", lit(Long.MaxValue)) 
) 

val resultDF = combinedDF.select(
    col("KEY1"), col("KEY2"), col("VAL"), row_number().over(
    Window.partitionBy(col("KEY1"), col("KEY2")).orderBy(col("Row_Num")) 
).alias("New_Row_Num") 
) 

resultDF.show 
+----+----+---+-----------+ 
|KEY1|KEY2|VAL|New_Row_Num| 
+----+----+---+-----------+ 
| 003| c|456|   1| 
| 003| c|ytr|   2| 
| 002| b|dfr|   1| 
| 002| b|444|   2| 
| 001| a|abc|   1| 
| 001| a|123|   2| 
| 001| a|y45|   3| 
+----+----+---+-----------+