2015-11-05 168 views
0

合并来自不同dataframes行一起例如第一我有这样在斯卡拉

+----+-----+-----+--------------------+-----+ 
|year| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----+ 
|2012|Tesla| S|   No comment|  | 
|1997| Ford| E350|Go get one now th...|  | 
|2015|Chevy| Volt|    null| null| 
+----+-----+-----+--------------------+-----+ 

我们有2012年,1997年和2015年将数据帧,我们有另一个数据帧这样

+----+-----+-----+--------------------+-----+ 
|year| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----+ 
|2012|BMW | 3|   No comment|  | 
|1997|VW | GTI | get    |  | 
|2015|MB | C200|    good| null| 
+----+-----+-----+--------------------+-----+ 

我们还有2012年,1997年,2015年。我们如何将同一年的行合并在一起?由于

输出应该是这样的

+----+-----+-----+--------------------+-----++-----+-----+--------------------------+ 
|year| make|model|    comment|blank|| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+ 
|2012|Tesla| S|   No comment|  |BMW | 3 |   no comment| 
|1997| Ford| E350|Go get one now th...|  |VW |GTI |  get   | 
|2015|Chevy| Volt|    null| null|MB |C200 |    Good |null 
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+ 

回答

1

你能得到你想要的表用一个简单的join。喜欢的东西:

val joined = df1.join(df2, df1("year") === df2("year")) 

我装你的输入,例如,我看到以下内容:

scala> df1.show 
... 
year make model comment 
2012 Tesla S  No comment 
1997 Ford E350 Go get one now 
2015 Chevy Volt null 

scala> df2.show 
... 
year make model comment 
2012 BMW 3  No comment 
1997 VW GTI get 
2015 MB C200 good 

当我运行join,我得到:

scala> val joined = df1.join(df2, df1("year") === df2("year")) 
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string] 

scala> joined.show 
... 
year make model comment  year make model comment 
2012 Tesla S  No comment  2012 BMW 3  No comment 
2015 Chevy Volt null   2015 MB C200 good 
1997 Ford E350 Go get one now 1997 VW GTI get 

有一点要注意的是,您的列名称可能不明确,因为它们在数据框中被命名为相同(因此您可以更改它们的名称以使对结果数据框的操作更易于编写)。

+0

Spark中有内连接,左连接,右连接还是全连接?谢谢 –