1
我想加入的火花2分集的条件下,我使用的火花版本2.1,或者对联接的结果对交叉连接
SELECT *
FROM Tb1
INNER JOIN Tb2
ON Tb1.key1=Tb2.key1
OR Tb1.key2=Tb2.Key2
但它导致的交叉连接,我怎么能连接两个表,并得到只有匹配记录?
我也尝试过左外连接,但它也迫使我改为交叉连接而不是?
我想加入的火花2分集的条件下,我使用的火花版本2.1,或者对联接的结果对交叉连接
SELECT *
FROM Tb1
INNER JOIN Tb2
ON Tb1.key1=Tb2.key1
OR Tb1.key2=Tb2.Key2
但它导致的交叉连接,我怎么能连接两个表,并得到只有匹配记录?
我也尝试过左外连接,但它也迫使我改为交叉连接而不是?
通过加入两次:
select *
from Tb1
inner join Tb2
on Tb1.key1=Tb2.key1
inner join Tb2 as Tb22
on Tb1.key2=Tb22.Key2
或左加盟两种:
select *
from Tb1
left join Tb2
on Tb1.key1=Tb2.key1
left join Tb2 as Tb22
on Tb1.key2=Tb22.Key2
尝试此方法
from pyspark.sql import SQLContext as SQC
sqc = SQC(sc)
x = [(1,2,3), (4,5,6), (7,8,9), (10,11,12), (13,14,15)]
y = [(1,4,5), (4,5,6), (10,11,16),(34,23,31), (56,14,89)]
x_df = sqc.createDataFrame(x,["x","y","z"])
y_df = sqc.createDataFrame(y,["x","y","z"])
cond = [(x_df.x == y_df.x) | (x_df.y == y_df.y)]
x_df.join(y_df,cond, "inner").show()
输出
+---+---+---+---+---+---+
| x| y| z| x| y| z|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 4| 5|
| 4| 5| 6| 4| 5| 6|
| 10| 11| 12| 10| 11| 16|
| 13| 14| 15| 56| 14| 89|
+---+---+---+---+---+---+
你是礼仪,但不是在我的情况下,你的答案导致交叉连接。 –