Pyspark左连接数据框使用错误的连接键值



我有2个spark数据框架。Df1包含列customerid, salary列customerid2,教育

的例子:df1| customerid | salary|------------|--------|| c1 | 120 || c2 | 90 || c3 | 90 || C4 | 100 |

df2| customerid2 |教育||-------------|-----------|| c1 | ba || c2 | bs || C5 | PhD || C4 | BS Physics|

我想要一个新的数据帧名称df_new1,使用以下代码连接以上2个数据帧。我想左连接df1和df2使用连接键customerid和customerid2。

df_new = df1.join(df2, on=x[df1.customerid==df2.customerid2],how='left')

预期输出:
df_new| customerid |薪水| customerid2 |学历||------------|--------|-------------|-----------|| c1 | 120 | c1 | ba || c2 | 90 | c2 | bs || c3 | 90 | null | null || C4 | 100 | C4 | BS Physics|

当前输出:
df_new| customerid |薪水| customerid2 |学历||------------|--------|-------------|-----------|| c1 | 120 | c1 | ba || C2 | 90 | C5 | PhD | <——本行的问题| c3 | 90 | null | null || C4 | 100 | C4 | BS Physics|

问题是,当我在spark数据框架中执行一些记录的连接时,即使客户ID值不同,它也会连接2个表。

感谢这个伟大的社区对这个非常罕见的问题的回应。

Taking your data as example it is generating expected output as you posted
>>> columns2 = ["customerid2","education"]
>>> data2=[("c1","BA"),("c2","BS"),("c5","phD"),("c4","BS Physics")]
>>> rdd2=sc.parallelize(data2)
>>> df2=rdd2.toDF(columns2)
>>> columns = ["customerid","salary"]
>>> data=[("c1","120"),("c2","90"),("c3","90"),("c4","100")]
>>> rdd=sc.parallelize(data)
>>> df1=rdd.toDF(columns)
>>> df_new = df1.join(df2,df1.customerid == df2.customerid2,"leftouter")
>>> df_new.show()

+----------+------+-----------+----------+| | customerid | |薪水customerid2 |教育+----------+------+-----------+----------+| c1| 120| c1| BA|| c4| 100| c4|BS物理| c3| 90| null| null|| c2| 90| c2| BS|+----------+------+-----------+----------+

可以检查是否有数据不包含前后空格。

最新更新