PySpark:可能是重复的,找不到类似的问题。
我有一张桌子a:
a | b | c
---------
1 | 3 | p
2 | 4 | q
3 | 4 | r
4 | 7 | s
还有一张表B:
p | q
---------
1 | Yes
2 | No
3 | Yes
我希望结果表在a列值等于p列值的列上联接。我尝试了内部联接,但它为每个q值返回整个表a的副本。我希望得到的表格是:
a | b | c | q
--------------
1 | 3 | p | Yes
2 | 4 | q | No
3 | 4 | r | Yes
请帮助了解如何在PySpark中实现这一点?此外,如果我想要这张桌子,我该怎么办:
a | b | c | q
--------------
1 | 3 | p | Yes
2 | 4 | q | No
3 | 4 | r | Yes
4 | 7 | s | null
您可以使用两个DataFrames
之间的联接语句轻松完成此操作
更多关于Joins的信息可以找到-Spark Joins
数据准备
df1 = pd.read_csv(StringIO("""
a|b|c
1|3|p
2|4|q
3|4|r
4|7|s
"""),delimiter='|')
df2 = pd.read_csv(StringIO("""
p|q
1|Yes
2|No
3|Yes
"""),delimiter='|')
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| p|
| 2| 4| q|
| 3| 4| r|
| 4| 7| s|
+---+---+---+
sparkDF2.show()
+---+---+
| p| q|
+---+---+
| 1|Yes|
| 2| No|
| 3|Yes|
+---+---+
联接-内部
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['a'] == sparkDF2['p'] ### Joining Key
,'inner' ### Join Type
).select(sparkDF1['*'],sparkDF2['q'])
finalDF.orderBy('a').show()
+---+---+---+---+
| a| b| c| q|
+---+---+---+---+
| 1| 3| p|Yes|
| 2| 4| q| No|
| 3| 4| r|Yes|
+---+---+---+---+
联接-左侧
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['a'] == sparkDF2['p']
,'left'
).select(sparkDF1['*'],sparkDF2['q'])
finalDF.orderBy('a').show()
+---+---+---+----+
| a| b| c| q|
+---+---+---+----+
| 1| 3| p| Yes|
| 2| 4| q| No|
| 3| 4| r| Yes|
| 4| 7| s|null|
+---+---+---+----+