给定两个数据帧:
A
+---+---+---+
|id1|id2|id3|
+---+---+---+
|11 |22 |aaa|
|12 |23 |bbb|
|13 |34 |L12|
|14 |32 |L22|
+---+---+---+
B
+---+--------
|id1|id2|type |
+---+--------
| 22|11 |red |
| 23|12 |red |
| 34|L12|blue|
| 32|L22|blue|
+---+--------
我想加入他们如下:
if B.type == 'red': A.id1 == B.id2
else if B.type == 'blue': (A.id2 == B.id1) & (A.id3 == B.id2)
因此,最后我得到:
+---+---+---+---+---+----+
|id1|id2|id3|id1|id2|type|
+---+---+---+---+---+----+
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12| 34|L12|blue|
| 14| 32|L22| 32|L22|blue|
+---+---+---+---+---+----+
但上述结果是通过提取条件得到的例如:join_condition = (when(B.type == 'red', A.id == B.id2) ...
我想这样处理这个问题:
reds = B.filter(type == 'red')
blues = B.filter(type == 'blue)
,然后依次连接:
a_reds = A.join(reds, A.id1 == B.id2, 'left')
a_blues = A.join(blues, (A.id2 == B.id1) & (A.id3 == B.id2))
现在为了得到一个统一的表,我想联合它们,但不包括调用union
后出现的空值。
例如:
+---+---+---+----+----+----+
|id1|id2|id3| id1| id2|type|
+---+---+---+----+----+----+
| 14| 32|L22|null|null|null|
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12|null|null|null|
| 12| 23|bbb|null|null|null|
| 14| 32|L22| 32| L22|blue|
| 13| 34|L12| 34| L12|blue|
| 11| 22|aaa|null|null|null|
+---+---+---+----+----+----+
可以做到吗?如果有,怎么做?
谢谢。
可以通过不进行左连接来避免空记录。
或者您可以过滤掉"type=null"执行并集后。
可以使用条件连接而不是2连接+ union。
# Assuming A and B is the dataframe name.
from pyspark.sql import functions as F
join_cond = (F.when(F.col('type') == 'red', A.id1 == B.id2)
.when(F.col('type') == 'blue', (A.id2 == B.id1) & (A.id3 == B.id2)))
df = A.join(B, join_cond)
结果
+---+---+---+---+---+----+
|id1|id2|id3|id1|id2|type|
+---+---+---+---+---+----+
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12| 34|L12|blue|
| 14| 32|L22| 32|L22|blue|
+---+---+---+---+---+----+