我对火花是相对较新的,我想知道我是否可以获取用于外部Join
的列的来源假设我有3 df
df 1
+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+
df2
+-----+----+
|item2| key|
+-----+----+
| t1|key1|
| t2|key2|
| t3|key6|
| t4|key7|
| t5|key8|
+-----+----+
df3
+-----+-----+
|item3| key|
+-----+-----+
| t1| key1|
| t2| key2|
| t3| key8|
| t4| key9|
| t5|key10|
+-----+-----+
我想在这3个数据范围内进行完整的外部加入,并包含一个新列以指示该键的来源。
例如
+-----+-----+-----+-----+------+
| key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null| t5| t3| DF2|
| key5|Item5| null| null| DF1|
| key7| null| t4| null| DF2|
| key3|Item3| null| null| DF1|
| key6| null| t3| null| DF2|
| key1|Item1| t1| t1| DF1|
| key4|Item4| null| null| DF1|
| key2|Item2| t2| t2| DF1|
| key9| null| null| t4| DF3|
|key10| null| null| t5| DF3|
+-----+-----+-----+-----+------+
有什么方法可以实现这一目标?
我会做这样的事情:
from pyspark.sql.functions import col, lit, coalesce, when
df1 = spark.createDataFrame(
[("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"),
("Item4", "key4"), ("Item5", "key5")],
["item1", "key"])
df2 = spark.createDataFrame(
[("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
("t4", "key7"), ("t5", "key8")],
["item2", "key"])
df3 = spark.createDataFrame([
("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
("t4", "key9"), ("t5", "key10")],
["item3", "key"])
df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
"source",
coalesce(
when(col("item1").isNotNull(), "df1"),
when(col("item2").isNotNull(), "df2"),
when(col("item3").isNotNull(), "df3")))
结果是:
## +-----+-----+-----+-----+------+
## | key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null| t5| t3| df2|
## | key5|Item5| null| null| df1|
## | key7| null| t4| null| df2|
## | key3|Item3| null| null| df1|
## | key6| null| t3| null| df2|
## | key1|Item1| t1| t1| df1|
## | key4|Item4| null| null| df1|
## | key2|Item2| t2| t2| df1|
## | key9| null| null| t4| df3|
## |key10| null| null| t5| df3|
## +-----+-----+-----+-----+------+