火花外部与源连接



我对火花是相对较新的,我想知道我是否可以获取用于外部Join

的列的来源

假设我有3 df

df 1

+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+

df2

+-----+----+
|item2| key|
+-----+----+
|   t1|key1|
|   t2|key2|
|   t3|key6|
|   t4|key7|
|   t5|key8|
+-----+----+

df3

+-----+-----+
|item3|  key|
+-----+-----+
|   t1| key1|
|   t2| key2|
|   t3| key8|
|   t4| key9|
|   t5|key10|
+-----+-----+

我想在这3个数据范围内进行完整的外部加入,并包含一个新列以指示该键的来源。

例如

+-----+-----+-----+-----+------+
|  key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null|   t5|   t3|   DF2|
| key5|Item5| null| null|   DF1|
| key7| null|   t4| null|   DF2|
| key3|Item3| null| null|   DF1|
| key6| null|   t3| null|   DF2|
| key1|Item1|   t1|   t1|   DF1|
| key4|Item4| null| null|   DF1|
| key2|Item2|   t2|   t2|   DF1|
| key9| null| null|   t4|   DF3|
|key10| null| null|   t5|   DF3|
+-----+-----+-----+-----+------+

有什么方法可以实现这一目标?

我会做这样的事情:

from pyspark.sql.functions import col, lit, coalesce, when
df1 = spark.createDataFrame(
    [("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"), 
    ("Item4", "key4"), ("Item5", "key5")],
    ["item1", "key"])
df2 = spark.createDataFrame(
    [("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
    ("t4", "key7"), ("t5", "key8")],
    ["item2", "key"])
df3 = spark.createDataFrame([
    ("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
    ("t4", "key9"), ("t5", "key10")],
    ["item3", "key"])
df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
    "source",
    coalesce(
        when(col("item1").isNotNull(), "df1"), 
        when(col("item2").isNotNull(), "df2"),  
        when(col("item3").isNotNull(), "df3")))

结果是:

## +-----+-----+-----+-----+------+                      
## |  key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null|   t5|   t3|   df2|
## | key5|Item5| null| null|   df1|
## | key7| null|   t4| null|   df2|
## | key3|Item3| null| null|   df1|
## | key6| null|   t3| null|   df2|
## | key1|Item1|   t1|   t1|   df1|
## | key4|Item4| null| null|   df1|
## | key2|Item2|   t2|   t2|   df1|
## | key9| null| null|   t4|   df3|
## |key10| null| null|   t5|   df3|
## +-----+-----+-----+-----+------+

相关内容

  • 没有找到相关文章

最新更新