将两个具有单独键的数据帧组合为一个数据帧,以便可以根据键选择两列



我希望新列DATE1等于KEY1上数据帧1(DF1(中的列START,并与DF2中基于KEY2的数据帧2(DF2(组合,以便在键可能位于联接中时显示DATE1。我可以显示列的开头,但它显示所有内容。

我希望DATE2等于KEY1上数据帧1(DF1(中的列START,但基于DF2中名为KEY3的不同键与DF2组合,以便在键在联接中匹配时显示DATE2。我可以显示列开始,但不知道如何只显示列开始时,结合两个键。

DF1的示例输入为:

+---------+--------+------+------+
|START    |KEY1    |Color  OTHER |
+---------+--------+------+------+
| 10/05/21|  1     | White|  3000|
| 10/06/21|  2     |  Blue|  4100|
| 10/07/21|  3     | Green|  6200|
+---------+--------+------+------+

DF2输入为:

+---------+--------+----+
|KEY2     |KEY3 |NUMBER|
+---------+--------+----+
|  1     |     2|  3000 |
|  2     |     3|  4100 |
|  3     |     1|  6200 |
+---------+--------+----+

输出如下:

+---------+--------+
|DATE1    | DATE2 |
+---------+--------+
| 10/05/21|10/06/21|       
| 10/06/21|10/07/21|       
| 10/07/21|10/05/21|      
+---------+--------+

以下是代码

def transform_df_data(df: DataFrame):
return 
df 
.withColumn("DATE1", col("START")) 
.withColumn("DATE2", col("START")) 
.withColumn("KEY1", col("KEY1")) 
.select("KEY1","DATE1","DATE2")

def build_final_df(df:DataFrame, otherdf:Dataframe)
df_transform = transform_df_data(d_period)
return final_transform.join(df_transform , final_transform.KEY1 == df_transform(KEY2, 'inner').withColumn("DATE1", col("START")).select("DATE1","DATE2")

请注意,我确实正确理解了这个问题,但我认为您希望在KEY1 = KEY2上加入df1df2,然后在KEY1 = KEY3上再次加入df1的结果:

import pyspark.sql.functions as F
data1 = [("10/05/21", 1, "White", 3000), ("10/06/21", 2, "Blue", 4100), ("10/07/21", 3, "Green", 6200)]
df1 = spark.createDataFrame(data1, ["START", "KEY1", "Color", "OTHER"])
data2 = [(1, 2, 3000), (2, 3, 4100), (3, 1, 6200)]
df2 = spark.createDataFrame(data2, ["KEY2", "KEY3", "NUMBER"])
df_result = df1.withColumnRenamed("START", "DATE1").join(
df2,
F.col("KEY1") == F.col("KEY2")
).select("DATE1", "KEY3").join(
df1.withColumnRenamed("START", "DATE2"),
F.col("KEY1") == F.col("KEY3")
).select("DATE1", "DATE2")
df_result.show()
#+--------+--------+
#|   DATE1|   DATE2|
#+--------+--------+
#|10/07/21|10/05/21|
#|10/05/21|10/06/21|
#|10/06/21|10/07/21|
#+--------+--------+

最新更新