我希望新列DATE1
等于KEY1
上数据帧1(DF1(中的列START
,并与DF2中基于KEY2
的数据帧2(DF2(组合,以便在键可能位于联接中时显示DATE1。我可以显示列的开头,但它显示所有内容。
我希望DATE2等于KEY1
上数据帧1(DF1(中的列START
,但基于DF2中名为KEY3
的不同键与DF2组合,以便在键在联接中匹配时显示DATE2
。我可以显示列开始,但不知道如何只显示列开始时,结合两个键。
DF1的示例输入为:
+---------+--------+------+------+
|START |KEY1 |Color OTHER |
+---------+--------+------+------+
| 10/05/21| 1 | White| 3000|
| 10/06/21| 2 | Blue| 4100|
| 10/07/21| 3 | Green| 6200|
+---------+--------+------+------+
DF2输入为:
+---------+--------+----+
|KEY2 |KEY3 |NUMBER|
+---------+--------+----+
| 1 | 2| 3000 |
| 2 | 3| 4100 |
| 3 | 1| 6200 |
+---------+--------+----+
输出如下:
+---------+--------+
|DATE1 | DATE2 |
+---------+--------+
| 10/05/21|10/06/21|
| 10/06/21|10/07/21|
| 10/07/21|10/05/21|
+---------+--------+
以下是代码
def transform_df_data(df: DataFrame):
return
df
.withColumn("DATE1", col("START"))
.withColumn("DATE2", col("START"))
.withColumn("KEY1", col("KEY1"))
.select("KEY1","DATE1","DATE2")
def build_final_df(df:DataFrame, otherdf:Dataframe)
df_transform = transform_df_data(d_period)
return final_transform.join(df_transform , final_transform.KEY1 == df_transform(KEY2, 'inner').withColumn("DATE1", col("START")).select("DATE1","DATE2")
请注意,我确实正确理解了这个问题,但我认为您希望在KEY1 = KEY2
上加入df1
和df2
,然后在KEY1 = KEY3
上再次加入df1
的结果:
import pyspark.sql.functions as F
data1 = [("10/05/21", 1, "White", 3000), ("10/06/21", 2, "Blue", 4100), ("10/07/21", 3, "Green", 6200)]
df1 = spark.createDataFrame(data1, ["START", "KEY1", "Color", "OTHER"])
data2 = [(1, 2, 3000), (2, 3, 4100), (3, 1, 6200)]
df2 = spark.createDataFrame(data2, ["KEY2", "KEY3", "NUMBER"])
df_result = df1.withColumnRenamed("START", "DATE1").join(
df2,
F.col("KEY1") == F.col("KEY2")
).select("DATE1", "KEY3").join(
df1.withColumnRenamed("START", "DATE2"),
F.col("KEY1") == F.col("KEY3")
).select("DATE1", "DATE2")
df_result.show()
#+--------+--------+
#| DATE1| DATE2|
#+--------+--------+
#|10/07/21|10/05/21|
#|10/05/21|10/06/21|
#|10/06/21|10/07/21|
#+--------+--------+