Pyspark在DataBricks上进行两个DataFrame映射



我是databricks(pyspark(的新手,我想根据条件映射到数据帧。DF1有id、date和其他一些列,DF2也有id、date和其他一些栏。例如DF1

名称其他列XZ18610XZ25277[/td>XZ25277[/td>XZU6S19XZ54866[/td>XZ82434[/td>XZ78433[/td>
日期ID
21/06/2022 ABC
22/05/2022 ABC XZ18610
22/04/2022 ABC XZ18610
05/05/2022 DEF
04/02/2022 DEF
28/06/2022 GHI
18/07/2022 JKL
27/07/2022 MNO
20/06/2022 PQR

您可以使用窗口函数:

from pyspark.sql import functions as F
from pyspark.sql import Window
df1 = spark.createDataFrame(
[
('21/06/2022','ABC','XZ18610',''),
('22/05/2022','ABC','XZ18610',''),
('22/04/2022','ABC','XZ18610',''),
('05/05/2022','DEF','XZ25277',''),
('04/02/2022','DEF','XZ25277',''),
('28/06/2022','GHI','XZU6S19',''),
('18/07/2022','JKL','XZ54866',''),
('27/07/2022','MNO','XZ82434',''),
('20/06/2022','PQR','XZ78433','')
],
['Date','name','ID','Other columns']
)
.withColumn('Date', F.to_date('Date', 'd/M/y'))
df2 = spark.createDataFrame(
[
('30/05/2022','XZ18610','B',''),
('21/06/2021','XZ18610','A',''),
('05/01/2021','XZ25277','B',''),
('28/07/2022','XZU6S19','E',''),
('18/05/2022','XZ54866','D',''),
('27/07/2022','XZ82434','F',''),
('20/06/2022','XZ78433','I','')
],
['Date1','ID1','Value','Other columns1']
)
.withColumn('Date1', F.to_date('Date1', 'd/M/y'))

df3 = df1
.join(df2, df1.ID == df2.ID1, 'left')
.filter(df2.Date1 <= df1.Date)
.withColumn('datediff', F.datediff('Date', 'Date1'))
.withColumn("min", F.min('datediff').over(Window.partitionBy("Date", "ID")))
.filter(F.col('datediff')==F.col('min'))
result = df1.join(df3, ['Date', 'name', 'ID', 'Other columns'], 'left')

result
.select('Date','name','ID','Other columns','Date1','Value','Other columns1')
.show()

# +----------+----+-------+-------------+----------+-----+--------------+
# |      Date|name|     ID|Other columns|     Date1|Value|Other columns1|
# +----------+----+-------+-------------+----------+-----+--------------+
# |2022-06-21| ABC|XZ18610|             |2022-05-30|    B|              |
# |2022-05-22| ABC|XZ18610|             |2021-06-21|    A|              |
# |2022-04-22| ABC|XZ18610|             |2021-06-21|    A|              |
# |2022-05-05| DEF|XZ25277|             |2021-01-05|    B|              |
# |2022-02-04| DEF|XZ25277|             |2021-01-05|    B|              |
# |2022-06-28| GHI|XZU6S19|             |      null| null|          null|
# |2022-07-18| JKL|XZ54866|             |2022-05-18|    D|              |
# |2022-07-27| MNO|XZ82434|             |2022-07-27|    F|              |
# |2022-06-20| PQR|XZ78433|             |2022-06-20|    I|              |
# +----------+----+-------+-------------+----------+-----+--------------+

最新更新