我是databricks(pyspark(的新手,我想根据条件映射到数据帧。DF1有id、date和其他一些列,DF2也有id、date和其他一些栏。例如DF1
日期 | 名称ID | 其他列|
---|---|---|
21/06/2022 |
ABC | XZ18610|
22/05/2022 |
ABC | XZ18610 |
22/04/2022 |
ABC | XZ18610 |
05/05/2022 |
DEF | XZ25277[/td>|
04/02/2022 |
DEF | XZ25277[/td>|
28/06/2022 |
GHI | XZU6S19|
18/07/2022 |
JKL | XZ54866[/td>|
27/07/2022 |
MNO | XZ82434[/td>|
20/06/2022 |
PQR | XZ78433[/td>
您可以使用窗口函数:
from pyspark.sql import functions as F
from pyspark.sql import Window
df1 = spark.createDataFrame(
[
('21/06/2022','ABC','XZ18610',''),
('22/05/2022','ABC','XZ18610',''),
('22/04/2022','ABC','XZ18610',''),
('05/05/2022','DEF','XZ25277',''),
('04/02/2022','DEF','XZ25277',''),
('28/06/2022','GHI','XZU6S19',''),
('18/07/2022','JKL','XZ54866',''),
('27/07/2022','MNO','XZ82434',''),
('20/06/2022','PQR','XZ78433','')
],
['Date','name','ID','Other columns']
)
.withColumn('Date', F.to_date('Date', 'd/M/y'))
df2 = spark.createDataFrame(
[
('30/05/2022','XZ18610','B',''),
('21/06/2021','XZ18610','A',''),
('05/01/2021','XZ25277','B',''),
('28/07/2022','XZU6S19','E',''),
('18/05/2022','XZ54866','D',''),
('27/07/2022','XZ82434','F',''),
('20/06/2022','XZ78433','I','')
],
['Date1','ID1','Value','Other columns1']
)
.withColumn('Date1', F.to_date('Date1', 'd/M/y'))
df3 = df1
.join(df2, df1.ID == df2.ID1, 'left')
.filter(df2.Date1 <= df1.Date)
.withColumn('datediff', F.datediff('Date', 'Date1'))
.withColumn("min", F.min('datediff').over(Window.partitionBy("Date", "ID")))
.filter(F.col('datediff')==F.col('min'))
result = df1.join(df3, ['Date', 'name', 'ID', 'Other columns'], 'left')
result
.select('Date','name','ID','Other columns','Date1','Value','Other columns1')
.show()
# +----------+----+-------+-------------+----------+-----+--------------+
# | Date|name| ID|Other columns| Date1|Value|Other columns1|
# +----------+----+-------+-------------+----------+-----+--------------+
# |2022-06-21| ABC|XZ18610| |2022-05-30| B| |
# |2022-05-22| ABC|XZ18610| |2021-06-21| A| |
# |2022-04-22| ABC|XZ18610| |2021-06-21| A| |
# |2022-05-05| DEF|XZ25277| |2021-01-05| B| |
# |2022-02-04| DEF|XZ25277| |2021-01-05| B| |
# |2022-06-28| GHI|XZU6S19| | null| null| null|
# |2022-07-18| JKL|XZ54866| |2022-05-18| D| |
# |2022-07-27| MNO|XZ82434| |2022-07-27| F| |
# |2022-06-20| PQR|XZ78433| |2022-06-20| I| |
# +----------+----+-------+-------------+----------+-----+--------------+