我的第一个 Pandas 数据帧看起来(示例)如下 -Date
值为datetime
,Temp in F
值为float
,Signal
值为string
。
Date Temp in F Signal
1 1990-10-01 2.23337 freeze
2 1991-07-31 2.99860 defrost
3 1991-10-01 3.12221 freeze
4 1992-07-31 3.2328 defrost
5 1992-10-01 4.21327 freeze
6 1993-07-29 2.23222 defrost
7 1993-10-01 1.53200 freeze
8 1994-07-29 2.15030 defrost
9 1994-10-03 1.22299 freeze
.. ... ... ...
57 2018-10-01 8.95045 freeze
58 2019-07-31 9.32463 defrist
59 2019-10-01 9.34722 freeze
60 2020-07-31 10.53200 defrost
61 2020-10-01 10.34000 freeze
我的第二个 Pandas 数据帧看起来(示例)如下 -Date
值为datetime
,Moisture
值为浮点数。
Date Moisture
1994-05-27 4.21232
1995-05-19 3.30000
1996-05-24 3.43227
1997-05-23 3.63333
1998-05-15 4.60000
1999-05-28 2.43240
2000-05-26 1.34237
2001-05-21 1.23430
2002-05-29 2.34343
2003-05-02 1.83433
2004-04-29 2.34341
2005-06-28 3.15373
2006-05-05 1.78565
2007-05-04 0.34533
2008-08-02 0.42267
2009-05-07 0.40000
2010-08-07 0.30000
2011-05-06 2.30000
2012-05-04 3.12300
2013-05-06 4.10200
2014-05-02 2.42000
2015-05-08 2.53300
2016-06-09 1.20000
2017-05-11 1.45000
2018-05-10 1.30000
2019-05-15 1.67230
2020-05-29 2.34000
现在,我想根据Signal
所在的日期值将Moisture
的值添加到第一个数据帧freeze
。举个例子
- 在第二个数据帧中:值 4.21232 记录于 1994-05-27。此值应添加到日期为 1994-10-03 的数据帧 1 行,因为它在湿度日期 (=1994-05-27) 之后,并且信号已冻结。
- 在第二个数据帧中:值 1.67230 记录于 2019-05-15。此值应添加到日期为 2019-10-01 的数据帧 1 行,因为它在湿度日期 (=2019-05-15) 之后,并且信号已冻结。
- 如果找到的中间日期没有值,则它应保持为空。
此过程应使用第二个数据帧的"水分"列的所有值完成,以便修改后的第一个数据帧应如下所示:
Date Temp in F Signal Moisture
1 1990-10-01 2.23337 freeze
2 1991-07-31 2.99860 defrost
3 1991-10-01 3.12221 freeze
4 1992-07-31 3.2328 defrost
5 1992-10-01 4.21327 freeze
6 1993-07-29 2.23222 defrost
7 1993-10-01 1.53200 freeze
8 1994-07-29 2.15030 defrost
9 1994-10-03 1.22299 freeze 4.21232
.. ... ... ...
57 2018-10-01 8.95045 freeze 1.30000
58 2019-07-31 9.32463 defrist
59 2019-10-01 9.34722 freeze 1.67230
60 2020-07-31 10.53200 defrost
61 2020-10-01 10.34000 freeze 2.34000
有人知道如何解决这个问题吗?
将merge_asof()
与连接所需数据的子集一起使用,然后重新添加数据。
df1 = pd.read_csv(io.StringIO(""" Date Temp_in_F Signal
1 1990-10-01 2.23337 freeze
2 1991-07-31 2.99860 defrost
3 1991-10-01 3.12221 freeze
4 1992-07-31 3.2328 defrost
5 1992-10-01 4.21327 freeze
6 1993-07-29 2.23222 defrost
7 1993-10-01 1.53200 freeze
8 1994-07-29 2.15030 defrost
9 1994-10-03 1.22299 freeze
.. ... ... ...
57 2018-10-01 8.95045 freeze
58 2019-07-31 9.32463 defrist
59 2019-10-01 9.34722 freeze
60 2020-07-31 10.53200 defrost
61 2020-10-01 10.34000 freeze"""), sep="s+")
df1["Date"] = pd.to_datetime(df1["Date"],errors="coerce")
df1 = df1.dropna().reset_index(drop=True)
df2 = pd.read_csv(io.StringIO(""" Date Moisture
1994-05-27 4.21232
1995-05-19 3.30000
1996-05-24 3.43227
1997-05-23 3.63333
1998-05-15 4.60000
1999-05-28 2.43240
2000-05-26 1.34237
2001-05-21 1.23430
2002-05-29 2.34343
2003-05-02 1.83433
2004-04-29 2.34341
2005-06-28 3.15373
2006-05-05 1.78565
2007-05-04 0.34533
2008-08-02 0.42267
2009-05-07 0.40000
2010-08-07 0.30000
2011-05-06 2.30000
2012-05-04 3.12300
2013-05-06 4.10200
2014-05-02 2.42000
2015-05-08 2.53300
2016-06-09 1.20000
2017-05-11 1.45000
2018-05-10 1.30000
2019-05-15 1.67230
2020-05-29 2.34000"""), sep="s+")
df2["Date"] = pd.to_datetime(df2["Date"])
# useful ...
mask = df1["Signal"]=="freeze"
# exclude defrost from merge_asof()
pd.concat([
pd.merge_asof(df1[mask].sort_values("Date"), df2.sort_values("Date"), on="Date")
, df1[~mask] # put defrost back in
]).sort_values("Date").reset_index(drop=True)
输出
Date Temp_in_F Signal Moisture
1990-10-01 2.23337 freeze NaN
1991-07-31 2.99860 defrost NaN
1991-10-01 3.12221 freeze NaN
1992-07-31 3.2328 defrost NaN
1992-10-01 4.21327 freeze NaN
1993-07-29 2.23222 defrost NaN
1993-10-01 1.53200 freeze NaN
1994-07-29 2.15030 defrost NaN
1994-10-03 1.22299 freeze 4.21232
2018-10-01 8.95045 freeze 1.30000
2019-07-31 9.32463 defrist NaN
2019-10-01 9.34722 freeze 1.67230
2020-07-31 10.53200 defrost NaN
2020-10-01 10.34000 freeze 2.34000