我有两个数据帧df1和df2。Df1是一个统一的日期范围,另一列中的默认值为0。所以它看起来像:
df1:
date Val
0 2020-02-01 0
1 2020-02-02 0
2 2020-02-03 0
.
.
.
同时df2只记录发生的事件
df2:
date Val
0 2020-02-01-10:00:00 98
1 2020-02-01-13:54:09 55
2 2020-02-02-11:33:17 32
.
.
.
我希望df1.val[i]
是df2中日期df1.date[i] and df1.date[i+1]
之间所有值的和
我写了下面的lambda函数,但它不起作用:
df1['val'] = df1.apply( lambda row: df2[ (df2.date < df1.date[row.index])].sum() )
帮助使用pd.cut
:
>>> df2.groupby(pd.cut(df2['date'], bins=df1['date'], labels=df1['date'][:-1]))
['Val'].sum().reset_index()
date Val
0 2020-02-01 153
1 2020-02-02 32
一些带有解释的代码。我知道有一些更快更简单的方法,但是这个方法很容易理解。只需将虚拟数据替换为您的数据。
import pandas as pd
import numpy as np
if __name__ == "__main__" :
# create pandas dataframe with custom series input with column names date and Val
df1 = pd.DataFrame({'date':
pd.date_range(start='1/1/2000', end='31/12/2000',periods=8, normalize=True),
'Val': np.zeros(8)})
df2 = pd.DataFrame({'date':
pd.date_range(start='1/1/2000', end='31/12/2000', periods=8, normalize=True),
'Val': np.random.randint(0, 100, 8)})
# I want the df1.val[i] to be the sum of all the values in df2 between the dates df1.date[i] and df1.date[i+1] .
# I want the output to be in a new dataframe with column names date and sumVal.
# The output dataframe should only have the rows for which there is a match between the dates in df1[i] and df1[i+1].
# The output dataframe should be sorted by date.
# The output dataframe should have the same number of rows as df1.
# The output dataframe should have the same number of columns as df1.
# The output dataframe should have the same index as df1.
print('-' * 50)
print('df1')
print(df1)
print('-' * 50)
print('df2')
print(df2)
for i in range(0, len(df1) - 1):
# add sum of all the values in df2 bfore the end date of df1 (i.e. i + 1)
df1.loc[i, 'Val'] += df2.loc[df2['date']
<= df1['date'][i+1], 'Val'].sum()
# remove all sums from below date threshold (i.e. values under i)
df1.loc[i, 'Val'] += df2.loc[df2['date']
< df1['date'][i], 'Val'].sum()
print('-' * 50)
print('df1')
print(df1)
我可以推荐duckdb来完成这些任务:
import pandas as pd
import duckdb
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df1['date'] = ['2020-02-01', '2020-02-02', '2020-02-03']
df1['Val'] = [0,0,0]
df1['date'] = pd.to_datetime(df1['date'])
df1.loc[0:len(df1)-2, 'date_2'] = df1.loc[1:len(df1)-1, 'date'].values
df2['date'] = ['2020-02-01-10:00:00', '2020-02-01-13:54:09', '2020-02-02-11:33:17 ']
df2['Val'] = [98,55,32]
df2['date'] = pd.to_datetime(df2['date'])
test_df = duckdb.query(
"""select t.date, sum(t.Val) as Val
from (
select t1.date, t2.Val, t2.date as date_validation
from df1 as t1
left join df2 as t2 on t1.date < t2.date and t1.date_2 >= t2.date
) t
group by t.date
order by t.date"""
).to_df()