Dataframe值是使用lambda函数的日期之间的和



我有两个数据帧df1和df2。Df1是一个统一的日期范围,另一列中的默认值为0。所以它看起来像:

df1: 
date          Val
0    2020-02-01     0
1    2020-02-02     0
2    2020-02-03     0
.
.
.

同时df2只记录发生的事件

df2: 
date                   Val
0    2020-02-01-10:00:00     98
1    2020-02-01-13:54:09     55
2    2020-02-02-11:33:17     32         
.
.
.

我希望df1.val[i]是df2中日期df1.date[i] and df1.date[i+1]之间所有值的和

我写了下面的lambda函数,但它不起作用:

df1['val'] = df1.apply( lambda row: df2[ (df2.date < df1.date[row.index])].sum() )

帮助

使用pd.cut:

>>> df2.groupby(pd.cut(df2['date'], bins=df1['date'], labels=df1['date'][:-1])) 
['Val'].sum().reset_index()
date  Val
0 2020-02-01  153
1 2020-02-02   32

一些带有解释的代码。我知道有一些更快更简单的方法,但是这个方法很容易理解。只需将虚拟数据替换为您的数据。

import pandas as pd
import numpy as np
if __name__ ==  "__main__" :
# create pandas dataframe with custom series input with column names date and Val
df1 = pd.DataFrame({'date': 
pd.date_range(start='1/1/2000', end='31/12/2000',periods=8, normalize=True), 
'Val': np.zeros(8)})
df2 = pd.DataFrame({'date': 
pd.date_range(start='1/1/2000', end='31/12/2000', periods=8, normalize=True), 
'Val': np.random.randint(0, 100, 8)})
# I want the df1.val[i] to be the sum of all the values in df2 between the dates df1.date[i] and df1.date[i+1] .
# I want the output to be in a new dataframe with column names date and sumVal.
# The output dataframe should only have the rows for which there is a match between the dates in df1[i] and df1[i+1].
# The output dataframe should be sorted by date.
# The output dataframe should have the same number of rows as df1.
# The output dataframe should have the same number of columns as df1.
# The output dataframe should have the same index as df1.
print('-' * 50)
print('df1')
print(df1)
print('-' * 50)
print('df2')
print(df2)
for i in range(0, len(df1) - 1):
# add sum of all the values in df2 bfore the end date of df1 (i.e. i + 1)
df1.loc[i, 'Val'] += df2.loc[df2['date']
<= df1['date'][i+1], 'Val'].sum()
# remove all sums from below date threshold (i.e. values under i)
df1.loc[i, 'Val'] += df2.loc[df2['date']
< df1['date'][i], 'Val'].sum()

print('-' * 50)
print('df1')
print(df1)

我可以推荐duckdb来完成这些任务:

import pandas as pd
import duckdb
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df1['date'] = ['2020-02-01', '2020-02-02', '2020-02-03']
df1['Val'] = [0,0,0]
df1['date'] = pd.to_datetime(df1['date'])
df1.loc[0:len(df1)-2, 'date_2'] = df1.loc[1:len(df1)-1, 'date'].values

df2['date'] = ['2020-02-01-10:00:00', '2020-02-01-13:54:09', '2020-02-02-11:33:17 ']
df2['Val'] = [98,55,32]
df2['date'] = pd.to_datetime(df2['date'])
test_df = duckdb.query(
"""select t.date, sum(t.Val) as Val
from (
select t1.date, t2.Val, t2.date as date_validation 
from df1 as t1 
left join df2 as t2 on t1.date < t2.date and t1.date_2 >= t2.date
) t 
group by t.date
order by t.date"""
).to_df()

最新更新