加速Pandas中的groupby().apply()



我有一个大型数据集,其中包含不同个人在特定日期参加的事件的记录。我正试图通过ID和年份来聚合这个,并返回一个新的汇总数据框。

d = pd.DataFrame([["1", 2013, "a"]
, ["1", 2014, "a"]
, ["1", 2015, "b"]
, ["1", 2017, "c"]
, ["1", 2018, "c"]
, ["2", 2013, "b"]
, ["2", 2014, "a"]
, ["2", 2015, "a"]
, ["2", 2017, "b"]
, ["2", 2018, "b"]]
, columns = ["ID","Year","Event"])
ID  Year Event
0  1  2013     a
1  1  2014     a
2  1  2015     b
3  1  2017     c
4  1  2018     c
5  2  2013     b
6  2  2014     a
7  2  2015     a
8  2  2017     b
9  2  2018     b

我想在一个固定的时间窗口内汇总不同的度量,返回在这一年之前发生的事件的累积计数,以及年内的特征。

ID  Year total_a total_b total_c a_this_year b_this_year c_this_year
0  1  2016       2       1       0           0           0           0
1  1  2017       2       1       1           0           0           1
2  1  2018       2       1       2           0           0           1
3  2  2016       2       1       0           0           0           0
4  2  2017       2       2       0           0           1           0
5  2  2018       2       3       0           0           1           0

我写了下面的代码来做这个

def get_agg(df):
out = {
"Year" : []
, "total_a" : []
, "total_b" : []
, "total_c" : []
, "a_this_year" : []
, "b_this_year" : []
, "c_this_year" : []
}

for yr in range(2016,2019):

df_filt = df[df["Year"] <= yr]

out["YearId"].append(yr)
for event in ["a", "b", "c":
out["total_"+event].append(df_filt[df_filt.Event == event].Year.nunique()) 
out[event+"_this_year].append(any(df_filt[df_filt.Event == event].Year == yr)) 

return(pd.DataFrame.from_dict(out))
d.groupby(ID).apply(get_agg)

然而,我在我的完整数据集(350,000行,150k id)上发现了巨大的性能问题。我正在寻找方法来帮助优化这个过程。

也许您可以使用pandas方法,例如,虽然我不确定它是否完全是所需的输出?

import pandas as pd
d = pd.DataFrame([["1", 2013, "a"]
, ["1", 2014, "a"]
, ["1", 2015, "b"]
, ["1", 2017, "c"]
, ["1", 2018, "c"]
, ["2", 2013, "b"]
, ["2", 2014, "a"]
, ["2", 2015, "a"]
, ["2", 2017, "b"]
, ["2", 2018, "b"]]
, columns = ["ID","Year","Event"])
# Add a dummy variable to count each event
d['Dummy'] = 1
# Aggregate observationns by event and year
d_agg = d.groupby(['ID', 'Year', 'Event'])['Dummy'].sum()
d_agg =  d_agg.reset_index()
d_agg['YearID'] = d_agg['Year'].astype(str) + "_" + d_agg['ID'].astype(str)
d_agg = d_agg.pivot(index='YearID', columns='Event', values='Dummy')
d_agg = d_agg.fillna(0)
# Get cumulative events for a rolling window
d_agg_cum = d_agg.rolling(window=3, min_periods=1).sum()
d_agg_cum.columns = [col + "_total" for col in d_agg_cum.columns]
d_agg.columns = [col + "_this_year" for col in d_agg.columns]
d_agg = d_agg.merge(d_agg_cum, left_index=True, right_index=True)
# If you want the year and id columns back
d_agg['Year'] = d_agg.index.str.split("_").str[0]
d_agg['ID'] = d_agg.index.str.split("_").str[1]
d_agg = d_agg.reset_index(drop=True)

对于任何年份/事件组合,不需要对数据过滤两次(一次表示小于或等于,一次表示等于),您可以只使用准确年份的数据填充您的字典。

那么用前一年的值来更新每年的累积数据就很简单了。

但是对于350,000行,也许所有这些,或者至少是确切的年份聚合部分,作为SQL工作比Python/Pandas工作更好

最新更新