我有一个大型数据集,其中包含不同个人在特定日期参加的事件的记录。我正试图通过ID和年份来聚合这个,并返回一个新的汇总数据框。
d = pd.DataFrame([["1", 2013, "a"]
, ["1", 2014, "a"]
, ["1", 2015, "b"]
, ["1", 2017, "c"]
, ["1", 2018, "c"]
, ["2", 2013, "b"]
, ["2", 2014, "a"]
, ["2", 2015, "a"]
, ["2", 2017, "b"]
, ["2", 2018, "b"]]
, columns = ["ID","Year","Event"])
ID Year Event
0 1 2013 a
1 1 2014 a
2 1 2015 b
3 1 2017 c
4 1 2018 c
5 2 2013 b
6 2 2014 a
7 2 2015 a
8 2 2017 b
9 2 2018 b
我想在一个固定的时间窗口内汇总不同的度量,返回在这一年之前发生的事件的累积计数,以及年内的特征。
ID Year total_a total_b total_c a_this_year b_this_year c_this_year
0 1 2016 2 1 0 0 0 0
1 1 2017 2 1 1 0 0 1
2 1 2018 2 1 2 0 0 1
3 2 2016 2 1 0 0 0 0
4 2 2017 2 2 0 0 1 0
5 2 2018 2 3 0 0 1 0
我写了下面的代码来做这个
def get_agg(df):
out = {
"Year" : []
, "total_a" : []
, "total_b" : []
, "total_c" : []
, "a_this_year" : []
, "b_this_year" : []
, "c_this_year" : []
}
for yr in range(2016,2019):
df_filt = df[df["Year"] <= yr]
out["YearId"].append(yr)
for event in ["a", "b", "c":
out["total_"+event].append(df_filt[df_filt.Event == event].Year.nunique())
out[event+"_this_year].append(any(df_filt[df_filt.Event == event].Year == yr))
return(pd.DataFrame.from_dict(out))
d.groupby(ID).apply(get_agg)
然而,我在我的完整数据集(350,000行,150k id)上发现了巨大的性能问题。我正在寻找方法来帮助优化这个过程。
也许您可以使用pandas方法,例如,虽然我不确定它是否完全是所需的输出?
import pandas as pd
d = pd.DataFrame([["1", 2013, "a"]
, ["1", 2014, "a"]
, ["1", 2015, "b"]
, ["1", 2017, "c"]
, ["1", 2018, "c"]
, ["2", 2013, "b"]
, ["2", 2014, "a"]
, ["2", 2015, "a"]
, ["2", 2017, "b"]
, ["2", 2018, "b"]]
, columns = ["ID","Year","Event"])
# Add a dummy variable to count each event
d['Dummy'] = 1
# Aggregate observationns by event and year
d_agg = d.groupby(['ID', 'Year', 'Event'])['Dummy'].sum()
d_agg = d_agg.reset_index()
d_agg['YearID'] = d_agg['Year'].astype(str) + "_" + d_agg['ID'].astype(str)
d_agg = d_agg.pivot(index='YearID', columns='Event', values='Dummy')
d_agg = d_agg.fillna(0)
# Get cumulative events for a rolling window
d_agg_cum = d_agg.rolling(window=3, min_periods=1).sum()
d_agg_cum.columns = [col + "_total" for col in d_agg_cum.columns]
d_agg.columns = [col + "_this_year" for col in d_agg.columns]
d_agg = d_agg.merge(d_agg_cum, left_index=True, right_index=True)
# If you want the year and id columns back
d_agg['Year'] = d_agg.index.str.split("_").str[0]
d_agg['ID'] = d_agg.index.str.split("_").str[1]
d_agg = d_agg.reset_index(drop=True)
对于任何年份/事件组合,不需要对数据过滤两次(一次表示小于或等于,一次表示等于),您可以只使用准确年份的数据填充您的字典。
那么用前一年的值来更新每年的累积数据就很简单了。
但是对于350,000行,也许所有这些,或者至少是确切的年份聚合部分,作为SQL工作比Python/Pandas工作更好