我试图获得前组类别的MAX。我使用DEFINE函数和APPLY。但是这段代码的性能似乎很慢。还有其他更快的方法吗?
下面是我的代码:def Price_MAX_prev_TREND(row):
prev_trend_count = row['trend_count'] - 1
x = df_filtered.loc[ df_filtered['trend_count'] == prev_trend_count , 'Price' ].max()
return x
df_filtered['Price_MAX_prev_TREND'] = df_filtered.apply(Price_MAX_prev_TREND, axis = 1)
用groupby.max
计算每组的最大值,然后用map
计算shift
移位后的组的最大值。
下面是一个简单的例子:
df = pd.DataFrame({'group': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'value': [1, 4, 2, 3, 2, 0, 5, 3, 3]
})
# only use sort=False if you want to keep the original order
s = df.groupby('group', sort=False)['value'].max()
df['max_previous'] = df['group'].map(s.shift())
输出:
group value max_previous
0 0 1 NaN
1 0 4 NaN
2 0 2 NaN
3 1 3 4.0
4 1 2 4.0
5 1 0 4.0
6 2 5 3.0
7 2 3 3.0
8 2 3 3.0
如果您有不连续整数/年/等。组,并希望确保您映射前面的(,即。n-1)组,不如用set_axis
改变索引:
df['max_previous'] = df['group'].map(s.set_axis(s.index+1))
查看差异的示例:
group value max_previous_shift max_previous_discontinuous
0 0 1 NaN NaN
1 0 4 NaN NaN
2 0 2 NaN NaN
3 1 3 4.0 4.0
4 1 2 4.0 4.0
5 1 0 4.0 4.0
6 2 5 3.0 3.0
7 2 3 3.0 3.0
8 2 3 3.0 3.0
9 4 7 5.0 NaN
10 4 3 5.0 NaN
11 4 1 5.0 NaN
首先汇总GroupBy.max
的最大值,然后根据之前的趋势通过与Series.map
的映射创建新列-将1
添加到索引(从trend_count
列),sort=False
用于提高性能:
s = df.groupby('trend_count', sort=False)['Price'].max()
df['Price_MAX_prev_TREND'] = df['trend_count'].map(s.rename(lambda x: x+1))
另一个想法是在聚合max
之前将1
添加到trend_count
:
s = df.assign(trend_count = df['trend_count'].add(1)).groupby('trend_count', sort=False)['Price'].max()
df['Price_MAX_prev_TREND'] = df['trend_count'].map(s)
性能取决于数据,最好在真实数据中测试:
np.random.seed(2023)
N = 10000
df = pd.DataFrame({'trend_count':np.random.randint(1000, size=N),
'Price': np.random.randint(1000, size=N)})
#original solution
In [192]: %%timeit
...: df['Price_MAX_prev_TREND1'] = df.apply(Price_MAX_prev_TREND, axis = 1)
...:
...:
...:
4.02 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [193]: %%timeit
...: s = df.groupby('trend_count', sort=False)['Price'].max()
...: df['Price_MAX_prev_TREND'] = df['trend_count'].map(s.rename(lambda x: x+1))
...:
...:
2.38 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [194]: %%timeit
...: s = df.assign(trend_count = df['trend_count'].add(1)).groupby('trend_count', sort=False)['Price'].max()
...: df['Price_MAX_prev_TREND'] = df['trend_count'].map(s)
...:
...:
2.26 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#mozway solution
In [195]: %%timeit
...: s = df.groupby('trend_count', sort=False)['Price'].max()
...: df['max_previous'] = df['trend_count'].map(s.set_axis(s.index+1))
...:
3.01 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)