如何在PANDAS中获得前一组的最大值?



我试图获得前组类别的MAX。我使用DEFINE函数和APPLY。但是这段代码的性能似乎很慢。还有其他更快的方法吗?

下面是我的代码:
def Price_MAX_prev_TREND(row):
prev_trend_count = row['trend_count'] - 1
x = df_filtered.loc[ df_filtered['trend_count'] == prev_trend_count , 'Price' ].max()
return x

df_filtered['Price_MAX_prev_TREND'] = df_filtered.apply(Price_MAX_prev_TREND, axis = 1)

groupby.max计算每组的最大值,然后用map计算shift移位后的组的最大值。

下面是一个简单的例子:

df = pd.DataFrame({'group': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'value': [1, 4, 2, 3, 2, 0, 5, 3, 3]
})
# only use sort=False if you want to keep the original order
s = df.groupby('group', sort=False)['value'].max()
df['max_previous'] = df['group'].map(s.shift())

输出:

group  value  max_previous
0      0      1           NaN
1      0      4           NaN
2      0      2           NaN
3      1      3           4.0
4      1      2           4.0
5      1      0           4.0
6      2      5           3.0
7      2      3           3.0
8      2      3           3.0

如果您有不连续整数/年/等。组,并希望确保您映射前面的(,即。n-1)组,不如用set_axis改变索引:

df['max_previous'] = df['group'].map(s.set_axis(s.index+1))

查看差异的示例:

group  value  max_previous_shift  max_previous_discontinuous
0       0      1                 NaN                         NaN
1       0      4                 NaN                         NaN
2       0      2                 NaN                         NaN
3       1      3                 4.0                         4.0
4       1      2                 4.0                         4.0
5       1      0                 4.0                         4.0
6       2      5                 3.0                         3.0
7       2      3                 3.0                         3.0
8       2      3                 3.0                         3.0
9       4      7                 5.0                         NaN
10      4      3                 5.0                         NaN
11      4      1                 5.0                         NaN

首先汇总GroupBy.max的最大值,然后根据之前的趋势通过与Series.map的映射创建新列-将1添加到索引(从trend_count列),sort=False用于提高性能:

s = df.groupby('trend_count', sort=False)['Price'].max()
df['Price_MAX_prev_TREND'] = df['trend_count'].map(s.rename(lambda x: x+1))

另一个想法是在聚合max之前将1添加到trend_count:

s = df.assign(trend_count = df['trend_count'].add(1)).groupby('trend_count', sort=False)['Price'].max()
df['Price_MAX_prev_TREND'] = df['trend_count'].map(s)

性能取决于数据,最好在真实数据中测试:

np.random.seed(2023)
N = 10000
df = pd.DataFrame({'trend_count':np.random.randint(1000, size=N),
'Price': np.random.randint(1000, size=N)})

#original solution
In [192]: %%timeit
...: df['Price_MAX_prev_TREND1'] = df.apply(Price_MAX_prev_TREND, axis = 1)
...: 
...: 
...: 
4.02 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [193]: %%timeit
...: s = df.groupby('trend_count', sort=False)['Price'].max()
...: df['Price_MAX_prev_TREND'] = df['trend_count'].map(s.rename(lambda x: x+1))
...: 
...: 
2.38 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [194]: %%timeit
...: s = df.assign(trend_count = df['trend_count'].add(1)).groupby('trend_count', sort=False)['Price'].max()
...: df['Price_MAX_prev_TREND'] = df['trend_count'].map(s)
...: 
...: 
2.26 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#mozway solution
In [195]: %%timeit
...: s = df.groupby('trend_count', sort=False)['Price'].max()
...: df['max_previous'] = df['trend_count'].map(s.set_axis(s.index+1))
...: 
3.01 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最新更新