数据帧分类和排序优化问题



我想取数据帧列'category'中的两个字母和另一列'data1'中的最大值4,并根据某些规则对它们进行排序。我使用了根据类别中的字母进行切片和排序的方法,然后使用concat进行合并。代码如下。有比这更简单的方法吗?

import numpy as np
from time import time
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
a = df[df['category'] == 'A'].sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B'].sort_values(by='data1', ascending=False).head(4)
df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)
category  data1  data2
0        A   9963   7174
1        A   9921   1787
2        A   9870   6105
3        A   9779   7189
4        B   9938   4259
5        B   9714   9362
6        B   9622   7033
7        B   9600   5718

我再次尝试lambda,代码如下:

import numpy as np
from time import time
import pandas as pd

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
df = df.groupby('category').apply(lambda x: x[x['data1'].isin(x['data1'].nlargest(4))]).reset_index(drop=True)
df = df.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)

使用:

#for same random data for compare
np.random.seed(2021)

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
a = df[df['category'] == 'A'].sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B'].sort_values(by='data1', ascending=False).head(4)
df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df1)
category  data1  data2
0        A   9882   9868
1        A   9855   6701
2        A   9798   1058
3        A   9669   7334
4        B   9973   3668
5        B   9900   4340
6        B   9846   7885
7        B   9659   4933

先通过两列使用DataFrame.sort_values,然后添加GroupBy.head:

df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False])
.groupby('category')
.head(4)
.reset_index(drop=True))
print (df1)
category  data1  data2
0        A   9882   9868
1        A   9855   6701
2        A   9798   1058
3        A   9669   7334
4        B   9973   3668
5        B   9900   4340
6        B   9846   7885
7        B   9659   4933

最新更新