我想取数据帧列'category'
中的两个字母和另一列'data1'
中的最大值4,并根据某些规则对它们进行排序。我使用了根据类别中的字母进行切片和排序的方法,然后使用concat
进行合并。代码如下。有比这更简单的方法吗?
import numpy as np
from time import time
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
a = df[df['category'] == 'A'].sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B'].sort_values(by='data1', ascending=False).head(4)
df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)
category data1 data2
0 A 9963 7174
1 A 9921 1787
2 A 9870 6105
3 A 9779 7189
4 B 9938 4259
5 B 9714 9362
6 B 9622 7033
7 B 9600 5718
我再次尝试lambda
,代码如下:
import numpy as np
from time import time
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
df = df.groupby('category').apply(lambda x: x[x['data1'].isin(x['data1'].nlargest(4))]).reset_index(drop=True)
df = df.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)
使用:
#for same random data for compare
np.random.seed(2021)
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.randint(1, 10000, len(df))
df['data2'] = np.random.randint(1, 10000, len(df))
a = df[df['category'] == 'A'].sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B'].sort_values(by='data1', ascending=False).head(4)
df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df1)
category data1 data2
0 A 9882 9868
1 A 9855 6701
2 A 9798 1058
3 A 9669 7334
4 B 9973 3668
5 B 9900 4340
6 B 9846 7885
7 B 9659 4933
先通过两列使用DataFrame.sort_values
,然后添加GroupBy.head
:
df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False])
.groupby('category')
.head(4)
.reset_index(drop=True))
print (df1)
category data1 data2
0 A 9882 9868
1 A 9855 6701
2 A 9798 1058
3 A 9669 7334
4 B 9973 3668
5 B 9900 4340
6 B 9846 7885
7 B 9659 4933