pandas groupby并执行分类排序以删除重复项



我有一个像下面这样的数据框架

df = pd.DataFrame({
"Name": ["Tim", "Tim", "Tim", "Tim", "Tim",'Jack','Jack','Jack'],
"Status": ["A1", "E1", "B3", "D4", "C90","A1","C90","B3"]
})

我的状态变量的实际顺序是B3

添加现有类别到列表和排序,并删除重复的Name列:

df["Status"] = pd.Categorical(df["Status"], 
categories=["B3", "A1", "D4", "C90", "E90","E1"], 
ordered=True)
df_cleaned = (df.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
print (df_cleaned)
Name Status
6  Jack    C90
1   Tim     E1

如果可能,一些不在类别列表中的值也删除缺失值:

df_cleaned = (df.dropna(subset=['Status'])
.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')

您可以使用pyjanitor中的encode_categorical抽象类别列创建,并使用drop_duplicatesgroupby:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.encode_categorical(Status=['B3', 'A1', 'D4','C90','E1'])
.sort_values(['Name','Status'])
# you can skip the lines below with drop_duplicates
# .drop_duplicates(subset='Name', keep='last')
.groupby('Name', as_index=False)
.Status
.last()
) 
Name Status
0  Jack    C90
1   Tim     E1

最新更新