我有一个像下面这样的数据框架
df = pd.DataFrame({
"Name": ["Tim", "Tim", "Tim", "Tim", "Tim",'Jack','Jack','Jack'],
"Status": ["A1", "E1", "B3", "D4", "C90","A1","C90","B3"]
})
我的状态变量的实际顺序是B3 添加现有类别到列表和排序,并删除重复的 如果可能,一些不在类别列表中的值也删除缺失值:Name
列:df["Status"] = pd.Categorical(df["Status"],
categories=["B3", "A1", "D4", "C90", "E90","E1"],
ordered=True)
df_cleaned = (df.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
print (df_cleaned)
Name Status
6 Jack C90
1 Tim E1
df_cleaned = (df.dropna(subset=['Status'])
.sort_values(['Status'])
.drop_duplicates(['Name'],keep='last')
您可以使用pyjanitor中的encode_categorical抽象类别列创建,并使用drop_duplicates
或groupby
:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.encode_categorical(Status=['B3', 'A1', 'D4','C90','E1'])
.sort_values(['Name','Status'])
# you can skip the lines below with drop_duplicates
# .drop_duplicates(subset='Name', keep='last')
.groupby('Name', as_index=False)
.Status
.last()
)
Name Status
0 Jack C90
1 Tim E1