根据目标类条件删除重复行

我有一个数据集，有3个目标类:'是'，'可能'和'否'。

Unique_id       target
111              Yes
111             Maybe
111              No
112              No
112             Maybe
113              No

我想删除基于unique_id的重复行。但是' drop duplicate '通常保留第一行或最后一行，我想根据以下条件保留行:

1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.

我尝试' sort_values '的目标类(是=1,Maybe=2, No=3)，然后删除重复项。

所需输出:

Unique_id       target
111               Yes
112              Maybe
113               No

我在想是否有更好的方法。

如有任何建议，不胜感激。谢谢!

根据['Yes' < 'Maybe' < 'No']的顺序，通过pd.CategoricalDtype将列target设置为Categorical数据类型，如下所示:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

然后，使用.groupby()对Unique_id进行分组，并使用.GroupBy.min()在同一Unique_id组内的target上取min:

df.groupby('Unique_id', as_index=False)['target'].min()

结果:

Unique_id target
0        111    Yes
1        112  Maybe
2        113     No

编辑

案例1:如果您有2个或更多相似的列(例如target和target2)以相同的顺序排序，则只需将代码应用于2列。例如，如果我们有以下数据帧:

Unique_id target target2
0        111    Yes      No
1        111  Maybe   Maybe
2        111     No     Yes
3        112     No      No
4        112  Maybe   Maybe
5        113     No   Maybe

可以同时得到两列的最小值，如下所示:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df[['target', 'target2']] = df[['target', 'target2']].astype(t)
df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()

结果:

Unique_id target target2
0        111    Yes     Yes
1        112  Maybe   Maybe
2        113     No   Maybe

案例2:如果希望显示数据框中的所有列，而不仅仅是Unique_id和target列，可以使用更简单的语法，如下所示:

另一个数据框架示例:

Unique_id target  Amount
0        111    Yes     123
1        111  Maybe     456
2        111     No     789
3        112     No    1234
4        112  Maybe    5678
5        113     No      25

然后，要显示target的所有列以及Unique_id的最小值，可以使用:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)
df.loc[df.groupby('Unique_id')['target'].idxmin()]

结果:

Unique_id target  Amount
0        111    Yes     123
4        112  Maybe    5678
5        113     No      25

使用map和idxmin:

t = {'Yes':0, 'Maybe':1, 'No':2}
df.loc[df.assign(tar=df.target.map(t)).groupby('Unique_id')['tar'].idxmin()]

编辑

相关内容

最新更新

热门标签：