数据帧应用集没有删除重复的值

我的数据集有时会在一个串联列中包含重复项，如下所示：

Total
0                 Thriller,Satire,Thriller
1                 Horror,Thriller,Horror
2                   Mystery,Horror,Mystery
3                 Adventure,Horror,Horror

当做这个

df['Total'].str.split(",").apply(set)

我得到

Total
0                 {Thriller,Satire}
1                 {Horror,Thriller}
2                 {Mystery,Horror,Crime}
3                 {Adventure,Horror}

用编码后

df['Total'].str.get_dummies(sep=",")

我得到一个类似的标题

{'Horror    {'Mystery   {'Thriller ... Horror Thriller'}

代替

Horror Mystery Thriller

使用Pandas数据帧时，如何去掉花括号？

方法Series.str.get_dummies也能很好地处理重复项。

因此，省略唯一值的代码：

df['Total'] = df['Total'].str.split(",").apply(set)

仅限使用：

df1 = df['Total'].str.get_dummies(sep=",")
print (df1)
Adventure  Horror  Mystery  Satire  Thriller
0          0       0        0       1         1
1          0       1        0       0         1
2          0       1        1       0         0
3          1       1        0       0         0

BUt如果需要删除重复项，则添加Series.str.join:

df1 = df['Total'].str.split(",").apply(set).str.join(',').str.get_dummies(sep=",")

相关内容

最新更新

热门标签：