根据对另一个熊猫列的筛选,查找熊猫列的模式



我有一个看起来类似于这个的数据帧

df = pd.DataFrame({'id': [1001, 1002, 1003, 1004, 1005, 1006]
'resolution_modified': ['It is recommended to replace scanner',
'It is recommended to replace scanner',
'It is recommended to replace laptop',
'It is recommended to replace laptop',
'It is recommended to replace printer',
'It is recommended to replace printer'],
'cluster':[1,1,2,2,3,3]})

我想在resolution_modified中找到每个唯一的cluster出现最多的字符串,这样我就有了一个映射,其中键是集群,值是resolution_modified列中的模式字符串。

这就是我尝试过的

# Get the string that occurs the most for each unqiue cluster
mode_string = {}
for cluster in hardware['cluster'].unique():
if hardware[hardware['cluster']==cluster]:
mode_string[cluster] = hardware['resolution_modified'].mode()[0]
mode_string

这不起作用,并抛出一个错误:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

您可以将pandas.DataFrame.groupbypandas.Series.mode:一起使用

mode_string = df.groupby("cluster")["resolution_modified"].agg(pd.Series.mode)
#cluster
#1       It is recommended to replace scanner
#2       It is recommended to replace laptop
#3       It is recommended to replace printer

您也可以将其转换为dict

mode_string = mode_string.to_dict()
#{1: 'It is recommended to replace scanner', 2: 'It is recommended to replace laptop', 3: 'It is recommended to replace printer'}

在这两种情况下,你都可以做到:

mode_string[1]
#'It is recommended to replace scanner'

Panda的方法是按集群分组,并找到resolution_modified的模式:

res = df.groupby('cluster')['resolution_modified'].agg(pd.Series.mode)
mode_string = res.to_dict()
print(mode_string)

输出

{1: 'It is recommended to replace scanner', 2: 'It is recommended to replace laptop', 3: 'It is recommended to replace printer'}

请参阅更多关于agg和模式的文档。

作为替代方案,您可以使用statistics.mode:

from statistics import mode
res = df.groupby('cluster')['resolution_modified'].agg(mode)

最新更新