我有一个看起来类似于这个的数据帧
df = pd.DataFrame({'id': [1001, 1002, 1003, 1004, 1005, 1006]
'resolution_modified': ['It is recommended to replace scanner',
'It is recommended to replace scanner',
'It is recommended to replace laptop',
'It is recommended to replace laptop',
'It is recommended to replace printer',
'It is recommended to replace printer'],
'cluster':[1,1,2,2,3,3]})
我想在resolution_modified
中找到每个唯一的cluster
出现最多的字符串,这样我就有了一个映射,其中键是集群,值是resolution_modified
列中的模式字符串。
这就是我尝试过的
# Get the string that occurs the most for each unqiue cluster
mode_string = {}
for cluster in hardware['cluster'].unique():
if hardware[hardware['cluster']==cluster]:
mode_string[cluster] = hardware['resolution_modified'].mode()[0]
mode_string
这不起作用,并抛出一个错误:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
您可以将pandas.DataFrame.groupby
与pandas.Series.mode
:一起使用
mode_string = df.groupby("cluster")["resolution_modified"].agg(pd.Series.mode)
#cluster
#1 It is recommended to replace scanner
#2 It is recommended to replace laptop
#3 It is recommended to replace printer
您也可以将其转换为dict
mode_string = mode_string.to_dict()
#{1: 'It is recommended to replace scanner', 2: 'It is recommended to replace laptop', 3: 'It is recommended to replace printer'}
在这两种情况下,你都可以做到:
mode_string[1]
#'It is recommended to replace scanner'
Panda的方法是按集群分组,并找到resolution_modified的模式:
res = df.groupby('cluster')['resolution_modified'].agg(pd.Series.mode)
mode_string = res.to_dict()
print(mode_string)
输出
{1: 'It is recommended to replace scanner', 2: 'It is recommended to replace laptop', 3: 'It is recommended to replace printer'}
请参阅更多关于agg和模式的文档。
作为替代方案,您可以使用statistics.mode:
from statistics import mode
res = df.groupby('cluster')['resolution_modified'].agg(mode)