如何获得连续元素频率的排名?



我有一个名为a的数据帧。我想获取每行中最常用的两个元素。Input:

import pandas as pd
a=pd.DataFrame({'A1':['food','movie','sport'],'A2':['game','traffic','health'],
'A3':['food','health','education'],'A4':['game','travel','other'],
'A5':['social','other','sport']})

Output:

A1       A2         A3      A4      A5
0   food     game       food    game  social
1  movie  traffic     health  travel   other
2  sport   health  education   other   sport

Expected:

top1       top2 
0   food     game    
1  health    movie    
2  sport   education

如您所见,也许一行中的某些元素以相同的频率出现。对于这样的元素,我只选择其中一个进行排名,例如,行中的所有元素1出现一次,所以我只是随机选择其中两个进行排名。

希望得到帮助和感谢!

Counter

from collections import Counter
tops = [
[*zip(*Counter(r).most_common(2))][0]
for r in zip(*map(a.get, a))
]
pd.DataFrame(tops, a.index, ['top1', 'top2'])
top1     top2
0   food     game
1  movie  traffic
2  sport   health

您可以尝试使用值计数并将出现的最多单词分配为顶部,依此类推

pd.DataFrame({'top1':a.apply(lambda x: x.value_counts().index[0],1 ).values,
'top2':a.apply(lambda x: x.value_counts().index[1],1 ).values})

外:

top1    top2
0   game    food
1   traffic movie
2   sport   other

使用:

a.apply(lambda x: pd.Series(x.value_counts().nlargest(2).index.tolist(), 
index=['top1','top2']), 
axis=1)

输出:

top1   top2
0     game   food
1  traffic  other
2    sport  other

你需要计数器和应用函数,

from collections import Counter
out_df=pd.DataFrame((df.apply(Counter,axis=1).apply(list).str[:2]).values.tolist(),columns=['top1','top2'])

最新更新