我有一个数据框架,每个类有3列和5行学生。这些学生中有些是复制品。我想列出所有班级中最常见的学生名字,并按降序列出,他们出现的次数,以及他们出现在哪些班级。
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
我希望输出看起来像这样:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
…等等
任何帮助将不胜感激,我是一个初学者,所以我已经在这几个小时。大脑被杀死了。谢谢!
我们可以melt
的DataFrame得到长形式,然后groupby aggregate
命名聚合得到类的数量和类的名称,最后我们可以sort_values
首先得到频率最高的学生:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
我们可以进一步有条件地添加类/类到class_count
,并写入to_csv
:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
Setup and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] += v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}