如何使用pandas从完整的数据帧中找到重复项?



我有一个数据框架,每个类有3列和5行学生。这些学生中有些是复制品。我想列出所有班级中最常见的学生名字,并按降序列出,他们出现的次数,以及他们出现在哪些班级。

df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0     ryan      sarah  austin
1    sarah         ed    ryan
2      tom      jacob     tom
3       ed       ryan     sam
4  jackson         de   sarah

我希望输出看起来像这样:

ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)

…等等

任何帮助将不胜感激,我是一个初学者,所以我已经在这几个小时。大脑被杀死了。谢谢!

我们可以melt的DataFrame得到长形式,然后groupby aggregate命名聚合得到类的数量和类的名称,最后我们可以sort_values首先得到频率最高的学生:

output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)

output_df:

student name  class_count                         classes
0         ryan            3  (biology, statistics, ecology)
1        sarah            3  (biology, statistics, ecology)
2           ed            2           (biology, statistics)
3          tom            2              (biology, ecology)
4       austin            1                      (ecology,)
5           de            1                   (statistics,)
6      jackson            1                      (biology,)
7        jacob            1                   (statistics,)
8          sam            1                      (ecology,)

我们可以进一步有条件地添加类/类到class_count,并写入to_csv:

# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)

output.csv:

ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"

Setup and imports:

import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
results = {}
for h in df:
for k,v in df[h].value_counts().items():
print(k,v)
if k in results:
results[k]['value'] += v
results[k]['class'].append(h)
else:
results[k] = {'value':v,'class':[h]}
results = {h:results[h] for h in sorted(results, key=lambda x:results[x]['value'],reverse=True)}

相关内容

  • 没有找到相关文章

最新更新