我正在寻找一种方法来生成数据框架的汇总统计信息。考虑下面的例子:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
结果将最后n=3行的值计数相加,删除它们,然后将它们作为一行添加到原始值计数中。此外,它会很好有所有的百分比。有什么想法如何实现这一点吗?干杯!
对于带有百分比的DataFrame,使用Series.iloc
进行索引,使用Series.to_frame
创建DataFrame,添加新行和新列填充百分比:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
我将使用tail(-3)
来获取除前3个外的最后一个值:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
输出:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
代码段
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
会给你你所需要的。格式也不错;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%