Python 熊猫数据帧来计算每个类别的选定列的平均值并作为数据帧返回？

Doc_type列中有多个类别，我想使用每个类别的选定列计算平均值，并将类别添加回数据帧中。与原始DataFrame中的所有记录不同，我希望每个类别只有一行包含数据帧中所选列的平均值

for name in sample.Doc_type.unique():
df_mean = sample[sample.Doc_type == name][
[
"total_tokens_per_doc",
"valid_token_percentage",
"special_chars_percentage",
"numeric_values_percentage",
]
].median()
print(df_mean)

Results:
total_tokens_per_doc         64.000000
valid_token_percentage        0.590551
special_chars_percentage      0.122449
numeric_values_percentage     0.340000
dtype: float64
total_tokens_per_doc         69.000000
valid_token_percentage        0.595376
special_chars_percentage      0.107143
numeric_values_percentage     0.316327
dtype: float64
total_tokens_per_doc         48.000000
valid_token_percentage        0.656250
special_chars_percentage      0.133333
numeric_values_percentage     0.250000
dtype: float64
total_tokens_per_doc         37.000000
valid_token_percentage        0.651685
special_chars_percentage      0.142857
numeric_values_percentage     0.242424
dtype: float64
total_tokens_per_doc         2.0
valid_token_percentage       0.5
special_chars_percentage     0.0
numeric_values_percentage    0.0

您可以通过以下方式使用组：

columns = ['Doc_type',
"total_tokens_per_doc",
"valid_token_percentage",
"special_chars_percentage",
"numeric_values_percentage"]
df_mean = sample[columns].groupby('Doc_type').median()
# to get the groupby variable as a column rather than an index:
df_mean.reset_index(inplace=True)

假设这是您的示例数据：

sample = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7], 'c':[8,9,7,6]})

然后选择列的平均值：

sample[['a','b']].mean()

样本输出：

a    2.5
b    5.5
dtype: float64

相关内容

最新更新

热门标签：