Doc_type列中有多个类别,我想使用每个类别的选定列计算平均值,并将类别添加回数据帧中。与原始DataFrame中的所有记录不同,我希望每个类别只有一行包含数据帧中所选列的平均值
for name in sample.Doc_type.unique():
df_mean = sample[sample.Doc_type == name][
[
"total_tokens_per_doc",
"valid_token_percentage",
"special_chars_percentage",
"numeric_values_percentage",
]
].median()
print(df_mean)
Results:
total_tokens_per_doc 64.000000
valid_token_percentage 0.590551
special_chars_percentage 0.122449
numeric_values_percentage 0.340000
dtype: float64
total_tokens_per_doc 69.000000
valid_token_percentage 0.595376
special_chars_percentage 0.107143
numeric_values_percentage 0.316327
dtype: float64
total_tokens_per_doc 48.000000
valid_token_percentage 0.656250
special_chars_percentage 0.133333
numeric_values_percentage 0.250000
dtype: float64
total_tokens_per_doc 37.000000
valid_token_percentage 0.651685
special_chars_percentage 0.142857
numeric_values_percentage 0.242424
dtype: float64
total_tokens_per_doc 2.0
valid_token_percentage 0.5
special_chars_percentage 0.0
numeric_values_percentage 0.0
您可以通过以下方式使用组:
columns = ['Doc_type',
"total_tokens_per_doc",
"valid_token_percentage",
"special_chars_percentage",
"numeric_values_percentage"]
df_mean = sample[columns].groupby('Doc_type').median()
# to get the groupby variable as a column rather than an index:
df_mean.reset_index(inplace=True)
假设这是您的示例数据:
sample = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,6,7], 'c':[8,9,7,6]})
然后选择列的平均值:
sample[['a','b']].mean()
样本输出:
a 2.5
b 5.5
dtype: float64