我有一个表,看起来像:
c_id soap_spend towel_spend year_spend cluster
c1 1 2 3 1
c2 2 4 6 2
c3 1 2 3 2
c4 3 2 5 1
我想返回两个表。
表1:
cluster_1 cluster_2
% soap_spend a = (1+3)/(3+5) c= (2+1)/(6+3) -- soap_spend.sum/year_spend.sum
% towel_spend b = (2+2)/(3+5) d= (2+4)/(6+3) -- towel_spend.sum/year_spend.sum
表2:
use results from table 1
cluster_1 cluster_2
% soap_spend a/mean(soap_spend) c/mean(towel_spend)
% towel_spend b/mean(soap_spend) d/mean(towel_spend)
我代码:
cols = ['soap_spend', 'towel_spend']
df.groupby('cluster').apply(df[col].sum()/df['year_spend'].sum()
对于谁来修复代码有什么建议吗?
您不需要按集群分组,使用loc操作进行求和就足够了:
numerator = df['soap_spend'].loc[df['cluster'] == val].sum()
denominator = df['year_spend'].loc[df['cluster'] == val].sum()
完整的代码应该是这样的:
for val in df.cluster.unique():
soap_numerator = df['soap_spend'].loc[df['cluster'] == val].sum()
denominator = df['year_spend'].loc[df['cluster'] == val].sum()
towel_numerator = df['towel_spend'].loc[df['cluster'] == val].sum()
soap_spend = soap_numerator / denominator
towel_spend = towel_numerator / denominator
col = [soap_spend,towel_spend]
df_results.insert(int(val),'cluster_{}'.format(int(val)),col)
df_results = df_results[['cluster_1','cluster_2']]
对于表2,您应该应用
table_2 = table_1.copy()
for row in range(table_2.shape[0]):
for col in table_2.columns:
table_2[col].iloc[row] = table_2[col][row] / table_2.iloc[row].mean()