使用分组结果创建字典



我正试图总结一些数据,然后在过滤后,我会根据最终结果输出一个字典。

请看看下面我已经能够弄清楚到目前为止,我正在寻找帮助的步骤的解释。提前谢谢。

import pandas as pd
import numpy as np
data = [['New York','Female','Green',523.5,40.614256], ['New York','Female','Green',176.5,46.168092], 
['New York','Male','Green',39.5,58.970444], ['New York','Female','Purple',40.5,40.739437], 
['New York','Male','Purple',12.5,46.854553], ['New York','Male','Green',187.5,59.477795], 
['New York','Male','Green',21.5,59.283682], ['New York','Female','Green',0.5,46.772469],
['New York','Male','Green',48.5,58.847725], ['New York','Female','Purple',17.5,40.642568],
['New York','Male','Green',27.5,46.911063]]
df = pd.DataFrame(data, columns = ['city', 'gender', 'color', 'cm ', 'temp'])
#create `cuts` for `temp`
cuts = np.arange(0, 100.25, 0.25) # range for 'cuts'
df['temp_range'] = pd.cut(df.temp, cuts, include_lowest=True)
print(df)
city  gender   color    cm        temp     temp_range
0   New York  Female   Green  523.5  40.614256  (40.5, 40.75]
1   New York  Female   Green  176.5  46.168092  (46.0, 46.25]
2   New York    Male   Green   39.5  58.970444  (58.75, 59.0]
3   New York  Female  Purple   40.5  40.739437  (40.5, 40.75]
4   New York    Male  Purple   12.5  46.854553  (46.75, 47.0]
5   New York    Male   Green  187.5  59.477795  (59.25, 59.5]
6   New York    Male   Green   21.5  59.283682  (59.25, 59.5]
7   New York  Female   Green    0.5  46.772469  (46.75, 47.0]
8   New York    Male   Green   48.5  58.847725  (58.75, 59.0]
9   New York  Female  Purple   17.5  40.642568  (40.5, 40.75]
10  New York    Male   Green   27.5  46.911063  (46.75, 47.0]

在这一点上,我被卡住了。

卡住-第一部分

我接下来要做的是生成一个输出如下内容的摘要:

city  gender      temp_range    Green_count  Purple_count  Green_pct   Green_sum    Purple_sum     Green_max  Purple_max                                    
0   New York  Male     (46.75, 47.0]              1             1       0.50        27.5          12.5          27.5        12.5                                                                                      
1   New York  Male     (58.75, 59.0]              2           NaN       1.00        88.0           NaN          48.5         NaN                                                                                                                  
2   New York  Male     (59.25, 59.5]              2           NaN       1.00       209.0           NaN         187.5         NaN
3   New York  Female   (40.5, 40.75]              1             2       0.33       523.5          58.0         523.5        40.5                                                
4   New York  Female   (46.0, 46.25]              1           NaN       1.00       176.5           NaN         176.5         NaN                                 
5   New York  Female   (46.75, 47.0]              1           NaN       1.00         0.5           NaN           0.5         NaN

…之后,我将应用如下过滤器,例如:

df = df.loc[df['Green_pct'] > 0.5]
df = df.reset_index()

. .这将在我的过滤器后产生如下所示的df:

city  gender      temp_range    Green_count  Purple_count  Green_pct   Green_sum    Purple_sum     Green_max  Purple_max                                                                                                                      
0   New York  Male     (58.75, 59.0]              2           NaN       1.00        88.0           NaN          48.5         NaN                                                                                                                  
1   New York  Male     (59.25, 59.5]              2           NaN       1.00       209.0           NaN         187.5         NaN                                            
2   New York  Female   (46.0, 46.25]              1           NaN       1.00       176.5           NaN         176.5         NaN                                 
3   New York  Female   (46.75, 47.0]              1           NaN       1.00         0.5           NaN           0.5         NaN 

卡住-第2部分

…最后,使用过滤后的df的数据,我想输出一个稍后可以使用的字典

字典的格式应该如下:

temp_dict = {('New York', Male):(58.75,59.0,59.25,59.5),
('New York', Female):(46.0, 46.25,46.75,47.0)}  

步骤1

x = df.pivot_table(
index=["city", "gender", "temp_range"],
columns="color",
values="cm",
aggfunc={"cm": [lambda x: len(x), lambda x: x.sum(), lambda x: x.max()]},
)
x.columns = x.columns.set_levels(["count", "sum", "max"], level=0)
x = pd.concat(
[
x,
pd.concat(
{
"pct": x.xs("count", axis=1).div(
x.xs("count", axis=1).sum(1), axis=0
)
},
axis=1,
),
],
axis=1,
)
x.columns = x.columns.map(lambda v: "_".join(v[::-1]))
print(x)

打印:

Green_count  Purple_count  Green_sum  Purple_sum  Green_max  Purple_max  Green_pct  Purple_pct
city     gender temp_range                                                                                                   
New York Female (40.5, 40.75]          1.0           2.0      523.5        58.0      523.5        40.5   0.333333    0.666667
(46.0, 46.25]          1.0           NaN      176.5         NaN      176.5         NaN   1.000000         NaN
(46.75, 47.0]          1.0           NaN        0.5         NaN        0.5         NaN   1.000000         NaN
Male   (46.75, 47.0]          1.0           1.0       27.5        12.5       27.5        12.5   0.500000    0.500000
(58.75, 59.0]          2.0           NaN       88.0         NaN       48.5         NaN   1.000000         NaN
(59.25, 59.5]          2.0           NaN      209.0         NaN      187.5         NaN   1.000000         NaN
步骤2

x = (
x.loc[x["Green_pct"] > 0.5]
.reset_index()
.groupby(["city", "gender"])["temp_range"]
.agg(lambda x: [a for v in x for a in [v.left, v.right]])
.to_dict()
)
print(x)

打印:

{
("New York", "Female"): [46.0, 46.25, 46.75, 47.0],
("New York", "Male"): [58.75, 59.0, 59.25, 59.5],
}

相关内容

  • 没有找到相关文章

最新更新