我正试图总结一些数据,然后在过滤后,我会根据最终结果输出一个字典。
请看看下面我已经能够弄清楚到目前为止,我正在寻找帮助的步骤的解释。提前谢谢。
import pandas as pd
import numpy as np
data = [['New York','Female','Green',523.5,40.614256], ['New York','Female','Green',176.5,46.168092],
['New York','Male','Green',39.5,58.970444], ['New York','Female','Purple',40.5,40.739437],
['New York','Male','Purple',12.5,46.854553], ['New York','Male','Green',187.5,59.477795],
['New York','Male','Green',21.5,59.283682], ['New York','Female','Green',0.5,46.772469],
['New York','Male','Green',48.5,58.847725], ['New York','Female','Purple',17.5,40.642568],
['New York','Male','Green',27.5,46.911063]]
df = pd.DataFrame(data, columns = ['city', 'gender', 'color', 'cm ', 'temp'])
#create `cuts` for `temp`
cuts = np.arange(0, 100.25, 0.25) # range for 'cuts'
df['temp_range'] = pd.cut(df.temp, cuts, include_lowest=True)
print(df)
city gender color cm temp temp_range
0 New York Female Green 523.5 40.614256 (40.5, 40.75]
1 New York Female Green 176.5 46.168092 (46.0, 46.25]
2 New York Male Green 39.5 58.970444 (58.75, 59.0]
3 New York Female Purple 40.5 40.739437 (40.5, 40.75]
4 New York Male Purple 12.5 46.854553 (46.75, 47.0]
5 New York Male Green 187.5 59.477795 (59.25, 59.5]
6 New York Male Green 21.5 59.283682 (59.25, 59.5]
7 New York Female Green 0.5 46.772469 (46.75, 47.0]
8 New York Male Green 48.5 58.847725 (58.75, 59.0]
9 New York Female Purple 17.5 40.642568 (40.5, 40.75]
10 New York Male Green 27.5 46.911063 (46.75, 47.0]
在这一点上,我被卡住了。
卡住-第一部分
我接下来要做的是生成一个输出如下内容的摘要:
city gender temp_range Green_count Purple_count Green_pct Green_sum Purple_sum Green_max Purple_max
0 New York Male (46.75, 47.0] 1 1 0.50 27.5 12.5 27.5 12.5
1 New York Male (58.75, 59.0] 2 NaN 1.00 88.0 NaN 48.5 NaN
2 New York Male (59.25, 59.5] 2 NaN 1.00 209.0 NaN 187.5 NaN
3 New York Female (40.5, 40.75] 1 2 0.33 523.5 58.0 523.5 40.5
4 New York Female (46.0, 46.25] 1 NaN 1.00 176.5 NaN 176.5 NaN
5 New York Female (46.75, 47.0] 1 NaN 1.00 0.5 NaN 0.5 NaN
…之后,我将应用如下过滤器,例如:
df = df.loc[df['Green_pct'] > 0.5]
df = df.reset_index()
. .这将在我的过滤器后产生如下所示的df
:
city gender temp_range Green_count Purple_count Green_pct Green_sum Purple_sum Green_max Purple_max
0 New York Male (58.75, 59.0] 2 NaN 1.00 88.0 NaN 48.5 NaN
1 New York Male (59.25, 59.5] 2 NaN 1.00 209.0 NaN 187.5 NaN
2 New York Female (46.0, 46.25] 1 NaN 1.00 176.5 NaN 176.5 NaN
3 New York Female (46.75, 47.0] 1 NaN 1.00 0.5 NaN 0.5 NaN
卡住-第2部分
…最后,使用过滤后的df
的数据,我想输出一个稍后可以使用的字典
字典的格式应该如下:
temp_dict = {('New York', Male):(58.75,59.0,59.25,59.5),
('New York', Female):(46.0, 46.25,46.75,47.0)}
步骤1
x = df.pivot_table(
index=["city", "gender", "temp_range"],
columns="color",
values="cm",
aggfunc={"cm": [lambda x: len(x), lambda x: x.sum(), lambda x: x.max()]},
)
x.columns = x.columns.set_levels(["count", "sum", "max"], level=0)
x = pd.concat(
[
x,
pd.concat(
{
"pct": x.xs("count", axis=1).div(
x.xs("count", axis=1).sum(1), axis=0
)
},
axis=1,
),
],
axis=1,
)
x.columns = x.columns.map(lambda v: "_".join(v[::-1]))
print(x)
打印:
Green_count Purple_count Green_sum Purple_sum Green_max Purple_max Green_pct Purple_pct
city gender temp_range
New York Female (40.5, 40.75] 1.0 2.0 523.5 58.0 523.5 40.5 0.333333 0.666667
(46.0, 46.25] 1.0 NaN 176.5 NaN 176.5 NaN 1.000000 NaN
(46.75, 47.0] 1.0 NaN 0.5 NaN 0.5 NaN 1.000000 NaN
Male (46.75, 47.0] 1.0 1.0 27.5 12.5 27.5 12.5 0.500000 0.500000
(58.75, 59.0] 2.0 NaN 88.0 NaN 48.5 NaN 1.000000 NaN
(59.25, 59.5] 2.0 NaN 209.0 NaN 187.5 NaN 1.000000 NaN
步骤2x = (
x.loc[x["Green_pct"] > 0.5]
.reset_index()
.groupby(["city", "gender"])["temp_range"]
.agg(lambda x: [a for v in x for a in [v.left, v.right]])
.to_dict()
)
print(x)
打印:
{
("New York", "Female"): [46.0, 46.25, 46.75, 47.0],
("New York", "Male"): [58.75, 59.0, 59.25, 59.5],
}