我有一个包含 2 列的数据集:age_group、target(0,1)。 我想创建第 3 列"计数"(age_group 的值计数)。它必须查找目标是好是坏,并输入相应的计数。
5个年龄垃圾箱:
df['age_group'] = pd.cut(df['age'], [17,22,26,32,45,50,60])
40 行:
age_group target
0 (45, 50] bad
1 (45, 50] bad
2 (32, 45] good
3 (32, 45] good
4 (50, 60] bad
5 (32, 45] bad
6 (26, 32] good
7 (50, 60] good
8 (32, 45] bad
9 (17, 22] good
10 (32, 45] good
我可以按目标分组:
df.groupby('target').age_group.value_counts().to_frame()
age_group
target age_group
bad (32, 45] 7
(26, 32] 3
(45, 50] 3
(50, 60] 3
(17, 22] 2
good (32, 45] 8
(17, 22] 4
(50, 60] 4
(45, 50] 3
(26, 32] 2
(22, 26] 1
但在此数据帧中,只有 age_group 是主要的可访问列。我无法访问"目标"列和好目标、坏目标的特定值。
我想查找每个age_group及其目标,并将相应的值放在"计数"列中。
所以我正在做这个丑陋的解决方法功能。
def get_value_count_for_age_group_category(age_group, target):
bad_vals = df[df['bad']==1]['age_group'].value_counts().sort_index()
good_vals = df[df['good']==1]['age_group'].value_counts().sort_index()
values = age_freq.values.tolist()
keys = age_freq.keys()
if target == 'bad':
for k in keys:
if age_group == pd.Interval(32,45):
return bad_vals[0]
elif age_group == pd.Interval(50, 60):
return bad_vals[1]
elif age_group == pd.Interval(45, 50):
return bad_vals[2]
elif age_group == pd.Interval(26, 32):
return bad_vals[3]
elif age_group == pd.Interval(22, 26):
return bad_vals[4]
elif age_group == pd.Interval(17,22):
return bad_vals[5]
else:
for k in keys:
if age_group == pd.Interval(32,45):
return good_vals[0]
elif age_group == pd.Interval(50, 60):
return good_vals[1]
elif age_group == pd.Interval(45, 50):
return good_vals[2]
elif age_group == pd.Interval(26, 32):
return good_vals[3]
elif age_group == pd.Interval(22, 26):
return good_vals[4]
elif age_group == pd.Interval(17,22):
return good_vals[5]
这是行不通的,将 2 个值 - age_group 及其目标传递给 Lambda 函数:
n['count'] = n[['age_group', 'target']].apply(lambda num:get_value_count_for_age_group_category(num, target) )
lambda>() 缺少 1 个必需的位置参数:
这对你有用吗?
df.groupby('target').age_group.value_counts().reset_index(name='count')
输入
age_group target
0 (45, 50) bad
1 (45, 50) bad
2 (32, 45) good
3 (32, 45) good
4 (50, 60) bad
5 (32, 45) bad
6 (26, 32) good
7 (50, 60) good
8 (32, 45) bad
9 (17, 22) good
10 (32, 45) good
输出
target age_group count
0 bad (32, 45) 2
1 bad (45, 50) 2
2 bad (50, 60) 1
3 good (32, 45) 3
4 good (17, 22) 1
5 good (26, 32) 1
6 good (50, 60) 1
如果需要"零"谷,请在下面使用
df1=df.groupby('target').age_group.value_counts().reset_index(name='count')
df1.set_index(['target','age_group']).unstack(fill_value=0).stack().reset_index()
输出
target age_group count
0 bad (17, 22) 0
1 bad (26, 32) 0
2 bad (32, 45) 2
3 bad (45, 50) 2
4 bad (50, 60) 1
5 good (17, 22) 1
6 good (26, 32) 1
7 good (32, 45) 3
8 good (45, 50) 0
9 good (50, 60) 1