我有一个带有ID的数据帧,以及这些ID所做的选择。每一个选择都与某个城市有关。选项集是整数列表:[10, 20, 30, 40, 50, 60]
,城市集是字符串列表['XX', 'YY', 'ZZ']
。注意:一个或多个选择可能与同一个城市有关。例如:选择20
和30
与城市'YY'
相关。
这是数据帧:
ID choice city
1 10 XX
1 10 XX
1 20 YY
1 10 XX
1 30 YY
1 40 ZZ
2 20 YY
2 50 ZZ
2 50 ZZ
2 50 ZZ
2 10 XX
3 30 YY
3 30 YY
3 60 ZZ
3 60 ZZ
3 60 ZZ
3 10 XX
这是选择城市数据帧:
choice city
10 XX
20 YY
30 YY
40 ZZ
50 ZZ
60 ZZ
另一个数据框架告诉我们有多少选择与每个城市有关:
city count
XX 1
YY 2
ZZ 3
我想为每个选项创建一个变量:'10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'
。在每个ID的第一行,if
第一个选择与城市'XX'
有关,因此变量"10_Var"将获得值0.8 / # of choices that related to this city
(0.8是某个参数(,而与同一城市无关的其他变量将获得值(1 - 0.8) / (# of choices - # of choices that related to the city 'XX')
。
在上面的步骤之后,数据应该是什么样子的:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.80 0.04 0.04 0.04 0.04 0.04
1 10 XX NaN NaN NaN NaN NaN NaN
1 20 YY NaN NaN NaN NaN NaN NaN
1 10 XX NaN NaN NaN NaN NaN NaN
1 30 YY NaN NaN NaN NaN NaN NaN
1 40 ZZ NaN NaN NaN NaN NaN NaN
2 20 YY 0.05 0.40 0.40 0.05 0.05 0.05
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 50 ZZ NaN NaN NaN NaN NaN NaN
2 10 XX NaN NaN NaN NaN NaN NaN
3 30 YY 0.05 0.40 0.40 0.05 0.05 0.05
3 30 YY NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 60 ZZ NaN NaN NaN NaN NaN NaN
3 10 XX NaN NaN NaN NaN NaN NaN
从第二行依此类推(对于每个ID(,例如,变量'10_Var'
将获得值:(0.8 * Previous-value)
+(1 - 0.8) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX'
,对于每个变量依此类推。
注意:应针对每个ID进行
预期结果:
ID choice city 10_Var 20_Var 30_Var 40_Var 50_Var 60_Var
1 10 XX 0.800000 0.040000 0.040000 0.040000 0.040000 0.040000
1 10 XX 0.840000 0.032000 0.032000 0.032000 0.032000 0.032000
1 20 YY 0.872000 0.025600 0.025600 0.025600 0.025600 0.025600
1 10 XX 0.697600 0.120480 0.120480 0.020480 0.020480 0.020480
1 30 YY 0.758080 0.096384 0.096384 0.016384 0.016384 0.016384
1 40 ZZ 0.606464 0.177107 0.177107 0.013107 0.013107 0.013107
2 20 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
2 50 ZZ 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
2 50 ZZ 0.032000 0.336000 0.336000 0.098667 0.098667 0.098667
2 50 ZZ 0.025600 0.268800 0.268800 0.145600 0.145600 0.145600
2 10 XX 0.020480 0.215040 0.215040 0.183147 0.183147 0.183147
3 30 YY 0.050000 0.400000 0.400000 0.050000 0.050000 0.050000
3 30 YY 0.040000 0.420000 0.420000 0.040000 0.040000 0.040000
3 60 ZZ 0.032000 0.436000 0.436000 0.032000 0.032000 0.032000
3 60 ZZ 0.025600 0.348800 0.348800 0.092267 0.092267 0.092267
3 60 ZZ 0.020480 0.279040 0.279040 0.140480 0.140480 0.140480
3 10 XX 0.016384 0.223232 0.223232 0.179051 0.179051 0.179051
这个问题可能有助于:创建";"指数平滑";变量-Pandas
这里有一个可能的解决方案:
import numpy as np
import pandas as pd
# Parameter
P = 0.8
def exp_smooth(g):
city = g.iloc[0].City
rows = [np.where(cities == city,
P/cic[city],
(1-P)/(len(choices)-cic[city]))]
for i in range(len(g) - 1):
city = g.iloc[i].City
rows.append(rows[-1]*P+(1-P)*np.where(cities == city, 1, 0)/cic[city])
return np.array(rows)
df = pd.DataFrame([[1, 10, "XX"], [1, 10, "XX"], [1, 20, "YY"], [1, 10, "XX"],
[1, 30, "YY"], [1, 40, "ZZ"], [2, 20, "YY"], [2, 50, "ZZ"],
[2, 50, "ZZ"], [2, 50, "ZZ"], [2, 10, "XX"], [3, 30, "YY"],
[3, 30, "YY"], [3, 60, "ZZ"], [3, 60, "ZZ"], [3, 60, "ZZ"],
[3, 10, "XX"]],
columns=("ID", "Choice", "City"))
chc = {10: "XX", 20: "YY", 30: "YY", 40: "ZZ", 50: "ZZ", 60: "ZZ"}
cic = {"XX": 1, "YY": 2, "ZZ": 3}
choices = np.unique(df.Choice)
cities = np.vectorize(lambda ch: chc[ch])(choices)
var_arr = np.concatenate([exp_smooth(g) for _, g in df.groupby("ID")], axis=0)
var_df = pd.DataFrame(var_arr, columns=[f"var_{c}" for c in choices])
df = pd.concat([df, var_df], axis=1)
df
包含预期结果:
ID Choice City var_10 var_20 var_30 var_40 var_50 var_60
0 1 10 XX 0.800000 0.040000 0.040000 0.040000 0.040000 0.040000
1 1 10 XX 0.840000 0.032000 0.032000 0.032000 0.032000 0.032000
2 1 20 YY 0.872000 0.025600 0.025600 0.025600 0.025600 0.025600
3 1 10 XX 0.697600 0.120480 0.120480 0.020480 0.020480 0.020480
4 1 30 YY 0.758080 0.096384 0.096384 0.016384 0.016384 0.016384
5 1 40 ZZ 0.606464 0.177107 0.177107 0.013107 0.013107 0.013107
6 2 20 YY 0.050000 0.400000 0.050000 0.050000 0.050000 0.050000
7 2 50 ZZ 0.040000 0.420000 0.140000 0.040000 0.040000 0.040000
8 2 50 ZZ 0.032000 0.336000 0.112000 0.098667 0.098667 0.098667
9 2 50 ZZ 0.025600 0.268800 0.089600 0.145600 0.145600 0.145600
10 2 10 XX 0.020480 0.215040 0.071680 0.183147 0.183147 0.183147
11 3 30 YY 0.050000 0.050000 0.400000 0.050000 0.050000 0.050000
12 3 30 YY 0.040000 0.140000 0.420000 0.040000 0.040000 0.040000
13 3 60 ZZ 0.032000 0.212000 0.436000 0.032000 0.032000 0.032000
14 3 60 ZZ 0.025600 0.169600 0.348800 0.092267 0.092267 0.092267
15 3 60 ZZ 0.020480 0.135680 0.279040 0.140480 0.140480 0.140480
16 3 10 XX 0.016384 0.108544 0.223232 0.179051 0.179051 0.179051