如何创建"指数平滑"变量(硬)



我有一个带有ID的数据帧,以及这些ID所做的选择。每一个选择都与某个城市有关。选项集是整数列表:[10, 20, 30, 40, 50, 60],城市集是字符串列表['XX', 'YY', 'ZZ']。注意:一个或多个选择可能与同一个城市有关。例如:选择2030与城市'YY'相关。

这是数据帧:

ID  choice city
1      10   XX
1      10   XX
1      20   YY
1      10   XX
1      30   YY
1      40   ZZ
2      20   YY
2      50   ZZ
2      50   ZZ
2      50   ZZ
2      10   XX
3      30   YY
3      30   YY
3      60   ZZ
3      60   ZZ
3      60   ZZ
3      10   XX

这是选择城市数据帧:

choice city
10   XX
20   YY
30   YY
40   ZZ
50   ZZ
60   ZZ

另一个数据框架告诉我们有多少选择与每个城市有关:

city  count
XX      1
YY      2
ZZ      3

我想为每个选项创建一个变量:'10_Var', '20_Var', '30_Var', '40_Var', '50_Var', '60_Var'。在每个ID的第一行,if第一个选择与城市'XX'有关,因此变量"10_Var"将获得值0.8 / # of choices that related to this city(0.8是某个参数(,而与同一城市无关的其他变量将获得值(1 - 0.8) / (# of choices - # of choices that related to the city 'XX')

在上面的步骤之后,数据应该是什么样子的:

ID  choice city  10_Var  20_Var  30_Var  40_Var  50_Var  60_Var
1      10   XX    0.80    0.04    0.04    0.04    0.04    0.04
1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
1      20   YY     NaN     NaN     NaN     NaN     NaN     NaN
1      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
1      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
1      40   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
2      20   YY    0.05    0.40    0.40    0.05    0.05    0.05
2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
2      50   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
2      10   XX     NaN     NaN     NaN     NaN     NaN     NaN
3      30   YY    0.05    0.40    0.40    0.05    0.05    0.05
3      30   YY     NaN     NaN     NaN     NaN     NaN     NaN
3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
3      60   ZZ     NaN     NaN     NaN     NaN     NaN     NaN
3      10   XX     NaN     NaN     NaN     NaN     NaN     NaN

从第二行依此类推(对于每个ID(,例如,变量'10_Var'将获得值:(0.8 * Previous-value)+(1 - 0.8) * {1 if the **last** choice is related to the city 'XX', 0 otherwise} / # of choices that related to the city 'XX',对于每个变量依此类推。

注意:应针对每个ID进行

预期结果:

ID  choice city    10_Var    20_Var    30_Var    40_Var    50_Var    60_Var
1      10   XX  0.800000  0.040000  0.040000  0.040000  0.040000  0.040000
1      10   XX  0.840000  0.032000  0.032000  0.032000  0.032000  0.032000
1      20   YY  0.872000  0.025600  0.025600  0.025600  0.025600  0.025600
1      10   XX  0.697600  0.120480  0.120480  0.020480  0.020480  0.020480
1      30   YY  0.758080  0.096384  0.096384  0.016384  0.016384  0.016384
1      40   ZZ  0.606464  0.177107  0.177107  0.013107  0.013107  0.013107
2      20   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
2      50   ZZ  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
2      50   ZZ  0.032000  0.336000  0.336000  0.098667  0.098667  0.098667
2      50   ZZ  0.025600  0.268800  0.268800  0.145600  0.145600  0.145600
2      10   XX  0.020480  0.215040  0.215040  0.183147  0.183147  0.183147
3      30   YY  0.050000  0.400000  0.400000  0.050000  0.050000  0.050000
3      30   YY  0.040000  0.420000  0.420000  0.040000  0.040000  0.040000
3      60   ZZ  0.032000  0.436000  0.436000  0.032000  0.032000  0.032000
3      60   ZZ  0.025600  0.348800  0.348800  0.092267  0.092267  0.092267
3      60   ZZ  0.020480  0.279040  0.279040  0.140480  0.140480  0.140480
3      10   XX  0.016384  0.223232  0.223232  0.179051  0.179051  0.179051

这个问题可能有助于:创建";"指数平滑";变量-Pandas

这里有一个可能的解决方案:

import numpy as np
import pandas as pd
# Parameter
P = 0.8
def exp_smooth(g):
city = g.iloc[0].City
rows = [np.where(cities == city,
P/cic[city],
(1-P)/(len(choices)-cic[city]))]
for i in range(len(g) - 1):
city = g.iloc[i].City
rows.append(rows[-1]*P+(1-P)*np.where(cities == city, 1, 0)/cic[city])
return np.array(rows)
df = pd.DataFrame([[1, 10, "XX"], [1, 10, "XX"], [1, 20, "YY"], [1, 10, "XX"],
[1, 30, "YY"], [1, 40, "ZZ"], [2, 20, "YY"], [2, 50, "ZZ"],
[2, 50, "ZZ"], [2, 50, "ZZ"], [2, 10, "XX"], [3, 30, "YY"],
[3, 30, "YY"], [3, 60, "ZZ"], [3, 60, "ZZ"], [3, 60, "ZZ"],
[3, 10, "XX"]],
columns=("ID", "Choice", "City"))
chc = {10: "XX", 20: "YY", 30: "YY", 40: "ZZ", 50: "ZZ", 60: "ZZ"}
cic = {"XX": 1, "YY": 2, "ZZ": 3}
choices = np.unique(df.Choice)
cities = np.vectorize(lambda ch: chc[ch])(choices)
var_arr = np.concatenate([exp_smooth(g) for _, g in df.groupby("ID")], axis=0)
var_df = pd.DataFrame(var_arr, columns=[f"var_{c}" for c in choices])
df = pd.concat([df, var_df], axis=1)

df包含预期结果:

ID  Choice City    var_10    var_20    var_30    var_40    var_50    var_60
0    1      10   XX  0.800000  0.040000  0.040000  0.040000  0.040000  0.040000
1    1      10   XX  0.840000  0.032000  0.032000  0.032000  0.032000  0.032000
2    1      20   YY  0.872000  0.025600  0.025600  0.025600  0.025600  0.025600
3    1      10   XX  0.697600  0.120480  0.120480  0.020480  0.020480  0.020480
4    1      30   YY  0.758080  0.096384  0.096384  0.016384  0.016384  0.016384
5    1      40   ZZ  0.606464  0.177107  0.177107  0.013107  0.013107  0.013107
6    2      20   YY  0.050000  0.400000  0.050000  0.050000  0.050000  0.050000
7    2      50   ZZ  0.040000  0.420000  0.140000  0.040000  0.040000  0.040000
8    2      50   ZZ  0.032000  0.336000  0.112000  0.098667  0.098667  0.098667
9    2      50   ZZ  0.025600  0.268800  0.089600  0.145600  0.145600  0.145600
10   2      10   XX  0.020480  0.215040  0.071680  0.183147  0.183147  0.183147
11   3      30   YY  0.050000  0.050000  0.400000  0.050000  0.050000  0.050000
12   3      30   YY  0.040000  0.140000  0.420000  0.040000  0.040000  0.040000
13   3      60   ZZ  0.032000  0.212000  0.436000  0.032000  0.032000  0.032000
14   3      60   ZZ  0.025600  0.169600  0.348800  0.092267  0.092267  0.092267
15   3      60   ZZ  0.020480  0.135680  0.279040  0.140480  0.140480  0.140480
16   3      10   XX  0.016384  0.108544  0.223232  0.179051  0.179051  0.179051

最新更新