根据列唯一值对组执行多元线性回归



我需要对4个不同的组进行多元线性回归,这些组从df['status']列中取出,df['status'].unique()值为(1,4,7,9)。回归后,我需要将结果保存在新列df['reg_results']中。

数据示例:

Out[71]: 
ID  status    y_Values     a   b      c      d
0    1       1  150.510000  0.26  23  0.151  1.215
1    2       1  153.110000  0.86  14  0.156  1.651
2    3       1  189.320000  0.46  51  0.151  2.154
3    4       9  145.650000  0.46  62  0.157  3.145
4    5       4  189.650000  0.91  11  0.123  2.104
5    6       4  144.230000  0.69  16  0.178  3.515
6    7       4  198.020000  0.62  18  0.891  1.561
7    8       9  178.090000  0.91  22  0.156  9.155

回归所需列为X = ['a', 'b', 'c', 'd']y = ['y_Values']

我已经找到了多个解决方案,其中使用整个列或列执行回归,如:

data = pd.read_csv(r'E:...data.csv')
lm = smf.ols(formula='y_Values ~ a + b + c + d', data=data).fit()
print(lm.params)

,结果为:

Intercept   -403.803691
a              0.170452
b             40.866943
c             14.839920
d              1.618234
dtype: float64

然而,我想为每个df['status'] == (1,4,7,9)行做同样的事情。并将数据存储在新列中。

我知道如何在R中做到这一点,但无法理解如何在分析中添加这些df['status']参数:

lapply(c(1,4,7,9), function(k){
data <- shape[status == k, c("ID", "a", "b", "c", "d", "y_Values")]
reg <- lm(y_Values ~ a + 0 + b + c + d, data = data)
reg2 <- step(reg, direction = "backward")

方法如下:如果您要对整个数据框架进行回归:

X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)

返回

OLS Regression Results                                
=======================================================================================
Dep. Variable:               y_Values   R-squared (uncentered):                   0.973
Model:                            OLS   Adj. R-squared (uncentered):              0.946
Method:                 Least Squares   F-statistic:                              35.97
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                     0.00216
Time:                        13:12:10   Log-Likelihood:                         -37.992
No. Observations:                   8   AIC:                                      83.98
Df Residuals:                       4   BIC:                                      84.30
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            167.3835     45.459      3.682      0.021      41.170     293.597
b              1.6286      0.621      2.622      0.059      -0.096       3.353
c             83.8313     55.572      1.509      0.206     -70.461     238.123
d             -2.7363      6.841     -0.400      0.710     -21.729      16.256
==============================================================================
Omnibus:                        1.673   Durbin-Watson:                   2.460
Prob(Omnibus):                  0.433   Jarque-Bera (JB):                0.446
Skew:                           0.574   Prob(JB):                        0.800
Kurtosis:                       2.860   Cond. No.                         146.
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

您可以选择要提取的值。

按个人状态执行:

status = list(set(df['status']))
for status in status:
print( status)
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df_redux['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
print_model = model.summary()
print(print_model)

给了:

1
ID  status  y_Values     a   b      c      d
0   1       1    150.51  0.26  23  0.151  1.215
1   2       1    153.11  0.86  14  0.156  1.651
2   3       1    189.32  0.46  51  0.151  2.154
OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 77.832
No. Observations:                   3   AIC:                            -149.7
Df Residuals:                       0   BIC:                            -152.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a           -248.8778        inf         -0        nan         nan         nan
b             -4.9837        inf         -0        nan         nan         nan
c            229.5383        inf          0        nan         nan         nan
d            242.9489        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.443
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.281
Skew:                           0.016   Prob(JB):                        0.869
Kurtosis:                       1.500   Cond. No.                         554.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
ID  status  y_Values     a   b      c      d
4   5       4    189.65  0.91  11  0.123  2.104
5   6       4    144.23  0.69  16  0.178  3.515
6   7       4    198.02  0.62  18  0.891  1.561
OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 82.381
No. Observations:                   3   AIC:                            -158.8
Df Residuals:                       0   BIC:                            -161.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a            183.1273        inf          0        nan         nan         nan
b              8.9478        inf          0        nan         nan         nan
c            -25.7862        inf         -0        nan         nan         nan
d            -34.3392        inf         -0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.154
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.284
Skew:                           0.072   Prob(JB):                        0.868
Kurtosis:                       1.500   Cond. No.                         67.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
ID  status  y_Values     a   b      c      d
3   4       9    145.65  0.46  62  0.157  3.145
7   8       9    178.09  0.91  22  0.156  9.155
OLS Regression Results                            
==============================================================================
Dep. Variable:               y_Values   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Wed, 27 Oct 2021   Prob (F-statistic):                nan
Time:                        13:12:15   Log-Likelihood:                 58.629
No. Observations:                   2   AIC:                            -113.3
Df Residuals:                       0   BIC:                            -115.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
a              1.4521        inf          0        nan         nan         nan
b              1.5473        inf          0        nan         nan         nan
c              0.1974        inf          0        nan         nan         nan
d             15.5869        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.800
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         8.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.

当然,考虑到子集的大小,回归结果不是那么好。我假设你有一个更大的数据框架。

要提取特定信息(如R2),只需添加print(model.rsquared)

Update:

一个更完整的提取信息的方法是添加:

stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]

where with返回两个数据帧:

stat_1 
Dep. Variable:          y_Values       R-squared (uncentered):     0.973
0             Model:               OLS  Adj. R-squared (uncentered):   0.94600
1            Method:     Least Squares                  F-statistic:  35.97000
2              Date:  Wed, 27 Oct 2021           Prob (F-statistic):   0.00216
3              Time:          13:52:04               Log-Likelihood: -37.99200
4  No. Observations:                 8                          AIC:  83.98000
5      Df Residuals:                 4                          BIC:  84.30000
6          Df Model:                 4                           NaN       NaN
7   Covariance Type:         nonrobust                           NaN       NaN

stat_2
index      coef  std err      t  P>|t|  [0.025   0.975]
0     a  167.3835   45.459  3.682  0.021  41.170  293.597
1     b    1.6286    0.621  2.622  0.059  -0.096    3.353
2     c   83.8313   55.572  1.509  0.206 -70.461  238.123
3     d   -2.7363    6.841 -0.400  0.710 -21.729   16.256

你现在可以选择你想要的列,例如:

stat_2['coeff']
index      coef
0     a  167.3835
1     b    1.6286
2     c   83.8313
3     d   -2.7363
所以你的循环应该是这样的:
df_coef =[]
status = list(set(df['status']))
for status in status:

df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
if len(stats_2)!=0:
stats_2['status'] = status
df_coef.append(stats_2)
else:
0

all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)

给了:

status      coef
a       1 -248.8778
b       1   -4.9837
c       1  229.5383
d       1  242.9489
a       4  183.1273
b       4    8.9478
c       4  -25.7862
d       4  -34.3392
a       9    1.4521
b       9    1.5473
c       9    0.1974
d       9   15.5869

然后通过合并status将其附加到原始df中

更新2

感谢解决方案,得到了所有的系数,但我的意思是合并/连接预测值是,当我打印出预测,我得到了这四个表的行ID和预测值。我需要的是合并这四个表(存储在一个变量predictions中),将其创建为具有列名IDresults的Dataframe。

之后,我可以合并新的数据帧到原来的列'ID'。

....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)
0       401.094849
1       420.949054
2       407.918627
4       363.367876
8       255.865852
...    
1556    430.050556
1558    292.949037
1559    306.011285
1560    412.041196
1561    360.829533
Length: 958, dtype: float64
5       366.159418
12      204.606629
18      400.767161
20      401.544449
21      267.192577
...    
1530    384.151730
1533    275.356699
1539    376.165539
1543    334.024327
1547    272.197374
Length: 205, dtype: float64

我试图将predictions变量转换为列表或字典,但无法弄清楚如何连接所有四个表。可能很容易解决,但是我找不到。

更新3

这个对你有用吗?

df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
predictions = pd.DataFrame(predictions, columns = ['predictions'])
gf = pd.concat([predictions, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)

生产:

predictions  ID  status  y_Values     a   b      c      d
0       150.51   1       1    150.51  0.26  23  0.151  1.215
1       153.11   2       1    153.11  0.86  14  0.156  1.651
2       189.32   3       1    189.32  0.46  51  0.151  2.154
4       189.65   5       4    189.65  0.91  11  0.123  2.104
5       144.23   6       4    144.23  0.69  16  0.178  3.515
6       198.02   7       4    198.02  0.62  18  0.891  1.561
3       145.65   4       9    145.65  0.46  62  0.157  3.145
7       178.09   8       9    178.09  0.91  22  0.156  9.155

注意,在这里的示例中,y_Valuespredictions由于缺少数据而重合。

最新更新