我需要对4个不同的组进行多元线性回归,这些组从df['status']
列中取出,df['status'].unique()
值为(1,4,7,9)。回归后,我需要将结果保存在新列df['reg_results']
中。
数据示例:
Out[71]:
ID status y_Values a b c d
0 1 1 150.510000 0.26 23 0.151 1.215
1 2 1 153.110000 0.86 14 0.156 1.651
2 3 1 189.320000 0.46 51 0.151 2.154
3 4 9 145.650000 0.46 62 0.157 3.145
4 5 4 189.650000 0.91 11 0.123 2.104
5 6 4 144.230000 0.69 16 0.178 3.515
6 7 4 198.020000 0.62 18 0.891 1.561
7 8 9 178.090000 0.91 22 0.156 9.155
回归所需列为X = ['a', 'b', 'c', 'd']
和y = ['y_Values']
。
我已经找到了多个解决方案,其中使用整个列或列执行回归,如:
data = pd.read_csv(r'E:...data.csv')
lm = smf.ols(formula='y_Values ~ a + b + c + d', data=data).fit()
print(lm.params)
,结果为:
Intercept -403.803691
a 0.170452
b 40.866943
c 14.839920
d 1.618234
dtype: float64
然而,我想为每个df['status'] == (1,4,7,9)
行做同样的事情。并将数据存储在新列中。
我知道如何在R中做到这一点,但无法理解如何在分析中添加这些df['status']
参数:
lapply(c(1,4,7,9), function(k){
data <- shape[status == k, c("ID", "a", "b", "c", "d", "y_Values")]
reg <- lm(y_Values ~ a + 0 + b + c + d, data = data)
reg2 <- step(reg, direction = "backward")
方法如下:如果您要对整个数据框架进行回归:
X = df[['a', 'b', 'c', 'd']]
Y = df['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
返回
OLS Regression Results
=======================================================================================
Dep. Variable: y_Values R-squared (uncentered): 0.973
Model: OLS Adj. R-squared (uncentered): 0.946
Method: Least Squares F-statistic: 35.97
Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
Time: 13:12:10 Log-Likelihood: -37.992
No. Observations: 8 AIC: 83.98
Df Residuals: 4 BIC: 84.30
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 167.3835 45.459 3.682 0.021 41.170 293.597
b 1.6286 0.621 2.622 0.059 -0.096 3.353
c 83.8313 55.572 1.509 0.206 -70.461 238.123
d -2.7363 6.841 -0.400 0.710 -21.729 16.256
==============================================================================
Omnibus: 1.673 Durbin-Watson: 2.460
Prob(Omnibus): 0.433 Jarque-Bera (JB): 0.446
Skew: 0.574 Prob(JB): 0.800
Kurtosis: 2.860 Cond. No. 146.
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
您可以选择要提取的值。
按个人状态执行:
status = list(set(df['status']))
for status in status:
print( status)
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
给了:
1
ID status y_Values a b c d
0 1 1 150.51 0.26 23 0.151 1.215
1 2 1 153.11 0.86 14 0.156 1.651
2 3 1 189.32 0.46 51 0.151 2.154
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 77.832
No. Observations: 3 AIC: -149.7
Df Residuals: 0 BIC: -152.4
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a -248.8778 inf -0 nan nan nan
b -4.9837 inf -0 nan nan nan
c 229.5383 inf 0 nan nan nan
d 242.9489 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.443
Prob(Omnibus): nan Jarque-Bera (JB): 0.281
Skew: 0.016 Prob(JB): 0.869
Kurtosis: 1.500 Cond. No. 554.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
4
ID status y_Values a b c d
4 5 4 189.65 0.91 11 0.123 2.104
5 6 4 144.23 0.69 16 0.178 3.515
6 7 4 198.02 0.62 18 0.891 1.561
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 82.381
No. Observations: 3 AIC: -158.8
Df Residuals: 0 BIC: -161.5
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 183.1273 inf 0 nan nan nan
b 8.9478 inf 0 nan nan nan
c -25.7862 inf -0 nan nan nan
d -34.3392 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.154
Prob(Omnibus): nan Jarque-Bera (JB): 0.284
Skew: 0.072 Prob(JB): 0.868
Kurtosis: 1.500 Cond. No. 67.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
9
ID status y_Values a b c d
3 4 9 145.65 0.46 62 0.157 3.145
7 8 9 178.09 0.91 22 0.156 9.155
OLS Regression Results
==============================================================================
Dep. Variable: y_Values R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Wed, 27 Oct 2021 Prob (F-statistic): nan
Time: 13:12:15 Log-Likelihood: 58.629
No. Observations: 2 AIC: -113.3
Df Residuals: 0 BIC: -115.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
a 1.4521 inf 0 nan nan nan
b 1.5473 inf 0 nan nan nan
c 0.1974 inf 0 nan nan nan
d 15.5869 inf 0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 1.800
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 8.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
当然,考虑到子集的大小,回归结果不是那么好。我假设你有一个更大的数据框架。
要提取特定信息(如R2),只需添加print(model.rsquared)
。
Update:
一个更完整的提取信息的方法是添加:
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
where with返回两个数据帧:
stat_1
Dep. Variable: y_Values R-squared (uncentered): 0.973
0 Model: OLS Adj. R-squared (uncentered): 0.94600
1 Method: Least Squares F-statistic: 35.97000
2 Date: Wed, 27 Oct 2021 Prob (F-statistic): 0.00216
3 Time: 13:52:04 Log-Likelihood: -37.99200
4 No. Observations: 8 AIC: 83.98000
5 Df Residuals: 4 BIC: 84.30000
6 Df Model: 4 NaN NaN
7 Covariance Type: nonrobust NaN NaN
和
stat_2
index coef std err t P>|t| [0.025 0.975]
0 a 167.3835 45.459 3.682 0.021 41.170 293.597
1 b 1.6286 0.621 2.622 0.059 -0.096 3.353
2 c 83.8313 55.572 1.509 0.206 -70.461 238.123
3 d -2.7363 6.841 -0.400 0.710 -21.729 16.256
你现在可以选择你想要的列,例如:
stat_2['coeff']
index coef
0 a 167.3835
1 b 1.6286
2 c 83.8313
3 d -2.7363
所以你的循环应该是这样的:
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
print(df_redux)
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_1 = pd.read_html(model.summary().tables[0].as_html(),header=0,index_col=0)[0]
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
if len(stats_2)!=0:
stats_2['status'] = status
df_coef.append(stats_2)
else:
0
all_coef = pd.concat(df_coef)
df = all_coef[['status', 'coef']]
print(df)
给了:
status coef
a 1 -248.8778
b 1 -4.9837
c 1 229.5383
d 1 242.9489
a 4 183.1273
b 4 8.9478
c 4 -25.7862
d 4 -34.3392
a 9 1.4521
b 9 1.5473
c 9 0.1974
d 9 15.5869
然后通过合并status
将其附加到原始df中
更新2
感谢解决方案,得到了所有的系数,但我的意思是合并/连接预测值是,当我打印出预测,我得到了这四个表的行ID和预测值。我需要的是合并这四个表(存储在一个变量predictions
中),将其创建为具有列名ID
和results
的Dataframe。
之后,我可以合并新的数据帧到原来的列'ID'。
....
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(predictions)
0 401.094849
1 420.949054
2 407.918627
4 363.367876
8 255.865852
...
1556 430.050556
1558 292.949037
1559 306.011285
1560 412.041196
1561 360.829533
Length: 958, dtype: float64
5 366.159418
12 204.606629
18 400.767161
20 401.544449
21 267.192577
...
1530 384.151730
1533 275.356699
1539 376.165539
1543 334.024327
1547 272.197374
Length: 205, dtype: float64
我试图将predictions
变量转换为列表或字典,但无法弄清楚如何连接所有四个表。可能很容易解决,但是我找不到。
更新3
这个对你有用吗?
df = pd.read_csv("df.csv", sep=";")
df_coef =[]
status = list(set(df['status']))
for status in status:
df_redux = df[df['status']==status]
X = df_redux[['a', 'b', 'c', 'd']]
Y = df_redux['y_Values']
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
stats_2 = pd.read_html(model.summary().tables[1].as_html(),header=0,index_col=0)[0]
predictions = pd.DataFrame(predictions, columns = ['predictions'])
gf = pd.concat([predictions, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
生产:
predictions ID status y_Values a b c d
0 150.51 1 1 150.51 0.26 23 0.151 1.215
1 153.11 2 1 153.11 0.86 14 0.156 1.651
2 189.32 3 1 189.32 0.46 51 0.151 2.154
4 189.65 5 4 189.65 0.91 11 0.123 2.104
5 144.23 6 4 144.23 0.69 16 0.178 3.515
6 198.02 7 4 198.02 0.62 18 0.891 1.561
3 145.65 4 9 145.65 0.46 62 0.157 3.145
7 178.09 8 9 178.09 0.91 22 0.156 9.155
注意,在这里的示例中,y_Values
和predictions
由于缺少数据而重合。