如何使用statsmodels.api构建三次回归



几天来,我一直在尝试进行三次回归,但我遇到了同样的问题:我的结果与我在R中编写的代码不一致。数据库完全相同,所以这不是问题所在。我现在的代码是这样的:

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import numpy as np
df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
x, y = np.array(df.lstat), np.array(df.crim)
polynomial_features= PolynomialFeatures(degree=3)
xp = polynomial_features.fit_transform(x.reshape(-1,1))
model = sm.OLS(y, xp).fit()
print(model.summary())

我还做了这样的东西:

import pandas as pd
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
ft1 = smf.ols(formula=f"crim ~ lstat + I(np.power(lstat,2)) + I(np.power(lstat,3))", data=df).fit()
print(ft1.summary())

这两个给出了完全相同的结果:

OLS Regression Results                            
==============================================================================
Dep. Variable:                   crim   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.213
Method:                 Least Squares   F-statistic:                     46.63
Date:                Sat, 03 Oct 2020   Prob (F-statistic):           1.35e-26
Time:                        10:26:13   Log-Likelihood:                -1744.2
No. Observations:                 506   AIC:                             3496.
Df Residuals:                     502   BIC:                             3513.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 1.2010      2.029      0.592      0.554      -2.785       5.187
lstat                    -0.4491      0.465     -0.966      0.335      -1.362       0.464
I(np.power(lstat, 2))     0.0558      0.030      1.852      0.065      -0.003       0.115
I(np.power(lstat, 3))    -0.0009      0.001     -1.517      0.130      -0.002       0.000
==============================================================================
Omnibus:                      607.734   Durbin-Watson:                   1.239
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            53621.219
Skew:                           5.726   Prob(JB):                         0.00
Kurtosis:                      52.114   Cond. No.                     5.20e+04
==============================================================================

这是R上的程序:

fit.lstat2 <- lm(crim ~ poly(lstat, 3))
summary(fit.lstat2)

它给出了以下结果:

## Call:
## lm(formula = crim ~ poly(lstat, 3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3392  10.654   <2e-16 ***
## poly(lstat, 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(lstat, 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(lstat, 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

这个结果是正确的,但我不知道为什么python给出了错误的答案。也许我应该使用一些不同的方法?

附言:我已经确保两个代码都能工作,包括数据库链接。因此,请随意运行它们,亲自查看结果。

我不是R方面的专家,但我想你不使用正交多项式,所以你必须设置raw=TRUE

当我对R进程使用raw=TRUE时,我得到了与python statsmodels.api相同的结果:

fit.lstat2 <- lm(crim ~ poly(lstat, 3, raw=TRUE))
summary(fit.lstat2)

结果:

Call:
lm(formula = crim ~ poly(lstat, 3, raw = TRUE))
Residuals:
Min      1Q  Median      3Q     Max 
-15.234  -2.151  -0.486   0.066  83.353 
Coefficients:
Estimate Std. Error t value Pr(>|t|)  
(Intercept)                  1.2009656  2.0286452   0.592   0.5541  
poly(lstat, 3, raw = TRUE)1 -0.4490656  0.4648911  -0.966   0.3345  
poly(lstat, 3, raw = TRUE)2  0.0557794  0.0301156   1.852   0.0646 .
poly(lstat, 3, raw = TRUE)3 -0.0008574  0.0005652  -1.517   0.1299  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.629 on 502 degrees of freedom
Multiple R-squared:  0.2179,    Adjusted R-squared:  0.2133 
F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

最新更新