python逻辑回归(初学者)

我正在使用python自学一些逻辑回归。我试图将演练中的课程应用于维基百科入口中的小数据集。

有些事情似乎不太对劲。Wikipedia和Excel Solver（使用本视频中的方法验证）给出了截距-4.0777和系数1.5046，但我从github示例中构建的代码分别输出-0.924200和0.756024。

我尝试使用的代码如下。有明显的错误吗？

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
>>>
                  0
Intercept -0.924200
study_hrs  0.756024

解决方案

只需将模型创建行更改为

model = LogisticRegression(C=100000, fit_intercept=False)

问题分析

默认情况下，sklearn使用拟合强度C=1（小C-大正则化，大C-小正则化）求解正则化LogisticRegression。

此类使用liblinear库、newton-cg和lbfgs解算器。它可以同时处理密集和稀疏输入。使用C序数组或CSR矩阵包含用于最佳性能的64位浮点；任何其他输入格式将被转换（和复制）。

因此，要获得他们的模型，你应该适合

model = LogisticRegression(C=1000000)

它给出

Intercept -2.038853 # this is actually half the intercept
study_hrs  1.504643 # this is correct

此外，问题还在于您在patsy中处理数据的方式，请参阅简化的正确示例

import numpy as np
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
X = np.array([[x] for x in X])
y = np.ravel(y)
model = LogisticRegression(C=1000000.)
model = model.fit(X,y)
print('coef', model.coef_)
print('intercept', model.intercept_)

给出

coef [[ 1.50464059]]
intercept [-4.07769916]

到底是什么问题？当您执行dmatrices时，默认情况下会将您的输入数据嵌入一列1（偏差）

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")
print(X)

这导致

    Intercept  study_hrs
0           1       0.50
1           1       0.75
2           1       1.00
3           1       1.25
4           1       1.50
5           1       1.75
6           1       1.75
7           1       2.00
8           1       2.25
9           1       2.50
10          1       2.75
11          1       3.00
12          1       3.25
13          1       3.50
14          1       4.00
15          1       4.25
16          1       4.50
17          1       4.75
18          1       5.00
19          1       5.50

这就是为什么产生的偏差只是真正的一个的的一半-scikit learning还添加了一列。。。所以你现在有两个偏差，所以最优的解决方案是给每个偏差一半的权重。

那么你能做什么呢？

不要以这种方式使用patsy
禁止patsy添加偏见
告诉sklearn不要增加偏见

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

给出

Intercept -4.077571
study_hrs  1.504597

根据需要

解决方案

问题分析

相关内容

最新更新

热门标签：