sklearn LogisticRegression predict_proba() 在使用参数时给出了不正确sampl

我正在尝试SciKit Learn。我以为我会尝试加权逻辑回归，但是在使用 sample_weight 参数初始化它时，我从 sklearn 的 LogisticRegression 对象那里得到了无意义的预测。

这是一个演示问题的玩具示例。我设置了一个非常简单的数据集，其中包含一个特征和一个二进制目标输出。

feat  target  weight
A       0       1
A       0       1
A       1       1
A       1       1
B       0       1
B       0       1
B       0       1
B       1       W

因此，任何合理的逻辑回归都应该预测，当feat=A时，成功的概率为 0.5。当 feat=B取决于重量W：

如果W=1，那么看起来有0.25的成功几率
如果W=3，这平衡了三个0，看起来有0.5的成功机会
如果W=9，现在实际上有九个1和三个0，所以成功的机会是0.75。

R 中的加权逻辑回归给出了正确的预测：

test <- function(final_weight) {
  feat   <- c('A','A','A','A','B','B','B','B')
  target <- c(0, 0, 1, 1, 0, 0, 0, 1)
  weight <- c(1, 1, 1, 1, 1, 1, 1, final_weight)
  df = data.frame(feat, target, weight)
  m = glm(target ~ feat, data=df, family='binomial', weights=weight)
  predict(m, type='response')
}
test(1)
#   1    2    3    4    5    6    7    8 
#0.50 0.50 0.50 0.50 0.25 0.25 0.25 0.25 
test(3)
#  1   2   3   4   5   6   7   8 
#0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 
test(9)
#   1    2    3    4    5    6    7    8 
#0.50 0.50 0.50 0.50 0.75 0.75 0.75 0.75

伟大。但是在SciKit Learn中，使用LogisticRegression对象，我在使用W=9时不断得到无意义的预测。这是我的Python代码：

import pandas as pd
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices
def test(final_weight):
    d = {
        'feat'   : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'target' : [0, 0, 1, 1, 0, 0, 0, 1],
        'weight' : [1, 1, 1, 1, 1, 1, 1, final_weight],
    }
    df = pd.DataFrame(d)
    print df, 'n'
    y, X = dmatrices('target ~ feat', df, return_type="dataframe")
    features = X.columns
    C = 1e10 # high value to prevent regularization
    solver = 'sag' # so we can use sample_weight
    lr = LogisticRegression(C=C, solver=solver)
    lr.fit(X, df.target, sample_weight=df.weight)
    print 'Predictions:', 'n', lr.predict_proba(X), 'n', '===='

test(1)
test(3)
test(9)

这给出了以下输出(我删除了一些以使其不那么冗长(：

  feat  target  weight
...
4    B       0       1
5    B       0       1
6    B       0       1
7    B       1       1
Predictions:
[[ 0.50000091  0.49999909]
...
 [ 0.74997935  0.25002065]]
====
  feat  target  weight
...
4    B       0       1
5    B       0       1
6    B       0       1
7    B       1       3
/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/sag.py:267: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
Predictions:
[[ 0.49939191  0.50060809]
...
 [ 0.49967407  0.50032593]]
====
  feat  target  weight
...
4    B       0       1
5    B       0       1
6    B       0       1
7    B       1       9
Predictions:
[[ 0.00002912  0.99997088]   # Nonsense predictions for A!
...
 [ 0.00000034  0.99999966]]  # And for B too...
====

你可以看到，当我将最终权重设置为 9(这似乎不是一个不合理的高权重(时，预测被破坏了！不仅对feat=B的预测是荒谬的，而且feat=A时的预测现在也是荒谬的。

我的问题是

为什么当最终权重为 9 时，这些预测会如此错误？

我做错了什么或误解了什么吗？

更一般地说，如果有人在SciKit Learn中成功使用了加权逻辑回归，并实现了与R的glm(..., family='binomial')函数给出的预测类似的预测，我会非常感兴趣。

非常感谢对此的任何帮助。

问题似乎出在求解器中：

solver = 'sag'

对于具有训练示例 iid 假设的大型数据集，通常使用随机求解器。它不适用于高样品重量。

将求解器更改为 lbfgs 后，结果与您在 R 中看到的结果相匹配。

solver = 'lbfgs'

相关内容

最新更新

热门标签：