有没有一种方法可以为scikit逻辑回归模型提供类似于statsmodels的良好输出?所有p值、标准错误等都在一个表中?
正如您和其他人所指出的,这是scikit学习的局限性。在下面讨论针对您的问题的scikit方法之前,"最佳"选项是使用如下统计模型:
import statsmodels.api as sm
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()
X表示输入特征/预测因子矩阵,y表示结果变量。如果X缺乏高度相关的特征,缺乏低方差特征,特征不产生"完美/准完美分离",并且任何分类特征都被减少到"n-1"级,即伪编码(而不是"n"级,如本文所述的一个热编码:伪变量陷阱),则统计模型可以很好地工作。
然而,如果上述方法不可行/不实用,下面将对一种scikit方法进行编码,以获得相当等效的结果——就特征系数/与标准误差的比值和95%CI估计而言。从本质上讲,代码从不同的逻辑回归scikit模型中生成这些结果,这些模型是针对数据的不同测试序列分割进行训练的。同样,确保分类特征被伪编码为n-1级(否则分类特征的scikit系数将不正确)。
#Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty
= "none")
##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)
##Create features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
train_x, test_x, train_y, test_y = train_test_split(X, y, stratify=y,
test_size=0.3, random_state=state) #5 test-train splits
log_nr.fit(train_x, train_y) #fit logistic model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_)
##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)
#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] -
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] +
coeff_table["95ci_se_coeff"]
最后,(可选)通过如下幂运算将系数转换为赔率。比值比是我最喜欢的逻辑回归输出,它们使用下面的代码附加到您的数据帧中。
#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])
这个答案建立在@pciunkiewicz的一个有点相关的回复之上:从sklearn 中整理多个测试列车分割的模型系数