是否有一种方法可以将最大的子集设置为statmodels glm中的基础,以及如何从参数中获得基础?



我对statmodels GLM有两个问题:

  1. 是否有一种方法可以告诉glm自动将大多数观测值设置为每个因素的基础(即参数0)?如果不是,有什么原因吗?
  2. 是否有一种方法可以显示或从GLM中提取基本级别(即param = 0的级别)的名称?我知道预测函数工作得很好,但我正在提取GLM输出以在其他地方使用它,并且希望将其自动化。

我知道我可以在公式中使用Treatment的解决方法,例如,代替formula='y~C(x)',我可以写formula='y~C(x, Treatment("abc"))'。我正在使用这个问题2目前,我想我可以扩展到问题1,如果我通过一个函数来增强公式追逐数据和公式,但想知道是否有一些更干净的方法来做到这一点或管道中的功能能够做到这一点?

欢呼所以

对于任何可能感兴趣的人,我用下面的函数实现了上面提到的解决方法。它工作得很好,但是它有一个缺点,如果你在级别中有括号()[],你就会遇到麻烦。你可以避免这个麻烦,只要交出治疗方案,但不是完美的。

def add_treatment_to_formula(formula:str, df:pd.DataFrame, exposure:str):
""" Little helper to add the Treatment field (this is the statsmodel terminology for the base level,
i.e. the level that gets the parameter=0 (factor=1)) to GLM formulas. It sets it to the level with the
largest exposure found in the df handed over.
:param formula: Example: 'claimamount ~ age + C(postcode) + C(familystatus, Treatment(2))' will get turned into
'claimamount ~ age + C(postcode, Treatment("12435")) + C(familystatus, Treatment(2))'. The familystatus already had
a treatment, the age is not categorical in this example, so only the postcode gets transformed.
:type formula: str
:param df: DataFrame with at least the columns that show up in the right side of the formula (age, postcode,
familystatus in the example) and the column containing the exposure as named in the exposure argument
:type df: pd.DataFrame()
:param exposure: Name of the column in the df containing the exposure
:type exposure: str
:return: Formula with treatments added
:rtype: str
"""
l_yx = formula.split('~')
if len(l_yx)>2:
log.error("This does not look like a formula, more than 1 ~ found.")
l_xs = l_yx[1].split('+')
l_xs_enh = []
for x in l_xs:
# if the treatment field is already set up don't change it. Also, if the field is not
# categorical, don't change it
if ('Treatment' in x) | (not 'C(' in x):
l_xs_enh.append(x)
else: # get field with largest exposure here and set it as treatment field
field = x[x.find('(') + 1:x.find(')')].strip()
df_exposure = df.groupby(field)[exposure].sum().reset_index()
treatment = df_exposure.loc[df_exposure[exposure]==max(df_exposure[exposure]), field].values[0]
if isinstance(treatment, str):
quotes = '"'
else:
quotes=''
x_enh = f'C({field}, Treatment({quotes}{treatment}{quotes}))'
l_xs_enh.append(x_enh)
formula_enhanced = l_yx[0] + '~ ' + ' + '.join(l_xs_enh)
return formula_enhanced

最好在将变量输入到glm之前对其进行分类。这可以通过使用pd.Categorial来实现,例如使用模拟数据集:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})

这里d是参考水平,因为它有最多的观测值:

df.x.value_counts()

d    28
b    27
c    26
a    19

如果引用之后的顺序不重要,您可以简单地执行:

df['x'] = pd.Categorical(df['x'],df.x.value_counts().index)

参考水平是简单的:

df.x.cat.categories[0]
'd'

关于这个的回归:

model = smf.glm(formula = 'y ~ x',data=df).fit()

你可以看到参考是d:

Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                            GLM   Df Residuals:                       96
Model Family:                Gaussian   Df Model:                            3
Link Function:               identity   Scale:                        0.059173
Method:                          IRLS   Log-Likelihood:                 1.5121
Date:                Tue, 23 Feb 2021   Deviance:                       5.6806
Time:                        09:16:31   Pearson chi2:                     5.68
No. Iterations:                     3                                         
Covariance Type:            nonrobust                                         
==============================================================================
coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5108      0.046     11.111      0.000       0.421       0.601
x[T.b]        -0.0953      0.066     -1.452      0.146      -0.224       0.033
x[T.c]         0.0633      0.066      0.956      0.339      -0.067       0.193
x[T.a]        -0.0005      0.072     -0.007      0.994      -0.142       0.141
==============================================================================

另一种选择是使用您所指出的处理,因此第一个任务是获得顶层:

np.random.seed(123)
df = pd.DataFrame({'y':np.random.uniform(0,1,100),
'x':np.random.choice(['a','b','c','d'],100)})
ref = df.x.describe().top
from patsy.contrasts import Treatment
contrast = Treatment(reference=ref)
mod = smf.glm("y ~ C(x, Treatment)", data=df).fit()

最新更新