机器学习-如何使用statmodels .formula.api (python)预测新的值

我使用以下方法训练逻辑模型，从乳腺癌数据中只使用一个特征'mean_area'

from statsmodels.formula.api import logit
logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

在训练的模型中有一个内置的预测方法。然而，这给出了所有训练样本的预测值。如下

predictions = result.predict()

假设我想要一个新值的预测，比如说30，我如何使用训练好的模型来输出这个值?(而不是手动读取系数和计算)

您可以为.predict()模型提供新的值，如本笔记本文档中的输出#11所示，用于单个观察。您可以提供多个观测值作为2d array，例如DataFrame -参见文档。

由于您使用的是公式API，因此您的输入需要采用pd.DataFrame的形式，以便列引用可用。在您的情况下，您可以使用.predict(pd.DataFrame({'mean_area': [1,2,3]})。

statsmodels .predict()只在没有提供替代选项时使用用于拟合的观测值作为默认值。

import statsmodels.formula.api as smf

model = smf.ols('y ~ x', data=df).fit()
# Predict for a list of observations, list length can be 1 to many..**
prediction = model.get_prediction(exog=dict(x=[5,10,25])) 
prediction.summary_frame(alpha=0.05)

我很难使用新的pandas数据框架来预测值。因此，我将要预测的数据添加到原始数据集后拟合

   y = data['price']
   x1 = data[['size', 'year']]
   data.columns
   #Index(['price', 'size', 'year'], dtype='object')
   x=sm.add_constant(x1)
   results = sm.OLS(y,x).fit()
   results.summary()
   ## predict on unknown data
   data = data.append(pd.DataFrame({'size': [853.0,777], 'year': [2012.0,2013], 'price':[None, None]}))
   data.tail()
   new_x = data.loc[data.price.isnull(), ['size', 'year']]
   results.predict(sm.add_constant(new_x))

这个问题已经有了答案，但我希望这对你有帮助。

根据文档，第一个参数是"exog"

exog: array_like，可选要预测
的值

进一步说，

"如果使用公式，则exog的处理方式与exog相同原始数据。的键访问相同的变量名称，并且可以是pandas DataFrame或字典之类的对象，包含numpy数组。

如果没有使用公式，则提供的exog需要具有相同的公式作为模型中原始exog的列数。没有转换除将其转换为numpy数组外，其他数据都将执行。
支持pandas数据框架中的
行索引，并将其添加到返回prediction"

from statsmodels.formula.api import logit
logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

因此，您可以为exog参数提供一个pandas数据框(Ex: df)，并且该数据框应该包含作为列的mean_area。因为'mean_area'是预测因子或自变量。

predictions = logistic_model.predict(exog=df)

相关内容

最新更新

热门标签：