是否可以从集成投票分类器中拟合一个特定的估计器?



这是我在这里的第一个问题,如果我做错了什么,请告诉我!

所以我使用 sklearn 构建了一个包含 3 个不同估计器的集成投票分类器。 我首先通过调用以下语句将所有 3 个数据与相同的数据拟合:est.fit()
第一个数据集很小,因为 3 个估计器拟合中的 2 个非常耗时。

现在我想用不同的数据再次拟合第三个估计器。有没有办法实现这一目标?

我尝试像这样访问估算器:ens.estimators_[2].fit(X_largedata, y_largedata)
不会引发错误,但我不确定这是否适合估算器的副本或实际上是融合的一部分的副本。
现在之后调用ens.predict(X_test)会导致以下错误:(如果我不尝试拟合第 3 个估计器,预测工作正常(

ValueError                                Traceback (most recent call last)
<ipython-input-438-65c955f40b01> in <module>
----> 1 pred_ens2 = ens.predict(X_test_ens2)
2 print(ens.score(X_test_ens2, y_test_ens2))
3 confusion_matrix(pred_ens2, y_test_ens2).ravel()
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in predict(self, X)
280         check_is_fitted(self)
281         if self.voting == 'soft':
--> 282             maj = np.argmax(self.predict_proba(X), axis=1)
283 
284         else:  # 'hard' voting
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _predict_proba(self, X)
300         """Predict class probabilities for X in 'soft' voting."""
301         check_is_fitted(self)
--> 302         avg = np.average(self._collect_probas(X), axis=0,
303                          weights=self._weights_not_none)
304         return avg
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in _collect_probas(self, X)
295     def _collect_probas(self, X):
296         """Collect results from clf.predict calls."""
--> 297         return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
298 
299     def _predict_proba(self, X):
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/ensemble/_voting.py in <listcomp>(.0)
295     def _collect_probas(self, X):
296         """Collect results from clf.predict calls."""
--> 297         return np.asarray([clf.predict_proba(X) for clf in self.estimators_])
298 
299     def _predict_proba(self, X):
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
117 
118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120         # update the docstring of the returned function
121         update_wrapper(out, self.fn)
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/pipeline.py in predict_proba(self, X)
461         Xt = X
462         for _, name, transform in self._iter(with_final=False):
--> 463             Xt = transform.transform(Xt)
464         return self.steps[-1][-1].predict_proba(Xt)
465 
~/jupyter/lexical/lexical_env/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in transform(self, X)
596             if (n_cols_transform >= n_cols_fit and
597                     any(X.columns[:n_cols_fit] != self._df_columns)):
--> 598                 raise ValueError('Column ordering must be equal for fit '
599                                  'and for transform when using the '
600                                  'remainder keyword')
ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword


编辑:我修复了错误!这是由于小数据集的列多于大数据集引起的。这可能是一个问题,因为当第一次与小数据集拟合时,转换器被告知会有这些列(?(。一旦它们具有相同的列(和列顺序(,它就可以工作。这似乎是只训练一个特定估计器的正确方法,但如果有更好的方法或者你认为我错了,请告诉我。

因此,似乎各个分类器都存储在可以使用.estimators_访问的列表中。此列表的各个条目是具有.fit方法的分类器。因此,逻辑回归示例:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
X1, y1 = make_classification(random_state=1)
X2, y2 = make_classification(random_state=2)

clf1 = LogisticRegression(random_state=1)
clf2 = LogisticRegression(random_state=2)
clf3 = LogisticRegression(random_state=3)

voting = VotingClassifier(estimators=[
('a', clf1),
('b', clf2),
('c', clf3),
])
# fit all
voting = voting.fit(X1,y1)
# fit individual one
voting.estimators_[-1].fit(X2,y2)
voting.predict(X2)

编辑:estimatorsestimators_之间的区别

。估计

这是一个元组列表,形式为(名称、估计器(:

for e in voting.estimators:
print(e)
('a', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=1, solver='warn', tol=0.0001, verbose=0,
warm_start=False))
('b', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=2, solver='warn', tol=0.0001, verbose=0,
warm_start=False))
('c', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=3, solver='warn', tol=0.0001, verbose=0,
warm_start=False))

。估计_

这只是一个估计器列表,没有名称。

for e in voting.estimators_:
print(e)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=1, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=2, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=3, solver='warn', tol=0.0001, verbose=0,
warm_start=False)

有趣

虽然

voting.estimators[0][1] == voting.estimators_[0]的计算结果为False,因此条目似乎并不相同。

投票分类器的预测方法使用.estimators_列表。

检查源的第 295 - 323 行

最新更新