SCIkit学习:没有信息泄露的学习曲线?

我想为使用 countVectorizer 提取特征的 LinearSVC 估计器生成一条学习曲线。countVectorizer 还应用了一些特征选择步骤。

我可以执行以下操作：

在所有数据上拟合矢量化器，包括选择前 N 个特征
在拟合线性SVC时使用这些功能
使用 linearSVC 作为 sklearn.model_selection.learning_curve(( 中的估计器

但我认为这会导致信息泄露：基于所有数据的信息将用于为学习曲线中使用的较小集合选择特征。

这是对的吗？有没有办法将内置的 sklearn.model_selection.learning_curve(( 与 countVectorizer 一起使用而不会泄露信息？

谢谢！

您需要将管道与learning_curve结合使用。管道将在训练时调用转换器的fit_transform，仅在测试时调用transform。learning_curve还将应用交叉验证，该交叉验证可以通过参数cv进行控制。

使用此管道，不会泄露信息。下面是一个在scikit-learn中使用集成玩具库的示例。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import learning_curve

categories = [
'alt.atheism',
'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None
data = fetch_20newsgroups(subset='train', categories=categories)
pipeline = make_pipeline(
CountVectorizer(), TfidfTransformer(), LinearSVC()
)
learning_curve(pipeline, data.data, data.target, cv=5)

相关内容

最新更新

热门标签：