在使用scikit的二进制文本分类中,在一袋单词的TF-IDF表示上使用SGD分类器线性模型学习,我想通过模型系数获得每个类别的特征重要性。对于这种情况,我听到了关于是否应该使用StandardScaler(with_mean=False(缩放列(特性(的不同意见。
对于稀疏数据,无论如何都无法在缩放之前对数据进行居中处理(With_mean=False部分(。默认情况下,TfidfVectorizer也已经对每个实例进行了L2行规范化。根据经验结果,例如下面的自包含示例,在不使用StandardScaler的情况下,每个类的顶级功能似乎在直觉上更有意义。例如,"nasa"one_answers"space"是sci.space的顶级标记,而"god"one_answers"christans"是talk.religion.misc等的顶级标记。
我是不是错过了什么?在这种NLP情况下,是否仍应使用StandardScaler(with_mean=False(从线性模型系数中获得特征重要性?
在这种情况下,这些没有StandardScaler(with_mean=False(的功能重要性从理论角度来看是否仍然不可靠?
# load text from web
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
categories=['sci.space','talk.religion.misc'])
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
categories=['sci.space','talk.religion.misc'])
# setup grid search, optionally use scaling
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
text_clf = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.8)),
# remove comment below to use scaler
#('scaler', StandardScaler(with_mean=False)),
#
('clf', SGDClassifier(random_state=0, max_iter=1000))
])
from sklearn.model_selection import GridSearchCV
parameters = {
'clf__alpha': (0.0001, 0.001, 0.01, 0.1, 1.0, 10.0)
}
# find best model
gs_clf = GridSearchCV(text_clf, parameters, cv=8, n_jobs=-1, verbose=-2)
gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
# model performance, very similar with and without scaling
y_predicted = gs_clf.predict(newsgroups_test.data)
from sklearn import metrics
print(metrics.classification_report(newsgroups_test.target, y_predicted))
# use eli5 to get feature importances, corresponds to the coef_ of the model, only top 10 lowest and highest for brevity of this posting
from eli5 import show_weights
show_weights(gs_clf.best_estimator_.named_steps['clf'], vec=gs_clf.best_estimator_.named_steps['vect'], top=(10, 10))
# Outputs:
No scaling:
Weight? Feature
+1.872 god
+1.235 objective
+1.194 christians
+1.164 koresh
+1.149 such
+1.147 jesus
+1.131 christian
+1.111 that
+1.065 religion
+1.060 kent
… 10616 more positive …
… 12664 more negative …
-0.922 on
-0.939 it
-0.976 get
-0.977 launch
-0.994 edu
-1.071 at
-1.098 thanks
-1.117 orbit
-1.210 nasa
-2.627 space
StandardScaler:
Weight? Feature
+0.040 such
+0.023 compuserve
+0.021 cockroaches
+0.017 how about
+0.016 com
+0.014 figures
+0.014 inquisition
+0.013 time no
+0.012 long time
+0.010 fellowship
… 11244 more positive …
… 14299 more negative …
-0.011 sherzer
-0.011 sherzer methodology
-0.011 methodology
-0.012 update
-0.012 most of
-0.012 message
-0.013 thanks for
-0.013 thanks
-0.028 ironic
-0.032 <BIAS>
我对此没有理论依据,但TfidfVectorizer()
之后的缩放特性让我有点紧张,因为这似乎会损坏idf部分。我对TfidfVectorizer()
的理解是,从某种意义上说,它可以跨文档和功能进行扩展。如果你的惩罚估计方法在没有缩放的情况下效果良好,我想不出有任何理由缩放。