线性模型文本分类中的特征重要性,StandardScaler(with_mean=False)是或否



在使用scikit的二进制文本分类中,在一袋单词的TF-IDF表示上使用SGD分类器线性模型学习,我想通过模型系数获得每个类别的特征重要性。对于这种情况,我听到了关于是否应该使用StandardScaler(with_mean=False(缩放列(特性(的不同意见。

对于稀疏数据,无论如何都无法在缩放之前对数据进行居中处理(With_mean=False部分(。默认情况下,TfidfVectorizer也已经对每个实例进行了L2行规范化。根据经验结果,例如下面的自包含示例,在不使用StandardScaler的情况下,每个类的顶级功能似乎在直觉上更有意义。例如,"nasa"one_answers"space"是sci.space的顶级标记,而"god"one_answers"christans"是talk.religion.misc等的顶级标记。

我是不是错过了什么?在这种NLP情况下,是否仍应使用StandardScaler(with_mean=False(从线性模型系数中获得特征重要性?

在这种情况下,这些没有StandardScaler(with_mean=False(的功能重要性从理论角度来看是否仍然不可靠?

# load text from web
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
categories=['sci.space','talk.religion.misc'])
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), 
categories=['sci.space','talk.religion.misc'])
# setup grid search, optionally use scaling
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
text_clf = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_df=0.8)),
# remove comment below to use scaler
#('scaler', StandardScaler(with_mean=False)),
#
('clf', SGDClassifier(random_state=0, max_iter=1000))
])
from sklearn.model_selection import GridSearchCV
parameters = {
'clf__alpha': (0.0001, 0.001, 0.01, 0.1, 1.0, 10.0)
}
# find best model
gs_clf = GridSearchCV(text_clf, parameters, cv=8, n_jobs=-1, verbose=-2)
gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
# model performance, very similar with and without scaling
y_predicted = gs_clf.predict(newsgroups_test.data)
from sklearn import metrics
print(metrics.classification_report(newsgroups_test.target, y_predicted))
# use eli5 to get feature importances, corresponds to the coef_ of the model, only top 10 lowest and highest for brevity of this posting
from eli5 import show_weights
show_weights(gs_clf.best_estimator_.named_steps['clf'], vec=gs_clf.best_estimator_.named_steps['vect'], top=(10, 10))    

# Outputs:
No scaling:
Weight?     Feature
+1.872  god
+1.235  objective
+1.194  christians
+1.164  koresh
+1.149  such
+1.147  jesus
+1.131  christian
+1.111  that
+1.065  religion
+1.060  kent
… 10616 more positive …
… 12664 more negative …
-0.922  on
-0.939  it
-0.976  get
-0.977  launch
-0.994  edu
-1.071  at
-1.098  thanks
-1.117  orbit
-1.210  nasa
-2.627  space 
StandardScaler:
Weight?     Feature
+0.040  such
+0.023  compuserve
+0.021  cockroaches
+0.017  how about
+0.016  com
+0.014  figures
+0.014  inquisition
+0.013  time no
+0.012  long time
+0.010  fellowship
… 11244 more positive …
… 14299 more negative …
-0.011  sherzer
-0.011  sherzer methodology
-0.011  methodology
-0.012  update
-0.012  most of
-0.012  message
-0.013  thanks for
-0.013  thanks
-0.028  ironic
-0.032  <BIAS> 

我对此没有理论依据,但TfidfVectorizer()之后的缩放特性让我有点紧张,因为这似乎会损坏idf部分。我对TfidfVectorizer()的理解是,从某种意义上说,它可以跨文档和功能进行扩展。如果你的惩罚估计方法在没有缩放的情况下效果良好,我想不出有任何理由缩放。

最新更新