我正在使用Scikit进行功能选择,但是我想获得文本中所有umigrams的分数值。我得到了分数,但是我如何将其映射到实际功能名称。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
Texts=["should schools have uniform","schools discipline","legalize marriage","marriage culture"]
labels=["3","3","7","7"]
vectorizer = CountVectorizer()
term_doc=vectorizer.fit_transform(Texts)
ch2 = SelectKBest(chi2, "all")
X_train = ch2.fit_transform(term_doc, labels)
print ch2.scores_
这给出了结果,但是我怎么知道哪些特征名称映射到什么分数?
它在文档中就在那里:
get_feature_names()
在初始选择Chi-square中的所有功能中打印功能名称,然后与您的列匹配并根据p值匹配,您可以删除该功能。
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X = df.drop("outcome",axis=1)
y = df["outcome"]
chi_scores = chi2(X,y)
chi_scores
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = False , inplace = True)
p_values.plot.bar(figsize=(20,10))
print(p_values>=0.5)