我想在拟合模型和预测准确性之前从数据集中删除德语停用词。我不确定为什么下面的代码无法提供帮助。已安装所有 NLTK 和相关库。
import nltk
nltk.download()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('german', ignore_stopwords=True)
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: [stemmer.stem(w) for w in analyzer(doc)]
stemmed_count_vect = StemmedCountVectorizer(stop_words='german')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf',
TfidfTransformer()), ('mnb',
MultinomialNB(fit_prior=False))])
text_mnb_stemmed = text_mnb_stemmed.fit(X, y)
predicted_mnb_stemmed = text_mnb_stemmed.predict(X)
np.mean(predicted_mnb_stemmed == y)
如果您只想从文档中删除德语停用词,那么您可以在 CountVectorizer 函数中传递停用词列表
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
german_stop_words = stopwords.words('german')
vect = CountVectorizer(stop_words = german_stop_words) # Now use this in your pipeline
我不确定您的关注是从各自的列中删除德语数据集,还是在矢量化时您希望排除德语停用词。
CountVectorizer 不用于从相应列中删除停用词,它用于矢量化您的语料库
如果您只想从数据帧的列中删除停用词,您可以简单地执行此操作...
import pandas as pd
df = pd.DataFrame(['how are you. vom und viel','hope this help aber','alle'], columns = ['x'])
def stop_word_removal(x):
token = x.split()
return ' '.join([w for w in token if not w in german_stop_words])
df['removed_stop_word'] = df['x'].apply(stop_word_removal)
x removed_stop_word
0 how are you. vom und viel how are you.
1 hope this help aber hope this help
2 alle