我想从数据集中删除所有德语停用词



我想在拟合模型和预测准确性之前从数据集中删除德语停用词。我不确定为什么下面的代码无法提供帮助。已安装所有 NLTK 和相关库。

import nltk
nltk.download()
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('german', ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
        def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: [stemmer.stem(w) for w in analyzer(doc)]

stemmed_count_vect = StemmedCountVectorizer(stop_words='german')
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf',
                            TfidfTransformer()), ('mnb',
                            MultinomialNB(fit_prior=False))])
text_mnb_stemmed = text_mnb_stemmed.fit(X, y)
predicted_mnb_stemmed = text_mnb_stemmed.predict(X)
np.mean(predicted_mnb_stemmed == y)

如果您只想从文档中删除德语停用词,那么您可以在 CountVectorizer 函数中传递停用词列表

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

german_stop_words = stopwords.words('german')
vect = CountVectorizer(stop_words = german_stop_words) # Now use this in your pipeline

我不确定您的关注是从各自的列中删除德语数据集,还是在矢量化时您希望排除德语停用词。

CountVectorizer 不用于从相应列中删除停用词,它用于矢量化您的语料库

如果您只想从数据帧的列中删除停用词,您可以简单地执行此操作...

import pandas as pd
df = pd.DataFrame(['how are you. vom und viel','hope this help aber','alle'], columns = ['x']) 

def stop_word_removal(x):
    token = x.split()
    return ' '.join([w for w in token if not w in german_stop_words])

 df['removed_stop_word']  = df['x'].apply(stop_word_removal)
     x                           removed_stop_word
 0   how are you. vom und viel   how are you.
 1   hope this help aber         hope this help
 2   alle   

最新更新