为CountVectorizer(sklearn)添加词干支持



我正试图用sklearn将词干添加到NLP中的管道中。

from nltk.stem.snowball import FrenchStemmer
stop = stopwords.words('french')
stemmer = FrenchStemmer()

class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, stemmer):
        super(StemmedCountVectorizer, self).__init__()
        self.stemmer = stemmer
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc))
stem_vectorizer = StemmedCountVectorizer(stemmer)
text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ])

当这个管道与sklearn的CountVectorizer一起使用时,它是有效的。如果我手动创建这样的功能,它也可以工作

vectorizer = StemmedCountVectorizer(stemmer)
vectorizer.fit_transform(X)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)

编辑

如果我在我的IPython笔记本上尝试这个管道,它会显示[*],什么也不会发生。当我查看我的终端时,它会给出以下错误:

Process PoolWorker-12:
Traceback (most recent call last):
  File "C:Anaconda2libmultiprocessingprocess.py", line 258, in _bootstrap
    self.run()
  File "C:Anaconda2libmultiprocessingprocess.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:Anaconda2libmultiprocessingpool.py", line 102, in worker
    task = get()
  File "C:Anaconda2libsite-packagessklearnexternalsjoblibpool.py", line 360, in get
    return recv()
AttributeError: 'module' object has no attribute 'StemmedCountVectorizer'

示例

以下是完整的示例

from sklearn.pipeline import Pipeline
from sklearn import grid_search
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemming(doc):
    return (stemmer.stem(w) for w in analyzer(doc))
X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou']
Y = [1,0,1,1,0,0]
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())])
parameters = { 'vect__analyzer': ['word', stemming]}
gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf.fit(X, Y)

如果您从参数中删除词根,它会起作用,否则它就不起作用。

更新

问题似乎出现在并行化过程中,因为当删除n_jobs=-1时,问题就会消失。

您可以将可调用的作为analyzer传递给CountVectorizer构造函数,以提供自定义分析器。这似乎对我有效。

from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))
stem_vectorizer = CountVectorizer(analyzer=stemmed_words)
print(stem_vectorizer.fit_transform(['Tu marches dans la rue']))
print(stem_vectorizer.get_feature_names())

打印输出:

  (0, 4)    1
  (0, 2)    1
  (0, 0)    1
  (0, 1)    1
  (0, 3)    1
[u'dan', u'la', u'march', u'ru', u'tu']

我知道我发布答案有点晚了。但它就在这里,以防有人仍然需要帮助。

以下是通过重写build_analyser() 将语言词干添加到计数矢量器的最干净的方法

from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem
french_stemmer = nltk.stem.SnowballStemmer('french')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])
vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')

您可以在vectorizer_s对象上自由调用CountVectorizer类的fittransform函数

您可以尝试:

def build_analyzer(self):
    analyzer = super(CountVectorizer, self).build_analyzer()
    return lambda doc:(stemmer.stem(w) for w in analyzer(doc))

并去除CCD_ 7方法。

相关内容

  • 没有找到相关文章

最新更新