在sklearn管道中使用spacy作为标记器



我试图在一个更大的scikit学习管道中使用spacy作为标记器,但一直遇到任务无法腌制发送给工人的问题。

最小示例:

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from functools import partial
import spacy

def spacy_tokenize(text, nlp):
return [x.orth_ for x in nlp(text)]
nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
tok = partial(spacy_tokenize, nlp=nlp)
pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=tok)),
('clf', SGDClassifier())])
params = {'vectorize__ngram_range': [(1, 2), (1, 3)]}
CV = RandomizedSearchCV(pipeline,
param_distributions=params,
n_iter=2, cv=2, n_jobs=2,
scoring='accuracy')
categories = ['alt.atheism', 'comp.graphics']
news = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True,
random_state=42)
CV.fit(news.data, news.target)

运行此代码我得到错误:

PicklingError: Could not pickle the task to send it to the workers.

让我困惑的是:

import pickle
pickle.dump(tok, open('test.pkl', 'wb'))

工作没有问题。

有人知道是否可以将spacy与sklearn交叉验证一起使用吗?谢谢

这不是一个解决方案,而是一个变通方法。看起来spacy和joblib之间存在一些问题:

  • https://github.com/explosion/spaCy/issues/1669
  • https://github.com/joblib/joblib/issues/767

如果您可以将标记化器作为函数保存在目录中的一个单独文件中,然后将其导入到当前文件中,则可以避免此错误。类似于:

  • custom_file.py

    import spacy
    nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])
    def spacy_tokenizer(doc):
    return [x.orth_ for x in nlp(doc)]
    
  • main.py

    #Other code     
    ...
    ... 
    from custom_file import spacy_tokenizer
    pipeline = Pipeline([('vectorize', CountVectorizer(tokenizer=spacy_tokenizer)),
    ('clf', SGDClassifier())])
    ...
    ...
    

相关内容

  • 没有找到相关文章

最新更新