使用计数矢量器和 TF-IDF 转换器时出错



我编写了以下代码来传输一些数据:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
def transform (data):
    vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None)
    clean = vectorizer.fit_transform(data)
    clean_tfidf_transformer = TfidfTransformer()
    clean_tfidf = clean_tfidf_transformer.fit_transform(clean)
    return clean_tfidf, clean_tfidf.shape[1]

但是,在某些数据上运行它时,会产生此错误:

Traceback (most recent call last):
  File "...", line 169, in <module>
    X, y = process(directory, filename)
  File "...", line 132, in process
    tr_abstract, abstractN = transform(pre_abstract)
  File "...", line 77, in transform
    clean = vectorizer.fit_transform(data)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
    for feature in analyze(doc):
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
    raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.

这是什么意思?

您的数据缺少值,以下代码可以重现错误

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
vectorizer = CountVectorizer(analyzer = "word", tokenizer=None, preprocessor=None, stop_words=None)
clean = vectorizer.fit_transform([u'i am shane', np.nan])
我在

使用 tfidftfidf.fit_transform 时也遇到了同样的错误。这里的其他答案都不起作用,所以我跑了

df['data'] = df['data'].astype(str) 

然后,它奏效了!试试这个

最新更新