我编写了以下代码来传输一些数据:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
def transform (data):
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None)
clean = vectorizer.fit_transform(data)
clean_tfidf_transformer = TfidfTransformer()
clean_tfidf = clean_tfidf_transformer.fit_transform(clean)
return clean_tfidf, clean_tfidf.shape[1]
但是,在某些数据上运行它时,会产生此错误:
Traceback (most recent call last):
File "...", line 169, in <module>
X, y = process(directory, filename)
File "...", line 132, in process
tr_abstract, abstractN = transform(pre_abstract)
File "...", line 77, in transform
clean = vectorizer.fit_transform(data)
File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File ".../anaconda/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.
这是什么意思?
您的数据缺少值,以下代码可以重现错误
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
vectorizer = CountVectorizer(analyzer = "word", tokenizer=None, preprocessor=None, stop_words=None)
clean = vectorizer.fit_transform([u'i am shane', np.nan])
我在
使用 tfidf
和 tfidf.fit_transform
时也遇到了同样的错误。这里的其他答案都不起作用,所以我跑了
df['data'] = df['data'].astype(str)
然后,它奏效了!试试这个