ValueError:np.nan是无效文档



下面是我编写的代码:

X2=df['title']
y2=df['news_type']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)
pp=Pipeline([
('bow',CountVectorizer(analyzer=final)),
('tfidf',TfidfTransformer()),
('classifier',RandomForestClassifier())
])
pp.fit(X2_train.astype("U"),y2_train.astype("U"))
predictions7=pp.predict(X2_test)

错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-300-2bed28a1314e> in <module>
----> 1 predictions7=pp.predict(X2_test)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
117 
118         # lambda, but not partial, allows help() to work with update_wrapper
--> 119         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120         # update the docstring of the returned function
121         update_wrapper(out, self.fn)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
405         Xt = X
406         for _, name, transform in self._iter(with_final=False):
--> 407             Xt = transform.transform(Xt)
408         return self.steps[-1][-1].predict(Xt, **predict_params)
409 
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
1248 
1249         # use the same matrix-building strategy as fit_transform
-> 1250         _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1251         if self.binary:
1252             X.data.fill(1)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1108         for doc in raw_documents:
1109             feature_counter = {}
-> 1110             for feature in analyze(doc):
1111                 try:
1112                     feature_idx = vocabulary[feature]
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
97 
98     if decoder is not None:
---> 99         doc = decoder(doc)
100     if analyzer is not None:
101         doc = analyzer(doc)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in decode(self, doc)
217 
218         if doc is np.nan:
--> 219             raise ValueError("np.nan is an invalid document, expected byte or "
220                              "unicode string.")
221 
ValueError: np.nan is an invalid document, expected byte or unicode string.

尽了一切努力来解决这个错误,但都解决不了。请告诉我这里做错了什么?它的抛出错误仅在以下行之后:predictions7=pp.predict(X2_test(。我已经粘贴了上面的错误

解决方案:

.Replace "pp.fit(X2_train.astype("U"),y2_train.astype("U"))" by
"pp.fit((X2_train.astype("U").str.lower()),(y2_train.astype("U").str.lower()))"
Replace "predictions7=pp.predict(X2_test)" by "predictions7=pp.predict(X2_test.astype("U"))"
Replace "pp.fit(X2_train.astype("U"),y2_train.astype("U"))" by
"pp.fit((X2_train.astype("U").str.lower()),(y2_train.astype("U").str.lower()))"
Replace "predictions7=pp.predict(X2_test)" by "predictions7=pp.predict(X2_test.astype("U"))"

对于将来有此问题的其他人。正如@Prayson W. Daniel所指出的,数据集中很可能有NaN值。

您可以使用按照@Prayson W. Daniel的建议重写整个数据集

df = df.fillna('')

或者,当您创建测试/训练数据集时,您可以将其应用于下游,例如,如果您的X列中只有NaN,则可以使用以下内容:

X_train, X_test, y_train, y_test = train_test_split(
foo['bar'].fillna('no text'), labels,
test_size=0.30,
random_state=53
)

最新更新