下面是我编写的代码:
X2=df['title']
y2=df['news_type']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3, random_state=42)
pp=Pipeline([
('bow',CountVectorizer(analyzer=final)),
('tfidf',TfidfTransformer()),
('classifier',RandomForestClassifier())
])
pp.fit(X2_train.astype("U"),y2_train.astype("U"))
predictions7=pp.predict(X2_test)
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-300-2bed28a1314e> in <module>
----> 1 predictions7=pp.predict(X2_test)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
117
118 # lambda, but not partial, allows help() to work with update_wrapper
--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120 # update the docstring of the returned function
121 update_wrapper(out, self.fn)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
405 Xt = X
406 for _, name, transform in self._iter(with_final=False):
--> 407 Xt = transform.transform(Xt)
408 return self.steps[-1][-1].predict(Xt, **predict_params)
409
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents)
1248
1249 # use the same matrix-building strategy as fit_transform
-> 1250 _, X = self._count_vocab(raw_documents, fixed_vocab=True)
1251 if self.binary:
1252 X.data.fill(1)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1108 for doc in raw_documents:
1109 feature_counter = {}
-> 1110 for feature in analyze(doc):
1111 try:
1112 feature_idx = vocabulary[feature]
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
97
98 if decoder is not None:
---> 99 doc = decoder(doc)
100 if analyzer is not None:
101 doc = analyzer(doc)
/home/monika/snap/jupyter/common/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in decode(self, doc)
217
218 if doc is np.nan:
--> 219 raise ValueError("np.nan is an invalid document, expected byte or "
220 "unicode string.")
221
ValueError: np.nan is an invalid document, expected byte or unicode string.
尽了一切努力来解决这个错误,但都解决不了。请告诉我这里做错了什么?它的抛出错误仅在以下行之后:predictions7=pp.predict(X2_test(。我已经粘贴了上面的错误
解决方案:
.Replace "pp.fit(X2_train.astype("U"),y2_train.astype("U"))" by
"pp.fit((X2_train.astype("U").str.lower()),(y2_train.astype("U").str.lower()))"
Replace "predictions7=pp.predict(X2_test)" by "predictions7=pp.predict(X2_test.astype("U"))"
Replace "pp.fit(X2_train.astype("U"),y2_train.astype("U"))" by
"pp.fit((X2_train.astype("U").str.lower()),(y2_train.astype("U").str.lower()))"
Replace "predictions7=pp.predict(X2_test)" by "predictions7=pp.predict(X2_test.astype("U"))"
对于将来有此问题的其他人。正如@Prayson W. Daniel
所指出的,数据集中很可能有NaN
值。
您可以使用按照@Prayson W. Daniel
的建议重写整个数据集
df = df.fillna('')
或者,当您创建测试/训练数据集时,您可以将其应用于下游,例如,如果您的X列中只有NaN,则可以使用以下内容:
X_train, X_test, y_train, y_test = train_test_split(
foo['bar'].fillna('no text'), labels,
test_size=0.30,
random_state=53
)