已校准分类器ValueError:无法将字符串转换为浮点值



数据帧:

id    review                                              name         label
1     it is a great product for turning lights on.        Ashley       
2     plays music and have a good sound.                  Alex        
3     I love it, lots of fun.                             Peter        

我想使用概率分类器(linear_svc)来预测基于评论的标签(概率为1)。我的代码:

from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load  dataset
X = training['review']
y = training['label']
linear_svc = LinearSVC()     #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid',  #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3) 
calibrated_svc.fit(X, y)

# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)

它在calibrated_svc.fit(X,y)上给出以下错误:

ValueError:无法将字符串转换为float:"这是一个很棒的产品用于转向…'

我将感谢您的帮助。

SVM模型不能直接处理文本数据。您需要先从文本中提取一些数字特征。我建议阅读NLP上的一些内容,如Bag of Words和TF-IDF。在任何情况下,对于您建议的示例,功能最小管道将是:

from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
#Load  dataset
X = training['review']
y = training['label']
linear_svc = make_pipeline(TfIdfVectorizer(), LinearSVC())
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid',
cv=3) 
calibrated_svc.fit(X, y)

# predict
prediction_data = predict_data['review']
predicted_probs = calibrated_svc.predict_proba(prediction_data)

你可能还想通过删除特殊字符、小写字母、词干等来清理文本。看看spacy库中的文本处理。

试试这个:

from sklearn.feature_extraction.text import TfidfVectorizer
X = training['review']
y = training['label']    
prediction_data = predict_data['review']
tfv = TfidfVectorizer(min_df=1, stop_words = 'english')
tfv.fit(list(X) + list(prediction_data))
X =  tfv.transform(X) 
prediction_data = tfv.transform(prediction_data)

然后构建模型:

linear_svc = LinearSVC()    
calibrated_svc = CalibratedClassifierCV(linear_svc, method='sigmoid', cv=3) 
calibrated_svc.fit(X, y)

最新更新