ValueError: X有13个特征,但RandomForestClassifier期望30个特征作为输入



当我试图做出预测时,我得到这个错误

input_data=[[58,    0,  0,  100,    248,    0,  0,  122,    0,  1,  1,  0,  2]]
prediction = random_forest.predict(input_data)
print(prediction)

我使用get_dummies方法分类数据,因此特征的数量增加到30

categorical_val.remove('target')
dataset = pd.get_dummies(df, columns = categorical_val)
# dataset=df
from sklearn.preprocessing import StandardScaler
s_sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[col_to_scale] = s_sc.fit_transform(dataset[col_to_scale])

我使用了不同的分类模型,其中之一是RandomForest

from sklearn.ensemble import RandomForestClassifier
# create regressor object
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 0)
random_forest.fit(X_train, y_train) 
pred=random_forest.predict(X_test)

错误:

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names
"X does not have valid feature names, but"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-be47ec6e672c> in <module>()
8 
9 input_data=[[58,        0,      0,      100,    248,    0,      0,      122,    0,      1,      1,      0,      2]]
---> 10 prediction = random_forest.predict(input_data)
11 print(prediction)
12 
4 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
399         if n_features != self.n_features_in_:
400             raise ValueError(
--> 401                 f"X has {n_features} features, but {self.__class__.__name__} "
402                 f"is expecting {self.n_features_in_} features as input."
403             )
ValueError: X has 13 features, but RandomForestClassifier is expecting 30 features as input.

我知道我得到这个错误,因为get_dummies()方法,但如果我不使用它的模型的准确性改变。

问题是你在不同大小的特征集上训练一个模型,但是input_data=[[58, 0, 0, 100, 248, 0, 0, 122, 0, 1, 1, 0, 2]]不适合训练大小。get_dumies()所做的是,如果列中有值1、2、3或a、b、c,它将分类列中的每个值转换为单独的列。get_dumies将以此创建三个列。所以当你给出输入来进行预测时把这些扩展的分类数据列数转换成那一列的类别数那一列的值就是0和1。0表示不存在,1表示存在。例如,我有3列的数据[[2,2,3]],前两列有2个类别,第三列有3个类别,所以新的数据集将用于列[1,2,1,2,1,2,3]和[[2,2,3]]的值展开形式将为[0,1,0,1,0,0,1]。我希望这对你有帮助。

最新更新