我的数据帧如下所示:
testId wordNumber_no difficulty containsPhoto complicatedWords Verdict
0 t1 140 NaN 0 7.653800e+06 Easy
1 t2 300 NaN 1 7.645800e+06 Hard
2 t3 394 7.653800e+06 0 NaN Hard
...
为了预测Verdict
,我很容易使用XGBoost,而且效果很好。我也想试试AdaBoost。
import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
cols_to_drop = ['testId']
df.drop(cols_to_drop, axis=1, inplace=True)
X = df.drop('Verdict', axis=1)
y = df['Verdict']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5) #not sure if random_state is needed, it fails both with and without it
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1)
model = abc.fit(X_train, y_train)
y_pred = model.predict(X_test)
但当拟合模型时,我得到了ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
。
我做了什么:
由于df.isnull().any()
返回Trues,所以我执行了df = df.fillna(method='ffill')
,但错误仍然存在。然后我尝试了df = df.fillna(lambda x: x.median())
,但由于lambda函数,我得到了TypeError: float() argument must be a string or a number, not 'function'
。对此有什么变通办法吗?
-
您可以删除所有包含NaN 的行
df.dropna()
-
用列平均值替换NaN。将其用于所有列
df[col].fillna(df[col].mean())
确保首先从初始数据集中删除所有NaN,然后从中制作训练和测试样本。