NaN在scikit学习中产生问题



我的数据帧如下所示:

testId    wordNumber_no    difficulty    containsPhoto     complicatedWords     Verdict
0     t1              140           NaN                0         7.653800e+06        Easy
1     t2              300           NaN                1         7.645800e+06        Hard
2     t3              394  7.653800e+06                0                  NaN        Hard
...

为了预测Verdict,我很容易使用XGBoost,而且效果很好。我也想试试AdaBoost

import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
cols_to_drop = ['testId'] 
df.drop(cols_to_drop, axis=1, inplace=True)
X = df.drop('Verdict', axis=1)
y = df['Verdict']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5) #not sure if random_state is needed, it fails both with and without it
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1)
model = abc.fit(X_train, y_train)
y_pred = model.predict(X_test)

但当拟合模型时,我得到了ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

我做了什么:

由于df.isnull().any()返回Trues,所以我执行了df = df.fillna(method='ffill'),但错误仍然存在。然后我尝试了df = df.fillna(lambda x: x.median()),但由于lambda函数,我得到了TypeError: float() argument must be a string or a number, not 'function'。对此有什么变通办法吗?

  1. 您可以删除所有包含NaN 的行

    df.dropna()
    
  2. 用列平均值替换NaN。将其用于所有列

    df[col].fillna(df[col].mean())
    

确保首先从初始数据集中删除所有NaN,然后从中制作训练和测试样本。

最新更新