SKlearn 输入时出现随机森林错误

我正在尝试为我的随机森林运行适合，但我收到以下错误：

forest.fit(train[features], y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    210         """
    211         # Validate or convert input data
--> 212         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    213         if issparse(X):
    214             # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    396                              % (array.ndim, estimator_name))
    397         if force_all_finite:
--> 398             _assert_all_finite(array)
    399 
    400     shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
     52             and not np.isfinite(X).all()):
     53         raise ValueError("Input contains NaN, infinity"
---> 54                          " or a value too large for %r." % X.dtype)
     55 
     56 
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

我已经将我的功能的数据帧从 float64 强制到 float32，并确保没有空值，所以不确定是什么引发了此错误。让我知道放入更多代码是否有帮助。

更新

它最初是一个熊猫数据帧，我删除了所有 NaN。原始数据帧是带有受访者信息的调查结果，我删除了除 dv 之外的所有问题。我通过运行返回 0 的rforest_df.isnull().sum()仔细检查了这一点。这是我用于建模的完整代码。

rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)

更新

这是 y 数据的样子

array([ 0,  1,  2,  3,  4,  3,  3,  5,  6,  7,  8,  7,  9,  6, 10,  6, 11,
        7, 11,  3,  7,  9,  6,  5,  9, 11, 12, 13,  6, 11,  3,  3,  6, 14,
       15,  0,  9,  9,  2,  0, 11,  3,  9,  4,  9,  7,  3,  4,  9, 12,  9,
        7,  6, 13,  6,  0,  0, 16,  6, 11,  4, 10, 11, 11, 17,  3,  6, 16,
        3,  4, 18, 19,  7, 11,  5, 11,  5,  4,  0,  6, 17,  7,  2,  3,  5,
       11,  8,  9, 18,  6,  9,  8,  5, 16, 20,  0,  4,  8, 13, 16,  3, 20,
        0,  5,  4,  2, 11,  0,  3,  0,  6,  6,  6,  9,  4,  6,  5, 11,  0,
       13,  6,  2, 11,  7,  5,  6, 18, 12, 21, 17,  3,  6,  0, 13, 21,  7,
        3,  2, 18, 22,  7,  3,  2,  6,  7,  8,  4,  0,  7, 12,  3,  7,  3,
        2, 11, 19, 11,  6,  2,  9,  3,  7,  9,  9,  5,  6,  8,  0, 18, 11,
        3, 12,  2,  6,  4,  7,  7, 11,  3,  6,  6,  0,  6, 12, 15,  3,  9,
        3,  3,  0,  5,  9,  7,  9, 11,  7,  3, 20,  0,  7,  6,  6, 23, 15,
       19,  0,  3,  6, 16, 13,  5,  6,  6,  3,  6, 11,  9,  0,  6, 23, 16,
        4,  0,  6, 17, 11, 17, 11,  4,  3, 13,  3, 17, 16, 11,  7,  4, 24,
        5,  2,  7,  7,  8,  3,  3, 11,  8,  7, 23,  7,  7, 11,  7, 11,  6,
       15,  3, 25,  7,  4,  5,  3, 17, 20,  3, 26,  7,  9,  6,  6, 17, 20,
        1,  0, 11,  9, 16, 20,  7,  7, 26,  3,  6, 20,  7,  2, 11,  7, 27,
        9,  4, 26, 28,  8,  6,  9, 19,  7, 29,  3,  2, 26, 30,  6, 31,  6,
       18,  3,  0, 18,  4,  7, 32,  0,  2,  8,  0,  5,  9,  4, 16,  6, 23,
        0,  7,  0,  7,  9,  6,  8,  3,  7,  9,  3,  3, 12, 11,  8, 19, 20,
        7,  3,  5, 11,  3, 11,  8,  4,  4,  6,  9,  4,  1,  3,  0,  9,  9,
        6,  7,  8, 33,  8,  7,  9, 34, 11, 11,  6,  9,  9, 17,  8, 19,  0,
        7,  4, 17,  6,  7,  0,  4, 12,  7,  6,  4, 16, 12,  9,  6,  6,  6,
        6, 26, 13,  9,  7,  2,  7,  3, 11,  3,  6,  7, 19,  4,  8,  9, 13,
       11, 15, 11,  4, 18,  7,  7,  7,  0,  5,  4,  6,  0,  3,  7,  4, 25,
       18,  6, 19,  7,  9,  4, 20,  6,  3,  7,  4, 35, 15, 11,  2, 12,  0,
        7, 32,  6, 18,  9,  9,  6,  2,  3, 19, 36, 32,  0,  7,  0,  9, 37,
        3,  5,  6,  5, 34,  2,  6,  0,  7,  0,  7,  3,  7,  4, 18, 18,  7,
        3,  7, 16,  9, 19, 13,  4, 16, 19,  3, 19, 38,  9,  4,  9,  8,  0,
       17,  0,  2,  3,  5,  6,  5, 11, 11,  2,  9,  5, 33,  9,  5,  6, 20,
       13,  3, 39, 13,  7,  0,  9,  0,  4,  6,  7, 16,  7,  0, 21,  5,  3,
       18,  5, 20,  2,  2, 14,  6, 17, 11, 11, 16, 16,  9,  8, 11,  3, 23,
        0, 11,  0,  6,  0,  0,  3, 16,  6,  7,  5,  9,  7, 13,  0, 20,  0,
       25,  6, 16,  8,  4,  4,  2,  8,  7,  5, 40,  3,  8,  5, 12,  8,  9,
        6,  6,  6,  6,  3,  7, 26,  4,  0, 13,  4,  3, 13, 12,  7,  7,  6,
        7, 19, 15,  0, 33,  4,  5,  5, 20,  3, 11,  5,  4,  7,  9,  7, 11,
       36,  9,  0,  6,  6, 11,  6,  4,  2,  5, 18,  8,  5,  5,  2, 25,  4,
       41,  7,  7,  5,  7,  3, 36, 11,  6,  9,  0,  9,  0, 16, 42, 11, 11,
       18,  9,  5, 36,  2,  9,  6,  3, 43,  9, 17, 13,  5,  9,  3,  4,  6,
       44, 37,  0, 45,  2, 18,  8, 46,  2, 12,  9,  9,  3, 16,  6, 12,  9,
        0, 11, 11,  0, 25,  8, 17,  4,  4,  3, 11,  3, 11,  6,  6,  9,  7,
       23,  0,  2,  0,  3,  3,  4,  4,  9,  5, 11, 16,  7,  3, 18, 11,  7,
        6,  6,  6,  5,  9,  6,  3,  9,  7, 17, 11,  4,  9,  2,  3,  0, 26,
        9,  0, 20,  8,  9,  6, 11,  6,  6,  7, 26,  6,  6,  4, 19,  5, 41,
       19, 18, 29,  6,  5, 13,  6, 11,  7,  7,  6,  8,  5,  0,  3, 13, 17,
        6, 20, 11,  6,  9,  6,  2,  7, 11,  9, 20, 12,  7,  6,  8,  7,  4,
        6,  2,  0,  7,  9, 26,  9, 16,  7,  4, 45,  7,  0, 23,  8,  4, 19,
        4, 26, 11,  4,  4,  5,  7,  3,  0, 29, 12,  3,  4, 11,  4, 12,  8,
        7,  5,  0, 47, 12,  0, 25,  6, 16, 20,  5,  8,  4,  4, 11, 12,  0,
        6,  3, 11,  4,  3, 48,  3,  6,  7,  4,  7,  0,  3,  7,  3, 18,  6,
        2,  9,  9, 11,  3,  9,  6, 18, 16,  6, 34,  2,  7,  4,  3, 45,  5,
        0,  7,  2, 17, 17,  9, 18,  5,  6,  5, 15,  5,  7,  6,  9,  0,  7,
       12, 17])

我首先建议您通过以下方式检查 train[features] df 中每一列的数据类型：

print train[features].dtypes

如果您发现存在非数字列，则可以检查这些列以确保没有任何会导致问题的意外值（例如字符串、NaN 等）。如果您不介意删除非数字列，只需使用以下方法选择所有数字列：

numeric_cols = X.select_dtypes(include=['float64','float32']).columns

如果需要，您还可以使用 int dtypes 添加列。

如果遇到太大或太小的值，模型无法处理，则表明缩放数据是一个好主意。在 sklearn 中，这可以按如下方式完成：

scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])

最后，您应该考虑使用 sklearn 的 Imputer 插补缺失值，并用如下所示的内容填充 NaN：

train[features].fillna(0, inplace=True)

当数据集中有空字符串（如 ''）时，就会发生这种情况。还尝试打印类似的东西

pd.value_counts()

甚至

sorted(list(set(...)))

或

获取循环中数据集中每一列的最小值或最大值。

上面使用 MinMaxScaler 的示例可能有效，但缩放功能在 RF 中效果不佳。

相关内容

最新更新

热门标签：