我正在尝试为我的随机森林运行适合,但我收到以下错误:
forest.fit(train[features], y)
返回
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
396 % (array.ndim, estimator_name))
397 if force_all_finite:
--> 398 _assert_all_finite(array)
399
400 shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
52 and not np.isfinite(X).all()):
53 raise ValueError("Input contains NaN, infinity"
---> 54 " or a value too large for %r." % X.dtype)
55
56
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
我已经将我的功能的数据帧从 float64 强制到 float32,并确保没有空值,所以不确定是什么引发了此错误。让我知道放入更多代码是否有帮助。
更新
它最初是一个熊猫数据帧,我删除了所有 NaN。原始数据帧是带有受访者信息的调查结果,我删除了除 dv 之外的所有问题。我通过运行返回 0 的rforest_df.isnull().sum()
仔细检查了这一点。这是我用于建模的完整代码。
rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)
更新
这是 y 数据的样子
array([ 0, 1, 2, 3, 4, 3, 3, 5, 6, 7, 8, 7, 9, 6, 10, 6, 11,
7, 11, 3, 7, 9, 6, 5, 9, 11, 12, 13, 6, 11, 3, 3, 6, 14,
15, 0, 9, 9, 2, 0, 11, 3, 9, 4, 9, 7, 3, 4, 9, 12, 9,
7, 6, 13, 6, 0, 0, 16, 6, 11, 4, 10, 11, 11, 17, 3, 6, 16,
3, 4, 18, 19, 7, 11, 5, 11, 5, 4, 0, 6, 17, 7, 2, 3, 5,
11, 8, 9, 18, 6, 9, 8, 5, 16, 20, 0, 4, 8, 13, 16, 3, 20,
0, 5, 4, 2, 11, 0, 3, 0, 6, 6, 6, 9, 4, 6, 5, 11, 0,
13, 6, 2, 11, 7, 5, 6, 18, 12, 21, 17, 3, 6, 0, 13, 21, 7,
3, 2, 18, 22, 7, 3, 2, 6, 7, 8, 4, 0, 7, 12, 3, 7, 3,
2, 11, 19, 11, 6, 2, 9, 3, 7, 9, 9, 5, 6, 8, 0, 18, 11,
3, 12, 2, 6, 4, 7, 7, 11, 3, 6, 6, 0, 6, 12, 15, 3, 9,
3, 3, 0, 5, 9, 7, 9, 11, 7, 3, 20, 0, 7, 6, 6, 23, 15,
19, 0, 3, 6, 16, 13, 5, 6, 6, 3, 6, 11, 9, 0, 6, 23, 16,
4, 0, 6, 17, 11, 17, 11, 4, 3, 13, 3, 17, 16, 11, 7, 4, 24,
5, 2, 7, 7, 8, 3, 3, 11, 8, 7, 23, 7, 7, 11, 7, 11, 6,
15, 3, 25, 7, 4, 5, 3, 17, 20, 3, 26, 7, 9, 6, 6, 17, 20,
1, 0, 11, 9, 16, 20, 7, 7, 26, 3, 6, 20, 7, 2, 11, 7, 27,
9, 4, 26, 28, 8, 6, 9, 19, 7, 29, 3, 2, 26, 30, 6, 31, 6,
18, 3, 0, 18, 4, 7, 32, 0, 2, 8, 0, 5, 9, 4, 16, 6, 23,
0, 7, 0, 7, 9, 6, 8, 3, 7, 9, 3, 3, 12, 11, 8, 19, 20,
7, 3, 5, 11, 3, 11, 8, 4, 4, 6, 9, 4, 1, 3, 0, 9, 9,
6, 7, 8, 33, 8, 7, 9, 34, 11, 11, 6, 9, 9, 17, 8, 19, 0,
7, 4, 17, 6, 7, 0, 4, 12, 7, 6, 4, 16, 12, 9, 6, 6, 6,
6, 26, 13, 9, 7, 2, 7, 3, 11, 3, 6, 7, 19, 4, 8, 9, 13,
11, 15, 11, 4, 18, 7, 7, 7, 0, 5, 4, 6, 0, 3, 7, 4, 25,
18, 6, 19, 7, 9, 4, 20, 6, 3, 7, 4, 35, 15, 11, 2, 12, 0,
7, 32, 6, 18, 9, 9, 6, 2, 3, 19, 36, 32, 0, 7, 0, 9, 37,
3, 5, 6, 5, 34, 2, 6, 0, 7, 0, 7, 3, 7, 4, 18, 18, 7,
3, 7, 16, 9, 19, 13, 4, 16, 19, 3, 19, 38, 9, 4, 9, 8, 0,
17, 0, 2, 3, 5, 6, 5, 11, 11, 2, 9, 5, 33, 9, 5, 6, 20,
13, 3, 39, 13, 7, 0, 9, 0, 4, 6, 7, 16, 7, 0, 21, 5, 3,
18, 5, 20, 2, 2, 14, 6, 17, 11, 11, 16, 16, 9, 8, 11, 3, 23,
0, 11, 0, 6, 0, 0, 3, 16, 6, 7, 5, 9, 7, 13, 0, 20, 0,
25, 6, 16, 8, 4, 4, 2, 8, 7, 5, 40, 3, 8, 5, 12, 8, 9,
6, 6, 6, 6, 3, 7, 26, 4, 0, 13, 4, 3, 13, 12, 7, 7, 6,
7, 19, 15, 0, 33, 4, 5, 5, 20, 3, 11, 5, 4, 7, 9, 7, 11,
36, 9, 0, 6, 6, 11, 6, 4, 2, 5, 18, 8, 5, 5, 2, 25, 4,
41, 7, 7, 5, 7, 3, 36, 11, 6, 9, 0, 9, 0, 16, 42, 11, 11,
18, 9, 5, 36, 2, 9, 6, 3, 43, 9, 17, 13, 5, 9, 3, 4, 6,
44, 37, 0, 45, 2, 18, 8, 46, 2, 12, 9, 9, 3, 16, 6, 12, 9,
0, 11, 11, 0, 25, 8, 17, 4, 4, 3, 11, 3, 11, 6, 6, 9, 7,
23, 0, 2, 0, 3, 3, 4, 4, 9, 5, 11, 16, 7, 3, 18, 11, 7,
6, 6, 6, 5, 9, 6, 3, 9, 7, 17, 11, 4, 9, 2, 3, 0, 26,
9, 0, 20, 8, 9, 6, 11, 6, 6, 7, 26, 6, 6, 4, 19, 5, 41,
19, 18, 29, 6, 5, 13, 6, 11, 7, 7, 6, 8, 5, 0, 3, 13, 17,
6, 20, 11, 6, 9, 6, 2, 7, 11, 9, 20, 12, 7, 6, 8, 7, 4,
6, 2, 0, 7, 9, 26, 9, 16, 7, 4, 45, 7, 0, 23, 8, 4, 19,
4, 26, 11, 4, 4, 5, 7, 3, 0, 29, 12, 3, 4, 11, 4, 12, 8,
7, 5, 0, 47, 12, 0, 25, 6, 16, 20, 5, 8, 4, 4, 11, 12, 0,
6, 3, 11, 4, 3, 48, 3, 6, 7, 4, 7, 0, 3, 7, 3, 18, 6,
2, 9, 9, 11, 3, 9, 6, 18, 16, 6, 34, 2, 7, 4, 3, 45, 5,
0, 7, 2, 17, 17, 9, 18, 5, 6, 5, 15, 5, 7, 6, 9, 0, 7,
12, 17])
我首先建议您通过以下方式检查 train[features] df 中每一列的数据类型:
print train[features].dtypes
如果您发现存在非数字列,则可以检查这些列以确保没有任何会导致问题的意外值(例如字符串、NaN 等)。如果您不介意删除非数字列,只需使用以下方法选择所有数字列:
numeric_cols = X.select_dtypes(include=['float64','float32']).columns
如果需要,您还可以使用 int dtypes 添加列。
如果遇到太大或太小的值,模型无法处理,则表明缩放数据是一个好主意。在 sklearn 中,这可以按如下方式完成:
scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])
最后,您应该考虑使用 sklearn 的 Imputer 插补缺失值,并用如下所示的内容填充 NaN:
train[features].fillna(0, inplace=True)
当数据集中有空字符串(如 '')时,就会发生这种情况。还尝试打印类似的东西
pd.value_counts()
甚至
sorted(list(set(...)))
或获取循环中数据集中每一列的最小值或最大值。
上面使用 MinMaxScaler 的示例可能有效,但缩放功能在 RF 中效果不佳。