我将输入读取为pandas数据框架,并将NaN填充为:
df = df.fillna(0)
之后,我分成训练集和测试集,并使用sklearn进行分类。
features = df.drop('class',axis=1)
labels = df['class']
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)
clf.fit(features_train, labels_train)
但是我还是得到了一个错误
"NaN错误":ValueError:输入包含NaN,无穷大或dtype('float32')的值太大。
似乎fillna()
没有发现缺失的数据。我怎样才能找到"NaN"在哪里?
df.isnull().sum()
可以显示数据框架内是否存在NaN
TLDR: pip install pandas——upgrade
我今天遇到了这个问题。在处理全零的稀疏数组时,sklearn的train_testrongplit()方法似乎存在问题。我在scikit-learn github repo上提出了一个bug,他们很快就回应了升级pandas的解决方案:https://github.com/scikit-learn/scikit-learn/issues/22133
步骤/代码复制
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.model_selection import train_test_split
X = pd.DataFrame.sparse.from_spmatrix(sparse.eye(5))
y = pd.Series(np.zeros(5))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# output as expected (when every input column has at least one non zero value)
print(X_train)
X_train, X_test, y_train, y_test = train_test_split(X[1:], y[1:], test_size=0.2, random_state=42)
# output column contains all NaN (when input column contains all zero's)
print(X_train)
第一个train_testrongplit()按预期输出,因为每列至少有一个非零行,然而第二个在第一列上输出NaN,因为所有行都是零。
0 1 2 3 4
--------------------------
4 0.0 0.0 0.0 0.0 1.0
2 0.0 0.0 1.0 0.0 0.0
0 1.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0
0 1 2 3 4
-------------------------
4 NaN 0.0 0.0 0.0 1.0
1 NaN 1.0 0.0 0.0 0.0
3 NaN 0.0 0.0 1.0 0.0
你问
如何找到"NaN"在哪里
是否有助于可视化问题数据在框架中的位置?
你可以试试matplotlib.pyplot.spy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# lets make some initial clean data
df = pd.DataFrame(
data={
'alpha': [0, 1, 2],
'beta': [3, 4, 5],
'gamma': [6, 7, 8]
},
index=['one', 'two', 'three']
)
# add some problematic points
# `NaN`s, infinities and stuff that is
# just not numeric
df.loc['one', 'beta'] = 'not a number but not NaN'
df.loc['two', 'alpha'] = np.NaN
df.loc['three', 'gamma'] = np.infty
fig, axes = plt.subplots(1, 3)
axes[0].spy(df.isnull())
axes[0].set_title('NaN elements')
axes[1].spy(df == np.infty)
axes[1].set_title('infinite elements')
axes[2].spy(~df.applymap(np.isreal))
axes[2].set_title('Non numeric elements')