使用熊猫数据帧输入调试 sklearn 的"NaN"错误



我将输入读取为pandas数据框架,并将NaN填充为:

df = df.fillna(0)

之后,我分成训练集和测试集,并使用sklearn进行分类。

features = df.drop('class',axis=1)
labels = df['class']
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=42)
clf.fit(features_train, labels_train)   

但是我还是得到了一个错误

"NaN错误":ValueError:输入包含NaN,无穷大或dtype('float32')的值太大。

似乎fillna()没有发现缺失的数据。我怎样才能找到"NaN"在哪里?

df.isnull().sum()

可以显示数据框架内是否存在NaN

TLDR: pip install pandas——upgrade

我今天遇到了这个问题。在处理全零的稀疏数组时,sklearn的train_testrongplit()方法似乎存在问题。我在scikit-learn github repo上提出了一个bug,他们很快就回应了升级pandas的解决方案:https://github.com/scikit-learn/scikit-learn/issues/22133

步骤/代码复制

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.model_selection import train_test_split
X = pd.DataFrame.sparse.from_spmatrix(sparse.eye(5))
y = pd.Series(np.zeros(5))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# output as expected (when every input column has at least one non zero value)
print(X_train)
X_train, X_test, y_train, y_test = train_test_split(X[1:], y[1:], test_size=0.2, random_state=42)
# output column contains all NaN (when input column contains all zero's)
print(X_train)

第一个train_testrongplit()按预期输出,因为每列至少有一个非零行,然而第二个在第一列上输出NaN,因为所有行都是零。

    0    1    2    3    4
 --------------------------
 4  0.0  0.0  0.0  0.0  1.0
 2  0.0  0.0  1.0  0.0  0.0
 0  1.0  0.0  0.0  0.0  0.0
 3  0.0  0.0  0.0  1.0  0.0
   0    1    2    3    4
 -------------------------
 4 NaN  0.0  0.0  0.0  1.0
 1 NaN  1.0  0.0  0.0  0.0
 3 NaN  0.0  0.0  1.0  0.0

你问

如何找到"NaN"在哪里

是否有助于可视化问题数据在框架中的位置?

你可以试试matplotlib.pyplot.spy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# lets make some initial clean data
df = pd.DataFrame(
    data={
        'alpha': [0, 1, 2],
        'beta': [3, 4, 5],
        'gamma': [6, 7, 8]
    },
    index=['one', 'two', 'three']
)
# add some problematic points
# `NaN`s, infinities and stuff that is 
#  just not numeric
df.loc['one', 'beta'] = 'not a number but not NaN'
df.loc['two', 'alpha'] = np.NaN
df.loc['three', 'gamma'] = np.infty
fig, axes = plt.subplots(1, 3)
axes[0].spy(df.isnull())
axes[0].set_title('NaN elements')
axes[1].spy(df == np.infty)
axes[1].set_title('infinite elements')
axes[2].spy(~df.applymap(np.isreal))
axes[2].set_title('Non numeric elements')

相关内容

  • 没有找到相关文章

最新更新