使用Sklearn在Pandas DataFrame中只标准化数字列时设置withcopy警告 &g



执行以下操作时,我从Pandas获得SettingWithCopyWarning。我理解警告的含义,我知道我可以关闭警告,但我很好奇我是否在使用pandas数据框(我将数据与分类列和数字列混合在一起)错误地执行这种类型的标准化。检查后,我的数字似乎很好,但我想清理我的语法,以确保我正确使用Pandas

我很好奇在处理这样混合数据类型的数据集时,是否有更好的这种操作工作流。

我的过程是这样的一些玩具数据:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List
# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
['1',125,'A',15],
['2',134,'A',20],
['3',112,'A',25],
['4',107,'B',35],
['5',68,'B',50],
['6',321,'B',10],
['7',26,'B',27],
['8',115,'C',64],
['9',100,'C',72],
['10',74,'C',18],
['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id        object
weight     int64
type      object
age        int64
dtype: object
# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()
# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only 
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])

<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

您的X_train,X_test仍然是原始数据帧的切片。修改切片会触发警告,通常不会起作用。

您可以在train_test_split之前转换,或者在分割后执行X_train = X_train.copy(),然后转换。

第二种方法可以防止代码中注释的信息泄漏。像这样:

# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy()    # don't you drop the label?
# y: pd.Series = df.pop('type')  # y = df['type']
# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'], 
test_size = 0.2, 
random_state = 0)
# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()
## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])

我试图解释pd.get_dummies()OneHotEncoder()如何将分类数据转换为虚拟列。但我建议使用OneHotEncoder()转换器,因为它是一个sklearn转换器,如果你想的话,你可以在Pipeline中使用它。

第一个OneHotEncoder():它的工作与pd相同。get_dummies函数可以,但是这个类的返回值是Numpy数组或稀疏数组。你可以在这里阅读更多关于这个类的信息:

from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.

第二个方法,pd.get_dummies():

df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)

最新更新