在scikit learn/pandas函数中没有列



我正在尝试训练这个随机分类器,看看我的预处理是否有效。我认为我犯了一个错误,分离我的训练数据和标签,我看到在错误信息(价格)。但我不知道到底是哪里不对。

代码:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

def diamond_preprocess(data_dir):
data = pd.read_csv(data_dir)
cleaned_data = data.drop(['id', 'depth_percent'], axis=1)  # Features I don't want
x = cleaned_data.drop(['price'], axis=1)  # Train data
y = cleaned_data['price']  # Label data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
numerical_features = cleaned_data.select_dtypes(include=['int64', 'float64']).columns
categorical_features = cleaned_data.select_dtypes(include=['object']).columns
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),  # Fill in missing data with median
('scaler', StandardScaler())  # Scale data
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Fill in missing data with 'missing'
('onehot', OneHotEncoder(handle_unknown='ignore'))  # One hot encode categorical data
])
preprocessor_pipeline = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
rf = Pipeline(steps=[('preprocessor', preprocessor_pipeline),
('classifier', RandomForestClassifier())])
rf.fit(x_train, y_train)

cleaned_data.columns:指数(["克拉","切","颜色","清晰","表","价格"、"长度"、"宽度","深度"],dtype = '对象')

错误:

File "pandas_libshashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'price'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:Users17574Anaconda3envskraken-gpulibsite-packagessklearnutils__init__.py", line 396, in _get_column_indices
col_idx = all_columns.get_loc(col)
File "C:Users17574Anaconda3envskraken-gpulibsite-packagespandascoreindexesbase.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'price'
The above exception was the direct cause of the following exception:
ValueError: A given column is not a column of the dataframe

我将x_train(其中排除了price,因为它是我的训练数据)馈送到包含'price'特征的预处理管道中,这似乎是疯狂的。这应该不是问题,因为我的标签都是"价格"整数,需要预处理,对吗?我是否需要一个单独的变压器用于标签?

您正在执行基于cleaned_dataDataFrame中定义的列而不是x_train中定义的列的ColumnTransformer

您可以通过从x_train中计算它们来修改您的分类和数值特征,如下所示:

numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = x_train.select_dtypes(include=['object']).columns

或者更好,使用sklearn.compose.make_column_selector执行如下选择:

from sklearn.compose import make_column_selector
preprocessor_pipeline = ColumnTransformer(
transformers=[
('num', numerical_transformer, make_column_selector(dtype_exclude=object)),
('cat', categorical_transformer, make_column_selector(dtype_include=object))
])

相关内容

  • 没有找到相关文章

最新更新