Scikit-Pandas,跨瓦尔分数特征数



我使用Kaggle上的泰坦尼克号教程熟悉Scikit及其熊猫集成。我已经清洁了数据,并想进行一些预测。我可以调用管道拟合和转换 - 不幸的是,我遇到了一个错误,试图使用Cross_val_score进行同样的操作。

我正在使用sklearn-pandas cross_val_score

代码如下:

mapping = [
        ('Age', None),
        ('Embarked',LabelBinarizer()),
        ('Fare',None),
        ('Pclass',LabelBinarizer()),
        ('Sex',LabelBinarizer()),
        ('Group',LabelBinarizer()),
        ('familySize',None),
        ('familyType',LabelBinarizer()),
        ('Title',LabelBinarizer())
    ]

pipe = Pipeline([
    ('featurize', DataFrameMapper(mapping)), 
    ('logReg', LogisticRegression())
    ])
X = df_train[df_train.columns.drop('Survived')]
y = df_train['Survived']
#model = pipe.fit(X = X, y = y)
#prediction = model.predict(df_train)
score = cross_val_score(pipe, X = X, y = y, scoring = 'accuracy')

df_train是一个包含我所有训练集的熊猫数据框架,包括结果。这两个评论的行:

model = pipe.fit(X = X, y = y)
prediction = model.predict(df_train)

工作正常,预测返回了一个带有预测结果的数组。使用Cross_val_score使用相同的内容,我会收到以下错误:

X has 20 features per sample; expecting 19

以下完整代码,可以使用Kaggle上的泰坦尼克号CSV文件(https://www.kaggle.com/c/titanic/data)运行

#%% Libraries import
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper, cross_val_score 
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
#%% Read the data
path = 'E:/Kaggle/Titanic/Data/'
file_training = 'train.csv'
file_test = 'test.csv'
#Import the training and test dataset and concatenate them
df_training = pd.read_csv(path + file_training, header = 0, index_col = 'PassengerId')
df_test = pd.read_csv(path + file_test, header = 0, index_col = 'PassengerId')
# Work on the concatenated training and test data for feature engineering and clean-up
df = pd.concat([df_training, df_test], keys = ['train','test'])

#%% Initial data exploration and cleaning
df.describe(include = 'all')
pd.isnull(df).sum() > 0
#%% Preprocesing and Cleanup
#Create new columns with the name (to identify individuals part of a family)
df['LName'] = df['Name'].apply(lambda x:x.split(',')[0].strip())
df['FName'] = df['Name'].apply(lambda x:x.split(',')[1].split('.')[1].strip())
#Get the title
df['Title'] = df['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
titleDic = { 
        'Master' : 'kid',
        'Mlle' : 'unmarriedWoman',
        'Miss' : 'unmarriedWoman',
        'Ms' : 'unmarriedWoman',
        'Jonkheer' : 'noble',
        'Don' : 'noble',
        'Dona' : 'noble',
        'Sir' : 'noble',
        'Lady' : 'noble',
        'the Countess' : 'noble',
        'Capt' : 'ranked',
        'Major' : 'ranked',
        'Col' : 'ranked',
        'Mr' : 'standard',
        'Mme' : 'standard',
        'Mrs' : 'standard',
        'Dr' : 'academic',
        'Rev' : 'academic'
        }
df['Group'] = df['Title'].map(titleDic)
#%% Working with the family size
#Get the family size
df['familySize'] = df['Parch'] + df['SibSp'] + 1
#Add a family tag (single, couple, small, large)
df['familyType'] = pd.cut(df['familySize'], 
  [1,2,3,5,np.inf], 
  labels = ['single','couple','sFamily','bFamily'], 
  right = False)
#%% Filling empty values
#Fill empty values with the mean or mode for the column
#Fill the missing values with mean for age per title, class and gender. Store value in AgeFull variable
agePivot = pd.DataFrame(df.groupby(['Group', 'Sex'])['Age'].median())
agePivot.columns = ['AgeFull']
df = pd.merge(df, agePivot, left_on = ['Group', 'Sex'], right_index = True)
df.loc[df['Age'].isnull(),['Age']] = df['AgeFull']
#Embark location missing values
embarkPivot = pd.DataFrame(df.groupby(['Group'])['Embarked'].agg(lambda x:x.value_counts().index[0]))
embarkPivot.columns = ['embarkFull']
df = pd.merge(df, embarkPivot, left_on = ['Group'], right_index = True)
df.loc[df['Embarked'].isnull(),['Embarked']] = df['embarkFull']
#Fill the missing fare value
df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].mean()
#%% Final clean-up (drop temporary columns)
df = df.drop(['AgeFull', 'embarkFull'], 1)
#%% Preparation for training
df_train = df.loc['train']
df_test = df.loc['test']
#Creation of dummy variables
mapping = [
            ('Age', None),
            ('Embarked',LabelBinarizer()),
            ('Fare',None),
            ('Pclass',LabelBinarizer()),
            ('Sex',LabelBinarizer()),
            ('Group',LabelBinarizer()),
            ('familySize',None),
            ('familyType',LabelBinarizer()),
            ('Title',LabelBinarizer())
        ]
pipe = Pipeline(steps = [
        ('featurize', DataFrameMapper(mapping)), 
        ('logReg', LogisticRegression())
        ])
#Uncommenting the line below fixes the code - why?
#df_train = df_train.sort_index()
X = df_train[df_train.columns.drop(['Survived'])]
y = df_train.Survived
score = cross_val_score(pipe, X = df_train, y = df_train.Survived, scoring = 'accuracy')

这很有趣。我仅通过使用索引数据框进行排序,然后将其传递到管道中的Cross_val_score,就解决了问题。

df_train = df_train.sort_index()

有人可以向我解释为什么这会影响Scikit的工作原理?

相关内容

  • 没有找到相关文章

最新更新