我对此非常陌生,所以任何类型的信息都会有所帮助。抱歉,如果我问了一个很琐碎的问题。我正在处理一个有很多零的中等大小的数据集。我们已经应用了很多模型,k=10的cv skf分数已经超过0.85,但roc_auc分数一直保持在0.5左右。我正在使用sklearn。下面是代码片段。
train_dataset = pd.read_csv('./input/train.csv', index_col='ID')
test_dataset = pd.read_csv('./input/test.csv', index_col='ID')
#print_shapes()
# How many nulls are there in the datasets?
nulls_train = (train_dataset.isnull().sum()==1).sum()
nulls_test = (test_dataset.isnull().sum()==1).sum()
#print('There are {} nulls in TRAIN and {} nulls in TEST dataset.'.format(nulls_train, nulls_test))
# Remove constant features
def identify_constant_features(dataframe):
count_uniques = dataframe.apply(lambda x: len(x.unique()))
constants = count_uniques[count_uniques == 1].index.tolist()
return constants
constant_features_train = set(identify_constant_features(train_dataset))
#print('There were {} constant features in TRAIN dataset.'.format(len(constant_features_train)))
# Drop the constant features
train_dataset.drop(constant_features_train, inplace=True, axis=1)
#print_shapes()
# Remove equals features
def identify_equal_features(dataframe):
features_to_compare = list(combinations(dataframe.columns.tolist(),2))
equal_features = []
for compare in features_to_compare:
is_equal = array_equal(dataframe[compare[0]],dataframe[compare[1]])
if is_equal:
equal_features.append(list(compare))
return equal_features
equal_features_train = identify_equal_features(train_dataset)
#print('There were {} pairs of equal features in TRAIN dataset.'.format(len(equal_features_train)))
# Remove the second feature of each pair.
features_to_drop = array(equal_features_train)[:,1]
train_dataset.drop(features_to_drop, axis=1, inplace=True)
#print_shapes()
# Define the variables model.
y_name = 'TARGET'
feature_names = train_dataset.columns.tolist()
feature_names.remove(y_name)
X = train_dataset[feature_names]
y = train_dataset[y_name]
# Save the features selected for later use.
pd.Series(feature_names).to_csv('features_selected_step1.csv', index=False)
#print('Features selectedn{}'.format(feature_names))
# Proportion of classes
y.value_counts()/len(y)
skf = cv.StratifiedKFold(y, n_folds=10, shuffle=True)
score_metric = 'roc_auc'
scores = {}
def score_model(model):
return cv.cross_val_score(model, X, y, cv=skf, scoring=score_metric)
clfxgb = xgb.XGBClassifier()
clfxgb = clfxgb.fit(X, y)
probxgb = clfxgb.predict(X)
# #print 'XGB', np.shape(probxgb)
print metrics.roc_auc_score(y, probxgb)
输出-从numpy和matplotlib填充交互式命名空间test.csvtrain.csv
0.502140359687
对于cv skf-
cv.cross_val_score(xgb.XGBClassifier(), X, y, cv=skf, scoring=score_metric)
输出-阵列([0.83114251,0.84162387,0.83580491])
我们将.csv文件作为-提交
test_dataset.drop(constant_features_train, inplace=True, axis=1)
test_dataset.drop(features_to_drop, axis=1, inplace=True)
print test_dataset.shape
X_SubTest = test_dataset
df_test = pd.read_csv('./input/test.csv')
id_test = df_test['ID']
predTest = model.predict(X_SubTest)
submission = pd.DataFrame({"ID":id_test, "TARGET":predTest})
submission.to_csv("submission_svm_23-3.csv", index=False)
您没有使用交叉验证信息来训练您的模型-roc_auc和交叉验证分数意味着非常不同的事情。为了获得更高的ROC分数,你需要进行模型选择——你需要选择交叉验证分数最高的模型(具有最佳参数)。一种方法是使用类似GridSearchCV
的东西(http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)以在XGBoost模型的不同参数中搜索潜在的模型空间。这样,您就可以专门选择您的模型,因为它具有较高的交叉验证过程。
以下是Kaggle的一个详细示例:https://www.kaggle.com/tanitter/introducing-kaggle-scripts/grid-search-xgboost-with-scikit-learn/run/23363