防止RandomizedSearchCV用所有一个类预测KNN分类器



我正在使用RandomizedSearchCV&KNeighborsClassifier尝试并预削减贷款违约。

使用RandomizedSearchCV在理论上似乎很好,但当我对它进行测试时,它发现最好的模拟程序是预测所有相同标签的模拟程序。

(数据被拆分为75%PAID 25%默认(所以我得到了75%的准确率,但它只是预测所有PAID。

n_neighbors = [int(x) for x in np.linspace(start = 1, stop = len(X_train)/3, num = 5)]
weights = ['uniform', 'distance']
algorithm  = ["auto","ball_tree","kd_tree","brute"]
leaf_size  = [int(x) for x in np.linspace(10, 100, num = 5)]
p  = [1,2]       
random_grid = {'n_neighbors': n_neighbors,
'weights': weights,
'algorithm': algorithm,
'leaf_size': leaf_size,
'p': p}
knn_clf = KNeighborsClassifier()
knn_random = RandomizedSearchCV(estimator = knn_clf, param_distributions = random_grid, n_iter = 25, cv = 3, verbose=1,)
knn_random.fit(X_train, y_train)

我能做些什么来对抗这种情况吗?有没有一个我可以通过的参数来阻止这种情况的发生?或者我可以在我的数据中做些什么?

y_test:

38        PAIDOFF
189       PAIDOFF
140       PAIDOFF
286    COLLECTION
142       PAIDOFF
101       PAIDOFF
187       PAIDOFF
139       PAIDOFF
149       PAIDOFF
11        PAIDOFF
269    COLLECTION
231       PAIDOFF
258       PAIDOFF
84        PAIDOFF
242       PAIDOFF
344    COLLECTION
104       PAIDOFF
214       PAIDOFF
109       PAIDOFF
76        PAIDOFF
41        PAIDOFF
262    COLLECTION
125       PAIDOFF
107       PAIDOFF
27        PAIDOFF
14        PAIDOFF
92        PAIDOFF
194       PAIDOFF
113       PAIDOFF
333    COLLECTION
...    
320    COLLECTION
15        PAIDOFF
72        PAIDOFF
122       PAIDOFF
243       PAIDOFF
184       PAIDOFF
294    COLLECTION
280    COLLECTION
218       PAIDOFF
197       PAIDOFF
133       PAIDOFF
143       PAIDOFF
179       PAIDOFF
249       PAIDOFF
80        PAIDOFF
331    COLLECTION
137       PAIDOFF
103       PAIDOFF
120       PAIDOFF
248       PAIDOFF
5         PAIDOFF
236       PAIDOFF
219       PAIDOFF
322    COLLECTION
283    COLLECTION
135       PAIDOFF
124       PAIDOFF
293    COLLECTION
166       PAIDOFF
85        PAIDOFF

预测:

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF'], dtype=object)

这是一个典型的不平衡数据问题。你可以尝试的几个简单的事情是对少数阶级进行上采样,或者对多数阶级进行下采样,然后再试一次。更好的方法是更改算法并使用SVC或神经网络这可能会对少数病例的损失造成很大影响。

例如,sklearnsklearn.svm.SVC具有class_weights = 'balanced'参数,该参数将对此有所帮助。它基本上会用输入数据中少数族裔的比例来衡量这些人的成本。

"平衡"模式使用y的值自动调整与输入数据中的类频率成反比的权重作为

最新更新