我正在使用RandomizedSearchCV&KNeighborsClassifier尝试并预削减贷款违约。
使用RandomizedSearchCV在理论上似乎很好,但当我对它进行测试时,它发现最好的模拟程序是预测所有相同标签的模拟程序。
(数据被拆分为75%PAID 25%默认(所以我得到了75%的准确率,但它只是预测所有PAID。
n_neighbors = [int(x) for x in np.linspace(start = 1, stop = len(X_train)/3, num = 5)]
weights = ['uniform', 'distance']
algorithm = ["auto","ball_tree","kd_tree","brute"]
leaf_size = [int(x) for x in np.linspace(10, 100, num = 5)]
p = [1,2]
random_grid = {'n_neighbors': n_neighbors,
'weights': weights,
'algorithm': algorithm,
'leaf_size': leaf_size,
'p': p}
knn_clf = KNeighborsClassifier()
knn_random = RandomizedSearchCV(estimator = knn_clf, param_distributions = random_grid, n_iter = 25, cv = 3, verbose=1,)
knn_random.fit(X_train, y_train)
我能做些什么来对抗这种情况吗?有没有一个我可以通过的参数来阻止这种情况的发生?或者我可以在我的数据中做些什么?
y_test:
38 PAIDOFF
189 PAIDOFF
140 PAIDOFF
286 COLLECTION
142 PAIDOFF
101 PAIDOFF
187 PAIDOFF
139 PAIDOFF
149 PAIDOFF
11 PAIDOFF
269 COLLECTION
231 PAIDOFF
258 PAIDOFF
84 PAIDOFF
242 PAIDOFF
344 COLLECTION
104 PAIDOFF
214 PAIDOFF
109 PAIDOFF
76 PAIDOFF
41 PAIDOFF
262 COLLECTION
125 PAIDOFF
107 PAIDOFF
27 PAIDOFF
14 PAIDOFF
92 PAIDOFF
194 PAIDOFF
113 PAIDOFF
333 COLLECTION
...
320 COLLECTION
15 PAIDOFF
72 PAIDOFF
122 PAIDOFF
243 PAIDOFF
184 PAIDOFF
294 COLLECTION
280 COLLECTION
218 PAIDOFF
197 PAIDOFF
133 PAIDOFF
143 PAIDOFF
179 PAIDOFF
249 PAIDOFF
80 PAIDOFF
331 COLLECTION
137 PAIDOFF
103 PAIDOFF
120 PAIDOFF
248 PAIDOFF
5 PAIDOFF
236 PAIDOFF
219 PAIDOFF
322 COLLECTION
283 COLLECTION
135 PAIDOFF
124 PAIDOFF
293 COLLECTION
166 PAIDOFF
85 PAIDOFF
预测:
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF'], dtype=object)
这是一个典型的不平衡数据问题。你可以尝试的几个简单的事情是对少数阶级进行上采样,或者对多数阶级进行下采样,然后再试一次。更好的方法是更改算法并使用SVC或神经网络这可能会对少数病例的损失造成很大影响。
例如,sklearnsklearn.svm.SVC
具有class_weights = 'balanced'
参数,该参数将对此有所帮助。它基本上会用输入数据中少数族裔的比例来衡量这些人的成本。
"平衡"模式使用y的值自动调整与输入数据中的类频率成反比的权重作为