如何在python中使用不平衡数据集获得更好的精度和召回率



我正在研究一个医疗保险欺诈检测模型。数据非常不平衡,有14起阳性欺诈案件和大约100万起非欺诈案件。我最初有8个特征,但通过对分类变量进行一次热编码,我有103个特征(这是因为我有94个唯一的提供者类型(。我创建了一个将逻辑回归分类器与SMOTE相结合的管道。

##########
#Use pipeline - combination of SMOTE and logistic regression model 
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])
# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)       
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train) 
predicted = pipeline.predict(X_test)       
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))  
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted)) 
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted)) 
# Obtain the results from the classification report and confusion matrix 
print('Classifcation report:n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:n', conf_mat)

这是我的输出:

Accuracy score:  0.9333130935552119
Precision score:  2.3716352424997034e-05
Recall score:  0.09090909090909091
Classification report:
precision    recall  f1-score   support
False       1.00      0.93      0.97    632407
True       0.00      0.09      0.00        11
accuracy                           0.93    632418
macro avg       0.50      0.51      0.48    632418
weighted avg       1.00      0.93      0.97    632418
Confusion matrix:
[[590243  42164]
[    10      1]]

显然,我的记忆力和准确度都非常低,这是不可接受的。如何提高记忆力和准确性?我正在考虑采样不足,但如果我将负类从大约100万条记录更改为负类,我担心会删除太多数据-->14张记录与我的阳性课相匹配。我也在考虑删除功能,但我不确定如何确定要删除哪些功能。

我们在处理财务欺诈检测时遇到了类似的问题,通常实际欺诈数据小于0.1%。您必须对主要类别进行抽样不足,同时注意确保各种内部类别的表示保持完整。因此,首先对你的主要群体进行聚类,然后从每个聚类中进行选择,为主要类别创建一个精简的群体。试着使用80:20、90:10等比例,直到你达到令人尊敬的精度和记忆力。像SMOTE这样的过采样技术实际上并不可取,因为在大多数情况下,合成准备的数据会与真实数据不同

最新更新