我正在研究一个医疗保险欺诈检测模型。数据非常不平衡,有14起阳性欺诈案件和大约100万起非欺诈案件。我最初有8个特征,但通过对分类变量进行一次热编码,我有103个特征(这是因为我有94个唯一的提供者类型(。我创建了一个将逻辑回归分类器与SMOTE相结合的管道。
##########
#Use pipeline - combination of SMOTE and logistic regression model
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE(random_state = 27, sampling_strategy = "minority")
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression', model)])
# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
y = PartB_encoded['Is_fraud']
X = PartB_encoded.drop(['Is_fraud'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27)
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print("Accuracy score: ", accuracy_score(y_true = y_test, y_pred = predicted))
print("Precision score: ", precision_score(y_true = y_test, y_pred=predicted))
print("Recall score: ", recall_score(y_true = y_test, y_pred= predicted))
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:n', conf_mat)
这是我的输出:
Accuracy score: 0.9333130935552119
Precision score: 2.3716352424997034e-05
Recall score: 0.09090909090909091
Classification report:
precision recall f1-score support
False 1.00 0.93 0.97 632407
True 0.00 0.09 0.00 11
accuracy 0.93 632418
macro avg 0.50 0.51 0.48 632418
weighted avg 1.00 0.93 0.97 632418
Confusion matrix:
[[590243 42164]
[ 10 1]]
显然,我的记忆力和准确度都非常低,这是不可接受的。如何提高记忆力和准确性?我正在考虑采样不足,但如果我将负类从大约100万条记录更改为负类,我担心会删除太多数据-->14张记录与我的阳性课相匹配。我也在考虑删除功能,但我不确定如何确定要删除哪些功能。
我们在处理财务欺诈检测时遇到了类似的问题,通常实际欺诈数据小于0.1%。您必须对主要类别进行抽样不足,同时注意确保各种内部类别的表示保持完整。因此,首先对你的主要群体进行聚类,然后从每个聚类中进行选择,为主要类别创建一个精简的群体。试着使用80:20、90:10等比例,直到你达到令人尊敬的精度和记忆力。像SMOTE这样的过采样技术实际上并不可取,因为在大多数情况下,合成准备的数据会与真实数据不同