使用R对重复K倍的数据集进行逻辑回归



我正试图预测水喝起来是否安全。数据集由以下部分组成:https://www.kaggle.com/adityakadiwal/water-potability?select=water_potability.csv.假设数据框架由Ph、硬度、固体、氯胺和可饮用性组成。

我想对10k倍进行逻辑回归(例如,我希望尝试更多的选择(。考虑到所需的计算能力,我还想用不同的随机10 k倍,再进行5次,然后选择最佳模型。

我遇到过k折叠函数和glm函数,但我不知道如何将其组合起来,将这个过程随机重复5次。稍后,我也想用KNN创建一些类似的东西。如果能在这件事上得到任何帮助,我将不胜感激。

某些代码:

df <- read_csv("water_potability.csv")
train_model <- trainControl(method = "repeatedcv",  
number = 10, repeats = 5)
model <- train(Potability~., data = df, method = "regLogistic",
trControl = train_model )

然而,我更喜欢使用非正规的物流。

您可以执行以下操作(基于此处的一些示例数据(

library(caret)
# Sample data since your post doesn't include sample data
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Make sure the response `admit` is a `factor`
df$admit <- factor(df$admit)
# Set up 10-fold CV
train_model <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# Train the model
model <- train(
admit ~ ., 
data = df, 
method = "glm",
family = "binomial",
trControl = train_model)
model
#Generalized Linear Model 
#
#400 samples
#  3 predictor
#  2 classes: '0', '1' 
#
#No pre-processing
#Resampling: Cross-Validated (10 fold, repeated 5 times) 
#Summary of sample sizes: 359, 361, 360, 360, 359, 361, ... 
#Resampling results:
#
#  Accuracy   Kappa    
#  0.7020447  0.1772786

我们可以查看混淆矩阵以获得良好的度量


confusionMatrix(predict(model), df$admit)
#Confusion Matrix and Statistics
#
#          Reference
#Prediction   0   1
#         0 253  98
#         1  20  29
#
#              Accuracy : 0.705           
#                95% CI : (0.6577, 0.7493)
#   No Information Rate : 0.6825          
#   P-Value [Acc > NIR] : 0.1809          
#
#                 Kappa : 0.1856          
#
#Mcnemar's Test P-Value : 1.356e-12       
#                                          
#            Sensitivity : 0.9267          
#            Specificity : 0.2283          
#         Pos Pred Value : 0.7208          
#         Neg Pred Value : 0.5918          
#             Prevalence : 0.6825          
#         Detection Rate : 0.6325          
#   Detection Prevalence : 0.8775          
#      Balanced Accuracy : 0.5775          
#                                          
#       'Positive' Class : 0     

最新更新