我正试图预测水喝起来是否安全。数据集由以下部分组成:https://www.kaggle.com/adityakadiwal/water-potability?select=water_potability.csv.假设数据框架由Ph、硬度、固体、氯胺和可饮用性组成。
我想对10k倍进行逻辑回归(例如,我希望尝试更多的选择(。考虑到所需的计算能力,我还想用不同的随机10 k倍,再进行5次,然后选择最佳模型。
我遇到过k折叠函数和glm函数,但我不知道如何将其组合起来,将这个过程随机重复5次。稍后,我也想用KNN创建一些类似的东西。如果能在这件事上得到任何帮助,我将不胜感激。
某些代码:
df <- read_csv("water_potability.csv")
train_model <- trainControl(method = "repeatedcv",
number = 10, repeats = 5)
model <- train(Potability~., data = df, method = "regLogistic",
trControl = train_model )
然而,我更喜欢使用非正规的物流。
您可以执行以下操作(基于此处的一些示例数据(
library(caret)
# Sample data since your post doesn't include sample data
df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# Make sure the response `admit` is a `factor`
df$admit <- factor(df$admit)
# Set up 10-fold CV
train_model <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
# Train the model
model <- train(
admit ~ .,
data = df,
method = "glm",
family = "binomial",
trControl = train_model)
model
#Generalized Linear Model
#
#400 samples
# 3 predictor
# 2 classes: '0', '1'
#
#No pre-processing
#Resampling: Cross-Validated (10 fold, repeated 5 times)
#Summary of sample sizes: 359, 361, 360, 360, 359, 361, ...
#Resampling results:
#
# Accuracy Kappa
# 0.7020447 0.1772786
我们可以查看混淆矩阵以获得良好的度量
confusionMatrix(predict(model), df$admit)
#Confusion Matrix and Statistics
#
# Reference
#Prediction 0 1
# 0 253 98
# 1 20 29
#
# Accuracy : 0.705
# 95% CI : (0.6577, 0.7493)
# No Information Rate : 0.6825
# P-Value [Acc > NIR] : 0.1809
#
# Kappa : 0.1856
#
#Mcnemar's Test P-Value : 1.356e-12
#
# Sensitivity : 0.9267
# Specificity : 0.2283
# Pos Pred Value : 0.7208
# Neg Pred Value : 0.5918
# Prevalence : 0.6825
# Detection Rate : 0.6325
# Detection Prevalence : 0.8775
# Balanced Accuracy : 0.5775
#
# 'Positive' Class : 0