我想分析R包gclus
中可用的数据集data(wine)
。如何按照70:30的比例将数据集分成训练集和测试集?
你可以这样分割你的数据:
library(gclus)
data("wine")
sample_size <- floor(0.70 * nrow(wine))
set.seed(123)
train_index <- sample(seq_len(nrow(wine)), size = sample_size)
train <- wine[train_index, ]
test <- wine[-train_index, ]
检查数据集的大小:
> nrow(wine)
[1] 178
> nrow(train)
[1] 124
> nrow(test)
[1] 54
这是@Quinten的另一种方法,非常好的方法:首先,我们为每一行创建一个id
,并使用sample_frac()
最终anti_join()
原始wine
与train_wine
:
#install.packages("gclus")
library(gclus)
library(dplyr)
data("wine")
wine <- wine %>%
mutate(id = row_number())
train_wine <- wine %>%
sample_frac(.70)
test_wine <- anti_join(wine, train_wine, by = 'id')
nrow(train_wine)
nrow(test_wine)