R:基于某种约束的随机抽样



我正在使用R编程语言。我有以下数据:

set.seed(123)
library(dplyr)
id_sample <- 1:25
id <- sample(id_sample, replace = TRUE, 100)
var_1 = rnorm(100,100,100)
var_2 = rnorm(100,100,100)
var_3 = rnorm(100,100,100)
data = data.frame(id, var_1, var_2, var_3)
my_data =  data.frame(data %>% group_by(id) %>% mutate(index = row_number(id)))
my_data <- my_data[order(my_data$id),]
groups = data.frame(my_data %>% group_by(id) %>% summarise(count=n()))
final_data =  merge(x = my_data, y = groups, by = "id", all.x = TRUE)

我想把这个数据随机分成两个70%-30%的数据集(data_set_A, data_set_B),这样"时间顺序"就被保留了。我的意思是:

  • 假设id = 1出现了5次(id = (1,1,1,1,1), index = (1,2,3,4,5)): data_set_A可以包含index(1,2), data_set_b可以包含(3,4,5)

  • 但是data_set_A不能包含index = (1,3,5), data_set_B不能包含index = (2,4)

我不确定如何将此约束指定为随机抽样:

n <- as.integer(length(final_data[,1])*0.7)
data_70 <- final_data[sample(nrow(final_data),n), ]
data_30 <- anti_join(final_data, data_70)

谁能告诉我怎么做这个?

谢谢!

final_data$row_number <- rownames(final_data)
data_70 <- final_data %>% group_by(id) %>% slice_sample(prop=.7)
data_30 <- final_data %>% anti_join(data_70, by="row_number")

最新更新