这是我的数据帧:
data<-
ID Group Modules
1 Male Physics
1 Male Chemistry
2 Female Biology
2 Female Physics
2 Female Chemistry
3 Male Physics
3 Male Biology
3 Male Chemistry
4 Male Physics
4 Male Biology
4 Male Chemistry
5 Male Physics
5 Male Biology
5 Male Chemistry
6 Male Physics
6 Male Biology
6 Male Chemistry
7 Female Physics
7 Female Biology
8 Female Chemistry
8 Male Physics
8 Male Biology
9 Male Chemistry
9 Male Physics
10 Male Biology
10 Male Chemistry
10 Male Physics
11 Male Biology
11 Male Chemistry
11 Male Physics
12 Female Biology
12 Female Chemistry
上述数据中男性(n=9(多于女性(n=3(。我想随机选择3只雄性而不重新交配,所以我最终会得到3只雄性和3只雌性。我也想保留重复的ID,所以我担心的结果是:
newdata<-
ID Group Modules
1 Male Physics
1 Male Chemistry
2 Female Biology
2 Female Physics
2 Female Chemistry
3 Male Physics
3 Male Biology
3 Male Chemistry
7 Female Physics
7 Female Biology
7 Female Chemistry
12 Female Physics
12 Female Biology
6 Male Physics
6 Male Biology
6 Male Chemistry
下面是我的代码:
samples_per_group<-6
new data<-data%>% group_by(Group)%>%slice(sample(n(),min(sampples_per_group, n())))%>%ungroup()
当我尝试运行这个程序时,它选择了6个样本大小(每组3个(,但它只从每个参与者中提取一行,而不是返回该参与者的所有行。基本上,我想在每个组上选择3个id,而不管该id重复多少次。欢迎任何援助。谢谢
如果您想对ID进行采样,您需要获取ID并对其进行采样:
groups = data %>%
distinct(ID, Group) %>%
group_by(Group) %>%
summarize(group_size = n())
smallest_group = min(groups$group_size)
groups %>%
group_by(Group) %>%
sample_n(size = smallest_group) %>%
ungroup() %>%
left_join(data)
像上面这样的东西应该有效。很难在dplyr
链中的组中获得单个数字——这是可行的——但我认为更简单的做法是打破管道并提取数字。我们按组采样3个(或多个(ID,然后连接回主数据以获得与这些ID对应的所有行。