r-在data.table中按组标记每个第n个元素

  • 本文关键字:元素 data table r data.table
  • 更新时间 :
  • 英文 :


我的数据由对不同组的一组观察结果组成。每组有不同数量的观察结果。我想创建一个变量,该变量用";1〃;以获得进一步的手动QA/QC。旗帜应在一组内有规律地间隔,但两组之间的间隔可能不同。间距是通过将每组的长度除以一个常数(本例为5(得出的。

数据看起来像这样:

dt<-data.table(places=c(rep("A",10), rep("B",20))) #the data
dt2<-data.table(places=c("A","B"), spacing=c(2,4)) #the spacings by group to apply to the data

然后应用一些代码生成标记(或序列(

dt$sequence<- ????

看起来像:

places  sequence
A       1
A   
A       1
A   
...
B       1
B   
B   
B   

从本质上讲,我想让每个小组";"计数";基于已经为该组确定的理想间距;1〃;每次计数回收时。我只是不知道如何输入数据。表中的间距和组组合。

这里有另一个选项:

dt[, sq := dt2[.SD, on=.(places), +((rowid(i.places)-1) %% spacing == 0L)]]

输出:

places sq
1:      A  1
2:      A  0
3:      A  1
4:      A  0
5:      A  1
6:      A  0
7:      A  1
8:      A  0
9:      A  1
10:      A  0
11:      B  1
12:      B  0
13:      B  0
14:      B  0
15:      B  1
16:      B  0
17:      B  0
18:      B  0
19:      B  1
20:      B  0
21:      B  0
22:      B  0
23:      B  1
24:      B  0
25:      B  0
26:      B  0
27:      B  1
28:      B  0
29:      B  0
30:      B  0

您可以馈送数据。使用连接dt2[.SD, on=.(places)计算该间距和组组合,然后使用rowid生成序列,然后取模以找到seq整数可被间距整除的行。

我得到了数据表解决方案:

dtest[, sequence := rep(seq_len(floor(.N/5)),length.out=.N), by = places]
dtest[sequence!=1,sequence:=NA]

以前从未使用过长度。。。。

根据我们的对话,以下是dplyr解决方案,每个解决方案都以开头

library(data.table)
library(dplyr)

dt <- data.table(places=c(rep("A",10), rep("B",20))) #the data

对于所讨论的两种方法:

  1. 普遍除数(此处为5(:
# The divisor to be applied universally across all groups.
universal_divisor <- 5

# The vectorized function you specified.
f <- function(group_length, divisor){
return(floor(group_length / divisor))
}

dt_universal <- dt %>%
# Group in order to index each row WITHIN its group.
group_by(places) %>%
# Mark a 1 at each point calculated by the given function 'f' from the group
# group size, against the universal divisor; otherwise make blank (NA).
mutate(sequence = if_else(row_number() %% f(n(), universal_divisor) == 0,
1, as.numeric(NA))) %>%
ungroup() %>% as.data.table()
  1. 自定义间距:
# Your spacings by group to apply to the data.
dt2 <- data.table(places=c("A","B"), spacing=c(2,4))

dt_custom <- dt %>%
# Match each row to the custom spacing value for its 'place'.
left_join(dt2, by = "places") %>%
# Group in order to index each row WITHIN its group.
group_by(places) %>%
# Mark with a 1 at the desired spacing; otherwise make blank (NA).
transmute(places,
sequence = if_else(row_number() %% spacing == 0,
1, as.numeric(NA))) %>%
ungroup() %>% as.data.table()

每种方法都将输出下面的data.table。虽然使用data.table可以更有效地完成其中一些操作,但我个人发现dplyr的工作流程非常透明和灵活。

places sequence
1:      A       NA
2:      A        1
3:      A       NA
4:      A        1
5:      A       NA
6:      A        1
7:      A       NA
8:      A        1
9:      A       NA
10:      A        1
11:      B       NA
12:      B       NA
13:      B       NA
14:      B        1
15:      B       NA
16:      B       NA
17:      B       NA
18:      B        1
19:      B       NA
20:      B       NA
21:      B       NA
22:      B        1
23:      B       NA
24:      B       NA
25:      B       NA
26:      B        1
27:      B       NA
28:      B       NA
29:      B       NA
30:      B        1
places sequence

最新更新