我之前问过类似的问题,但我需要一些进一步的输出,并决定发布一个新问题。
我有一个这样的数据表对象:
library(data.table)
cells <- c(100, 1,1980,1,0,1,1,0,1,0,
150, 1,1980,1,1,1,0,0,0,1,
99 , 1,1980,1,1,1,1,0,0,0,
899, 1,1980,0,1,0,1,1,1,1,
789, 1,1982,1,1,1,0,1,1,1 )
colname <- c("number","sex", "birthy", "2004","2005", "2006", "2007", "2008", "2009","2010")
rowname <- c("1","2","3","4","5")
y <- matrix(cells, nrow=5, ncol=10, byrow=TRUE, dimnames = list(rowname,colname))
y <- data.table(y, keep.rownames = TRUE)
2004年栏中的值1表示此人在2004年期间连续投保。前3年投保的人可以参加研究。我需要此 data.table 的子集,其中包含以下条件为真的所有观察结果:2004+2005+2006 = 3 或 2005+2006+2007 = 或 2006+2007+...
#using melt and rle function to restrucure the data
tmp <- melt(y, id = "rn", measure.vars = patterns("^20"),
variable.factor = FALSE, variable.name = "year")[, rle(value), by = rn]
#subset data based on condition, keeping only the first relevant sequence
tmp2 <- tmp[(values == 1 & lengths >= 3), .(rn,lengths)][, .SD[1,], by=rn]
##selecting only rows with value=1 and min 3 in a row
##keeping only the variable rn
tmp3 <- tmp[values == 1, which(max(lengths) >= 3), by = rn]$rn
##using the row-number to select obersvations from data.table
##merging length of sequence
dt <- merge(y[as.integer(tmp3)],tmp2, by="rn")
如果它们不是序列的一部分,有没有办法将所有 1 转换为 0?例如,rn==4 变量 "2005" 需要为零。
我还需要一个包含序列开始年份的新变量"begy"。例如rn==5
和begy==2004
。任何建议将不胜感激...
新的解决方案:
# define a custom function in order to only keep the sequences
# with 3 (or more) consecutive years
rle3 <- function(x) {
r <- rle(x)
r$values[r$lengths < 3 & r$values == 1] <- 0
inverse.rle(r)
}
# replace all '1'-s that do not belong to a sequence of at least 3 to '0'
# create 'begy'-variable
melt(y, id = 1:4, measure.vars = patterns("^20"),
variable.factor = FALSE, variable.name = "year"
)[, value := rle3(value), by = rn
][, begy := year[value == 1][1], rn
][, dcast(.SD[!is.na(begy)], ... ~ year, value.var = "value")]
这给了:
rn number sex birthy begy 2004 2005 2006 2007 2008 2009 2010 1: 2 150 1 1980 2004 1 1 1 0 0 0 0 2: 3 99 1 1980 2004 1 1 1 1 0 0 0 3: 4 899 1 1980 2007 0 0 0 1 1 1 1 4: 5 789 1 1982 2004 1 1 1 0 1 1 1
旧解决方案:
# define a custom function in order to only keep the sequences
# with 3 (or more) consecutive years
rle3 <- function(x) {
r <- rle(x)
r$values[r$lengths < 3 & r$values == 1] <- 0
inverse.rle(r)
}
# create a reference 'data.table' with only the row to keep
# and the start year of the (first) sequence (row 5 has 2 sequences of 3)
x <- melt(y, id = "rn", measure.vars = patterns("^20"),
variable.factor = FALSE, variable.name = "year"
)[, value := rle3(value), by = rn
][value == 1, .SD[1], rn]
# join 'x' with 'y' to add 'begy' and filter out the row with no sequences of 3
y[x, on = "rn", begy := year][!is.na(begy)]
这给了:
rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy 1: 2 150 1 1980 1 1 1 0 0 0 1 2004 2: 3 99 1 1980 1 1 1 1 0 0 0 2004 3: 4 899 1 1980 0 1 0 1 1 1 1 2007 4: 5 789 1 1982 1 1 1 0 1 1 1 2004
使用滚动窗口的选项:
#convert into long format and convert data types
DT <- melt(y, id="rn", measure.vars=patterns("^20"),
variable.factor=FALSE, variable.name="YEAR",
value.factor=FALSE, value.name="VALUE")
cols <- c("YEAR", "VALUE")
DT[, (cols) := lapply(.SD, as.integer), .SDcols=cols]
setorder(DT, rn, YEAR)
#for each row, sum the value that is between current year and 2 years from now
winsize <- 3L
DT[, roll3y := vapply(YEAR,
function(k) sum(VALUE[between(YEAR, k, k + winsize - 1L)]),
integer(1L)),
by=.(rn)]
#> Is there a way to turn all 1 to 0 if they are not part of a sequence?
#reset all values to 0 and set only those in sequence to 1
DT[, VALUE := 0L][
c(outer(DT[roll3y==winsize, which=TRUE], 0L:2L, `+`)), VALUE := 1L]
#> I also need a new variable "begy" containing the year of the beginning of the sequence.
#identify the year that a sequence of 3 ones appears
longDT <- rbindlist(list(DT,
DT[roll3y==winsize, .(YEAR="begy", VALUE=min(YEAR)), by=.(rn,number,sex,birthy)]),
use.names=TRUE, fill=TRUE)
#get desired output
dcast(longDT, rn + number + sex + birthy ~ YEAR, value.var="VALUE")
输出:
rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy
1: 1 100 1 1980 0 0 0 0 0 0 0 NA
2: 2 150 1 1980 1 1 1 0 0 0 0 2004
3: 3 99 1 1980 1 1 1 1 0 0 0 2004
4: 4 899 1 1980 0 0 0 1 1 1 1 2007
5: 5 789 1 1982 1 1 1 0 1 1 1 2004
OP 已请求
- 将所有
1
s转换为0
s,如果它们不是连续3年或更多年份序列的一部分, - 添加一个新列,其中包含"the"序列开始的年份。
请注意,第二个要求是模棱两可的,因为可能有多个连续 3 年或更多年的序列,例如,在第 5 行中。在这里,我们采用第一个(最旧(序列的开始年份。
以下解决方案
- 从宽幅到长幅的重塑,
- 计算连续年份的连胜长度,
- 如果
1
s 不是连续 3 年或更多年份序列的一部分,则将它们转换为0
s, - 获取第一个序列的起始年份,
- 删除没有连续序列的行(未找到
begy
(,以及 - 最后,重新调整回宽格式。
无需滚动窗口或自定义功能。
library(data.table)
melt(y, , patterns("^\d"))[
order(rn), N := .N, by = .(rleid(value), rn)][
value == 1 & N < 3, value := 0][
, begy := first(variable[value == 1]), by = rn][
, dcast(.SD[!is.na(begy), -"N"], ... ~ variable)]
rn number sex birthy begy 2004 2005 2006 2007 2008 2009 2010 1: 2 150 1 1980 2004 1 1 1 0 0 0 0 2: 3 99 1 1980 2004 1 1 1 1 0 0 0 3: 4 899 1 1980 2007 0 0 0 1 1 1 1 4: 5 789 1 1982 2004 1 1 1 0 1 1 1