r-以条件cummin方式迭代替换组中的值



超大数据集的子集有两个维度:一个是组ORG,另一个是距离dist,例如

  1. 第3行意味着在15公里半径范围内(在某种坐标下(没有(N=0(法国公司
  2. 在第6行,有一家(N=1(法国公司成立于1992年(FirstEntry=1992(,半径为30公里(按某种坐标(

我需要有效地生成一个新列FirstEntry2,如下所示:

ORG dist N FirstEntry FirstEntry2
1: FRA    5 0         NA          NA
2: FRA   10 0         NA          NA
3: FRA   15 0         NA          NA
4: FRA   20 0         NA          NA
5: FRA   25 0         NA          NA
6: FRA   30 1       1992        1992 # the first valid firm A w/in 30km radius
7: FRA   35 2       1994        1992 # firm A must be earliest w/in 35km as well, so replace this with 1992
8: FRA   40 2       1994        1992 # the same as previous row
9: FRA   45 2       1994        1992 # the same as previous row
10: FRA   99 2       1994        1992 # the same as previous row
11: JPN    5 0         NA          NA
12: JPN   10 0         NA          NA
13: JPN   15 0         NA          NA
14: JPN   20 0         NA          NA
15: JPN   25 0         NA          NA
16: JPN   30 0         NA          NA
17: JPN   35 1       1995        1995 # w/in 35km, this is earliest, though afar there's a firm est. in 1992
18: JPN   40 2       1992        1992 # so, FirstEntry2 in this row no need to be replaced
19: JPN   45 2       1992        1992 # the same reason, no replace
20: JPN   99 2       1992        1992 # the same reason, no replace
21: DEU    5 0         NA          NA
22: DEU   10 1       1998        1998 # the first valid firm C, w/in 10km radius
23: DEU   15 2       1999        1998 # this firm C must be earliest w/in 15km as well, so replace this with 1998
24: DEU   20 2       1999        1998 # the same as previous row
25: DEU   25 2       1999        1998 # the same as previous row
26: DEU   30 2       1999        1998 # the same as previous row
27: DEU   35 2       1999        1998 # the same as previous row
28: DEU   40 2       1999        1998 # the same as previous row
29: DEU   45 2       1999        1998 # the same as previous row
30: DEU   99 2       1999        1998 # the same as previous row
# Sorry, there were mistakes when I posted it here at first. (edited)
test <- data.table(ORG = c(rep("FRA", 10), rep("JPN", 10), rep("DEU", 10)),
dist = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 99, 
5, 10, 15, 20, 25, 30, 35, 40, 45, 99,
5, 10, 15, 20, 25, 30, 35, 40, 45, 99), 
N = c(0L, 0L, 0L, 0L, 0L, 1L, 2L, 2L, 2L, 2L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
0L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
FirstEntry = c(NA, NA, NA, NA, NA, 1992, 1994, 1994, 1994, 1994, 
NA, NA, NA, NA, NA, NA, 1995, 1992, 1992, 1992,
NA, 1998,rep(1999, 8)), 
FirstEntry2= c(NA, NA, NA, NA, NA, 1992, 1992, 1992, 1992, 1992, 
NA, NA, NA, NA, NA, NA, 1995, 1992, 1992, 1992,
NA, rep(1998, 9)))

我试过这样的东西,但不是想要的结果

test[, FirstEntry2 := shift(FirstEntry), by = .(ORG, cumsum(c(1, +(FirstEntry > shift(FirstEntry) & !is.na(FirstEntry))[-1])))] 

我该怎么做才对?非常感谢!

我找到了一个解决方案,

for (col in names(test)) set(test, which(is.na(test[[col]])), col, value = 9999 )
test[, FirstEntry3 := cummin(FirstEntry), 
by = .(ORG)]
identical(test$FirstEntry2, test$FirstEntry3)

不!!我的大脑没有功能。。。

最新更新