超大数据集的子集有两个维度:一个是组ORG
,另一个是距离dist
,例如
- 第3行意味着在15公里半径范围内(在某种坐标下(没有(
N=0
(法国公司 - 在第6行,有一家(
N=1
(法国公司成立于1992年(FirstEntry=1992
(,半径为30公里(按某种坐标(
我需要有效地生成一个新列FirstEntry2
,如下所示:
ORG dist N FirstEntry FirstEntry2
1: FRA 5 0 NA NA
2: FRA 10 0 NA NA
3: FRA 15 0 NA NA
4: FRA 20 0 NA NA
5: FRA 25 0 NA NA
6: FRA 30 1 1992 1992 # the first valid firm A w/in 30km radius
7: FRA 35 2 1994 1992 # firm A must be earliest w/in 35km as well, so replace this with 1992
8: FRA 40 2 1994 1992 # the same as previous row
9: FRA 45 2 1994 1992 # the same as previous row
10: FRA 99 2 1994 1992 # the same as previous row
11: JPN 5 0 NA NA
12: JPN 10 0 NA NA
13: JPN 15 0 NA NA
14: JPN 20 0 NA NA
15: JPN 25 0 NA NA
16: JPN 30 0 NA NA
17: JPN 35 1 1995 1995 # w/in 35km, this is earliest, though afar there's a firm est. in 1992
18: JPN 40 2 1992 1992 # so, FirstEntry2 in this row no need to be replaced
19: JPN 45 2 1992 1992 # the same reason, no replace
20: JPN 99 2 1992 1992 # the same reason, no replace
21: DEU 5 0 NA NA
22: DEU 10 1 1998 1998 # the first valid firm C, w/in 10km radius
23: DEU 15 2 1999 1998 # this firm C must be earliest w/in 15km as well, so replace this with 1998
24: DEU 20 2 1999 1998 # the same as previous row
25: DEU 25 2 1999 1998 # the same as previous row
26: DEU 30 2 1999 1998 # the same as previous row
27: DEU 35 2 1999 1998 # the same as previous row
28: DEU 40 2 1999 1998 # the same as previous row
29: DEU 45 2 1999 1998 # the same as previous row
30: DEU 99 2 1999 1998 # the same as previous row
# Sorry, there were mistakes when I posted it here at first. (edited)
test <- data.table(ORG = c(rep("FRA", 10), rep("JPN", 10), rep("DEU", 10)),
dist = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 99,
5, 10, 15, 20, 25, 30, 35, 40, 45, 99,
5, 10, 15, 20, 25, 30, 35, 40, 45, 99),
N = c(0L, 0L, 0L, 0L, 0L, 1L, 2L, 2L, 2L, 2L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
0L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
FirstEntry = c(NA, NA, NA, NA, NA, 1992, 1994, 1994, 1994, 1994,
NA, NA, NA, NA, NA, NA, 1995, 1992, 1992, 1992,
NA, 1998,rep(1999, 8)),
FirstEntry2= c(NA, NA, NA, NA, NA, 1992, 1992, 1992, 1992, 1992,
NA, NA, NA, NA, NA, NA, 1995, 1992, 1992, 1992,
NA, rep(1998, 9)))
我试过这样的东西,但不是想要的结果
test[, FirstEntry2 := shift(FirstEntry), by = .(ORG, cumsum(c(1, +(FirstEntry > shift(FirstEntry) & !is.na(FirstEntry))[-1])))]
我该怎么做才对?非常感谢!
我找到了一个解决方案,
for (col in names(test)) set(test, which(is.na(test[[col]])), col, value = 9999 )
test[, FirstEntry3 := cummin(FirstEntry),
by = .(ORG)]
identical(test$FirstEntry2, test$FirstEntry3)
不!!我的大脑没有功能。。。