r-modify函数,用于查找每行数字连续运行的最大长度



我想在这里修改问题的答案,或者有一个新的解决方案来包含另一列,该列显示"0"的第二大连续运行。我的示例数据和代码如下,函数在month列上运行,second largest run列是我希望添加的内容。我正在处理一个大型数据集,所以效率越高越好,任何想法都会受到赞赏,谢谢。

样本数据

structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A", 
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185, 
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19, 
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101), 
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0, 
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25, 
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135, 
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244, 
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37, 
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31, 
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163, 
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `Second largest run` = c(1, 
NA, NA, NA, NA, 2, NA, NA, NA), result = c(2, 4, -Inf, 1, 
-Inf, 5, -Inf, 3, -Inf)), row.names = c(NA, -9L), class = c("tbl_df", 
"tbl", "data.frame"))

代码

most_consecutive_val = function(x, val = 0) {
with(rle(x), max(lengths[values == val]))
}

test$result=apply(test[,-c(1:4,17)], MARGIN = 1, most_consecutive_val)

我们希望sort输出,然后提取所需索引,而不是从行程编码(rle(函数中获取max。当我们请求一个不存在的索引时,我们会得到NA——例如,第2行中没有第二个零。

ordered_runs = function(x, val = 0, idx = 1) {
with(rle(x), sort(lengths[values == val], decreasing = TRUE))[idx]
}
test$result_1 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 1)
test$result_2 <- apply(test[,-c(1:4,17:18)], MARGIN = 1, ordered_runs, idx = 2)

输出与您预期的略有不同——(1(使用NA而不是-Inf,以及(2(在第一行中,我认为第二行有2个零。

> test[,c(1,17:20)]
# A tibble: 9 x 5
ID `Second largest run` result result_1 result_2
<dbl>                <dbl>  <dbl>    <int>    <int>
1     1                    1      2        2        2
2     2                   NA      4        4       NA
3     3                   NA   -Inf       NA       NA
4     4                   NA      1        1       NA
5     5                   NA   -Inf       NA       NA
6     6                    2      5        5        2
7     7                   NA   -Inf       NA       NA
8     8                   NA      3        3       NA
9     9                   NA   -Inf       NA       NA

这里有一个使用data.table的选项,它对于OP的大型数据集来说应该很快,并且还可以同时识别所有零序列:

library(data.table)
setDT(DF)
cols <- c("Jan", "Feb", "March", "April", "May", "June", "July", "August", "Sep", "Oct", "Nov", "Dec")
#convert into a long format
m <- melt(DF, measure.vars=cols)[
#identify consecutive sequences of the same number and count
order(ID), c("rl", "rw") := .(rl <- rleid(ID, value), rowid(rl))][
#extract the last element where values = 0 (that is the length of sequences of zeros)
value == 0L, .(ID=ID[.N], len=rw[.N]), rl][
#sort in descending order for length of sequences
order(ID, -len)]
#pivot into wide format and perform a update join
wide <- dcast(m, ID ~ rowid(ID), value.var="len")
DF[wide, on=.(ID), (names(wide)) := mget(names(wide))]

输出:

ID V1  V2  V3 Jan Feb March April May June July August Sep Oct Nov Dec  1  2
1:  1  A  21 149  10  20     0     0  25   29   34     37   0   0  48  50  2  2
2:  2  B 233 250  20  25    28    34   0    0    0      0  39  48  55  60  4 NA
3:  3  A 185 218  10  15    20    25  30   35   40     45  50  55  60  65 NA NA
4:  4  B  85 104  12   0    14    16  22   24   28     29  31  35  40  45  1 NA
5:  5  B 208  62  76  89   108   125 135  145  163    173 193 210 221 239 NA NA
6:  6  A 112  19  28   0     0    71   0    0    0      0   0 163 183 195  5  2
7:  7  A 238 175 137 152   165   181 191  205  217    228 239 252 272 289 NA NA
8:  8  B  66 168 162 177   194   208 224  244  256    276 308   0   0   0  3 NA
9:  9  B  38  28 101 119   132   149 169  187  207    221 236 247 264 277 NA NA

数据:

DF <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9), V1 = c("A", 
"B", "A", "B", "B", "A", "A", "B", "B"), V2 = c(21, 233, 185, 
85, 208, 112, 238, 66, 38), V3 = c(149, 250, 218, 104, 62, 19, 
175, 168, 28), Jan = c(10, 20, 10, 12, 76, 28, 137, 162, 101), 
Feb = c(20, 25, 15, 0, 89, 0, 152, 177, 119), March = c(0, 
28, 20, 14, 108, 0, 165, 194, 132), April = c(0, 34, 25, 
16, 125, 71, 181, 208, 149), May = c(25, 0, 30, 22, 135, 
0, 191, 224, 169), June = c(29, 0, 35, 24, 145, 0, 205, 244, 
187), July = c(34, 0, 40, 28, 163, 0, 217, 256, 207), August = c(37, 
0, 45, 29, 173, 0, 228, 276, 221), Sep = c(0, 39, 50, 31, 
193, 0, 239, 308, 236), Oct = c(0, 48, 55, 35, 210, 163, 
252, 0, 247), Nov = c(48, 55, 60, 40, 221, 183, 272, 0, 264
), Dec = c(50, 60, 65, 45, 239, 195, 289, 0, 277), `1` = c(2L, 
4L, NA, 1L, NA, 5L, NA, 3L, NA), `2` = c(2L, NA, NA, NA, 
NA, 2L, NA, NA, NA)), row.names = c(NA, -9L), class = "data.frame")

相关内容

  • 没有找到相关文章

最新更新