R中唯一值的累积计数

我的数据集的简化版本如下:

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

我想创建一个新的数据集，其中对于每个"深度"的值，我将具有从顶部开始的唯一值的累积数量。例如

depth cumsum
 1      2
 2      2
 3      3

关于如何做到这一点有什么想法吗?我对r比较陌生

我发现这是一个使用factor并仔细设置levels的完美案例。我在这里用data.table。确保您的value列是character(不是绝对要求)。

步骤1:通过仅取unique行，将data.frame转换为data.table。

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth") # just to be sure before factoring "value"

步骤2:将value转换为factor并强制转换为numeric。确保你自己设置关卡(这很重要)。
```
dt[, id := as.numeric(factor(value, levels = unique(value)))]
```

步骤3:设置键列为depth用于子集，只选择最后一个值

 setkey(dt, "depth", "id")
 dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]
#    depth id
# 1:     1  2
# 2:     2  2
# 3:     3  3

第4步:因为随着深度增加的行中的所有值应该至少前一行的值，您应该使用cummax来获得最终输出。
```
dt.out[, id := cummax(id)]
```

Edit:上面的代码是为了说明的目的。实际上，你根本不需要第三列。这是我编写最终代码的方式。

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

下面是一个更复杂的示例和代码输出:

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6

一次尝试。

df %>%
  #group_by(group)%>% # if you have a third variable and you want to achieve the same results for each group
  mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
  group_by(depth) %>% # add group variable for more layers
  summarise(cum_unique_entries = last(cum_unique_entries))

这是另一个尝试:

numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)

给了

这似乎也适用于@Arun的例子:

好的第一步是创建一个TRUE或FALSE列，其中TRUE用于每个值的第一个值，FALSE用于该值的后续出现。这可以使用duplicated:

轻松完成

mydata$first.appearance = !duplicated(mydata$value)

重构数据最好使用aggregate完成。在本例中，它表示对depth的每个子集中的first.appearance列求和:

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

结果如下:

  depth first.appearance
1     1  2
2     2  0
3     3  1

这仍然不是一个累加和。为此，您可以使用cumsum函数(然后删除旧列):

newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

总结一下:

mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

输出:

  depth cumsum
1     1      2
2     2      2
3     3      3

这可以用使用sqldf包的单个SQL语句以相对干净的方式编写。假设DF为原始数据帧:

library(sqldf)
sqldf("select b.depth, count(distinct a.value) as cumsum
    from DF a join DF b 
    on a.depth <= b.depth
    group by b.depth"
)

这里是使用lapply()的另一个解决方案。对于unique(df$depth)，创建唯一depth值的向量，然后对于每个这样的值子集，只有depth等于或小于特定depth值的value值。然后计算唯一的value值的长度。这个长度值存储在cumsum中，然后depth=x将给出特定深度级别的值。使用do.call(rbind,...)将其作为一个数据帧。

do.call(rbind,lapply(unique(df$depth), 
               function(x)
             data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
  depth cumsum
1     1      2
2     2      2
3     3      3

相关内容

最新更新

热门标签：