聚合r中的顺序数据和分组数据



我有一个看起来像这个玩具示例的数据集。这些数据描述了一个人搬到的位置以及搬迁发生后的时间。例如,人1从农村地区出发,但在463天前搬到了城市(第二行),415天前他从这个城市搬到了一个城镇(第三行),等等。

set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]

数据:

time destination user
 526       rural    1
 463        city    1
 415        town    1
 299        city    1
 179       rural    1
 938        town    2
 229        town    2
 118        city    2
 818        city    3
 195        city    3

我希望将这些数据汇总为下面的格式。也就是说,计算每个用户的重定位类型,并求和为一个矩阵。我如何实现这一点(最好不写循环)?

from  to     count
city  city   1
city  town   1
city  rural  1
town  city   2
town  town   1
town  rural  0
rural city   1
rural town   0
rural rural  0

基于data.table包的一种可能的方法:

library(data.table)
cases <- unique(df$destination)
setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
          ][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]

#      from     to count
#    <char> <char> <int>
# 1:   city   city     1
# 2:   city  rural     1
# 3:   city   town     1
# 4:  rural   city     1
# 5:  rural  rural     0
# 6:  rural   town     0
# 7:   town   city     2
# 8:   town  rural     0
# 9:   town   town     1

这是一个data.table选项

setDT(df)[
    ,
    setNames(
        rev(data.frame(embed(as.character(destination), 2))),
        c("from", "to")
    ), user
][, count := .N, .(from, to)][]

,

   user  from    to count
1:    1 rural  city     1
2:    1  city  town     1
3:    1  town  city     2
4:    1  city rural     1
5:    2  town  town     1
6:    2  town  city     2
7:    3  city  city     1

以下是tidyverse的解决方案:

library(dplyr)
library(purrr)
df %>%
  group_split(user) %>%
  map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]), 
                  as.character(.x[["destination"]][-1])) %>%
        set_names("from", "to")) %>%
  group_by(from, to) %>%
  count()
# A tibble: 6 x 3
# Groups:   from, to [6]
  from  to        n
  <chr> <chr> <int>
1 city  city      1
2 city  rural     1
3 city  town      1
4 rural city      1
5 town  city      2
6 town  town      1

这是dplyr唯一的解决方案:

  1. 将to识别为lag函数,并与paste0组合为helper列。
  2. 移除lead引起的NA
  3. 使用add_count来改变n
df %>% 
  group_by(user) %>% 
  rename(from = destination) %>% 
  mutate(to = lead(from), .before=3) %>% 
  mutate(helper = paste0(from, to)) %>% 
  filter(!is.na(to)) %>% 
  group_by(helper) %>% 
  add_count(helper, from, to) %>% 
  ungroup() %>% 
  select(user, from, to, n)

输出:

   user from  to        n
  <int> <fct> <fct> <int>
1     1 rural city      1
2     1 city  town      1
3     1 town  city      2
4     1 city  rural     1
5     2 town  town      1
6     2 town  city      2
7     3 city  city      1

最新更新