r-按id合并数据帧,同时交织年份并在年份之间结转值

  • 本文关键字:之间 合并 id 数据帧 r merge
  • 更新时间 :
  • 英文 :


我有两个要合并的数据帧。根据id&CCD_ 2。

其中之一是";"main";,另一个添加信息。然而,我不能以常规方式(即merge()dplyr::left_join()(合并它们,因为它们中的year值不一定与每个id匹配。因此,我想按时间顺序将第二个表中的已知内容转移到主表中的每year行中。

在下面的例子中,我有两张关于军官的表格。";主";一个有3列用于idyear和另一个col_1:

df_main_info <-
tribble(~id, ~year, ~col_1, 
1,   2008,  "foo",
1,   2005,  "bar",
1,   2010,  "blah", 
1,   2020,  "bar",  
2,   1999,  "foo", 
2,   2020,  "foo",  
3,   2002,  "bar",
3,   2010,  "bar",
4,   2003,  "foo",
4,   2010,  "bar"
)

我有一个额外的表格,里面有idyear列,每个军官什么时候获得军衔,是什么军衔:

df_ranks_history <-
tribble(~id, ~year, ~army_rank,
1,   2005,  "second_lieutenant",
1,   2010,  "first_lieutenant",
1,   2018,  "major",
1,   2021,  "colonel",
2,   2002,  "major",
2,   2018,  "colonel",
3,   1995,  "second_lieutenant",
3,   2000,  "captain",
3,   2012,  "colonel"
)

年份不完全匹配。但是,例如,如果id = 3警官在2000年成为"captain",那么我们知道在2002年仍然是这样,所以我们可以输入";船长;第7行的CCD_ 15中。

因此,所需的输出应该是:


desired_output <-
tribble(~id, ~year, ~col_1, ~army_rank,
1,   2008,  "foo",   "second_lieutenant",
1,   2005,  "bar",   "second_lieutenant",
1,   2010,  "blah",  "first_lieutenant",
1,   2020,  "bar",   "major",
2,   1999,  "foo",   NA,
2,   2020,  "foo",   "colonel",
3,   2002,  "bar",   "captain",
3,   2010,  "bar",   "captain",
4,   2003,  "foo",   NA,
4,   2010,  "bar",   NA
)

如果这是相关的,则排名按特定顺序排列:

us_army_officer_ranks <- c("second_lieutenant", 
"first_lieutenant", 
"captain", 
"major", 
"lieutenant_colonel", 
"colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant
library(dplyr)
library(tidyr)
df_main_info %>% 
full_join(df_ranks_history, by = c("id", "year")) %>%
group_by(id) %>%
arrange(id, year) %>%
fill(army_rank, .direction = "down") %>%
filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups:   id [4]
#       id  year col_1 army_rank        
#    <dbl> <dbl> <chr> <chr>            
#  1     1  2005 bar   second_lieutenant
#  2     1  2008 foo   second_lieutenant
#  3     1  2010 blah  first_lieutenant 
#  4     1  2020 bar   major            
#  5     2  1999 foo   NA               
#  6     2  2020 foo   colonel          
#  7     3  2002 bar   captain          
#  8     3  2010 bar   captain          
#  9     4  2003 foo   NA               
# 10     4  2010 bar   NA    
library(data.table)
setDT(df_main_info)
setDT(df_ranks_history)
df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]
id year         army_rank col_1
1:  1 2008 second_lieutenant   foo
2:  1 2005 second_lieutenant   bar
3:  1 2010  first_lieutenant  blah
4:  1 2020             major   bar
5:  2 1999              <NA>   foo
6:  2 2020           colonel   foo
7:  3 2002           captain   bar
8:  3 2010           captain   bar
9:  4 2003              <NA>   foo
10:  4 2010              <NA>   bar

最新更新