具有完整层次结构的长格式父子数据帧到宽格式



我有以下长格式数据帧,其中两列包含嵌套的父子层次结构:

parent,child,child_level
d     ,sf   ,x
d     ,st   ,x
d     ,s0   ,x
sf    ,gr4  ,l
sf    ,gr3  ,l
st    ,grd  ,l
st    ,gr9  ,l
s0    ,n7   ,l
s0    ,b12  ,l
grd   ,nyvc ,b
gr3   ,trub2,b
b12   ,ngb2 ,b
b12   ,ggb8 ,b
nyvc  ,xtr2d,i
trub2 ,xtuD ,i
gr4   ,stab3,i
gr9   ,ubc8 ,i
n7    ,ubc2 ,i
ggb8  ,drik2,i

我的目标是将父列和子列取消嵌套为宽格式。列名称应对应于child_level列中的相应级别:

,x ,l  ,b    ,i
d,sf,gr4,NA   ,stab3
d,sf,gr3,trub2,xtuD
d,st,grd,nyvc ,xtr2d
d,st,gr9,NA   ,ubc8
d,s0,n7 ,NA   ,ubc2
d,s0,b12,ngb2 ,NA
d,s0,b12,ggb8 ,drik2

没有表示顺序或级别等级的其他信息,这些信息仅从子列和父列中出现。此外,并非输出表的所有行都包含所有层次结构级别 - 这些级别应使用 NA 填充。

EDIT1(为了澄清):R 或 Python 解决方案可以工作,所以我提供了一个通用的输入表(例如作为 csv 阅读)。此外,下表是手工制作的 - 我实际上不知道如何以编程方式到达那里。

EDIT2:行不是有序的,即子级别可以按任何顺序排列,因此它必须是某种递归方法。

更新了答案

我现在有一种方法,只要你知道后代的顺序,即我们需要对child_level向量进行排序的顺序,它就有效。如果你不知道,我们应该能够计算订单,但现在我认为它是已知的。

该方法基于:

  1. 首先还要计算parent_level
  2. parent_levelchild_level嵌套数据框
  3. 一个自定义函数,可以与purrr::accumulate2purrr::reduce2一起使用,该函数使用left_join将所有 data.frame 联接在一行中,如果再次联接现有列,则相应的列将合并为一个

在应用此自定义join_merge函数之前:

  1. 嵌套的数据帧需要按降序排序(child_level
  2. )
  3. 列名parentchild将替换为parent_levelchild_level
  4. 最后,parent_levelchild_level被组合成一个名为arg_ls的向量,该向量作为.y参数传递给accumulate2(或者reduce2

我希望这适用于您的真实数据。

library(tidyverse)
dat <- tribble(
~ parent, ~child, ~child_level,
"d"     ,"sf"   ,"x",
"d"     ,"st"   ,"x",
"d"     ,"s0"   ,"x",
"sf"    ,"gr4"  ,"l",
"sf"    ,"gr3"  ,"l",
"st"    ,"grd"  ,"l",
"st"    ,"gr9"  ,"l",
"s0"    ,"n7"   ,"l",
"s0"    ,"b12"  ,"l",
"grd"   ,"nyvc" ,"b",
"gr3"   ,"trub2","b",
"b12"   ,"ngb2" ,"b",
"b12"   ,"ggb8" ,"b",
"nyvc"  ,"xtr2d","i",
"trub2" ,"xtuD" ,"i",
"gr4"   ,"stab3","i",
"gr9"   ,"ubc8" ,"i",
"n7"    ,"ubc2" ,"i",
"ggb8"  ,"drik2","i"
)
# in a first step we calculate the `parent_level`
dat <- dat %>% 
left_join(., select(., -parent), by = c("parent" = "child")) %>% 
rename("child_level" = "child_level.x",
"parent_level" = "child_level.y") %>% 
mutate(parent_level = replace_na(parent_level, "o"))
# we need this function to work with accumulate2 or reduce2
join_merge <- function(df1, df2, .rename) {
res <- left_join(df1, df2, by = .rename[1]) 
# in case an existing column is joined again, we need to merge it together
if(length(colnames(select(res, starts_with(all_of(.rename[2]))))) > 1) {
res <- mutate(res, across(matches(paste0(.rename[2], ".x")), 
~ if_else(is.na(.x), eval(sym(paste0(.rename[2], ".y"))), .x))) %>% 
select(-all_of(paste0(.rename[2], ".y"))) %>% 
rename(!! .rename[2] := paste0(.rename[2], ".x"))
}
res
}

# accumulate is used to show how the final result is buildt
dat %>% 
nest_by(child_level, parent_level) %>% 
arrange(child_level == "i", desc(child_level)) %>% 
mutate(arg_ls = list(c(parent_level, child_level))) %>% 
mutate(data = list(rename_with(data,
~ paste0(child_level),
"child") %>%
rename_with(~ paste0(parent_level),
"parent"))) %>%
ungroup %>% 
mutate(dat_acc = accumulate2(data,
arg_ls[-1],
join_merge)) %>% 
pull(dat_acc)
#> [[1]]
#> # A tibble: 3 x 2
#>   o     x    
#>   <chr> <chr>
#> 1 d     sf   
#> 2 d     st   
#> 3 d     s0   
#> 
#> [[2]]
#> # A tibble: 6 x 3
#>   o     x     l    
#>   <chr> <chr> <chr>
#> 1 d     sf    gr4  
#> 2 d     sf    gr3  
#> 3 d     st    grd  
#> 4 d     st    gr9  
#> 5 d     s0    n7   
#> 6 d     s0    b12  
#> 
#> [[3]]
#> # A tibble: 7 x 4
#>   o     x     l     b    
#>   <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA> 
#> 2 d     sf    gr3   trub2
#> 3 d     st    grd   nyvc 
#> 4 d     st    gr9   <NA> 
#> 5 d     s0    n7    <NA> 
#> 6 d     s0    b12   ngb2 
#> 7 d     s0    b12   ggb8 
#> 
#> [[4]]
#> # A tibble: 7 x 5
#>   o     x     l     b     i    
#>   <chr> <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA>  <NA> 
#> 2 d     sf    gr3   trub2 xtuD 
#> 3 d     st    grd   nyvc  xtr2d
#> 4 d     st    gr9   <NA>  <NA> 
#> 5 d     s0    n7    <NA>  <NA> 
#> 6 d     s0    b12   ngb2  <NA> 
#> 7 d     s0    b12   ggb8  drik2
#> 
#> [[5]]
#> # A tibble: 7 x 5
#>   o     x     l     b     i    
#>   <chr> <chr> <chr> <chr> <chr>
#> 1 d     sf    gr4   <NA>  stab3
#> 2 d     sf    gr3   trub2 xtuD 
#> 3 d     st    grd   nyvc  xtr2d
#> 4 d     st    gr9   <NA>  ubc8 
#> 5 d     s0    n7    <NA>  ubc2 
#> 6 d     s0    b12   ngb2  <NA> 
#> 7 d     s0    b12   ggb8  drik2

创建于 2020-12-22 由 reprex 软件包 (v0.3.0)

如果没有进一步的信息,这里是我通常如何处理这个问题。我正在展示一个tidyverse解决方案,但当然这也可以在基础 R 中完成。

data <- structure(list(child = structure(c(10L, 11L, 9L, 4L, 5L, 5L, 
6L, 1L, 8L, 13L, 7L, 3L, 16L, 17L, 12L, 15L, 14L, 2L), .Label = c("b12", 
"drik2", "ggb8", "gr4", "grd", "n7", "ngb2", "nyvc", "s0", "sf", 
"st", "stab3", "trub2", "ubc2", "ubc8", "xtr2d", "xtuD"), class = "factor"), 
parent = structure(c(2L, 2L, 2L, 11L, 11L, 12L, 10L, 10L, 
7L, 4L, 1L, 1L, 9L, 13L, 5L, 6L, 8L, 3L), .Label = c("b12", 
"d", "ggb8", "gr3", "gr4", "gr9", "grd", "n7", "nyvc", "s0", 
"sf", "st", "trub2"), class = "factor"), child_level = structure(c(4L, 
4L, 4L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("b", "i", "t", "x"), class = "factor")), class = "data.frame", row.names = c(NA, 
-18L))
library(tidyverse)
pivot <- data %>% mutate(unique = rownames(data)) %>% pivot_wider(id_cols = unique, names_from = child_level, values_from = child) %>% select(!unique)

输入如下所示:

# > data
#    child parent child_level
# 1     sf      d           x
# 2     st      d           x
# 3     s0      d           x
# 4    gr4     sf           t
# 5    grd     sf           t
# 6    grd     st           t
# 7     n7     s0           t
# 8    b12     s0           t
# 9   nyvc    grd           b
# 10 trub2    gr3           b
# 11  ngb2    b12           b
# 12  ggb8    b12           b
# 13 xtr2d   nyvc           i
# 14  xtuD  trub2           i
# 15 stab3    gr4           i
# 16  ubc8    gr9           i
# 17  ubc2     n7           i
# 18 drik2   ggb8           i

它将输出以下内容:

# > pivot
# # A tibble: 18 x 4
#    x     t     b     i    
#    <fct> <fct> <fct> <fct>
#  1 sf    NA    NA    NA   
#  2 st    NA    NA    NA   
#  3 s0    NA    NA    NA   
#  4 NA    gr4   NA    NA   
#  5 NA    grd   NA    NA   
#  6 NA    grd   NA    NA   
#  7 NA    n7    NA    NA   
#  8 NA    b12   NA    NA   
#  9 NA    NA    nyvc  NA   
# 10 NA    NA    trub2 NA   
# 11 NA    NA    ngb2  NA   
# 12 NA    NA    ggb8  NA   
# 13 NA    NA    NA    xtr2d
# 14 NA    NA    NA    xtuD 
# 15 NA    NA    NA    stab3
# 16 NA    NA    NA    ubc8 
# 17 NA    NA    NA    ubc2 
# 18 NA    NA    NA    drik2

最新更新