我有以下长格式数据帧,其中两列包含嵌套的父子层次结构:
parent,child,child_level
d ,sf ,x
d ,st ,x
d ,s0 ,x
sf ,gr4 ,l
sf ,gr3 ,l
st ,grd ,l
st ,gr9 ,l
s0 ,n7 ,l
s0 ,b12 ,l
grd ,nyvc ,b
gr3 ,trub2,b
b12 ,ngb2 ,b
b12 ,ggb8 ,b
nyvc ,xtr2d,i
trub2 ,xtuD ,i
gr4 ,stab3,i
gr9 ,ubc8 ,i
n7 ,ubc2 ,i
ggb8 ,drik2,i
我的目标是将父列和子列取消嵌套为宽格式。列名称应对应于child_level列中的相应级别:
,x ,l ,b ,i
d,sf,gr4,NA ,stab3
d,sf,gr3,trub2,xtuD
d,st,grd,nyvc ,xtr2d
d,st,gr9,NA ,ubc8
d,s0,n7 ,NA ,ubc2
d,s0,b12,ngb2 ,NA
d,s0,b12,ggb8 ,drik2
没有表示顺序或级别等级的其他信息,这些信息仅从子列和父列中出现。此外,并非输出表的所有行都包含所有层次结构级别 - 这些级别应使用 NA 填充。
EDIT1(为了澄清):R 或 Python 解决方案可以工作,所以我提供了一个通用的输入表(例如作为 csv 阅读)。此外,下表是手工制作的 - 我实际上不知道如何以编程方式到达那里。
EDIT2:行不是有序的,即子级别可以按任何顺序排列,因此它必须是某种递归方法。
更新了答案
我现在有一种方法,只要你知道后代的顺序,即我们需要对child_level
向量进行排序的顺序,它就有效。如果你不知道,我们应该能够计算订单,但现在我认为它是已知的。
该方法基于:
- 首先还要计算
parent_level
- 按
parent_level
和child_level
嵌套数据框 - 一个自定义函数,可以与
purrr::accumulate2
或purrr::reduce2
一起使用,该函数使用left_join
将所有 data.frame 联接在一行中,如果再次联接现有列,则相应的列将合并为一个
在应用此自定义join_merge
函数之前:
- 嵌套的数据帧需要按降序排序(
child_level
) - 列名
parent
和child
将替换为parent_level
和child_level
- 最后,
parent_level
和child_level
被组合成一个名为arg_ls
的向量,该向量作为.y
参数传递给accumulate2
(或者reduce2
我希望这适用于您的真实数据。
library(tidyverse)
dat <- tribble(
~ parent, ~child, ~child_level,
"d" ,"sf" ,"x",
"d" ,"st" ,"x",
"d" ,"s0" ,"x",
"sf" ,"gr4" ,"l",
"sf" ,"gr3" ,"l",
"st" ,"grd" ,"l",
"st" ,"gr9" ,"l",
"s0" ,"n7" ,"l",
"s0" ,"b12" ,"l",
"grd" ,"nyvc" ,"b",
"gr3" ,"trub2","b",
"b12" ,"ngb2" ,"b",
"b12" ,"ggb8" ,"b",
"nyvc" ,"xtr2d","i",
"trub2" ,"xtuD" ,"i",
"gr4" ,"stab3","i",
"gr9" ,"ubc8" ,"i",
"n7" ,"ubc2" ,"i",
"ggb8" ,"drik2","i"
)
# in a first step we calculate the `parent_level`
dat <- dat %>%
left_join(., select(., -parent), by = c("parent" = "child")) %>%
rename("child_level" = "child_level.x",
"parent_level" = "child_level.y") %>%
mutate(parent_level = replace_na(parent_level, "o"))
# we need this function to work with accumulate2 or reduce2
join_merge <- function(df1, df2, .rename) {
res <- left_join(df1, df2, by = .rename[1])
# in case an existing column is joined again, we need to merge it together
if(length(colnames(select(res, starts_with(all_of(.rename[2]))))) > 1) {
res <- mutate(res, across(matches(paste0(.rename[2], ".x")),
~ if_else(is.na(.x), eval(sym(paste0(.rename[2], ".y"))), .x))) %>%
select(-all_of(paste0(.rename[2], ".y"))) %>%
rename(!! .rename[2] := paste0(.rename[2], ".x"))
}
res
}
# accumulate is used to show how the final result is buildt
dat %>%
nest_by(child_level, parent_level) %>%
arrange(child_level == "i", desc(child_level)) %>%
mutate(arg_ls = list(c(parent_level, child_level))) %>%
mutate(data = list(rename_with(data,
~ paste0(child_level),
"child") %>%
rename_with(~ paste0(parent_level),
"parent"))) %>%
ungroup %>%
mutate(dat_acc = accumulate2(data,
arg_ls[-1],
join_merge)) %>%
pull(dat_acc)
#> [[1]]
#> # A tibble: 3 x 2
#> o x
#> <chr> <chr>
#> 1 d sf
#> 2 d st
#> 3 d s0
#>
#> [[2]]
#> # A tibble: 6 x 3
#> o x l
#> <chr> <chr> <chr>
#> 1 d sf gr4
#> 2 d sf gr3
#> 3 d st grd
#> 4 d st gr9
#> 5 d s0 n7
#> 6 d s0 b12
#>
#> [[3]]
#> # A tibble: 7 x 4
#> o x l b
#> <chr> <chr> <chr> <chr>
#> 1 d sf gr4 <NA>
#> 2 d sf gr3 trub2
#> 3 d st grd nyvc
#> 4 d st gr9 <NA>
#> 5 d s0 n7 <NA>
#> 6 d s0 b12 ngb2
#> 7 d s0 b12 ggb8
#>
#> [[4]]
#> # A tibble: 7 x 5
#> o x l b i
#> <chr> <chr> <chr> <chr> <chr>
#> 1 d sf gr4 <NA> <NA>
#> 2 d sf gr3 trub2 xtuD
#> 3 d st grd nyvc xtr2d
#> 4 d st gr9 <NA> <NA>
#> 5 d s0 n7 <NA> <NA>
#> 6 d s0 b12 ngb2 <NA>
#> 7 d s0 b12 ggb8 drik2
#>
#> [[5]]
#> # A tibble: 7 x 5
#> o x l b i
#> <chr> <chr> <chr> <chr> <chr>
#> 1 d sf gr4 <NA> stab3
#> 2 d sf gr3 trub2 xtuD
#> 3 d st grd nyvc xtr2d
#> 4 d st gr9 <NA> ubc8
#> 5 d s0 n7 <NA> ubc2
#> 6 d s0 b12 ngb2 <NA>
#> 7 d s0 b12 ggb8 drik2
创建于 2020-12-22 由 reprex 软件包 (v0.3.0)
如果没有进一步的信息,这里是我通常如何处理这个问题。我正在展示一个tidyverse
解决方案,但当然这也可以在基础 R 中完成。
data <- structure(list(child = structure(c(10L, 11L, 9L, 4L, 5L, 5L,
6L, 1L, 8L, 13L, 7L, 3L, 16L, 17L, 12L, 15L, 14L, 2L), .Label = c("b12",
"drik2", "ggb8", "gr4", "grd", "n7", "ngb2", "nyvc", "s0", "sf",
"st", "stab3", "trub2", "ubc2", "ubc8", "xtr2d", "xtuD"), class = "factor"),
parent = structure(c(2L, 2L, 2L, 11L, 11L, 12L, 10L, 10L,
7L, 4L, 1L, 1L, 9L, 13L, 5L, 6L, 8L, 3L), .Label = c("b12",
"d", "ggb8", "gr3", "gr4", "gr9", "grd", "n7", "nyvc", "s0",
"sf", "st", "trub2"), class = "factor"), child_level = structure(c(4L,
4L, 4L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("b", "i", "t", "x"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
library(tidyverse)
pivot <- data %>% mutate(unique = rownames(data)) %>% pivot_wider(id_cols = unique, names_from = child_level, values_from = child) %>% select(!unique)
输入如下所示:
# > data
# child parent child_level
# 1 sf d x
# 2 st d x
# 3 s0 d x
# 4 gr4 sf t
# 5 grd sf t
# 6 grd st t
# 7 n7 s0 t
# 8 b12 s0 t
# 9 nyvc grd b
# 10 trub2 gr3 b
# 11 ngb2 b12 b
# 12 ggb8 b12 b
# 13 xtr2d nyvc i
# 14 xtuD trub2 i
# 15 stab3 gr4 i
# 16 ubc8 gr9 i
# 17 ubc2 n7 i
# 18 drik2 ggb8 i
它将输出以下内容:
# > pivot
# # A tibble: 18 x 4
# x t b i
# <fct> <fct> <fct> <fct>
# 1 sf NA NA NA
# 2 st NA NA NA
# 3 s0 NA NA NA
# 4 NA gr4 NA NA
# 5 NA grd NA NA
# 6 NA grd NA NA
# 7 NA n7 NA NA
# 8 NA b12 NA NA
# 9 NA NA nyvc NA
# 10 NA NA trub2 NA
# 11 NA NA ngb2 NA
# 12 NA NA ggb8 NA
# 13 NA NA NA xtr2d
# 14 NA NA NA xtuD
# 15 NA NA NA stab3
# 16 NA NA NA ubc8
# 17 NA NA NA ubc2
# 18 NA NA NA drik2