r-如果存在于同一组中,则折叠行



大家好,我有一个数据帧,比如:

Group       family
1     A      Canidae
2     B      Canidae
3     A      Felidae
4     B      Canidae
5     C Elephantidae
6     C    Galinacae
7     D    Galinacae
8     D     Siuridae
9     E       Apidae

我想折叠存在familyGroup组(例如:

犬科存在于AB中,因此我将各组的所有唯一值合并并添加到family2

Group family2
A,B   Canidae,Felidae 

然后我继续,我看到Elephantidae和Galinacae都在C中,而thtaGalinaca也在D中,所以我崩溃了:

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 

最后我们应该得到:

Group family2
A,B   Canidae,Felidae 
C,D   Elephantidae,Galinacae,Siuridae 
E     Apidae 

有人有主意吗?

以下是数据,请给这样的东西?非常感谢你的帮助和时间。

如果有帮助的话,以下是数据:

structure(list(Group = structure(c(1L, 2L, 1L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
family = structure(c(2L, 2L, 4L, 2L, 3L, 5L, 5L, 6L, 1L), .Label = c("Apidae", 
"Canidae", "Elephantidae", "Felidae", "Galinacae", "Siuridae"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

这是我的一些查找函数的解决方案

# A lookup function that look for intersect between group 
# if there are at least one intersect - those group will be combined
look_up_group <- function(one_group, lookup_list) {
matched_list <- map(lookup_list, function(x) {
tryCatch(
{
intersect(x, one_group)
}, error = function(e) {
stop(paste0("Error in lookup function: one_group=", one_group, "; x=", x))
}) 
})

index <- which(unlist(map(matched_list, function(x) { length(x) > 0 })))
sort(unique(unlist(lookup_list[index])))
}

df %>%
# First remove all duplicated rows - exactly the same for both Group, Family
filter(!duplicated(.)) %>%
# arrange in alphabetical order
arrange(Group, family) %>%
# create a Group_2 which is combination of all Group for each family
group_by(family) %>%
mutate(Group_2 = list(Group)) %>%
ungroup() %>%
# Create Group_3 which is the full combined Group for all intersect Group
mutate(Group_3 = map(.[["Group_2"]], function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
# Combine all Group_3 into a Group_final
mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} ))) %>%
# Finally put them all together.
select(Group_final, family) %>%
group_by(Group_final) %>%
summarize(family = paste(family, collapse = ","), .groups = "drop")

这是的最终输出

Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae   

为了更容易理解,这里是每个步骤的细节

第一步

# remove duplicate & create variable Group_2
tmp <- df %>%
filter(!duplicated(.)) %>%
arrange(Group, family) %>%
group_by(family) %>%
mutate(Group_2 = list(Group)) %>%
ungroup()

我们有这个数据-

Group family       Group_2  
<fct> <fct>        <list>   
1 A     Canidae      <fct [2]>
2 A     Felidae      <fct [1]>
3 B     Canidae      <fct [2]>
4 C     Elephantidae <fct [1]>
5 C     Galinacae    <fct [2]>
6 D     Galinacae    <fct [2]>
7 D     Siuridae     <fct [1]>
8 E     Apidae       <fct [1]>

Group_2看起来像这个

> tmp$Group_2
[[1]]
[1] A B
Levels: A B C D E
[[2]]
[1] A
Levels: A B C D E
[[3]]
[1] A B
Levels: A B C D E
[[4]]
[1] C
Levels: A B C D E
[[5]]
[1] C D
Levels: A B C D E
[[6]]
[1] C D
Levels: A B C D E
[[7]]
[1] D
Levels: A B C D E
[[8]]
[1] E
Levels: A B C D E

然后下一步创建Group_3并将它们组合到Group_final中

# Create Group_3
tmp <- tmp %>% 
mutate(Group_3 = map(.[["Group_2"]],
function(x) { look_up_group(one_group = x, lookup_list = .[["Group_2"]]) })) %>%
mutate(Group_final = unlist(map(Group_3, function(x) { paste (x, collapse = ",")} )))

这是新的tmp

# A tibble: 8 x 5
Group family       Group_2   Group_3   Group_final
<fct> <fct>        <list>    <list>    <chr>      
1 A     Canidae      <fct [2]> <fct [2]> A,B        
2 A     Felidae      <fct [1]> <fct [2]> A,B        
3 B     Canidae      <fct [2]> <fct [2]> A,B        
4 C     Elephantidae <fct [1]> <fct [2]> C,D        
5 C     Galinacae    <fct [2]> <fct [2]> C,D        
6 D     Galinacae    <fct [2]> <fct [2]> C,D        
7 D     Siuridae     <fct [1]> <fct [2]> C,D        
8 E     Apidae       <fct [1]> <fct [1]> E     

然后最后一步组合将族放在Group_final中

tmp %>%
select(Group_final, family) %>%
group_by(Group_final) %>%
summarize(family = paste(family, collapse = ","), .groups = "drop")

的最终结果

# A tibble: 3 x 2
Group_final family                                   
* <chr>       <chr>                                    
1 A,B         Canidae,Felidae,Canidae                  
2 C,D         Elephantidae,Galinacae,Galinacae,Siuridae
3 E           Apidae 

[更新:添加调试tryCatch]

最新更新