r-基于至少一个公共值对Id进行分组



我有一个元素是整数的列表,如果这些元素至少共享一个值,我想累加这些元素。对于那些与其他元素没有任何价值观的元素,我希望它们保持原样。这是我的样品日期:

x <- list(c(1, 2), c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 8), c(6, 9, 7), 7, c(5, 8), 10, 11)

这是我想要的输出:

desired_reult <- list(c(1, 2, 3, 4, 5, 8), 
c(6, 9, 7), 
10, 
11)

我想首先使用purrr中的reduceaccumulate函数来完成此操作,但欢迎使用任何其他tidyverse解决方案。到目前为止,我已经尝试过这个解决方案,但它只给了我一个union,显然放弃了其余的:

x %>% 
reduce(~ if(any(.x %in% .y)) union(.x, .y) else .x)
[1] 1 2 3 4 5 8

一般来说,我正在寻找一种像聚类一样将具有公共值的整数(id(分组的方法,但不幸的是,到目前为止,我的努力都白费了。

非常感谢你事先的帮助。

我怀疑有一个集合覆盖解决方案,但在此期间,这里有一种图形方法:

首先,让我们将整数向量转换为边列表,这样就可以将其制成图。我们可以使用expand.grid

library(igraph)
edgelist <- do.call(rbind,lapply(x,(x)expand.grid(x,x))) #R version >= 4.1.0

现在我们有一个两列的data.frame,显示所有整数(一组边(之间的连接。

CCD_ 7可以很方便地用它绘制图形。

从那里我们可以使用igraph::components来提取连通分量。

g <- graph.data.frame(edgelist)
split(names(components(g)$membership),components(g)$membership)
#$`1`
#[1] "1" "2" "3" "4" "5" "8"
#$`2`
#[1] "6" "9" "7"
#$`3`
#[1] "10"
#$`4`
#[1] "11"

或使用Tidyverse:

library(dplyr); library(purrr)
map_dfr(x, ~expand.grid(.x,.x)) %>%
graph.data.frame() %>%
components() %>% 
pluck(membership) %>%
stack() %>%
{split(as.numeric(as.character(.[,2])),.[,1])}
$`1`
[1] 1 2 3 4 5 8
$`2`
[1] 6 9 7
$`3`
[1] 10
$`4`
[1] 11

一种方法:

i与j相邻,iff与(i,j(相交!=空集。我们想找到矩阵的连通分量,在位置(i,j(上有1,当集i与集j相邻时,否则为0。前4行构建邻接矩阵,第5行和第6行查找连接的组件,其余的是基于该成员关系拆分列表并取唯一值。

library(tidyverse)
library(igraph)
map(x, function(a) map_int(x, ~length(base::intersect(a, .x)) > 0) * 1L) %>% 
reduce(rbind) %>%
graph.adjacency() %>%
as.undirected() %>%
components() %>%
pluck("membership") %>%
split(seq_along(.), .) %>%
map(~unique(unlist(x[.x])))

感谢我亲爱的朋友@Ian Canmpbell介绍的一篇内容丰富的帖子,我想挑战自己,为此编写一个自定义函数。它仍然是第一个版本,虽然不是很优雅,当然可以有很大的改进,但目前它是稳定的,因为我在一些输入上尝试了它,它没有让人失望。

anoush <- function(x) {
# First we check whether x is a list
stopifnot(is.list(x)) 
# Then we take every element of the input and calculate the intersect between
# that element & others. In case there were some we would store the indices 
# in `vec` vector. So in the end we have a list called `ind` whose elements 
# are all the indices connected with the corresponding elements of the original 
# list for example first element of `ind` is `1`, `2`, `3` which means in 
# the original list these elements have common values.

ind <- lapply(1:length(x), function(a) {
vec <- c()
for(i in 1:length(x)) {
if(length(unique(base::intersect(x[[a]], x[[i]]))) > 0) {
vec <- c(vec, i)
}
}
vec 
})
# Then we go on to again compare each element of `ind` with other elements
# in case there were any intersect, we will calculate the `union` of them.
# for each element we will end up with a list of accumulated values but
# but in the end we use `Reduce` to capture only the last one. So for each
# element of `ind` we end up having a collection of indices that also 
# result in duplicated values. For example elements `1` through `5` of 
# `dup_ind` contains the same value cause in the original list these 
# elements have common values.
dup_ind <- lapply(1:length(ind), function(a) {
out <- c()
for(i in 1:length(ind)) {
if(length(unique(base::intersect(ind[[a]], ind[[i]]))) > 0) {
out[[i]] <- union(ind[[a]], ind[[i]])
}
vec2 <- Reduce("union", out)
}
vec2
}) 
# Here we get rid of the duplicated elements of the list by means of 
# `relist` funciton and since in this process all the duplicated elements
# will turn to `integer(0)` I have filtered those out.

un <- unlist(dup_ind)
res <- Map(`[`, dup_ind, relist(!duplicated(un), skeleton = dup_ind))
res2 <- Filter(length, res)

sapply(res2, function(a) unique(unlist(lapply(a, function(b) `[[`(x, b)))))

}

输出

> anoush(x)
[[1]]
[1] 1 2 3 4 5 8
[[2]]
[1] 6 9 7
[[3]]
[1] 10
[[4]]
[1] 11

最新更新