我有两个数据集如下所示:
df1 <- data.frame(Grade = c("G3","G3","G3","G3","G3","G3","G3","G3","G3","G3"),
names = c("Harper","Mason","Evelyn","Ella","Avery",
"Jackson","Olivia","Isla","Emily","Poppy"))
> df1
Grade names
1 G3 Harper
2 G3 Mason
3 G3 Evelyn
4 G3 Ella
5 G3 Avery
6 G3 Jackson
7 G3 Olivia
8 G3 Isla
9 G3 Emily
10 G3 Poppy
df2 <- data.frame(Grade = c("G3","G3","G3","G3","G3","G3","G3"),
names = c("Harper","Mason","Ava","Avery","Isabella",
"Jessica","Emily"))
> df2
Grade names
1 G3 Harper
2 G3 Mason
3 G3 Ava
4 G3 Avery
5 G3 Isabella
6 G3 Jessica
7 G3 Emily
在一个新的数据框中,我想保存四个信息:
(a( 通用名称,(b( DF1 中的唯一名称,
(c( DF2 中的唯一名称,以及 (d( 每列的计数。
因此,数据集应如下所示:
> final
Grade common.names unique.df1 unique.df2
1 G3 Harper Evelyn Ava
2 G3 Mason Ella Isabella
3 G3 Avery Jackson Jessica
4 G3 Emily Olivia <NA>
5 G3 <NA> Isla <NA>
6 G3 <NA> Poppy <NA>
7 Count 4 6 3
我试图从library(compare)
compare()
,但这似乎不适用于查找通用名称。
comparison <- compare(df1,df2,allowAll=TRUE)
comparison$tM
> comparison$tM
Grade names
1 G3 AVERY
2 G3 ELLA
3 G3 EVELYN
4 G3 HARPER
5 G3 JACKSON
6 G3 MASON
7 G3 OLIVIA
对此有什么想法吗? 谢谢!
你可以写一个函数:
join <- function(x,y)
{
join_by = intersect(names(x),names(y))
a <- data.table::transpose(dplyr::inner_join(x,y,join_by))
b <- data.table::transpose(dplyr::anti_join(x,y,join_by))
d <- data.table::transpose(dplyr::anti_join(y,x,join_by))
counts <- setNames(lengths(e <- list(a,b,d)),
c("common.names", "unique.df1", "unique.df2"))
f <- do.call(plyr::rbind.fill,e[y<-order(counts,decreasing = TRUE)])
s <- data.table::transpose(f)[-c(3,5)]
setNames(s,c("V1",names(counts[y])))[c(1,y+1)]
}
join(df1,df2)
V1 common.names unique.df1 unique.df2
1 G3 Harper Evelyn Ava
2 G3 Mason Ella Isabella
3 G3 Avery Jackson Jessica
4 G3 Emily Olivia <NA>
5 G3 <NA> Isla <NA>
6 G3 <NA> Poppy <NA>
这是一个选项,我们按"Grade"拆分数据集(假设有多个"Grade"值(,用Map
遍历list
,获取两个数据集中常见的、独特的元素(intersect
、setdiff
- 相应的函数(,创建一个带有cbind.fill
的data.frame
(从rowr
开始(并rbind
list
元素
library(rowr)
lst1 <- split(as.character(df1$names), df1$Grade)
lst2 <- split(as.character(df2$names), df2$Grade)
out <- do.call(rbind, unname(Map(function(x, y, z) {
cn <- intersect(x, y)
un1 <- setdiff(x, y)
un2 <- setdiff(y, x)
cbind(Grade = z, cbind.fill(cn, un1, un2, fill = NA))
}, lst1, lst2[names(lst1)], names(lst1))))
names(out)[-1] <- c("common.names", "unique.df1", "unique.df2")
out[] <- lapply(out, as.character)
rbind(out, c(Grade = 'Count', colSums(!is.na(out[-1]))))
# Grade common.names unique.df1 unique.df2
#1 G3 Harper Evelyn Ava
#2 G3 Mason Ella Isabella
## G3 Avery Jackson Jessica
#4 G3 Emily Olivia <NA>
#5 G3 <NA> Isla <NA>
#6 G3 <NA> Poppy <NA>
#7 Count 4 6 3