我必须使用两个独立评级器的评级来数据帧。在x栏中,对具有特定参考ID(参考ID(的论文的发表年份进行了编码。对于某些纸张,对多个样本进行了编码。该信息反映在变量"Sample.ID"中(例如,在df1中,有三个样本编码为Ref.ID"C"(。参考ID和样本ID的组合表示在变量"Ref.sample.ID"中。我想知道df1和df2对变量x的编码不同。注意,df2比df1少一行,因为df2中的评级器只为参考ID"C"编码了两个样本,而df1中的评级者编码了三个样本。
我正试图找到一个R代码,它将暴露df1和df2之间的不匹配。不匹配可能是因为每个参考ID编码的行数不同,也可能是因为同一参考样本ID的df1和df2之间的x不同。
有人知道如何做到最好吗?我为每一个提示感到高兴:(
df1 <- read.table(text="
Ref.ID Sample.ID Ref.Sample.ID x y
A 1 A-1 2000 a
B 1 B-1 1992 a
C 1 C-1 2018 b
C 2 C-2 2018 b
C 3 C-3 2018 b
D 1 D-1 2011 c
D 1 D-1 2011 c
E 1 E-1 1990 a
F 1 F-1 1990 c
G 1 G-1 2015 d
G 2 G-2 2015 d
G 3 G-3 2015 d", header=TRUE)
# Note df2 has one row less than df1!
df2 <- read.table(text="
Ref.ID Sample.ID Ref.Sample.ID x y
A 1 A-1 2000 a
B 1 B-1 1992 a
C 1 C-1 2018 b
C 2 C-2 2018 b
D 1 D-1 2011 a
D 2 D-2 2011 a
E 1 E-1 1991 a
F 1 F-1 1990 d
G 1 G-1 2011 d
G 2 G-2 2011 d
G 3 G-3 2011 c", header=TRUE)
最终结果应该是参考样本ID的不同向量,对于这些向量,x或y上的df1和df2之间存在差异。
例如。对于x:"C-3"E-1"G-1"G-2"G-3"D-2">
对于y:"C-3"D-1"F-1"G-3"D-2">
这将同时使用tidyr
和dplyr
。
您可以先为两个数据帧pivot_longer
,这样您将有一个单独的行供x
和y
进行比较。然后使用anti_join
来查找这两个数据帧之间的差异。这将检查任一数据帧中的额外/缺失/不同行。
最后,为了获得最终结果,您可以按x
或y
进行筛选,选择Ref.Sample.ID
作为您感兴趣的列,并选择distinct()
来删除重复项。如果希望所有结果都在一个数据帧中,则可以使用group_by(var)
而不是filter
。
library(tidyverse)
df1_long <- pivot_longer(df1, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))
df2_long <- pivot_longer(df2, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))
df_diff <- bind_rows(anti_join(df1_long, df2_long), anti_join(df2_long, df1_long))
df_diff %>%
filter(var == "x") %>%
select(Ref.Sample.ID) %>%
distinct()
输出
# A tibble: 6 x 1
Ref.Sample.ID
<chr>
1 C-3
2 E-1
3 G-1
4 G-2
5 G-3
6 D-2
数据
df1 <- structure(list(Ref.ID = c("A", "B", "C", "C", "C", "D", "D",
"E", "F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 3L, 1L,
1L, 1L, 1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1",
"C-2", "C-3", "D-1", "D-1", "E-1", "F-1", "G-1", "G-2", "G-3"
), x = c(2000L, 1992L, 2018L, 2018L, 2018L, 2011L, 2011L, 1990L,
1990L, 2015L, 2015L, 2015L), y = c("a", "a", "b", "b", "b", "c",
"c", "a", "c", "d", "d", "d")), class = "data.frame", row.names = c(NA,
-12L))
df2 <- structure(list(Ref.ID = c("A", "B", "C", "C", "D", "D", "E",
"F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1", "C-2",
"D-1", "D-2", "E-1", "F-1", "G-1", "G-2", "G-3"), x = c(2000L,
1992L, 2018L, 2018L, 2011L, 2011L, 1991L, 1990L, 2011L, 2011L,
2011L), y = c("a", "a", "b", "b", "a", "a", "a", "d", "d", "d",
"c")), class = "data.frame", row.names = c(NA, -11L))