r-检测基于特定列的两个数据帧中的不匹配



我必须使用两个独立评级器的评级来数据帧。在x栏中,对具有特定参考ID(参考ID(的论文的发表年份进行了编码。对于某些纸张,对多个样本进行了编码。该信息反映在变量"Sample.ID"中(例如,在df1中,有三个样本编码为Ref.ID"C"(。参考ID和样本ID的组合表示在变量"Ref.sample.ID"中。我想知道df1和df2对变量x的编码不同。注意,df2比df1少一行,因为df2中的评级器只为参考ID"C"编码了两个样本,而df1中的评级者编码了三个样本。

我正试图找到一个R代码,它将暴露df1和df2之间的不匹配。不匹配可能是因为每个参考ID编码的行数不同,也可能是因为同一参考样本ID的df1和df2之间的x不同。

有人知道如何做到最好吗?我为每一个提示感到高兴:(

df1 <- read.table(text="
Ref.ID    Sample.ID    Ref.Sample.ID     x       y
A         1            A-1               2000    a    
B         1            B-1               1992    a
C         1            C-1               2018    b 
C         2            C-2               2018    b   
C         3            C-3               2018    b   
D         1            D-1               2011    c 
D         1            D-1               2011    c
E         1            E-1               1990    a      
F         1            F-1               1990    c   
G         1            G-1               2015    d   
G         2            G-2               2015    d    
G         3            G-3               2015    d", header=TRUE)
# Note df2 has one row less than df1!
df2 <- read.table(text="
Ref.ID    Sample.ID    Ref.Sample.ID     x       y     
A         1            A-1               2000    a   
B         1            B-1               1992    a
C         1            C-1               2018    b
C         2            C-2               2018    b   
D         1            D-1               2011    a 
D         2            D-2               2011    a
E         1            E-1               1991    a       
F         1            F-1               1990    d   
G         1            G-1               2011    d    
G         2            G-2               2011    d     
G         3            G-3               2011    c", header=TRUE)

最终结果应该是参考样本ID的不同向量,对于这些向量,x或y上的df1和df2之间存在差异。

例如。对于x:"C-3"E-1"G-1"G-2"G-3"D-2">

对于y:"C-3"D-1"F-1"G-3"D-2">

这将同时使用tidyrdplyr

您可以先为两个数据帧pivot_longer,这样您将有一个单独的行供xy进行比较。然后使用anti_join来查找这两个数据帧之间的差异。这将检查任一数据帧中的额外/缺失/不同行。

最后,为了获得最终结果,您可以按xy进行筛选,选择Ref.Sample.ID作为您感兴趣的列,并选择distinct()来删除重复项。如果希望所有结果都在一个数据帧中,则可以使用group_by(var)而不是filter

library(tidyverse)
df1_long <- pivot_longer(df1, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))
df2_long <- pivot_longer(df2, cols = c(x, y), names_to = "var", values_to = "val", values_ptypes = list(val = 'character'))
df_diff <- bind_rows(anti_join(df1_long, df2_long), anti_join(df2_long, df1_long))
df_diff %>%
filter(var == "x") %>%
select(Ref.Sample.ID) %>%
distinct()

输出

# A tibble: 6 x 1
Ref.Sample.ID
<chr>        
1 C-3          
2 E-1          
3 G-1          
4 G-2          
5 G-3          
6 D-2 

数据

df1 <- structure(list(Ref.ID = c("A", "B", "C", "C", "C", "D", "D", 
"E", "F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 3L, 1L, 
1L, 1L, 1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1", 
"C-2", "C-3", "D-1", "D-1", "E-1", "F-1", "G-1", "G-2", "G-3"
), x = c(2000L, 1992L, 2018L, 2018L, 2018L, 2011L, 2011L, 1990L, 
1990L, 2015L, 2015L, 2015L), y = c("a", "a", "b", "b", "b", "c", 
"c", "a", "c", "d", "d", "d")), class = "data.frame", row.names = c(NA, 
-12L))
df2 <- structure(list(Ref.ID = c("A", "B", "C", "C", "D", "D", "E", 
"F", "G", "G", "G"), Sample.ID = c(1L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 3L), Ref.Sample.ID = c("A-1", "B-1", "C-1", "C-2", 
"D-1", "D-2", "E-1", "F-1", "G-1", "G-2", "G-3"), x = c(2000L, 
1992L, 2018L, 2018L, 2011L, 2011L, 1991L, 1990L, 2011L, 2011L, 
2011L), y = c("a", "a", "b", "b", "a", "a", "a", "d", "d", "d", 
"c")), class = "data.frame", row.names = c(NA, -11L))

相关内容

  • 没有找到相关文章

最新更新