r-用于比较具有相同ID的特定列的行的函数

  • 本文关键字:函数 ID 比较 用于 r compare rows
  • 更新时间 :
  • 英文 :


我有一个大的实验室数据库,一些ID有多个结果,我还创建了另一个关键变量,首字母缩写+年龄+性别变量,用于其他与医院病历匹配的目的。然而,我注意到,有时不同的首字母缩写有相同的医院ID。我想写一个函数来检测这种不连贯性。

所以数据库的例子:

df=data.frame(ID=c("5606","5606","5728","5824","5824","5824","5824"),
key2=c("TN35M","TN35M","JJ26M","CD47F","CD47F","DG44M","DG44M"),
date_sample=c("12/03/2012","12/03/2012","19/04/2012","21/05/2012","21/05/2012","19/10/2012","19/10/2012"), service=c("ORTHO","ORTHO","BLOC","VISC","VISC","BLOC","BLOC"), germe=c("Acinetobacter sp","Burkholderia pseudomallei","Stenotrophomonas maltophilia","Staphylococcus haemolyticus"," Enterobacter cloacae","Escherichia  coli","Pseudomonas aeruginosa"))

ID      key2    date_sample service germe
5606    TN35M   12/03/2012  ORTHO   Acinetobacter sp
5606    TN35M   12/03/2012  ORTHO   Burkholderia pseudomallei
5728    JJ26M   19/04/2012  BLOC    Stenotrophomonas maltophilia
5824    CD47F   21/05/2012  VISC    Staphylococcus haemolyticus
5824    CD47F   21/05/2012  VISC    Enterobacter cloacae
5824    DG44M   19/10/2012  BLOC    Escherichia coli
5824    DG44M   19/10/2012  BLOC    Pseudomonas aeruginosa

每个ID应该有一个唯一的key2变量。如何比较同一个"ID"变量的"key2"变量行,并使用输出变量来检测所有不相干的行,以确保每个ID都给了一个唯一的患者,但不会被一个以上的患者共享?

类似:


ID       key2   date_sample service germe                        incoherence
5606    TN35M   12/03/2012  ORTHO   Acinetobacter sp                N
5606    TN35M   12/03/2012  ORTHO   Burkholderia pseudomallei       N
5728    JJ26M   19/04/2012  BLOC    Stenotrophomonas maltophilia    N
5824    CD47F   21/05/2012  VISC    Staphylococcus haemolyticus     Y
5824    CD47F   21/05/2012  VISC    Enterobacter cloacae            Y
5824    DG44M   19/10/2012  BLOC    Escherichia coli                Y
5824    DG44M   19/10/2012  BLOC    Pseudomonas aeruginosa          Y

使用dplyr

library(dplyr)
df %>%
group_by(ID) %>%
mutate(incoherence = c("N", "Y")[(n_distinct(key2) > 1) +1])
#   ID    key2 incoherence
#  <fct> <fct> <chr>      
#1 5606  TN35M N          
#2 5606  TN35M N          
#3 5728  JJ26M N          
#4 5824  CD47F Y          
#5 5824  CD47F Y          
#6 5824  DG44M Y          
#7 5824  DG44M Y       

data.table

library(data.table)
setDT(df)[, incoherence := c("N", "Y")[(uniqueN(key2) > 1) +1], by = ID]

您可以计算每个组的唯一值。如果大于1,则Y(或在这种情况下为TRUE(,即

!with(df, ave(key2, ID, FUN = function(i) length(unique(i)))) == 1
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

注意:确保变量是字符,而不是因子

相关内容

  • 没有找到相关文章

最新更新