这是我的数据集:
df = data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Diagnosis=c("NEG","NEG","POS","NEG","NEG","NEG","NEG","POS","POS",'','',''))
我想用两个条件过滤掉重复的病例:1)如果"诊断"等于"POS",那么选择一个"POS"观测值;2)如果"诊断"不等于"POS",则检查"诊断"是否等于"NEG",然后选择一个"NEG"观测值;3)对于不满足条件1和条件2的其他情况,只需选择组(ID)中的任意一条记录。
这是我想要的数据集:
df = data.frame(ID = c(1,2,3,4),
Diagnosis=c("POS","NEG","POS",''))
这是我尝试的代码,但它没有得到预期的结果:
df_unique <- df %>% group_by(ID) %>% filter(Diagnosis==ifelse('POS' %in% Diagnosis, first(Diagnosis=='POS'),ifelse((!('POS' %in% Diagnosis)&('NEG' %in% Diagnosis)),first(Diagnosis=='POS'),first(Diagnosis))))
一种方法是将Diagnosis
作为有序因子,然后将arrange
(按此列排序)并按组选择第一行。
library(dplyr)
df$Diagnosis <- factor(df$Diagnosis, levels = c("POS", "NEG", ""), ordered = TRUE)
df %>%
group_by(ID) %>%
arrange(Diagnosis) %>%
slice(1)
ID Diagnosis
<dbl> <ord>
1 1 "POS"
2 2 "NEG"
3 3 "POS"
4 4 ""
我们也可以使用distinct
library(dplyr)
library(forcats)
df %>%
mutate(Diagnosis = fct_relevel(Diagnosis, c("POS", "NEG", ""))) %>%
arrange(ID, Diagnosis) %>%
distinct(ID, .keep_all = TRUE)
# ID Diagnosis
#1 1 POS
#2 2 NEG
#3 3 POS
#4 4