r语言 - 查找组的不同特征,使其不同于另一个组



我有两组患者(生病和健康)。每个患者都有这样的等级特征:

healthy_patient1 <- data.frame(feature=c("a", "b", "c", "d", "e", "f"), rank = c(0.001, 0.002, 0.002, 0.003, 0.05, 0.067))
healthy_patient2 <- data.frame(feature=c("a", "d", "e", "f", "g", "h", "q"), rank = c(0.001, 0.008, 0.01, 0.02, 0.05, 0.067, 1.2))
healthy_patient3 <- data.frame(feature=c("c", "d", "e", "g", "k", "l"), rank = c(0.003, 0.005, 0.01, 0.02, 0.05, 0.08))
healthy_patient4 <- data.frame(feature=c("b", "e", "g", "d", "k", "q", "o"), rank = c(0.001, 0.008, 0.01, 0.021, 0.054, 0.078, 1.1))
ill_patient1 <- data.frame(feature=c("c", "d", "e", "f", "o", "p", "q"), rank = c(0.002, 0.004, 0.005, 0.006, 0.02, 0.067, 0.09))
ill_patient2 <- data.frame(feature=c("e", "f", "o", "p", "r"), rank = c(0.001, 0.003, 0.02, 0.02, 0.03))
ill_patient3 <- data.frame(feature=c("c", "e", "o", "n", "k", "r"), rank = c(0.003, 0.005, 0.01, 0.03, 0.04, 0.08))
ill_patient4 <- data.frame(feature=c("b", "e", "o", "h", "n", "r", "s"), rank = c(0.002, 0.007, 0.01, 0.02, 0.03, 0.068, 1.1))

等级显示特定患者特征的特异性,等级越低,特征越重要。 我想在健康患者中找到他们与患病患者的共同特征。反之亦然,这些特征对于患病患者来说很常见,与健康患者不同。

另外,我需要知道共同特征的排名总和

我试过这个:

healthy_comm <- intersect(intersect(healthy_patient1$feature, healthy_patient2$feature),intersect(healthy_patient3$feature, healthy_patient4$feature))
ill_comm <- intersect(intersect(ill_patient1$feature, ill_patient2$feature),intersect(ill_patient3$feature, ill_patient4$feature))
setdiff(healthy_comm, ill_comm)
healthy_comm 
[1] "d" "e"
ill_comm 
1] "e" "o"
setdiff(healthy_comm, ill_comm) 
[1] "d"

我可以回去在原始数据中找到"d"的排名总和,但在我的真实数据集中,我有更多的患者和特征。所以,也许有一个更优雅和有效的解决方案来解决这个问题

上。在这种情况下,所需的输出将是"d", sum_rank_healthy(d)=0.037, sum_rank_ill(d)=0.004

这是它如何工作的基本想法:

  1. 将数据框的名称作为列添加到所有数据框
  2. 然后创建数据帧df_healthy并使用bind_rowsdf_ill
  3. 然后在此示例中应用inner_joinbyfeature(您也可以使用rank)与输出,您可以找到常见和不同的功能。
ill_patient1$patient <- "ill_patient1"
ill_patient2$patient <- "ill_patient2"
ill_patient3$patient <- "ill_patient3"
ill_patient4$patient <- "ill_patient4"
healthy_patient1$patient <- "healthy_patient1"
healthy_patient2$patient <- "healthy_patient2"
healthy_patient3$patient <- "healthy_patient3"
healthy_patient4$patient <- "healthy_patient4"

df_healthy <- bind_rows(healthy_patient1, healthy_patient2, healthy_patient3, healthy_patient4)
df_ill <- bind_rows(ill_patient1, ill_patient2, ill_patient3, ill_patient4)

library(dplyr)
inner_join(df_ill, df_healthy, by = "feature")

您可以扩展

library(dplyr)
inner_join(df_ill, df_healthy, by = "feature") %>% 
mutate(common_rank = as.logical(rank.x == rank.y))

输出

feature rank.x patient.x    rank.y patient.y        common_rank
<chr>    <dbl> <chr>         <dbl> <chr>            <lgl>      
1 c        0.002 ill_patient1  0.002 healthy_patient1 TRUE       
2 c        0.002 ill_patient1  0.003 healthy_patient3 FALSE      
3 d        0.004 ill_patient1  0.003 healthy_patient1 FALSE      
4 d        0.004 ill_patient1  0.008 healthy_patient2 FALSE      
5 d        0.004 ill_patient1  0.005 healthy_patient3 FALSE      
6 d        0.004 ill_patient1  0.021 healthy_patient4 FALSE      
7 e        0.005 ill_patient1  0.05  healthy_patient1 FALSE      
8 e        0.005 ill_patient1  0.01  healthy_patient2 FALSE      
9 e        0.005 ill_patient1  0.01  healthy_patient3 FALSE      
10 e        0.005 ill_patient1  0.008 healthy_patient4 FALSE      
# ... with 29 more rows

一个带有mget的选项

library(dplyr)
df1 <-  mget(ls(pattern = 'ill_patient')) %>% bind_rows(.id = 'patient')
df2 <- mget(ls(pattern = 'healthy_patient')) %>% bind_rows(.id = 'patient')
inner_join(df1, df2, by = 'feature')

相关内容

最新更新