标识仅包含表达式重复的子集

  • 本文关键字:子集 表达式 包含 标识 r
  • 更新时间 :
  • 英文 :


>我有一个这样的数据集:

 df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B", 
                   "C","C","C","C","C","D","D","D","D","D"),  
                y= as.factor(c(rep("Eoissp2",4),rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2","Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

我想确定,对于x的每个子集,y中的相应级别是包含表达式Eois的完全重复的。因此,ABD将在向量中返回,因为ABD的每个级别都包含表达式Eois,而级别C由各种唯一级别组成(例如 Eois、Automeris 和 Acharias(。对于此示例,输出为:

   output<- c("A", "B", "D")

使用新的 df:

> df %>% filter(str_detect(y,"Eois")) %>% group_by(x) %>% distinct(y) %>% 
    count() %>% filter(n==1) %>% select(x)
# A tibble: 2 x 1
# Groups:   x [2]
  x    
  <fct>
1 A    
2 B   

(以下答案使用问题作者发布的原始df。

magrittr 中使用管道函数和 dplyr 中的函数:

> df %>% group_by(x) %>% distinct(y)
# A tibble: 7 x 2
# Groups:   x [3]
  x     y      
  <fct> <fct>  
1 A     plant1a
2 B     plant1b
3 C     plant1a
4 C     plant2a
5 C     plant3a
6 C     plant4a 
7 C     plant5a

然后,您可以像这样汇总结果:

> results <- df %>% group_by(x) %>% distinct(y) %>% 
    count() %>% filter(n==1) %>% select(x)
> results
# A tibble: 2 x 1
# Groups:   x [2]
  x    
  <fct>
1 A    
2 B   

如果您知道原始数据框总是按顺序带有 x,则可以删除group_by部分。

基于dplyr的解决方案可以是:

library(dplyr)
df %>% group_by(x) %>%
  filter(grepl("Eoiss", y)) %>%
  mutate(y = sub("\d+", "", y)) %>%
  filter(n() >1 & length(unique(y)) == 1) %>%
  select(x) %>% unique(.)
# A tibble: 3 x 1
# Groups: x [3]
#  x     
#  <fctr>
#1 A     
#2 B     
#3 D

数据

df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B", 
                   "C","C","C","C","C","D","D","D","D","D"),  
               y= as.factor(c(rep("Eoissp2",4),
      rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2",
      "Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

最新更新