R中通过模糊字符串匹配和分组摘要创建新变量的有效方法



我正在尝试使用模糊字符串匹配将字符串转换为特定的id,并使用dplyr执行分组摘要。其基本思想是通过字典查找方法将不完美的基因序列组合成一个基因名称,并计算该基因被检测的次数。这样,序列aaaaaaaaaxaa的计数与gene1匹配,并相加在一起。

我可以使用forif语句,通过将原始数据与字典逐行比较来完成我想要的操作,但我发现当我扩大规模时,这将是低效的(原始数据文件平均有15k行,字典有200行(。请参阅下面我正在努力改进的解决方案,如果你能想出一种更高效、更优雅的方法,请告诉我。

df <- data.frame(str_var = rep(c("aaaaaa", "aXaaaa", "bbbbbb", "bbbXbb"), 3),
grp_var = rep(c("grp1","grp2"), each=6),
num_var = rep(c(1,2), 6))
df
#>    str_var grp_var num_var
#> 1   aaaaaa    grp1       1
#> 2   aXaaaa    grp1       2
#> 3   bbbbbb    grp1       1
#> 4   bbbXbb    grp1       2
#> 5   aaaaaa    grp1       1
#> 6   aXaaaa    grp1       2
#> 7   bbbbbb    grp2       1
#> 8   bbbXbb    grp2       2
#> 9   aaaaaa    grp2       1
#> 10  aXaaaa    grp2       2
#> 11  bbbbbb    grp2       1
#> 12  bbbXbb    grp2       2

dictionary <- data.frame(string = c("aaaaaa","bbbbbb", "cccccc", "dddddd"),
id = c("gene1", "gene2", "gene3", "gene4"))
dictionary
#>   string    id
#> 1 aaaaaa gene1
#> 2 bbbbbb gene2
#> 3 cccccc gene3
#> 4 dddddd gene4
for(i in 1:nrow(df)){


for(j in 1:nrow(dictionary)){

match_found <- agrepl(dictionary$string[j], df$str_var[i],
max.distance = list(sub=1, ins=0, del=0, all=1-1e-9))

if(match_found == TRUE){

gene = dictionary[j, "id"]

df$gene_id[i] <- gene

break

}

}

}
suppressPackageStartupMessages(library(dplyr))
new_df <- df %>%
group_by(grp_var, gene_id) %>%
summarize(gene_count=sum(num_var))
#> `summarise()` has grouped output by 'grp_var'. You can override using the `.groups` argument.
new_df
#> # A tibble: 4 x 3
#> # Groups:   grp_var [2]
#>   grp_var gene_id gene_count
#>   <chr>   <chr>        <dbl>
#> 1 grp1    gene1            6
#> 2 grp1    gene2            3
#> 3 grp2    gene1            3
#> 4 grp2    gene2            6

创建于2021-06-08由reprex包(v2.0.0(

也许fuzzyjoin会更容易

library(fuzzyjoin)
stringdist_left_join(df, dictionary, by = c("str_var" = "string")) %>% 
group_by(grp_var, gene_id = id) %>% 
summarise(gene_count = sum(num_var), .groups = 'drop')

-输出

# A tibble: 4 x 3
grp_var gene_id gene_count
<chr>   <chr>        <dbl>
1 grp1    gene1            6
2 grp1    gene2            3
3 grp2    gene1            3
4 grp2    gene2            6

最新更新