好吧,我将描述真实的数据,而不是一个代表,因为我不认为这会使它更容易,但为了澄清这一切,这个问题涉及到一个小小的生物化学101。
我使用DNA诱变文库,其中某些DNA位置是随机的,这导致蛋白质因此也具有随机的氨基酸位置。DNA由核苷酸(其中有四个,GATC)和氨基酸(其中有20个,用字母表示)组成,由一组三个核苷酸(密码子)编码。
我有两个描述密码子-氨基酸关系的向量:
cods <- c("GCT","GCC","GCA","GCG","CGT","CGC","CGA","CGG","AGA","AGG","AAT","AAC","GAT","GAC","TGT","TGC","CAA","CAG","GAA","GAG","GGT","GGC","GGA","GGG","CAT","CAC","TAA","TAG","TGA","ATT","ATC","ATA","CTT","CTC","CTA","CTG","TTA","TTG","AAA","AAG","ATG","TTT","TTC","CCT","CCC","CCA","CCG","TCT","TCC","TCA","TCG","AGT","AGC","ACT","ACC","ACA","ACG","TGG","TAT","TAC","GTT","GTC","GTA","GTG")
aas <- c("A","A","A","A","R","R","R","R","R","R","N","N","D","D","C","C","Q","Q","E","E","G","G","G","G","H","H","*","*","*","I","I","I","L","L","L","L","L","L","K","K","M","F","F","P","P","P","P","S","S","S","S","S","S","T","T","T","T","W","Y","Y","V","V","V","V")
随机化的位置允许某些核苷酸位于密码子的特定位置,并由特定的(不相关的)字母表示,因此,例如核苷酸密码子&;nys &;允许所有四个核苷酸(GATC)在第一个位置,但只有AG在第2位,AC在第3位。现在我创建了所有可能的NYS和另一个库的三元组,如下所示:
NYS <- expand.grid(list(c("A","C","G", "T"), c("C","T"), c("C","G")))
VRM <- expand.grid(list(c("A","C","G"), c("A","G"), c("A","C")))
然后我计算所有这些组合对应的氨基酸:
# make codon triplet strings
NYS[,"cods"] <- paste(NYS$Var1, NYS$Var2, NYS$Var3, sep='')
VRM[,"cods"] <- paste(VRM$Var1, VRM$Var2, VRM$Var3, sep='')
#look them up in the aa vector and add a column
NYS[,"aas"] <- aas[match(NYS$cods, cods)]
VRM[,"aas"] <- aas[match(VRM$cods, cods)]
#get only the relevant columns
VRM <- VRM %>% select("aas", "cods")
NYS <- NYS %>% select("aas", "cods")
NYS$cods <- "NYS"
VRM$cods <- "VRM"
现在到了棘手的部分:根据特定的输入向量,描述随机密码子的数量和类型,例如:library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")
我现在要计算这些文库中可能出现的所有氨基酸序列。然后,我想创建一个包含所有唯一序列和出现计数的数据帧。我是这样做的:
# make a string that contains n sort()s of the columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(lapply(str_split(paste(gsub("(...)", "sort(\1\$aas)", library_cods), collapse = ","), ",", simplify = T), function(x) eval(parse(text=x))))
# get the string from the rows
unique_seqs <- apply(all_combos, 1, function(x) paste(x, collapse = ""))
#rle() to count
unique_seqs <- data.frame(unclass(rle(sort(unique_seqs))))
#sort by count
unique_seqs <- unique_seqs[order(unique_seqs$lengths, decreasing = T),]
这一切都很好,然而,问题是它真的很慢。所以我的主要问题是,我怎样才能使它更快?在我的系统上,执行执行rle()的两行和后面的一行需要70秒。相比之下,bash中的sort -n file | uniq -c | sort -n
在相同的数据上花费了~22s。虽然这样更好,但仍然很慢,所以我想也许我应该做一些数学运算,而不是计算和计数^^
作为附带问题;我也觉得我的代码很笨拙(特别是all_combos <-
行;我知道把字符串当作代码来计算是很糟糕的);如果有人想指出如何使我的代码更有效,我也会很感激。
代码中的一些步骤可以做得更简洁。对于三元组只需要向量,我们稍后使用mget
获取它们。
NYS <- expand.grid(list(c("A", "C", "G", "T"), c("C", "T"), c("C", "G")))
VRM <- expand.grid(list(c("A", "C", "G"), c("A", "G"), c("A", "C")))
## triplets
NYS <- aas[match(Reduce(paste0, NYS), cods)]
VRM <- aas[match(Reduce(paste0, VRM), cods)]
## input vector
library_cods <- c("NYS", "VRM", "NYS", "NYS", "VRM", "VRM")
# columns as determined by library_cods, evaluate, expand
all_combos <- expand.grid(mget(library_cods))
# get the string from the rows
unique_seqs <- Reduce(paste0, all_combos)
# sort by count
unique_seqs <- data.frame(sort(table(unique_seqs), decreasing=T))
结果head(unique_seqs)
# unique_seqs Freq
# 1 LRLLRR 729
# 2 ARLLRR 486
# 3 LGLLRR 486
# 4 LRALRR 486
# 5 LRLARR 486
# 6 LRLLGR 486
在我的系统上运行大约16秒,这是合理的。