我正在研究一个简单的生物信息学问题。我有一个可行的解决方案,但它效率低得离谱。如何提高效率?
问题:
在字符串g
中找到长度k
的模式,假设k
-mer 最多可以有d
个不匹配。
这些字符串和模式都是基因组的,所以我们可能的字符集是{A, T, C, G}
的。
我将调用函数FrequentWordsMismatch(g, k, d)
。
因此,这里有一些有用的示例:
FrequentWordsMismatch('AAAAAAAAAA', 2, 1)
→['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']
下面是一个更长的示例,如果您实现此功能并想要进行测试:
FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2)
→['GCACACAGAC', 'GCGCACACAC']
使用我的天真解决方案,第二个示例很容易花费~60秒,尽管第一个非常快。
朴素的解决方案:
我的想法是,对于 g 中的每个 k 长度段,找到每个可能的"邻居"(例如,具有多达d个不匹配的其他k长度段(,并将这些邻居作为键添加到字典中。然后,我计算每个相邻的kmers在字符串g中出现的次数,并将其记录在字典中。
显然,这是一种糟糕的方法,因为随着k和d的增加,邻居的数量会疯狂地扩展,并且必须扫描每个邻居的字符串使得这个实现非常慢。但唉,这就是我寻求帮助的原因。
我将把我的代码放在下面。肯定有很多新手错误需要解开,所以感谢您的时间和关注。
def FrequentWordsMismatch(g, k, d):
'''
Finds the most frequent k-mer patterns in the string g, given that those
patterns can mismatch amongst themselves up to d times
g (String): Collection of {A, T, C, G} characters
k (int): Length of desired pattern
d (int): Number of allowed mismatches
'''
counts = {}
answer = []
for i in range(len(g) - k + 1):
kmer = g[i:i+k]
for neighborkmer in Neighbors(kmer, d):
counts[neighborkmer] = Count(neighborkmer, g, d)
maxVal = max(counts.values())
for key in counts.keys():
if counts[key] == maxVal:
answer.append(key)
return(answer)
def Neighbors(pattern, d):
'''
Find all strings with at most d mismatches to the given pattern
pattern (String): Original pattern of characters
d (int): Number of allowed mismatches
'''
if d == 0:
return [pattern]
if len(pattern) == 1:
return ['A', 'C', 'G', 'T']
answer = []
suffixNeighbors = Neighbors(pattern[1:], d)
for text in suffixNeighbors:
if HammingDistance(pattern[1:], text) < d:
for n in ['A', 'C', 'G', 'T']:
answer.append(n + text)
else:
answer.append(pattern[0] + text)
return(answer)
def HammingDistance(p, q):
'''
Find the hamming distance between two strings
p (String): String to be compared to q
q (String): String to be compared to p
'''
ham = 0 + abs(len(p)-len(q))
for i in range(min(len(p), len(q))):
if p[i] != q[i]:
ham += 1
return(ham)
def Count(pattern, g, d):
'''
Count the number of times that the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
return len(MatchWithMismatch(pattern, g, d))
def MatchWithMismatch(pattern, g, d):
'''
Find the indicies at which the pattern occurs in the string g,
allowing for up to d mismatches
pattern (String): Pattern of characters
g (String): String in which we're looking for pattern
d (int): Number of allowed mismatches
'''
answer = []
for i in range(len(g) - len(pattern) + 1):
if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
answer.append(i)
return(answer)
更多测试
FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']
FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']
FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]
FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']
FrequentWordsMismatch('AAT', 3, 0) → ['AAT']
FrequentWordsMismatch('TAGCG', 2, 1) → ['GG', 'TG']
问题描述在几个方面是模棱两可的,所以我通过示例。 您似乎希望字母表中的所有k
长度字符串都(A, C, G, T}
以便与g
的连续子字符串的匹配数最大 - 其中"匹配"表示每个字符相等,最多d
个字符不等式。
我忽略了即使输入具有不同长度,您的HammingDistance()
函数也会组成一些东西,主要是因为它对我来说没有多大意义;-(,但部分原因是不需要在你给出的任何示例中获得您想要的结果。
下面的代码在所有示例中生成您想要的结果,从某种意义上说,生成您给出的输出列表的排列。 如果你想要规范的输出,我建议在返回之前对输出列表进行排序。
该算法非常简单,但依赖于itertools
"以 C 速度"进行繁重的组合提升。 所有示例的运行时间都不到一秒。
对于每个长度k
连续的子串g
,考虑所有combinations(k, d)
组d
不同的索引位置。 有4**d
种方法可以用来自{A, C, G, T}
的字母填充这些索引位置,每种方法都是"一种模式",与最多d
差异的子字符串匹配。 通过记住已经生成的模式来清除重复项;这比一开始只生成独特模式的英勇努力要快。
因此,总的来说,时间要求是O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d
,其中对于k
和d
的合理小值,k**d
是二项式协效率combinations(k, d)
的高估替身。 需要注意的重要一点是 - 不出所料 - 它在d
中呈指数级增长。
def fwm(g, k, d):
from itertools import product, combinations
from collections import defaultdict
all_subs = list(product("ACGT", repeat=d))
all_ixs = list(combinations(range(k), d))
patcount = defaultdict(int)
for starti in range(len(g)):
base = g[starti : starti + k]
if len(base) < k:
break
patcount[base] += 1
seen = set([base])
basea = list(base)
for ixs in all_ixs:
saved = [basea[i] for i in ixs]
for newchars in all_subs:
for i, newchar in zip(ixs, newchars):
basea[i] = newchar
candidate = "".join(basea)
if candidate not in seen:
seen.add(candidate)
patcount[candidate] += 1
for i, ch in zip(ixs, saved):
basea[i] = ch
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
编辑:唯一地生成模式
与其通过保留一组到目前为止看到的重复项来清除重复项,不如一开始就足够简单,以防止生成重复项。 事实上,下面的代码更短更简单,尽管有些微妙。 作为减少冗余工作的回报,对inner()
函数进行了多层递归调用。 哪种方式更快似乎取决于具体的输入。
def fwm(g, k, d):
from collections import defaultdict
patcount = defaultdict(int)
alphabet = "ACGT"
allbut = {ch: tuple(c for c in alphabet if c != ch)
for ch in alphabet}
def inner(i, rd):
if not rd or i == k:
patcount["".join(base)] += 1
return
inner(i+1, rd)
orig = base[i]
for base[i] in allbut[orig]:
inner(i+1, rd-1)
base[i] = orig
for i in range(len(g) - k + 1):
base = list(g[i : i + k])
inner(0, d)
maxcount = max(patcount.values())
return [p for p, c in patcount.items() if c == maxcount]
单独进行问题描述而不是示例(出于我在评论中解释的原因(,一种方法是:
s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"
"CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"
"CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"
"GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"
"ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"
def frequent_words_mismatch(g,k,d):
def num_misspellings(x,y):
return sum(xx != yy for (xx,yy) in zip(x,y))
seen = set()
for i in range(len(g)-k+1):
seen.add(g[i:i+k])
# For each unique sequence, add a (key,bin) pair to the bins dictionary
# (The bin is initialized to a list containing only the sequence, for now)
bins = {seq:[seq,] for seq in seen}
# Loop again through the unique sequences...
for seq in seen:
# Try to fit it in *all* already-existing bins (based on bin key)
for bk in bins:
# Don't re-add seq to it's own bin
if bk == seq: continue
# Test bin keys, try to find all appropriate bins
if num_misspellings(seq, bk) <= d:
bins[bk].append(seq)
# Get a list of the bin keys (one for each unique sequence) sorted in order of the
# number of elements in the corresponding bins
sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)
# largest_bin_key will be the key of the largest bin (there may be ties, so in fact
# this is *a* key of *one of the bins with the largest length*). That is, it'll
# be the sequence (found in the string) that the most other sequences (also found
# in the string) are at most d-distance from.
largest_bin_key = sorted_keys[0]
# You can return this bin, as your question description (but not examples) indicate:
return bins[largest_bin_key]
largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin)) # 13
print(largest_bin)
(这个(最大的箱子包含:
['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG', 'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG', 'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']
它是O(n**2(,其中n是唯一序列的数量,在我的计算机上完成大约0.1秒。