python迭代select only字符串包含特定字符



我想遍历kmers列表,并选择只包含字符A、T、G和C 的项目

kmers=["AL","AT","GC","AA","AP"]
for kmer in kmers:       
for letter in kmer:
if letter not in ["A","T","G","C"]:
pass
else:
DNA_kmers.append(kmer)
print("DNA_kmers",DNA_kmers)

输出:

DNA_kmers ['AL', 'AT', 'AT', 'GC', 'GC', 'AA', 'AA', 'AP']

期望输出:

DNA_kmers=["AT","GC","AA"]

我唯一知道的方法是

if "B" in kmer or "D" in kmer or "E" in kmer or "F" in kmer or "H" in kmer or "I" in kmer or "J" in kmer or "K" in kmer or "L" in kmer or "M" in kmer or "N" in kmer or "O" in kmer or "P" in kmer or "Q" in kmer or "R" in kmer or "S" in kmer or "U" in kmer or "V" in kmer or "W" in kmer or "X" in kmer or "Y" in kmer or "Z" in kmer:
pass

您的代码当前将添加任何一个字符匹配的项。我们可以调整它,只添加两个字符匹配的项目:

kmers=["AL","AT","GC","AA","AP"]
DNA_kmers =[]
for kmer in kmers:       
for letter in kmer:
if letter not in ["A","T","G","C"]:
break
else:
DNA_kmers.append(kmer)
print("DNA_kmers",DNA_kmers)

如果您不熟悉Python,我已经在for循环中使用了else子句。并非所有语言都有此功能。当且仅当循环完成所有迭代时,才会运行else块。

有更简单的方法来做你想做的事情。例如,以下将使用嵌套列表理解来完成任务:

kmers=["AL","AT","GC","AA","AP"]
allowed = set("AGCT")
print([k for k in kmers if all([c in allowed for c in k])])

一个更高性能的通用解决方案是使用正则表达式:

import re
kmers=["AL","AT","GC","AA","AP"]
r = re.compile("^[ATGC]*$")
print([k for k in kmers if r.match(k)])

如果我们将问题限制为只有k=2的k-mers,我们可以进一步优化性能。如果匹配固定长度的字符串(例如使用[AGCT]{2}(,正则表达式的性能应该会略有提高。我们还可以使用product创建一个用于恒定时间查找的集合:

import itertools
kmers=["AL","AT","GC","AA","AP"]
allowed = {a+b for a,b in itertools.product("AGCT", repeat=2)}
print([k for k in kmers if k in allowed])

最新更新