在python中搜索Fasta文件，返回有效读取

新手。我想用python写一个函数，它在fasta文件中搜索一个基因的名称，然后返回相应的读值。

FASTA文件示例:

>name1
AATTCCGG
>name2
ATCGATCG

到目前为止，我的代码(非常初级):

def findseq(name):
    with open('cel39.fa', 'rb') as csv_file:
        csv_reader = csv.reader(csv_file)
        for i in csv_reader:
            if i == '>' + name:
                return i+1
                break

这实际上不起作用，因为我不能返回' I +1'。此外，我可以迭代len(csv_reader)，因为'len'不是一个属性。我也不确定是否有一个更有效(但简单)的搜索系统，这样我就不需要每次遍历整个文件(将是数千行)。

具体来说，是否有更好的方法读取Fasta文件?我能不能把书还回去?

findseq(name1)

应该返回'AATTCCGG'

谢谢! !

看一下python库:Biothon

它包含了大量有用的工具和方法。

下面是他们解析fasta文件的例子:

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

这个例子打印出fasta文件中的所有记录。

For your purpose:

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    if seq_record.id == name:
        return seq_record.seq

由于FASTA文件序列扩展到多行，因此必须将行连接起来，直到找到>的下一个实例。下面的代码创建了一个以基因名称为键，以基因序列为值的字典。

with open('cel39.fa', 'rb') as fp:
    lines = fp.read().splitlines()
geneDict = {}
# Just to start populating the dictionary later
geneName = 'dummy'
fastaSeq = ''
for line in lines:
    if line[0] == '>':
        geneDict.update({geneName: fastaSeq})
        geneName = line[1:]
        fastaSeq = ''
    else:
        fastaSeq += line
geneDict.update({geneName: fastaSeq}) # Putting the values from the last loop
geneDict.pop('dummy') # Now removing the dummy
print geneDict['name1']
print geneDict['name2']

然后打印:

AATTCCGG
ATCGATCG

相关内容

最新更新

热门标签：