我有一个很好的代码来读取fasta文件:
from itertools import groupby
def is_header(line):
return line[0] == '>'
def parse_fasta(filename):
if filename.endswith('.gz'):
opener = lambda filename: gzip.open(filename, 'rb')
else:
opener = lambda filename: open(filename, 'r')
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
yield name, sequences.upper()
它运行良好,直到我更新到Python 3.10.4,然后当我尝试使用它时,我得到了这个错误:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 21, in parse_fasta
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 112, in <module>
sys.exit(main())
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 81, in main
plasmid, chromosome = split_sequences_from_fasta_file(filename)
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 111, in split_sequences_from_fasta_file
for header, seq in parse_fasta(filename)
RuntimeError: generator raised StopIteration
我在conda(conda 4.13.0(环境中运行代码,并尝试调试该函数,但我被卡住了。我不想为了使用Biopython而失去这个功能。如果你们有任何想法来解决它,我真的很感激。谢谢Paulo
fasta文件示例:
> seq_name
AGGTAGGGA
有趣的是,当我在命令行运行python解释器中的函数时,一切都很好,但当我使用导入的函数从脚本中调用函数时,我就出现了错误。
>>> import gzip
>>> from itertools import groupby
>>> def is_header(line):
... return line[0] == '>'
...
>>> for name, seq in parse_fasta("/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/Genomes/Archaea/Asgard/Chr/GCA_008000775.1_ASM800077v1_genomic.fna.gz"):
... print(name, seq[:50])
...
CP042905.1 Candidatus Prometheoarchaeum syntrophicum strain MK-D1 chromosome, complete genome TAAATATTATAGCCCGTAATAGCAGAGTCACCAACACTTAAAGGTGCATC
>>> quit()
您得到的异常是因为您在代码中的各种迭代器上手动调用__next__
方法。最终,在一个没有任何剩余值的迭代器上执行此操作,会引发一个StopIteration
异常。
在旧得多的Python版本中,在生成器函数中保留未捕获状态是可以的。StopIteration
异常将像任何其他异常一样继续冒泡。对于生成器函数,引发StopIteration
是其行为的预期部分(当函数结束时,无论是通过return
语句,还是通过到达其代码的末尾,都会自动发生(。在Python3.5中,这种行为发生了变化,PEP479使StopIteration
在生成器中未被捕获成为错误。
现在,考虑到代码的逻辑,我不太清楚为什么会得到空迭代器。如果文件采用您所描述的格式,则__next__
调用应该始终有一个值要获取,并且当没有剩余值时,for
循环将接收到StopIteration
,这将抑制它(并结束循环(。也许您的某些文件格式不正确,标题行本身没有后续序列?
在任何情况下,如果您捕获StopIteration
并打印出一些诊断信息,您都可以更好地诊断问题。我会试试:
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
try:
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
except StopIteration as si:
print(f'no sequence for {name}')
raise ValueError() from si
yield name, sequences.upper()
如果您发现丢失的序列是正常的,并且发生在每个文件的末尾,那么您可以通过在except
块中放入return
语句来抑制错误,或者在for
循环(for name_iter, sequences_iter in zip(fasta_iter, fasta_iter):
(中使用zip
来抑制错误。不过,我不太愿意把它作为第一个解决方案,因为如果有额外的头,它会丢弃一个头,而默默地丢失数据通常是个坏主意。