使用Python读取大型FastQ文件

我平均有几个具有500.000.000行(125.000.000序列(的FastQ文件。是否有一种快速读取这些FastQ文件的快速方法。

我想做的是读取每个序列并将前16个序列用作条形码。然后计算每个文件中条形码的数量。

这是我的脚本，需要数小时：

import os, errno
from Bio import SeqIO
import gzip
files = os.listdir(".")
for file in files[:]:
    if not file.endswith(".fastq.gz"):
        files.remove(file)
maps = {}
for file in files:
    print "Now Parsing file %s"%file
    maps[file] = {}
    with gzip.open(file,"r") as handle:
        recs = SeqIO.parse(handle,"fastq")
        for rec in recs:
            tag = str(rec.seq)[0:16]
            if tag not in map[file]:
                maps[file][tag] = 1
            else:
                maps[file][tag] += 1

我有250 GB RAM和20个CPU，可用于多线程...

谢谢。

未经测试，但是这是一种方法，您可以以'尴尬的平行'方式做到这一点：

import multiprocessing as mp
import os, errno
from Bio import SeqIO
import gzip
def ImportFile(file):
    maps = {}
    with gzip.open(file,"r") as handle:
        recs = SeqIO.parse(handle,"fastq")
        for rec in recs:
            tag = str(rec.seq)[0:16]
            if tag not in maps.keys():
                maps[tag] = 1
            else:
                maps[tag] += 1
    return {file:maps}

files = os.listdir(".")
for file in files[:]:
    if not file.endswith(".fastq.gz"):
        files.remove(file)
# I'd test this with smaller numbers before using up all 20 cores
pool = mp.Pool(processes=10)
output = pool.map(ImportFile,files)

相关内容

最新更新

热门标签：