50MB的文件在python中读取时间太长



我试图分析一个简单但"大";(52MB) .ped文件我从一个1000genomes project .vcf文件生成:它有107行和248189列。我不关心前6个颜色,我感兴趣的颜色只包含字母,' A','C','G','T',我需要计算它们的频率。它要么是它们中的一个,要么是两个的组合(例如a, a, a,C, a, a,C,…)。问题是:我无法加载整个文件,因为内存使用量从7GB增加到16GB(我无法理解,因为文件只有52MB大),所以我尝试了另一种方法。

这些列相互补充,所以col6和7表示相同的东西,8和9,10和11,所以一次只需要导入两个col1就可以分析这个文件。通过创建字典,我可以避免加载整个文件,保留必要的信息,为此我做了以下操作:

import pandas as pd
import io
import matplotlib.pyplot as plt
import numpy as np
import time
def main(end):
for col in range(6,end,2):
allels = pd.read_csv('filtered_conv.ped',sep = 't',header=None, usecols=[(col), (col+1)])
allel_1 = allels[col].value_counts().rename_axis('Nb').reset_index(name='counts')
allel_2 = allels[col+1].value_counts().rename_axis('Nb').reset_index(name='counts')

对于字典,我在循环中使用这个:

if len(allel_1['counts']) == 2 and len(allel_2['counts']) == 2: 
alt_nb = allel_1['Nb'][1]
alt_c = allel_1['counts'][1]+allel_2['counts'][1]
elif len(allel_1['counts']) == 2:
alt_nb = allel_1['Nb'][1]
alt_c = allel_1['counts'][1]
else: 
alt_nb = allel_2['Nb'][1]
alt_c = allel_2['counts'][1]
info = {'ref_Nb':allel_1['Nb'][0],
'c_ref':(allel_1['counts'][0]+allel_2['counts'][0]),
'alt_Nb':allel_1['Nb'][1] if len(allel_1['Nb']) == 2 else allel_2['Nb'][1],
'c_alt':alt_c}
freq_count.append(info)

虽然它可以工作,但运行这个cols 6到240需要30秒,所以完成这个过程需要大约8个小时。

freq_count = []
start_time = time.time()
main(240)
print("--- %s seconds ---" % (time.time() - start_time))

还有其他方法吗?我看到了一些其他的SO答案,但分类数据(…dtype='category')这次从33秒增加到38秒,我不确定还能做什么…

所以所有行只包含6元列跳过,然后248183列,每个值只能是'A','C','G'和'T'?文件是tab分隔的?

with open('filtered_conv.ped') as file:
# read line by line
for line in file:
# skip chars until 5 tabs were found
tab_count = 0
while tab_count < 5:
if next(line) == "t":
tab_count += 1
# iterate over characters
for next_char in line:
# do something in case A C G or T is found, ' t and n are skipped
if next_char == "A":
// do something
elif next_char == "C":
// do something
elif next_char == "G":
// do something
elif next_char == "T":
// do something

如果你需要一个进度条,你可以像这样使用tqdm:

from tqdm import tqdm
...
# read line by line
for line in tqdm(file):
...

最新更新