我的文件格式如下:
45 TCGGCGCAGCGTTAGGATTAG 44 TTTCATCTGCCGCCGTTGCCG
43 CGTTTTCGGATGGATCATTGT 43 CTTGCGACGCATTTGGATCAG
35 CCGATTGCTAATCGGCAGTTG 32 TCGCGGTATCCGTCTCTTAAT
31 TTTAATGCTAAGACTACGTGG 31 TTGTGGGTGCATCAGGATTTG
31 ACTTGGCTGGCTAATGTGCAG 31 CGTTCTCTGCGGATTTATCAG
31 TCGCCGACGCCTTATTGTAAT 30 ACGAAGCCTTAGTGGATGCTT
15 ACGATGATGATGCCTCATCTT 3 ATTACTGAGCTTAAGGCGAAG
它有1000万行,第一列是出现,第二列是DNA序列。我想先将DNA转换为肽。然后,我想计算7个位置的20个氨基酸的总体发生(列),总体发生在6个位置,总数为400个位置,在5个位置以8000个氨基酸三聚体的总体出现,并在4个位置出现160000的总体出现。这是我的代码:
#!/usr/bin/env python
#!/usr/bin/env python
def fre(aa,cn):
import numpy as np
m=len(aa)
position_monomer=np.zeros((m,7))
position_dimer=np.zeros((m,6))
position_trimer=np.zeros((m,5))
position_tetramer=np.zeros((m,4))
fre_monomer=np.zeros((20,7))
fre_dimer=np.zeros((400,6))
fre_trimer=np.zeros((8000,5))
fre_tetramer=np.zeros((160000,4))
aa_default=['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for i in xrange(m):
for j in xrange(7):
position_monomer[i][j]=aa_default.index(aa[i][j])
fre_monomer[position_monomer[i][j]][j]=fre_monomer[position_monomer[i][j]][j]+cn[i]
for i in xrange(m):
for j in xrange(6):
position_dimer[i][j]=20*aa_default.index(aa[i][j])+aa_default.index(aa[i][j+1])
fre_dimer[position_dimer[i][j]][j]=fre_dimer[position_dimer[i][j]][j]+cn[i]
for i in xrange(m):
for j in xrange(5):
position_trimer[i][j]=400*aa_default.index(aa[i][j])+20*aa_default.index(aa[i][j+1])+aa_default.index(aa[i][j+2])
fre_trimer[position_trimer[i][j]][j]=fre_trimer[position_trimer[i][j]][j]+cn[i]
for i in xrange(m):
for j in xrange(4):
position_tetramer[i][j]=8000*aa_default.index(aa[i][j])+400*aa_default.index(aa[i][j+1])+20*aa_default.index(aa[i][j+2])+aa_default.index(aa[i][j+3])
fre_tetramer[position_tetramer[i][j]][j]=fre_tetramer[position_tetramer[i][j]][j]+cn[i]
return fre_monomer,fre_dimer,fre_trimer,fre_tetramer
#!/usr/bin/env python
#!/usr/bin/env python
from __future__ import division
import numpy as np
from Position import fre
from Xdelete import X_delete
from Bio.Seq import translate
import time
start=time.clock()
textfile = open(r"C:UsersShangyang LiDesktopFINAL 8 DATA SETSillumina_phd7_allseqs.txt","r")
data = []
copy_number=[]
DNA=[]
peptide=[]
for line in textfile:
data.append(line.split())
for row in range(len(data)):
copy_number.append(int(data[row][0]))
DNA.append(data[row][1])
peptide=[translate(line) for line in DNA]
(peptides,copy_numbers)=X_delete(peptide,copy_number)
(fre_monomer,fre_dimer,fre_trimer,fre_tetramer)=fre(peptides,copy_numbers)
,但我想获得单体,二聚体,三聚体,四聚体的频率,现在是30分钟到1000万个数据大小。我想使用多处理来解决这个问题,但我不知道如何使用它。
尝试结合一些循环。如果两个外循环进行完全相同的迭代,请组合其内部环以避免外迭代:
for i in xrange(m):
for j in xrange(7):
....
for i in xrange(m):
for j in xrange(6):
....
变为:
for i in xrange(m):
for j in xrange(7):
....
for j in xrange(6):
....
另外,看看您是否可以避免使用某些index()
调用,因为它们通常必须在整个列表中迭代。