用Python计算蛋白质接触顺序



蛋白质接触顺序(CO)是残基间接触的局部性。CO还与蛋白质的折叠速度相关。较高的接触阶数表示较长的折叠时间,而较低的接触阶数来预测潜在的向下折叠,或在没有自由能屏障的情况下发生的蛋白质折叠。

这里有一个内联Web服务器和一个perl脚本来计算CO。

有没有办法计算python中的CO?

您可以利用这样一个事实,即您知道坐标数组的形状始终是(N,3),因此您可以取消创建数组和调用np.dot,这对于像这里这样的非常小的数组来说效率较低。这样,我就把你的函数重新写成了abs_contact_order2:

from __future__ import print_function, division
import mdtraj as md
import numpy as np
import numba as nb
@nb.njit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
"""Return the absolute contact order."""
contact_count = 0
seq_distance_sum = 0
cutoff_2 = cutoff_nm*cutoff_nm
N = len(atoms_residue)
for i in xrange(N):
for j in xrange(N):
seq_dist = atoms_residue[j] - atoms_residue[i]
if seq_dist > 0:
d = xyz[j] - xyz[i]
if np.dot(d, d) < cutoff_2:
seq_distance_sum += seq_dist 
contact_count += 1

if contact_count==0.:
return 0.
return seq_distance_sum/float(contact_count)
@nb.njit
def abs_contact_order2(xyz, atoms_residue, cutoff_nm=.6):
"""Return the absolute contact order."""
contact_count = 0
seq_distance_sum = 0
cutoff_2 = cutoff_nm*cutoff_nm
N = len(atoms_residue)
for i in xrange(N):
for j in xrange(N):
seq_dist = atoms_residue[j] - atoms_residue[i]
if seq_dist > 0:
d = 0.0
for k in xrange(3):
d += (xyz[j,k] - xyz[i,k])**2
if d < cutoff_2:
seq_distance_sum += seq_dist 
contact_count += 1

if contact_count==0.:
return 0.
return seq_distance_sum/float(contact_count)

然后计时:

%timeit abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
1 loop, best of 3: 723 ms per loop
%timeit abs_co = abs_contact_order2(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
10 loops, best of 3: 28.2 ms per loop

我已经从函数中删除了print语句,它允许您在nopython模式下运行Numba-jit。如果您真的需要这些信息,我会返回所有必要的数据,并编写一个检查返回值并根据需要打印任何调试信息的瘦包装器。

还要注意,当对Numba函数进行计时时,您应该首先通过在计时循环之外调用函数来"预热"jit。通常,如果你要多次调用该函数,Numba jit该函数所需的时间比一次调用的时间要长,所以你并不能很好地了解代码的速度。但是,如果你只调用一次函数,并且启动时间很重要,那么继续按原样计时。

更新:

你可以进一步加快速度,因为你在(i,j)和(j,i)对上循环,它们是对称的:

@nb.njit
def abs_contact_order3(xyz, atoms_residue, cutoff_nm=.6):
"""Return the absolute contact order."""
contact_count = 0
seq_distance_sum = 0
cutoff_2 = cutoff_nm*cutoff_nm
N = len(atoms_residue)
for i in xrange(N):
for j in xrange(i+1,N):
seq_dist = atoms_residue[j] - atoms_residue[i]
if seq_dist > 0:
d = 0.0
for k in xrange(3):
d += (xyz[j,k] - xyz[i,k])**2
if d < cutoff_2:
seq_distance_sum += 2*seq_dist 
contact_count += 2

if contact_count==0.:
return 0.
return seq_distance_sum/float(contact_count)

和定时:

%timeit abs_co = abs_contact_order3(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
100 loops, best of 3: 15.7 ms per loop

确实存在!使用优秀的MDTraj这不是问题:

import numpy as np
import mdtraj as md
@jit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
"""Return the absolute contact order."""
contact_count = 0
seq_distance_sum = 0
cutoff_2 = cutoff_nm*cutoff_nm
N = len(atoms_residue)
for i in xrange(N):
for j in xrange(N):
seq_dist = atoms_residue[j] - atoms_residue[i]
if seq_dist > 0:
d = xyz[j] - xyz[i]
if np.dot(d, d) < cutoff_2:
seq_distance_sum += seq_dist 
contact_count += 1

if contact_count==0.:
print("Warning, no contacts found!")
return 0.
print("contact_count: ", contact_count)
print("seq_distance_sum: ", seq_distance_sum)
return seq_distance_sum/float(contact_count)

运行使用:

traj = md.load("test.pdb")
print("Atoms: ", traj.n_atoms)
seq_atoms = np.array([a.residue.resSeq for a in traj.top.atoms], dtype=int)
abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
print("Absolute Contact Order: ", abs_co)
print("Relative Contact Order: ", abs_co/traj.n_residues)

如果没有numba,这需要16秒,只需添加一个@jit,时间就会减少到~1s。

原始的perl脚本需要大约8s

perl contactOrder.pl -r -a -c 6 test.pdb

测试文件以及jupyter笔记本都在这里。

最新更新