我想知道是否有人知道任何工具可以让我计算多序列比对中任何特定位置的氨基酸频率。
例如,如果我有三个序列:
Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA
我想找到一种按位置搜索以下输出的方法:
Position 1 - M = 3;
Position 2 - M = 3;
Position 3 - R = 2, L = 1;
Position 4 - S = 2, T = 1;
Position 5 - A = 3.
谢谢!我熟悉R和Linux,但如果有其他软件可以做到这一点,我相信我可以学习。
使用R:
x <- read.table(text = "Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA")
ixCol = 1
table(sapply(strsplit(x$V4, ""), "[", ixCol))
# M
# 3
ixCol = 4
table(sapply(strsplit(x$V4, ""), "[", ixCol))
# S T
# 2 1
根据输入文件格式,可能存在专门构建的生物导管包/功能。
这真的很容易解析,你可以使用任何选择的语言。下面是Python中的一个例子,使用dict和Counter将数据组装到一个简单的对象中。
from collections import defaultdict, Counter
msa = '''
Species 1 - MMRSA
Species 2 - MMLSA
Species 3 - MMRTA
'''
r = defaultdict(list) #dictionary having the sequences index as key and the list of aa found at that index as value
for line in msa.split('n'):
line = line.strip()
if line:
sequence = line.split(' ')[-1]
for i, aa in enumerate(list(sequence)):
r[i].append(aa)
count = {k:Counter(v) for k,v in r.items()}
print(count)
#{0: Counter({'M': 3}), 1: Counter({'M': 3}), 2: Counter({'R': 2, 'L': 1}), 3: Counter({'S': 2, 'T': 1}), 4: Counter({'A': 3})}
按指定打印输出:
for k, v in count.items():
print(f'Position {k+1} :', end=' ') #add 1 to start counting from 1 instead of 0
for aa, c in v.items():
print(f'{aa} = {c};', end=' ')
print()
它打印:
Position 1 : M = 3;
Position 2 : M = 3;
Position 3 : R = 2; L = 1;
Position 4 : S = 2; T = 1;
Position 5 : A = 3;