单独计算序列列表的GC内容

我有一个序列列表，我正在尝试以百分比计算GC含量(意思是序列中字母"G"、"G"、"C"、"C"的百分比(

#series of sequences
seq0,seq1,seq2,seq3,seq4,seq5 = 'CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTGACCCAAGGACCCGGCCAC','attgccattatataACCCGGCCACCCCCATAGGCAGATGTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG','TCXCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC','CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAAAAGAAttacggacttaGTCAGCCCCGCAGGGG','ATGGGGTGATCGTCGCTCGCGGGCTCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA','AAAAAAGAAGTCGCTCGCGTCGCTCGCGGGCTGGGCTCTGTCTGCGTCGCTCGCGGGCTAGAGAGCCAGGGTGA'
#sequences aggregated into a list
NTs = [seq0,seq1,seq2,seq3,seq4,seq5]
#specifying nucleotides
nucleotides = ['G','A','C','T', 'U']
#checking and removing if there are any non-nucleotide characters present
if any(x not in nucleotides for x in NTs):
print("ERROR: non-nucleotide characters present")
[''.join(i for i in x if i.upper() in nucleotides) for x in NTs]
#calculating GC percent of each sequence using the aggregated list
gCountseq0 = seq0.count('G') + seq0.count('g')
cCountseq0 = seq0.count('C') + seq0.count('c')
gcContentseq0 = ((gCountseq0 + cCountseq0)*100) / len(seq0)
print('The GC content of seq0 is',gcContentseq0,'%')

从这里我只得到输出

ERROR: non-nucleotide characters present
The GC content of seq0 is 70.0 %

最终，我试图得到一个看起来像下面输出的东西，但我有点卡住了，我不知道如何将NT列表作为GC%计算的参数，这样我就可以一次完成所有序列，而不是每个单独的

ERROR: non-nucleotide characters present in seq2
The GC content of seq0 is x %
The GC content of seq1 is x %
The GC content of seq2 is x %
The GC content of seq3 is x %
The GC content of seq4 is x %
The GC content of seq5 is x %

您只需要迭代循环中的序列列表(NT(，并在每次迭代中计算GC竞争。

这是GC计算的函数：

def GC_calc(fa_string):
_string = fa_string.upper()
_G = _string.count('G')
_C = _string.count('C')
return (_G + _C)/len(_string) * 100

这是一个循环：

for i,j in zip(names, NTs):
print(f'The GC content of {i} is {GC_calc(j)} %')

在这里，我使用zip函数来同时遍历名称和序列。我认为这是更好的方式。执行此操作时，应在zip函数中添加序列名称的list。

names = ['seq_name_1', 'seq_name_2']

相关内容

最新更新

热门标签：