示例文件:
Column header 95: A|T|E|A|A|Y|E|A|E|A
Column header 96: W|I|Q|Q|A|L|P|K|E|A
Column header 97: S|D|F|Q|G|Y|E|A|E|A
我想从csv文件中计算每列氨基酸组成的百分比。我只能计算第一列的百分比,但无法遍历其余列并打印所有列的百分比。
import csv
with open ('test.csv', 'r') as f:
reader = csv.reader(f)
column = [row[0] for row in reader]
amino_acids = {}
for aa in column:
if aa in amino_acids:
amino_acids[aa] += 1
else:
amino_acids[aa] = 1
for aa, count in amino_acids.items():
#print(f'{aa}: {count}')
percentage = count / len (column) *100
print (f"{aa}: {percentage: .2f}%")
预期输出:
column header 95:
A=50%
E=30% and so on
similarly for the remaining columns.
请建议
不清楚您的输入方式,但您可以在每行应用以下代码,
代码:
s = 'A|T|E|A|A|Y|E|A|E|A'.split('|')
['{}={}%'.format(i, ls.count(i)/len(ls)*100) for i in set(ls)]
输出:
['T=10.0%', 'A=50.0%', 'E=30.0%', 'Y=10.0%']
进程使用基本的Python文件读取,因为不是CSV文件
from collections import Counter
def show_stats(filename):
' shows the percentage of amino acids for each line in file '
with open(filename, 'r') as f:
for line in f:
line = line.rstrip().split(':') # remove trailing 'n' and split on ':'
column_info, sequence = line # separate into colum info and amino acid sequence
sequence = sequence.strip().split('|') # remove leading & trailing whitesplace and split on '|'
amino_acids = Counter(sequence) # Count of each amino acid in sequence
percent_convert_factor = 100.0/sum(amino_acids.values()) # 100 divided by total count (for conversion to percent)
for k in amino_acids:
amino_acids[k] *= percent_convert_factor # convert counts to percentage
amino_acids = dict(sorted(amino_acids.items(), key = lambda kv: kv[0])) # in ascending order by amino acid
print(column_info) # Column header
print('n'.join(f"{aa}={percentage: .2f}%" for aa, count in amino_acids.items())) # Amino acid percentages
# Process file
show_stats('test.csv')
Column header 95
A= 50.00%
E= 30.00%
T= 10.00%
Y= 10.00%
Column header 96
A= 20.00%
E= 10.00%
I= 10.00%
K= 10.00%
L= 10.00%
P= 10.00%
Q= 20.00%
W= 10.00%
Column header 97
A= 20.00%
D= 10.00%
E= 20.00%
F= 10.00%
G= 10.00%
Q= 10.00%
S= 10.00%
Y= 10.00%