我有一个从MS-DIAL代谢组学MSP光谱试剂盒下载的大文本文件含EI-MS, MS/MS
文件被打开为文本文件,化合物看起来像这样:
NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602 53
238.0757 31
NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0
每个化合物都有NAME
到下一个NAME
之间的数据。
我要做的是去除Num Peaks:
中值为零的所有化合物(即Num Peaks: 0
)。如果化合物的第12行是Num Peaks: 0
,则删除该化合物的所有数据(向上12行,删除)。
在上面的复合式中,删除NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
到Num Peaks: 0
之间的行之后,我需要将数据保存回txt或msp格式。
我所做的只是将数据导入为一个列表:
with open('pathtoMSMS-Public-Pos-VS15.msp') as f:
lines = f.readlines()
然后创建一个索引列表,其中每个复合开始链接:
indices = [i for i, s in enumerate(lines) if 'NAME' in s]
我认为,现在我需要添加连续的索引差大于14(意思是峰值数大于0)链接
# to find the difference between consecutive indices.
v = np.diff(indices)
选择有差异的并在第一个位置添加元素0
diff14 = np.where(v == 14)
diff14 = np.append([0],diff14[0])
现在我想只选择那些不是diff14的值,以便创建一个包含峰数大于0的化合物的新列表
现在我需要一些循环来选择正确的索引,但不知道如何:
lines[indices[diff14[0]]: indices[diff14[1]]]
lines[indices[diff14[1]+1] : indices[diff14[2]]]
lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]
lines[indices[diff14[3]+1] : indices[diff14[4]]]
任何更好的想法或提示都非常感谢
这并不像其他答案那样紧凑和内存效率高,但希望它应该更容易理解和扩展。
我建议的方法是将您的输入解析为例如列表的列表,每个元素包含单个化合物。我建议采取3个步骤:(1)将数据解析为化合物列表,(2)迭代该化合物列表,删除不需要的化合物,(3)将列表输出回文件。根据文件的大小,可以对数据进行1次循环,也可以进行3次单独的循环。
# Step (1) Parse the file
compounds = list() # store all compunds
with open('compound.txt', 'r') as f:
# stores a single compound as a list of rows for a given compound.
# Note: can be improved to e.g. a dictionary or a custom class
current_compound = list()
for line in f:
if line.strip() == '': # assumes each compound is split by empty line(s)
print('Empty line')
# Store previous compound
if len(current_compound) != 0:
compounds.append(list(current_compound))
# prepare for next compound
current_compound = list()
else:
# At this point we could parse this more,
# e.g. seperate into key/value, but lets just append the whole line with trailing newline
print('Adding', line.strip())
current_compound.append(line)
好的,现在让我们检查进度
for item in compounds:
print('n===Compound===n', item)
在搜索结果
===Compound===
['NAME: C11H11NO5; PlaSMA ID-967n', 'PRECURSORMZ: 238.0712n', 'PRECURSORTYPE: [M+H]+n', 'FORMULA: C11H11NO5n', 'Ontology: Formula predictedn', 'INCHIKEY:n', 'SMILES:n'
, 'RETENTIONTIME: 1.74n', 'CCS: -1n', 'IONMODE: Positiven', 'COLLISIONENERGY:n', 'Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE
_Ripe_Posn', 'Num Peaks: 2n', '192.06602 53n', '238.0757 31n']
===Compound===
['NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141n', 'PRECURSORMZ: 656.19415n', 'PRECURSORTYPE: [M+H]+n', 'FORMULA: C29H35O17n', 'Ontology: Anthocyanidin O-glycosidesn
', 'INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-On', 'SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1n', 'RETENTIONTIME: 2.81n', '
CCS: 241.3010517n', 'IONMODE: Positiven', 'COLLISIONENERGY:n', 'Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard
onlyn', 'Num Peaks: 0n']
然后可以遍历该化合物列表,并在写回文件之前删除Num Peaks设置为0的化合物。如果你需要这部分的帮助,请告诉我。
# Open / read tmp file created with the text you supplied
filedat = open('tmpWrt.txt','r')
filelines = filedat.readlines()
# Open output file object
file_out = open('tmp_out.txt','w')
line_count = 0
# Iterate through all file lines
for line in filelines:
# If line is beginning of section
# reset tmp variables
if line != "n" and line.split()[0] == "NAME:":
tmp_lines = []
flag = 'n'
tmp_lines.append(line)
line_count += 1
# If line is the end of a section and peaks > 0
# write to file
if (line == "n" or line_count == len(filelines)) and flag == 'y':
#tmp_lines.append("n")
for tmp_line in tmp_lines:
file_out.write(tmp_line)
# If peaks > 0 set flag to "y"
if line != "n" and line.split()[0] == "Num":
if int(line.split()[2]) != 0:
flag = "y"
file_out.close()
这是一个相当简单的处理文件的方法。
打开数据文件并遍历其行,将它们存储在列表(缓存)中。如果一行以NAME:
开头,则该行是新记录的开始,如果该行不为空,则可以打印缓存。
如果行以Num Peaks:
开头,则检查该值。如果为0,则清空缓存,导致该记录被遗忘。
只包含空格的行被跳过。
with open('data') as f:
line_cache = []
for line in f:
if line.startswith('NAME:'):
if line_cache:
print(*line_cache, sep='')
line_cache = []
elif line.startswith('Num Peaks:'):
num_peaks = int(line.partition(': ')[2])
if num_peaks == 0:
line_cache = []
continue
if line.strip(): # filter empty lines
line_cache.append(line)
if line_cache: # don't forget the last record
print(*line_cache, sep='', end='')
输出到stdout。它可以被重定向到shell环境中的文件上。如果您想直接写入文件,您可以在开始时打开它并修改print()
语句:
with open('output', 'w') as output, open('data') as f:
...
并将print()
s更改为
print(*line_cache, sep='', file=output)