如何使用python有条件地从txt文件中删除行序列



我有一个从MS-DIAL代谢组学MSP光谱试剂盒下载的大文本文件含EI-MS, MS/MS

文件被打开为文本文件,化合物看起来像这样:

NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602   53
238.0757    31
NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0

每个化合物都有NAME到下一个NAME之间的数据。

我要做的是去除Num Peaks:中值为零的所有化合物(即Num Peaks: 0)。如果化合物的第12行是Num Peaks: 0,则删除该化合物的所有数据(向上12行,删除)。

在上面的复合式中,删除NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141Num Peaks: 0之间的行之后,我需要将数据保存回txt或msp格式。

我所做的只是将数据导入为一个列表:

with open('pathtoMSMS-Public-Pos-VS15.msp') as f:
lines = f.readlines()

然后创建一个索引列表,其中每个复合开始链接:

indices = [i for i, s in enumerate(lines) if 'NAME' in s]

我认为,现在我需要添加连续的索引差大于14(意思是峰值数大于0)链接

# to find the difference between consecutive indices.
v = np.diff(indices)

选择有差异的并在第一个位置添加元素0


diff14 = np.where(v == 14)
diff14 = np.append([0],diff14[0])

现在我想只选择那些不是diff14的值,以便创建一个包含峰数大于0的化合物的新列表

现在我需要一些循环来选择正确的索引,但不知道如何:

lines[indices[diff14[0]]: indices[diff14[1]]]
lines[indices[diff14[1]+1] : indices[diff14[2]]]
lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]
lines[indices[diff14[3]+1] : indices[diff14[4]]]

任何更好的想法或提示都非常感谢

这并不像其他答案那样紧凑和内存效率高,但希望它应该更容易理解和扩展。

我建议的方法是将您的输入解析为例如列表的列表,每个元素包含单个化合物。我建议采取3个步骤:(1)将数据解析为化合物列表,(2)迭代该化合物列表,删除不需要的化合物,(3)将列表输出回文件。根据文件的大小,可以对数据进行1次循环,也可以进行3次单独的循环。

# Step (1) Parse the file
compounds = list() # store all compunds
with open('compound.txt', 'r') as f:
# stores a single compound as a list of rows for a given compound.
# Note: can be improved to e.g. a dictionary or a custom class
current_compound = list()
for line in f:
if line.strip() == '': # assumes each compound is split by empty line(s)
print('Empty line')
# Store previous compound
if len(current_compound) != 0:
compounds.append(list(current_compound))
# prepare for next compound
current_compound = list()
else:
# At this point we could parse this more,
# e.g. seperate into key/value, but lets just append the whole line with trailing newline
print('Adding', line.strip())
current_compound.append(line)

好的,现在让我们检查进度

for item in compounds:
print('n===Compound===n', item)

在搜索结果

===Compound===
['NAME: C11H11NO5; PlaSMA ID-967n', 'PRECURSORMZ: 238.0712n', 'PRECURSORTYPE: [M+H]+n', 'FORMULA: C11H11NO5n', 'Ontology: Formula predictedn', 'INCHIKEY:n', 'SMILES:n'
, 'RETENTIONTIME: 1.74n', 'CCS: -1n', 'IONMODE: Positiven', 'COLLISIONENERGY:n', 'Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE
_Ripe_Posn', 'Num Peaks: 2n', '192.06602   53n', '238.0757    31n']
===Compound===
['NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141n', 'PRECURSORMZ: 656.19415n', 'PRECURSORTYPE: [M+H]+n', 'FORMULA: C29H35O17n', 'Ontology: Anthocyanidin O-glycosidesn
', 'INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-On', 'SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1n', 'RETENTIONTIME: 2.81n', '
CCS: 241.3010517n', 'IONMODE: Positiven', 'COLLISIONENERGY:n', 'Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard
onlyn', 'Num Peaks: 0n']

然后可以遍历该化合物列表,并在写回文件之前删除Num Peaks设置为0的化合物。如果你需要这部分的帮助,请告诉我。

# Open / read tmp file created with the text you supplied
filedat = open('tmpWrt.txt','r')
filelines = filedat.readlines()
# Open output file object
file_out = open('tmp_out.txt','w')
line_count = 0
# Iterate through all file lines
for line in filelines:
# If line is beginning of section
# reset tmp variables
if line != "n" and line.split()[0] == "NAME:":
tmp_lines = []
flag = 'n'
tmp_lines.append(line)
line_count += 1
# If line is the end of a section and peaks > 0
# write to file
if (line == "n" or line_count == len(filelines)) and flag == 'y':
#tmp_lines.append("n")
for tmp_line in tmp_lines:
file_out.write(tmp_line)
# If peaks > 0 set flag to "y"
if line != "n" and line.split()[0] == "Num":
if int(line.split()[2]) != 0:
flag = "y"
file_out.close()

这是一个相当简单的处理文件的方法。

打开数据文件并遍历其行,将它们存储在列表(缓存)中。如果一行以NAME:开头,则该行是新记录的开始,如果该行不为空,则可以打印缓存。

如果行以Num Peaks:开头,则检查该值。如果为0,则清空缓存,导致该记录被遗忘。

只包含空格的行被跳过。

with open('data') as f:
line_cache = []
for line in f:
if line.startswith('NAME:'):
if line_cache:
print(*line_cache, sep='')
line_cache = []
elif line.startswith('Num Peaks:'):
num_peaks = int(line.partition(': ')[2])
if num_peaks == 0:
line_cache = []
continue
if line.strip():        # filter empty lines
line_cache.append(line)
if line_cache:    # don't forget the last record
print(*line_cache, sep='', end='')

输出到stdout。它可以被重定向到shell环境中的文件上。如果您想直接写入文件,您可以在开始时打开它并修改print()语句:

with open('output', 'w') as output, open('data') as f:
...

并将print()s更改为

print(*line_cache, sep='', file=output)

最新更新