如何实现 pythonic 行延续读取



我正在尝试实现一个python脚本来读取和提取ASCII文本文件中的行。这似乎是一件相当容易的事情,但是我最终遇到了一个我自己无法解决的问题。 我尝试读取的文件包含测试和一些行 从*tr999开始。此模式可以有大字母或小字母,位数和*的存在是可选的。星号也可以是之前和之后。此信号关键字后跟数字,可以是 int 或 folat。 为了捕捉信号,我使用 python 正则表达式

re.search("[*]{0,1}[Tt][Rr][0-9]{1,5}[*]{0,1}",line)

文本文件如下所示

tr10* 1 2 3 22 1 1 13 12 33 33 33
*Tr20 12 22 -1 2  2 2 5 5 5 6 6 6 77
Tr20 1 1 1 &
2 0 0
1 1 1
2 2 2
c that is a comment and below is the problem case '&' is missing
*tr22221 2 2 2
1 1 1
2 2 2

我编写的代码无法捕捉最后一种情况。缺少继续线路信号&的位置。使用&继续行是可选的,可以在连续行的乞求处用许多空格代替。

我写的代码是

import sys
fp=open(sys.argv[1],'r')
import re 
# get the integers only
def loop_conv(string):
conv=[]
for i in string.split(" "):
try:
conv.append(float(i))
except ValueError:
pass
return conv
# extract the information
def extract_trans_card(line,fp):
extracted=False
if len(line)>2 and not re.search("[cC]",line.split()[0]) and re.search("[*]{0,1}[Tt][Rr][0-9]{1,5}[*]{0,1}",line) :
extracted=True
trans_card=[]
trans_card.append(line.split()[0])
line_old=line
# this part here is because after the read signal,
# data to be extracted might be on the same line             
for val in loop_conv(line):
trans_card.append(val)
# this part here fails. I am not able to catch the case '&' missing.
# i tried to peek the next line with seek() but it i got a system error. 
# the idea is to loop until i have a continue line case  
while (re.search("^(s){5,60}",line) or re.search("[&$]",line_old)) and len(trans_card) <13:
line=fp.readline()
for val in loop_conv(line):
trans_card.append(val)
line_old=line

#print('M',trans_card)
print('value',trans_card)
trans_card=[]
return extracted 

# read the file with a loop
for line in fp:
if not extract_trans_card(line,fp) :
print(line,end='')  

输出为:

value ['tr10*', 1.0, 2.0, 3.0, 22.0, 1.0, 1.0, 13.0, 12.0, 33.0, 33.0, 33.0]
value ['*Tr20', 12.0, 22.0, -1.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 77.0]
value ['Tr20', 1.0, 1.0, 1.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0]
c that is a comment and below is the problem case '&' is missing
value ['*tr22221', 2.0, 2.0, 2.0]
1 1 1
2 2 2

最后一行是问题所在。由于1 1 12 2 2被忽略,只是被回声。 这个问题看起来类似于python继续行的方式。通过空格或使用&. 我希望有人能帮助我们解决这个问题,并指出解决这个问题的正确方法

代码工作流的问题在于,当连续行信号是可选的时,很难在不弄乱下一个trans_card的情况下检测到与当前trans_card关联的最后一行。

由于trans_card的开头(标头)可以用re.search(r"[*]?[Tt][Rr][0-9]{1,5}[*]?"找到,因此每当检测到此标头模式时,处理前一个trans_card会更容易。

下面是一个示例代码,我粗略地从您的代码逻辑中复制了它,并将生成的trans_card保存到列表列表中:

import sys
import re
# get the floats only from line, copied from your code
def loop_conv(string):
conv=[]
for i in string.split(" "):
try:
conv.append(float(i))
except ValueError:
pass
return conv
# set previous trans_card with non-EMPTY vals list
def set_prev_trans_card(card, vals):
if len(vals):
card.append(vals)
#print ('value: {}'.format(vals))
# below new code logic:
with open(sys.argv[1], 'r') as fp:
trans_card = []
# a list to save items retrieved from lines associated with the same trans_card
values = []
# set up a flag to identify header
is_header = 0
for line in fp:
# if line is a comment, then skip it 
if re.search("[cC]",line.split()[0]):
#print(line, end='')
continue
# if line is a header, append the existing values[] (from the previous trans_card) 
# list to trans_card[] and then reset values[]
if len(line)>2 and re.search(r"[*]?[Tt][Rr][0-9]{1,5}[*]?", line):
# append values[] to trans_card
set_prev_trans_card(trans_card, values)
# reset values[] to the first S+ on the header 
values = [ line.split()[0] ]
# set is_header flag to 1
is_header = 1
# if line ends with &n, then concatenate the next lines
while line.endswith('&n'):
line += ' ' + fp.readline()
# add all numbers(floats) from header or lines starts with 5-60 white-spaces into the values[] list, and reset is_header flag to 0
if is_header or re.search("^(s){5,60}",line):
values.extend(loop_conv(line))
is_header = 0
# append the last values[] to trans_card
set_prev_trans_card(trans_card, values)
for v in trans_card:
print ('value: {}'.format(v))

输出为:

value: ['tr10*', 1.0, 2.0, 3.0, 22.0, 1.0, 1.0, 13.0, 12.0, 33.0, 33.0, 33.0]
value: ['*Tr20', 12.0, 22.0, -1.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 77.0]
value: ['Tr20', 1.0, 1.0, 1.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0]
value: ['*tr22221', 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0]

注意:我跳过了代码中的len(trans_card) <13条件,认为它只是用来防止无限while循环。 如果没有,应该很容易添加到上面的示例代码中。

您可能希望将^添加到注释和标题的模式中,以便它们仅匹配字符串的开头,而不是搜索字符串中的任何位置。

这是一种处理文件的 Python 方法(实际上是任何可迭代的文件,其中next()项返回一个字符串,也可能不以换行符结尾),其中延续可以用当前"记录"的最后一列中的"&"指定(Python 实际上使用"\")或下一个"记录"中的空格字符:

import re

def read_lines_with_continue(iter):
"""This function is passed an interator where each iteration returns the next line.
This function processes logical continuations consisting of lines that end with '&' or lines
that begin a space."""
next_line = ''
saw_continue = True
for line in iter:
# get rid of any trailing '&'
edited_line = re.sub(r'&$', '', line)
if saw_continue:
next_line += edited_line
saw_continue = False
elif line[0] == ' ':
next_line += edited_line
elif next_line != '':
yield next_line
next_line = edited_line
if line != edited_line:
saw_continue = True
if next_line != '':
yield next_line

lines = [
'1abc',
'2def&',
'ghi',
' xyz',
' ver&',
'jkl',
'3aaa',
'4xxx',
' yyy'
]

# instead of passing a list, you could also pass a file
for l in read_lines_with_continue(lines):
print(l)
1abc
2defghi xyz verjkl
3aaa
4xxx yyy

最新更新