将结构化(机器的结构)文本文件(config-file)解析为结构化表格式



主要目标是从或多或少可读取的配置文件进入表格格式,可以从每个人那里读取对机器及其配置标准的更深入的了解。

我有一个配置文件:

******A MANO:111111         ,20190726,001,0914,06621242746     
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:111111         ,20190726,001,0914,06621242746     
******A MANO:222222         ,20190726,001,0914,06621242746     
DXS*HAWA776A0A*VA*V0/6*1
ST*001*0001
ID1*HAW250755*VMI1-9900****250755*6*0
CB1*021545*DeBright*7.030.16*3.02*250755
PA1*0*100
PA1*1*60
PA2*2769*166140*210*12600*0*0*0*0
******E MANO:222222         ,20190726,001,0914,06621242746   

文件中有几个对象总是以'a mano:'开头,并以'e mano:'结尾,然后是对象数。下面的所有线都是对象的属性(机器的设置(。并非所有对象都具有相同数量的设置。一个对象可能有55行,另一个对象为199。

我到目前为止尝试的是:

from pyparsing import *
'''
grammar:
object_nr ::= Word(nums, exact=6)
num ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
'''
path_input = r'\......'
with open(path_input) as input_file:
    line = input_file.readline()
    cnt = 1
object_nr_parser = Word(nums, exact=6)
for match, start, stop in object_nr_parser.scanString(input_file):
    print(match, start, stop)

给我打印输出: ['201907'] 116 122 ['019211'] 172 178

它找到的数字以及字符串中的起点和终点。但是这个数字不是我在寻找的,也不是正确的。我什至找不到配置文件中的第二个数字。

是通过爆炸解决此问题的正确方法,还是有更方便的方法可以做到这一点?我在哪里犯错?

最后,如果我将每个机器都有一个具有属性的对象,那将是令人震惊的,这将是a mano之间的所有界线:

预期结果将是这样的:

{"object": "111111",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0"},
{"object": "222222",
"line1":"DXS*HAWA776A0A*VA*V0/6*1",
"line2":"ST*001*0001",
"line3":"ID1*HAW250755*VMI1-9900****250755*6*0",
"line4":"CB1*021545*DeBright*7.030.16*3.02*250755",
"line5":"PA1*0*100",
"line6":"PA1*1*60",
"line7":"PA2*2769*166140*210*12600*0*0*0*0",
"line8":"PA2*2769*166140*210*12600*0*0*0*0",
"line9":"PA2*2769*166140*210*12600*0*0*0*0",
"line10":"PA2*2769*166140*210*12600*0*0*0*0"}

不确定这是否是目的的最佳解决方案,但这是目前想到的。

完成操作的最肮脏方法之一就是使用正则表达式,并用线路断路和所有线路休息';'。我认为这不是一个解决方案,应该使用

您可以按行解析IT:

import re
with open('file.txt', 'r') as f:
    lines = f.readlines()
    lines = [x.strip() for x in lines]
result = []
name = ''
i = 1
for line in lines:
    if 'A MANO' in line:
        name = re.findall('A MANO:(d+)', line)[0]
        result.append({'object': name})
        i = 1
    elif 'E MANO' not in line:
        result[-1][f'line{i}'] = line
        i += 1

输出:

[{
        'object': '111111',
        'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
        'line2': 'ST*001*0001',
        'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
        'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
        'line5': 'PA1*0*100',
        'line6': 'PA1*1*60',
        'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
    }, {
        'object': '222222',
        'line1': 'DXS*HAWA776A0A*VA*V0/6*1',
        'line2': 'ST*001*0001',
        'line3': 'ID1*HAW250755*VMI1-9900****250755*6*0',
        'line4': 'CB1*021545*DeBright*7.030.16*3.02*250755',
        'line5': 'PA1*0*100',
        'line6': 'PA1*1*60',
        'line7': 'PA2*2769*166140*210*12600*0*0*0*0'
    }
]

,但我建议使用更紧凑的输出格式:

import re
with open('file.txt', 'r') as f:
    lines = f.readlines()
    lines = [x.strip() for x in lines]
result = {}
name = ''
for line in lines:
    if 'A MANO' in line:
        name = re.findall('A MANO:(d+)', line)[0]
        result[name] = []
    elif 'E MANO' not in line:
        result[name].append(line)

输出:

{
    '111111': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0'],
    '222222': ['DXS*HAWA776A0A*VA*V0/6*1', 'ST*001*0001', 'ID1*HAW250755*VMI1-9900****250755*6*0', 'CB1*021545*DeBright*7.030.16*3.02*250755', 'PA1*0*100', 'PA1*1*60', 'PA2*2769*166140*210*12600*0*0*0*0']
}

最新更新