使用 Python 拆分数据并将排序的数据分配给列以进行 Excel 视图



嗨,我在文本文件中有一组如下所示的数据(用于替换学校数据的虚拟数据(

01-01-1998 00:00:00 AM  GP: D(B):1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM  GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM  SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM  00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM  SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.    
01-01-1998 01:06:24 AM  ET: Program Disconnected event.

我想整理数据显示为下面的预期结果,格式为

[['Timestamp','System','Di','Message']    #  <-- header
['01-01-1998 00:00:00 AM', 'GP:','D(B):','1234 to time difference. Hourly Avg:-3 secs'],
['01-01-1998 00:00:12 AM', 'GP:','D(A):', '2345 to time difference. Hourly Avg:0 secs'],
['01-01-1998 00:08:08 AM', 'SYS:','','The Screen Is now minimised.'],   #<-- with a blank
['01-01-1998 00:09:10 AM', 'SC:','','Findcorrect: HW. D:1. Count one two three four five.'],
['01-01-1998 00:14:14 AM', 'SC:','D1','test. Old:111, New:222, Calculated was 123, out of 120 secs.' ],
['01-01-1998 01:06:24 AM', 'ET:','', 'Program Disconnected event.']]

基本上,我想从一开始就添加一个标题,然后允许数据适合 4 列(时间戳、ststem、Di 和消息(。一些可能是问题的问题是,在源数据的第 4 行,它会在系统中复制一个额外的时间戳,从而导致需要清除它。

Di 字段将主要包含"D(A(、D(B(、D1 和 D2

字符串的其余部分将位于"消息"字段下。 然后,最终的输出将发送到Excel文件以进行数据分析。

如下所示的代码是我尝试过的(在python中很弱(

import re
from itertools import islice
from itertools import groupby
from operator import itemgetter
import xlsxwriter
import pandas as pd
import os
content = []
with open("DTBL.log","r") as infile:
for line in infile.readlines():
words = line.split(" ")
#print(line)
#content.append(words).split(',')
print(words)
timestamp = line[:22]
system = line[23:28]
##print(timestamp)
content.append(timestamp + ', ' + system + ','+(line[29:]).rstrip('n'))
print(content)
for l in content:
if l[-2].isdigit():

由于缺乏python知识,代码尚未完全开发,我将非常感谢任何指导或示例! 一些问题需要思考,我使用熊猫/数据帧吗?或者我可以在没有PD的情况下做到这一点吗?

编辑:第一行数据更新为"D(B(1234",数字和 D(B( 之间不应有任何空格

清理此混乱数据的代码部分使用正则表达式,部分使用字符串插值。

清理后的 csv 的编写使用模块 csv,因为需要在文本中屏蔽内部,(例如符合Old:111, New:222, ...(:

创建演示文件:

with open("data.txt","w") as w:
w.write("""01-01-1998 00:00:00 AM  GP: D(B): 1234 to time difference. Hourly Avg:-3 secs
01-01-1998 00:00:12 AM  GP: D(A): 2345 to time difference. Hourly Avg:0 secs
01-01-1998 00:08:08 AM  SYS: The Screen Is now minimised.
01-01-1998 00:09:10 AM  00:09:10 AM SC: Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp
01-01-1998 00:14:14 AM  SC: D1 test. Old:111, New:222, Calculated was 123, out of 120 secs.    
01-01-1998 01:06:24 AM  ET: Program Disconnected event.""")

解析并编写它:

import re
def parseLine(line):
# get the timestamp
ts = re.match(r"d{2}-d{2}-d{4} d{2}:d{2}:d{2} +(?:AM|PM)",line)
# get all but the timestamp - cleaning the double-time issue
cleaned = re.sub(r"^d{2}-d{2}-d{4} (d{2}:d{2}:d{2} (AM|PM) +)+","", line)
# split cleaned part based on occurence of ["D(A)", "D(B)", "D1", "D2"]
if any(k in cleaned.split(":")[1] for k in ["D(A)", "D(B)", "D1", "D2"]):
system, di, msg = cleaned.split(" ", maxsplit = 2)
else:
di = ""
system, msg = cleaned.split(":", maxsplit = 1)
# return each line as list of cleaned stuff:
return [ts[0].strip() ,system.strip(), di.strip(), msg.strip()]
# fixed header, lines will be appended   
p = [['Timestamp','System','Di','Message']]
with open("data.txt","r") as r:
for l in r:
l = l.strip()
p.append(parseLine(l))
import csv
with open("c.csv","w",newline="") as w:
writer = csv.writer(w,quoting=csv.QUOTE_ALL)
writer.writerows(p)

读取并输出写入的文件:

with open("c.csv") as r:
print(r.read())

文件内容(屏蔽的csv(否则st. Old:111, New:222, Calculated was 123, ...会损坏您的格式:

"Timestamp","System","Di","Message"
"01-01-1998 00:00:00 AM","GP:","D(B):","1234 to time difference. Hourly Avg:-3 secs"
"01-01-1998 00:00:12 AM","GP:","D(A):","2345 to time difference. Hourly Avg:0 secs"
"01-01-1998 00:08:08 AM","SYS","","The Screen Is now minimised."
"01-01-1998 00:09:10 AM","SC","","Findcorrect: W. D:1. Count one two three four five.       #there are somehow some glitch in the system showing 2 timestamp"
"01-01-1998 00:14:14 AM","SC:","D1","test. Old:111, New:222, Calculated was 123, out of 120 secs."
"01-01-1998 01:06:24 AM","ET","","Program Disconnected event."

最新更新