如何在Python中获取和验证日志文件中的xml内容



我需要解析一些日志文件,其中的内容类似于XML,但它没有根,中间有文本内容。

日志文件格式为:

2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373011</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>
2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373012</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>

既然它是一个日志文件,我可以使用ElementTree库吗?我需要验证不同项目ID的Measured OK。

我试过这些,但都不起作用:(1(

import xml.etree.ElementTree as ET
import re
with open('C:lovelyLibrariessite.log') as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<?xml[^>]+?>)", r"1<root>", xml) + "</root>")

可能无法解析一个文件,该文件包含混合在一起的随机文本片段和XML。文本部分很可能包含类似XML但格式不好的内容(如<?xml[^>]+?>(;在一般情况下,将其与XML区分开来是不可能的。

试试这个。它具有很高的容错性,并将数据视为文本。

from simplified_scrapy import SimplifiedDoc
html = '''
2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373011</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>
2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373012</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>
'''
doc = SimplifiedDoc(html)
# Outcome = doc.Outcome
Outcomes = doc.Outcomes 
print(Outcomes.ItemId.text, Outcomes.AreaId.text)

结果:

['373011', '373012'] ['232', '232']

以下是更多示例:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

最新更新