如何在Python中获取和验证日志文件中的xml内容

我需要解析一些日志文件，其中的内容类似于XML，但它没有根，中间有文本内容。

日志文件格式为：

2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373011</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>
2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373012</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>

既然它是一个日志文件，我可以使用ElementTree库吗？我需要验证不同项目ID的Measured OK。

我试过这些，但都不起作用：(1(

import xml.etree.ElementTree as ET
import re
with open('C:lovelyLibrariessite.log') as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<?xml[^>]+?>)", r"1<root>", xml) + "</root>")

可能无法解析一个文件，该文件包含混合在一起的随机文本片段和XML。文本部分很可能包含类似XML但格式不好的内容(如<?xml[^>]+?>(；在一般情况下，将其与XML区分开来是不可能的。

试试这个。它具有很高的容错性，并将数据视为文本。

from simplified_scrapy import SimplifiedDoc
html = '''
2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373011</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>
2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
<ItemId>373012</ItemId>
<AreaId>232</AreaId>
<CarrierId>131</CarrierId>
<AResult>
<Measured>Ok</Measured>
</AResult>
<TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>
'''
doc = SimplifiedDoc(html)
# Outcome = doc.Outcome
Outcomes = doc.Outcomes 
print(Outcomes.ItemId.text, Outcomes.AreaId.text)

结果：

['373011', '373012'] ['232', '232']

以下是更多示例：https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

相关内容

最新更新

热门标签：