如何从多个文本中提取数据并将结果保存到JSON?



假设日志文件中的文本格式如下:

DEBUG: {"id":12311,"pool_num":"4125441212441893","full_name":"john doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c948341298703ee.jpeg","image_2":"upload\/4a190fb8941a3d746cff01aa945b.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"}
DEBUG: {"id":12312,"pool_num":"89451222214511221","full_name":"jane doe","mobile":"000000","image_1":"upload\/d7379280d5494asdasd9c948341298123.jpeg","image_2":"upload\/4a190fb89asd123746cff01aa945b.jpeg","image_3":"upload\/3afd55aebb4dadasd15b9ac335dd9236661.jpeg"}
DEBUG: {"id":12313,"pool_num":"12312345612312312","full_name":"smith doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c948341298701551.jpeg","image_2":"upload\/123easfdsdagdfhdf213432123123.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"}
DEBUG: {"id":12314,"pool_num":"82123423444112345","full_name":"adam doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c9483412987666.jpeg","image_2":"upload\/asfda1234235we3rtsdasdasdah456.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd94216.jpeg"}

目前我可以用这个正则表达式提取一些数据:

b(?:pool_num|full_name|image_1|image_2|image_3)\":\"([^"]+)

演示:https://regex101.com/r/ZmXaVl/1

但是最后的文本包含"\"并且还没有清理。

我想从pool_num,full_name,image_1,image_2image_3中提取干净的值,并以JSON格式保存到.txt文件。

我的期望输出是:

[
{
"pool_num" : 4125441212441893,
"full_name" : "john doe",
"image_1" : "d7379280d549499dd9c948341298703ee.jpeg",
"image_2" : "4a190fb8941a3d746cff01aa945b.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
},
{
"pool_num" : 89451222214511221,
"full_name" : "jane doe",
"image_1" : "d7379280d5494asdasd9c948341298123.jpeg",
"image_2" : "4a190fb89asd123746cff01aa945b.jpeg",
"image_3" : "3afd55aebb4dadasd15b9ac335dd9236661.jpeg"
},
{
"pool_num" : 12312345612312312,
"full_name" : "smith doe",
"image_1" : "d7379280d549499dd9c948341298701551.jpeg",
"image_2" : "123easfdsdagdfhdf213432123123.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
},
{
"pool_num" : 82123423444112345,
"full_name" : "adam doe",
"image_1" : "d7379280d549499dd9c9483412987666.jpeg",
"image_2" : "asfda1234235we3rtsdasdasdah456.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd94216.jpeg"
}
]

我如何用最好的Python方法得到想要的输出?

这是一个可能的解决方案,从日志中提取以'DEBUG: '开头的行,然后获得该行的json部分,并按照@Tomerikoo的评论的建议导入它。

这会产生问题中列出的预期输出格式。

此解决方案依赖于以'DEBUG: '开头的行。它也可以调整为解析带有额外前缀的行。

如果这种方法可以解决问题,那么它将比一些基于正则表达式的解决方案更具弹性。

import json
import pprint
pp = pprint.PrettyPrinter(indent=4)
mydata = []
lines = log.split("n")
for line in lines:
if line.startswith("DEBUG: {"):
json_string = line.split("DEBUG: ")[1]
mydata.append(json.loads(json_string))

pp.pprint(mydata)

最新更新