假设日志文件中的文本格式如下:
DEBUG: {"id":12311,"pool_num":"4125441212441893","full_name":"john doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c948341298703ee.jpeg","image_2":"upload\/4a190fb8941a3d746cff01aa945b.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"}
DEBUG: {"id":12312,"pool_num":"89451222214511221","full_name":"jane doe","mobile":"000000","image_1":"upload\/d7379280d5494asdasd9c948341298123.jpeg","image_2":"upload\/4a190fb89asd123746cff01aa945b.jpeg","image_3":"upload\/3afd55aebb4dadasd15b9ac335dd9236661.jpeg"}
DEBUG: {"id":12313,"pool_num":"12312345612312312","full_name":"smith doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c948341298701551.jpeg","image_2":"upload\/123easfdsdagdfhdf213432123123.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"}
DEBUG: {"id":12314,"pool_num":"82123423444112345","full_name":"adam doe","mobile":"000000","image_1":"upload\/d7379280d549499dd9c9483412987666.jpeg","image_2":"upload\/asfda1234235we3rtsdasdasdah456.jpeg","image_3":"upload\/3afd55aebb4d1461a4e15b9ac335dd94216.jpeg"}
目前我可以用这个正则表达式提取一些数据:
b(?:pool_num|full_name|image_1|image_2|image_3)\":\"([^"]+)
演示:https://regex101.com/r/ZmXaVl/1
但是最后的文本包含"\"
并且还没有清理。
我想从pool_num
,full_name
,image_1
,image_2
和image_3
中提取干净的值,并以JSON格式保存到.txt
文件。
我的期望输出是:
[
{
"pool_num" : 4125441212441893,
"full_name" : "john doe",
"image_1" : "d7379280d549499dd9c948341298703ee.jpeg",
"image_2" : "4a190fb8941a3d746cff01aa945b.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
},
{
"pool_num" : 89451222214511221,
"full_name" : "jane doe",
"image_1" : "d7379280d5494asdasd9c948341298123.jpeg",
"image_2" : "4a190fb89asd123746cff01aa945b.jpeg",
"image_3" : "3afd55aebb4dadasd15b9ac335dd9236661.jpeg"
},
{
"pool_num" : 12312345612312312,
"full_name" : "smith doe",
"image_1" : "d7379280d549499dd9c948341298701551.jpeg",
"image_2" : "123easfdsdagdfhdf213432123123.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd92380.jpeg"
},
{
"pool_num" : 82123423444112345,
"full_name" : "adam doe",
"image_1" : "d7379280d549499dd9c9483412987666.jpeg",
"image_2" : "asfda1234235we3rtsdasdasdah456.jpeg",
"image_3" : "3afd55aebb4d1461a4e15b9ac335dd94216.jpeg"
}
]
我如何用最好的Python方法得到想要的输出?
这是一个可能的解决方案,从日志中提取以'DEBUG: '开头的行,然后获得该行的json部分,并按照@Tomerikoo的评论的建议导入它。
这会产生问题中列出的预期输出格式。
此解决方案依赖于以'DEBUG: '开头的行。它也可以调整为解析带有额外前缀的行。
如果这种方法可以解决问题,那么它将比一些基于正则表达式的解决方案更具弹性。
import json
import pprint
pp = pprint.PrettyPrinter(indent=4)
mydata = []
lines = log.split("n")
for line in lines:
if line.startswith("DEBUG: {"):
json_string = line.split("DEBUG: ")[1]
mydata.append(json.loads(json_string))
pp.pprint(mydata)