如何解析由换行符分隔的无效JSON



Mongodb日志消息是JSON格式,它们驻留在一个名为mongod.log的文件中,每个日志消息由换行符n

分隔我正在努力:
  • 捕获有效JSON的每一行(日志消息)
  • 将JSON转换为python字典

我一直得到的错误是

json.decoder.JSONDecodeError: Extra data: line 2 column 1

我知道这是抛出的,因为整个日志文件不是有效的JSON,只有单独的日志消息。

我如何逐个迭代?

起初我想做一个我自己的iter接下来,抓取每一行并在处理完有效的json(日志消息)后移动。我现在看到有一种方法可以使用json。解码器解析换行分隔JSON

import json
file_path = mongod.log
with open(file_path, 'r') as file
data = json.load(file)
print(data)

mongod.log

{"t":{"$date":"2021-03-09T15:50:43.475-06:00"},"s":"I",  "c":"CONTROL",  "id":20712,   "ctx":"LogicalSessionCacheReap","msg":"Sessions collection is not set up; waiting until next sessions reap interval","attr":{"error":"NamespaceNotFound: config.system.sessions does not exist"}}
{"t":{"$date":"2021-03-10T10:33:51.002-06:00"},"s":"I",  "c":"CONTROL",  "id":23377,   "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2021-04-02T21:38:59.486-05:00"},"s":"I",  "c":"CONTROL",  "id":20714,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"NotYetInitialized: Replication has not yet been configured"}}

您可以像这样导入这个文件:

import json
file_path = 'mongod.log'
with open(file_path, 'r') as f:
df = pd.DataFrame([json.loads(line) for line in f])
print(df)

…输出:

t  s        c     id                         ctx                                                msg                                               attr
0  {'$date': '2021-03-09T15:50:43.475-06:00'}  I  CONTROL  20712     LogicalSessionCacheReap  Sessions collection is not set up; waiting unt...  {'error': 'NamespaceNotFound: config.system.se...
1  {'$date': '2021-03-10T10:33:51.002-06:00'}  I  CONTROL  23377               SignalHandler                                    Received signal              {'signal': 15, 'error': 'Terminated'}
2  {'$date': '2021-04-02T21:38:59.486-05:00'}  I  CONTROL  20714  LogicalSessionCacheRefresh  Failed to refresh session cache, will try agai...  {'error': 'NotYetInitialized: Replication has ...

pd.read_json,如注释所示:

file_path = 'mongod.log'
df = pd.read_json(file_path, lines=True)
print(df)
#same df as in the first way

对于每一行都有字典的两列,您可以像这样继续:

dict_cols = ['t', 'attr']
res = (pd.concat([df,
*(pd.json_normalize(df.pop(col)) for col in dict_cols)
],axis=1)
)
print(res)

…输出:

s        c     id                         ctx                                                msg                          $date                                              error  signal
0  I  CONTROL  20712     LogicalSessionCacheReap  Sessions collection is not set up; waiting unt...  2021-03-09T15:50:43.475-06:00  NamespaceNotFound: config.system.sessions does...     NaN
1  I  CONTROL  23377               SignalHandler                                    Received signal  2021-03-10T10:33:51.002-06:00                                         Terminated    15.0
2  I  CONTROL  20714  LogicalSessionCacheRefresh  Failed to refresh session cache, will try agai...  2021-04-02T21:38:59.486-05:00  NotYetInitialized: Replication has not yet bee...     NaN

最新更新