Mongodb日志消息是JSON格式,它们驻留在一个名为mongod.log的文件中,每个日志消息由换行符n
分隔我正在努力:- 捕获有效JSON的每一行(日志消息)
- 将JSON转换为python字典
我一直得到的错误是
json.decoder.JSONDecodeError: Extra data: line 2 column 1
我知道这是抛出的,因为整个日志文件不是有效的JSON,只有单独的日志消息。
我如何逐个迭代?
起初我想做一个我自己的iter接下来,抓取每一行并在处理完有效的json(日志消息)后移动。我现在看到有一种方法可以使用json。解码器解析换行分隔JSON
import json
file_path = mongod.log
with open(file_path, 'r') as file
data = json.load(file)
print(data)
mongod.log
{"t":{"$date":"2021-03-09T15:50:43.475-06:00"},"s":"I", "c":"CONTROL", "id":20712, "ctx":"LogicalSessionCacheReap","msg":"Sessions collection is not set up; waiting until next sessions reap interval","attr":{"error":"NamespaceNotFound: config.system.sessions does not exist"}}
{"t":{"$date":"2021-03-10T10:33:51.002-06:00"},"s":"I", "c":"CONTROL", "id":23377, "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2021-04-02T21:38:59.486-05:00"},"s":"I", "c":"CONTROL", "id":20714, "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"NotYetInitialized: Replication has not yet been configured"}}
您可以像这样导入这个文件:
import json
file_path = 'mongod.log'
with open(file_path, 'r') as f:
df = pd.DataFrame([json.loads(line) for line in f])
print(df)
…输出:
t s c id ctx msg attr
0 {'$date': '2021-03-09T15:50:43.475-06:00'} I CONTROL 20712 LogicalSessionCacheReap Sessions collection is not set up; waiting unt... {'error': 'NamespaceNotFound: config.system.se...
1 {'$date': '2021-03-10T10:33:51.002-06:00'} I CONTROL 23377 SignalHandler Received signal {'signal': 15, 'error': 'Terminated'}
2 {'$date': '2021-04-02T21:38:59.486-05:00'} I CONTROL 20714 LogicalSessionCacheRefresh Failed to refresh session cache, will try agai... {'error': 'NotYetInitialized: Replication has ...
或pd.read_json
,如注释所示:
file_path = 'mongod.log'
df = pd.read_json(file_path, lines=True)
print(df)
#same df as in the first way
对于每一行都有字典的两列,您可以像这样继续:
dict_cols = ['t', 'attr']
res = (pd.concat([df,
*(pd.json_normalize(df.pop(col)) for col in dict_cols)
],axis=1)
)
print(res)
…输出:
s c id ctx msg $date error signal
0 I CONTROL 20712 LogicalSessionCacheReap Sessions collection is not set up; waiting unt... 2021-03-09T15:50:43.475-06:00 NamespaceNotFound: config.system.sessions does... NaN
1 I CONTROL 23377 SignalHandler Received signal 2021-03-10T10:33:51.002-06:00 Terminated 15.0
2 I CONTROL 20714 LogicalSessionCacheRefresh Failed to refresh session cache, will try agai... 2021-04-02T21:38:59.486-05:00 NotYetInitialized: Replication has not yet bee... NaN