大型ndjson文件无法在Python中正确加载

我有一个大小为5GB的json文件。我想加载它，并在上面做一些EDA，以便找出相关信息的位置。

我试过了：

import json
import pprint
json_fn = 'abc.ndjson'
data = json.load(open(json_fn, 'rb'))
pprint.pprint(data, depth=2)

但这只是与崩溃

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

我也试过：

import ijson
with open(json_fn) as f:
items = ijson.items(f, 'item', multiple_values=True)  # "multiple values" needed as it crashes otherwise with a "trailing garbage parse error" (https://stackoverflow.com/questions/59346164/ijson-fails-with-trailing-garbage-parse-error)
print('Data loaded - no processing ...')
print("---items---")
print(items)
for item in items:
print("---item---")
print(item)

但这只是回来了：

Data loaded, now importing
---items---
<_yajl2.items object at 0x7f436de97440>
Process finished with exit code 0

ndjson文件包含有效的ascii字符(用vi检查(，但包含很长的行，因此无法从文本编辑器中真正理解。

该文件以如下方式启动：

{"visitId":257057,"staticFeatures":[{"type":"CODES","value":"9910,51881,42833,486,4280,42731,2384,V5861,9847,3962,49320,3558,2720,4019,99092"},{"type":"visitID","value":"357057"},{"type":"VISITOR_ID","value":"68824"}, {"type":"ADMISSION_ID","value":"788457"},{"type":"AGE","value":"34"}, ...

我做错了什么？如何处理此文件？

您使用的是前缀item。为了实现这一点，json应该将list作为顶级元素。

例如，请参阅下面的json

data2.json

[
{
"Identifier": "21979c09fc4e6574"
},
{
"Identifier": "e6235cce58ec8b9c"
}
]

代码：

with open('data2.json') as fp:
items = ijson.items(fp, 'item')
for x in items:
print(x)

输出：

{'Identifier': '21979c09fc4e6574'}
{'Identifier': 'e6235cce58ec8b9c'}

的另一个例子

data.json

{
"earth": {
"europe": [
{"name": "Paris", "type": "city", "info": {  }},
{"name": "Thames", "type": "river", "info": {  }}
],
"america": [
{"name": "Texas", "type": "state", "info": {  }}
]
}
}

上面的json没有列表作为顶级元素，所以我应该为ijson.items()提供有效的前缀。前缀应为'earth.europe.item'

代码：

with open('data.json') as fp:
items = ijson.items(fp, 'earth.europe.item')
for x in items:
print(x)

输出：

{'name': 'Paris', 'type': 'city', 'info': {}}
{'name': 'Thames', 'type': 'river', 'info': {}}

相关内容

最新更新

热门标签：