我有一个非常大的json文档(~100 GB),我试图使用jq
来解析出符合给定标准的特定对象。因为它太大了,我将无法将它读入内存,并且需要利用--stream
选项。
我知道如何运行一个select
提取我需要什么,当我不是流,但可以使用一些帮助,找出如何正确配置我的命令。
这是我的文档example.json
的一个示例。
{
"reporting_entity_name" : "INSURANCE COMPANY",
"reporting_entity_type" : "INSURER",
"last_updated_on" : "2022-12-01",
"version" : "1.0.0",
"in_network" : [ {
"negotiation_arrangement" : "ffs",
"name" : "ER VISIT",
"billing_code_type" : "CPT",
"billing_code_type_version" : "2022",
"billing_code" : "99285",
"description" : "HIGHEST LEVEL ER VISIT",
"negotiated_rates" : [ {
"provider_groups" : [ {
"npi" : [ 111111111, 222222222],
"tin" : {
"type" : "ein",
"value" : "99-9999999"
}
} ],
"negotiated_prices" : [ {
"negotiated_type" : "negotiated",
"negotiated_rate" : 550.50,
"expiration_date" : "9999-12-31",
"service_code" : [ "23" ],
"billing_class" : "institutional"
} ]
} ]
}
]
}
我试图抓取in_network
对象,其中billing_code
等于99285。
如果我能够在没有流媒体的情况下做到这一点,我会这样做:
jq '.in_network[] | select(.billing_code == "99285")' example.json
预期输出:
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}
任何帮助我如何可以配置这与--stream
选项将非常感激!
如果.in_network
数组中的对象单独适合您的内存,则截断数组项(两层深度):
jq --stream -n '
fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
| select(.billing_code == "99285")
' example.json
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}
您会发现jq —-stream
非常慢,即使是10GB。由于jq旨在补充其他shell工具,因此我建议使用jstream (https://github.com/bcicen/jstream)或我自己的jm或jm.py (https://github.com/pkoppstein/jm)来"splat"数组,并将结果通过管道传递给jq。
。要达到与jq过滤器相同的效果:
jm —-pointer /in_network example.json |
jq 'select(.billing_code == "99285")'