如何在jq——stream命令中使用' select ' ?



我有一个非常大的json文档(~100 GB),我试图使用jq来解析出符合给定标准的特定对象。因为它太大了,我将无法将它读入内存,并且需要利用--stream选项。

我知道如何运行一个select提取我需要什么,当我不是流,但可以使用一些帮助,找出如何正确配置我的命令。

这是我的文档example.json的一个示例。

{
"reporting_entity_name" : "INSURANCE COMPANY",
"reporting_entity_type" : "INSURER",
"last_updated_on" : "2022-12-01",
"version" : "1.0.0",
"in_network" : [ {
"negotiation_arrangement" : "ffs",
"name" : "ER VISIT",
"billing_code_type" : "CPT",
"billing_code_type_version" : "2022",
"billing_code" : "99285",
"description" : "HIGHEST LEVEL ER VISIT",
"negotiated_rates" : [ {
"provider_groups" : [ {
"npi" : [ 111111111, 222222222],
"tin" : {
"type" : "ein",
"value" : "99-9999999"
}
} ],
"negotiated_prices" : [ {
"negotiated_type" : "negotiated",
"negotiated_rate" : 550.50,
"expiration_date" : "9999-12-31",
"service_code" : [ "23" ],
"billing_class" : "institutional"
} ]
} ]
}
]
}

我试图抓取in_network对象,其中billing_code等于99285。

如果我能够在没有流媒体的情况下做到这一点,我会这样做:

jq '.in_network[] | select(.billing_code == "99285")' example.json

预期输出:

{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}

任何帮助我如何可以配置这与--stream选项将非常感激!

如果.in_network数组中的对象单独适合您的内存,则截断数组项(两层深度):

jq --stream -n '
fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
| select(.billing_code == "99285")
' example.json
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}

您会发现jq —-stream非常慢,即使是10GB。由于jq旨在补充其他shell工具,因此我建议使用jstream (https://github.com/bcicen/jstream)或我自己的jm或jm.py (https://github.com/pkoppstein/jm)来"splat"数组,并将结果通过管道传递给jq。

。要达到与jq过滤器相同的效果:

jm —-pointer /in_network example.json | 
jq 'select(.billing_code == "99285")' 

最新更新