如何使用jq将数组拆分为块?

我有一个非常大的JSON文件，包含一个数组。是否可以使用jq将此数组拆分为几个固定大小的较小数组？假设我的输入是这样的：[1,2,3,4,5,6,7,8,9,10]，我想把它分成 3 个元素长块。jq所需的输出将是：

[1,2,3]
[4,5,6]
[7,8,9]
[10]

实际上，我的输入数组有近三百万个元素，都是 UUID。

有一个(未记录的)内置_nwise，满足功能要求：

$ jq -nc '[1,2,3,4,5,6,7,8,9,10] | _nwise(3)'
[1,2,3]
[4,5,6]
[7,8,9]
[10]

也：

$ jq -nc '_nwise([1,2,3,4,5,6,7,8,9,10];3)' 
[1,2,3]
[4,5,6]
[7,8,9]
[10]

顺便说一下，_nwise可以用于数组和字符串。

(我相信它是没有记录的，因为对合适的名称有一些疑问。

总体拥有成本版本

不幸的是，内置版本被粗心定义，对于大型数组不会很好地执行。这是一个优化版本(它应该与非递归版本一样高效)：

def nwise($n):
def _nwise:
if length <= $n then . else .[0:$n] , (.[$n:]|_nwise) end;
_nwise;

对于大小为 300 万的数组，这是非常高性能的：旧 Mac 上的 3.91 秒，162746368最大居民大小。

请注意，此版本(使用尾部调用优化递归)实际上比使用此页面上其他位置显示的foreach的nwise/2版本更快。

以下面向流的window/3定义，由于塞德里克·康内斯 (GitHub：Connesc)，概括_nwise，并说明 "拳击技术"，避免了使用拳击的需要流结束标记，因此可以使用如果流包含非 JSON 值nan. 一个定义window/3方面的_nwise/1也包括在内。

window/3的第一个参数被解释为流。 $size是窗口大小，$step指定要跳过的值的数量。例如

window(1,2,3; 2; 1)

收益率：

[1,2]
[2,3]

窗口/3 和_nsize/1

def window(values; $size; $step):
def checkparam(name; value): if (value | isnormal) and value > 0 and (value | floor) == value then . else error("window (name) must be a positive integer") end;
checkparam("size"; $size)
| checkparam("step"; $step)
# We need to detect the end of the loop in order to produce the terminal partial group (if any).
# For that purpose, we introduce an artificial null sentinel, and wrap the input values into singleton arrays in order to distinguish them.
| foreach ((values | [.]), null) as $item (
{index: -1, items: [], ready: false};
(.index + 1) as $index
# Extract items that must be reused from the previous iteration
| if (.ready | not) then .items
elif $step >= $size or $item == null then []
else .items[-($size - $step):]
end
# Append the current item unless it must be skipped
| if ($index % $step) < $size then . + $item
else .
end
| {$index, items: ., ready: (length == $size or ($item == null and length > 0))};
if .ready then .items else empty end
);
def _nwise($n): window(.[]; $n; $n);

源：

https://gist.github.com/connesc/d6b87cbacae13d4fd58763724049da58

这里有一个简单的对我有用的：

def chunk(n):
range(length/n|ceil) as $i | .[n*$i:n*$i+n];

用法示例：

jq -n 
'def chunk(n): range(length/n|ceil) as $i | .[n*$i:n*$i+n];
[range(5)] | chunk(2)'
[
0,
1
]
[
2,
3
]
[
4
]

奖励：它不使用递归，也不依赖_nwise，所以它也适用于jaq。

如果数组太大而无法舒适地放入内存中，那么我会采用 @CharlesDuffy 建议的策略——也就是说，使用面向流的nwise版本将数组元素流式传输到 jq 的第二次调用中，例如：

def nwise(stream; $n):
foreach (stream, nan) as $x ([];
if length == $n then [$x] else . + [$x] end;
if (.[-1] | isnan) and length>1 then .[:-1]
elif length == $n then .
else empty
end);

上述"驱动程序"将是：

nwise(inputs; 3)

但请记住使用 -n 命令行选项。

从任意数组创建流：

$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json

因此，shell 管道可能如下所示：

$ jq -cn --stream '
fromstream( inputs | (.[0] |= .[1:])
| select(. != [[]]) )' huge.json |
jq -n -f nwise.jq

这种方法非常高效。为了使用nwise/2将 300 万个项目的流分组为 3 个组，

/usr/bin/time -lp

对于 JQ 的第二次调用，给出：

user         5.63
sys          0.04
1261568  maximum resident set size

警告：此定义使用nan作为流结束标记。由于nan不是 JSON 值，因此这对于处理 JSON 流来说不会成为问题。

可以肯定的是，下面是黑客 - 但内存效率高的黑客，即使列表很长：

jq -c --stream 'select(length==2)|.[1]' <huge.json 
| jq -nc 'foreach inputs as $i (null; null; [$i,try input,try input])'

管道的第一部分在输入 JSON 文件中流式传输，每个元素发出一行，假设数组由原子值组成(其中 [] 和 {} 在此处作为原子值包含在内)。因为它在流模式下运行，所以它不需要将整个内容存储在内存中，尽管它是单个文档。

管道的第二部分重复读取最多三个项目并将它们组合成一个列表。

这应该避免一次在内存中需要三条以上的数据。

总体拥有成本版本

窗口/3 和_nsize/1

源：

相关内容

最新更新

热门标签：