在bash中遍历文件的块(有些结构化)



我有一个包含以下内容的文件" structure& quot;text:

>some multiline-
>text
---
>in multiple chunks (this one for instance is the second of this sample)
---
>Their number, sizes and content are irregular
---
>they can't
>be known in
>advance
---
>And they'll contain pretty much any char whatsoever known or unknown in the universe
>like 𒊺/🫘/n/...
---

我希望能够通过for循环(这是一个强烈的偏好)在

行中读取它们。
for chunk in $(someUnknownMagic --over content.file)
do
echo "I'll do something with ${chunk}"
done

我很确定有一个简单的答案,但我不能使用

  • IFS:它是一个单字符分隔符列表
  • sed(为了简化我的模式'n---n'走向智能)+IFS:因为我必须选择一个分隔符字符,可能会出现在我的块

所以我没有想法(但我确信有很多选择)…

sed(简化我的模式'n- n') + IFS

很棒!插入一个唯一的字节(以下为零字节)来分隔块,然后将它们作为由该字节分隔的流读取。使用GNU sed:

sed -z 's/n---n/x00/g' content.file |
while IFS= read -r -d '' chunk; do
echo "$chunk"
done

真的,只要遍历行并累积直到找到---行,你需要复杂吗:

chunk=""
while IFS= read -r line; do
if [[ line == '---' ]]; then
echo "$chunk"
chunk=""
fi
chunk+=$line$'n'
done < content.file

当我们有一些重复的模式时,我们可以遍历它。在你的例子中,它是---

所以我们可以这样解决…

#!/bin/bash
file=$(<"$1");
# read-only numeric value 
declare -ir chunk_max=$(grep -c '---'  <<< "$file");
for ((index=0; index < $chunk_max; ++index )); do
chunk="${file%%---*}";
file="${file#*---}";
echo "chunk[ $index ]";
echo "$chunk";
done

脚本:

  • 计算我们有多少---
  • 循环到chunk_max
  • 删除第一个chunk|右侧匹配
  • 更新文件通过移除第一个块,我们提取了|左侧匹配

输出
chunk[ 0 ]
>some multiline-
>text
chunk[ 1 ]
>in multiple chunks (this one for instance is the second of this sample)
chunk[ 2 ]
>Their number, sizes and content are irregular
chunk[ 3 ]
>they can't
>be known in
>advance
chunk[ 4 ]
>And they'll contain pretty much any char whatsoever known or unknown in the universe
>like 𒊺/🫘/n/...

一些bug修复

  • 如果一行包含---, grep将失败
  • 删除每个块前后的换行符
#!/bin/bash
file=$(<"$1");
# only a single line starts end ends with ---
declare -ir chunk_max=$(grep -c '^---$'  <<< "$file");
for ((index=0; index < $chunk_max; ++index )); do
# from the end, delete everything up to "n----"
chunk="${file%%$'n'---*}";
# from the beginning, delete everything up to "---n"
file="${file#*---$'n'}";
# print
echo "chunk[ $index ]";
echo "$chunk";
done

输出
chunk[ 0 ]
>some multiline-
>text
chunk[ 1 ]
>in multiple chunks (this one for instance is the second of this sample)
chunk[ 2 ]
>Their number, sizes and content are irregular
chunk[ 3 ]
>they can't
>be known in
>advance
chunk[ 4 ]
>And they'll contain pretty much any char whatsoever known or unknown in the universe
>like 𒊺/🫘/n/...

注意

  • $'...'将值视为特殊字符,因此$'n'表示它是换行符,而不是+n
  • ${VAR%%PATTERN}从右侧匹配并删除
  • ${VAR##PATTERN}从左侧匹配并删除