嵌套多个条件的 bash 文本解析



我有以下代码,用于检查超过 10 个单词的行,并在出现第一个逗号字符的位置拆分它们。它重申了该过程,因此所有新拆分的行超过 10 个单词和逗号也被拆分(最后没有超过 10 个单词和逗号的行(。

如何编辑此代码以执行以下操作:在所有逗号拆分完成后(当前代码已经执行的操作(,检查生成的行是否超过 10 个单词并在第一个"和"(带空格(出现的地方拆分?

#!/usr/bin/env bash
input=input.txt
temp=$(mktemp ${input}.XXXX)
trap "rm -f $temp" 0
while awk '
BEGIN { retval=1 }
NF >= 10 && /, / {
sub(/, /, ","ORS)
retval=0
}
1
END { exit retval }
' "$input" > "$temp"; do
mv -v $temp $input
done

输入样本:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 
Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11 and Word12 Word13 Word14 Word15 
Word1 Word2 Word3 Word4 and Word5

期望输出:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16 
Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11 and
Word12 Word13 Word14 Word15 
Word1 Word2 Word3 Word4 and Word5

提前谢谢你!

请尝试以下操作:

awk '{
while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {
if (match($0, "[^,]+,")) {
# puts a newline after the 1st comma
print substr($0, 1, RLENGTH)
$0 = substr($0, RLENGTH + 1)
} else {
# puts a newline before the 1st substring " and "
n = split($0, a, " +and +")
if (a[1] == "") {               # $0 starts with " and "
a[1] = " and " a[2]
for (i = 2; i < n; i++) {
a[i] = a[i+1]
}
n--
}
print a[1]
$0 = " and " a[2]
for (i = 3; i <= n; i++) {      # there are two ore more " and "
$0 = $0 " and " a[i]
}
}
}
print
}' input.txt

给定输入的输出:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10 Word11
and Word12 Word13 Word14 Word15
Word1 Word2 Word3 Word4 and Word5

[解释]

  • 它迭代同一记录,而模式空间包含 超过10个字段(不包括单词"and"(&&模式空间 包括行分隔符,以便启用成功的拆分。
  • 如果图案空间包含逗号,则打印左手 并用右手更新图案空间。
  • 如果模式空间包含单词 " 和 ",则处理有点 困难,因为单词保留在更新的模式空间中。 从某种意义上说,我的方法可能并不优雅,但即使记录也有效 包含多个(两个或更多("和"s。

[编辑]

如果要将单词and作为字数的一部分,请替换第 2 行:

while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {

跟:

while (NF > 10 && match($0, "( +and +)|,")) {

此外,如果您允许单词and跟随 原行:脚本会简化一点:

awk '{
while (NF > 10 && match($0, "( +and +)|,")) {
if (match($0, "[^,]+,")) {
# puts a newline after the 1st comma
print substr($0, 1, RLENGTH)
$0 = substr($0, RLENGTH + 1)
} else {
# puts a newline after the 1st substring " and "
n = split($0, a, " +and +")
print a[1] " and"
$0 = " " a[2]
for (i = 3; i <= n; i++) {      # there are two ore more " and "
$0 = $0 " and " a[i]
}
}
}
print
}' input.txt

此外,如果您可以选择Perl,您可以说:

perl -ne '{
while (split > 10 && /( +and +)|,/) {
if (/^.*?(, *| +and +)/) {
print $&, "n";
$_ = " $'''";
}
}
print
}' input.txt

希望这有帮助。

这是你期望的答案吗?

echo "Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21 and Word22 Word23 Word24." | grep -oE '[a-zA-Z0-9,.]+' | awk '
BEGIN {
cnt = 0
}
{
str = str " " $0
if ($0 ~ /,$/){
print str
cnt = 0
str = ""
}
else if (cnt < 10){
cnt++
}
else {
print str
cnt = 0
str = ""
}
} END {
print str
}' | sed 's/^ *//'
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21
and Word22 Word23 Word24.

最新更新