多行上的 SED 正则表达式无法捕获所有内容



我有这个文本文件(示例)

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

<This is a line of text with a year=33020 month=12 in it
This line of text does not have a year or month in it
This year=33020 is the current year the current month=1
This is the year=33020 the month=2/>

using linux sed (sed (GNU sed) 4.2.2) regexp:

sed -En 'N;s/<(This.*2020.*[sSn]*?)>/1/gp' test2.txt

它只捕获这个字符串:

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it

我试着捕捉<…比;作为集团

我哪里做错了?

如果您想打印以<This开头,包含2020的段落(以<...>分隔),并且只有它们,您可以尝试:

sed -En '/^</!d;:a;/>$/!{N;ba;};/<This.*2020/p;' test2.txt

只要模式空间不从<开始,它就被删除,并开始一个新的循环(/^</!d)。

然后,只要模式空间不以>结束,就追加新的行,但不开始新的循环,而是分支到a标签(/>$/!{N;ba;})。

当一个完整的段落存储在模式空间中时,我们退出这个循环并应用最后一个命令(^<This.*2020/p):如果模式空间匹配你的模式,它就被打印出来。最后,一个新的循环开始了。

当然,正则表达式必须适应您的需要。如果段落分隔符前面(后面)可以有空格,例如,使用:

sed -En '/^[[:space:]]*</!d;:a;/>[[:space:]]*$/!{N;ba;};/<This.*2020/p;' test2.txt

在GNU Awk中,您可以将RS指定为正则表达式。

bash gawk -v RS='[<>]' /This.*2020/ <<:
> <This is a line of text with a year=2020 month=12 in it This line of
> text does not have a year or month in it This year=2021 is the current
> year the current month=1 This is the year=2021 the month=2/>
> 
> <This is a line of text with a year=33020 month=12 in it This line of
> text does not have a year or month in it This year=33020 is the
> current year the current month=1 This is the year=33020 the month=2/>
> :
This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/

可以看到,这也修饰了分隔符;但把它加回来并不太难(提示:{ print "<" $0 ">" })。

最新更新