我有这个文本文件(示例)
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>
<This is a line of text with a year=33020 month=12 in it
This line of text does not have a year or month in it
This year=33020 is the current year the current month=1
This is the year=33020 the month=2/>
using linux sed (sed (GNU sed) 4.2.2) regexp:
sed -En 'N;s/<(This.*2020.*[sSn]*?)>/1/gp' test2.txt
它只捕获这个字符串:
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
我试着捕捉<…比;作为集团
我哪里做错了?
如果您想打印以<This
开头,包含2020
的段落(以<...>
分隔),并且只有它们,您可以尝试:
sed -En '/^</!d;:a;/>$/!{N;ba;};/<This.*2020/p;' test2.txt
只要模式空间不从<
开始,它就被删除,并开始一个新的循环(/^</!d
)。
然后,只要模式空间不以>
结束,就追加新的行,但不开始新的循环,而是分支到a
标签(/>$/!{N;ba;}
)。
当一个完整的段落存储在模式空间中时,我们退出这个循环并应用最后一个命令(^<This.*2020/p
):如果模式空间匹配你的模式,它就被打印出来。最后,一个新的循环开始了。
当然,正则表达式必须适应您的需要。如果段落分隔符前面(后面)可以有空格,例如,使用:
sed -En '/^[[:space:]]*</!d;:a;/>[[:space:]]*$/!{N;ba;};/<This.*2020/p;' test2.txt
在GNU Awk中,您可以将RS
指定为正则表达式。
bash gawk -v RS='[<>]' /This.*2020/ <<:
> <This is a line of text with a year=2020 month=12 in it This line of
> text does not have a year or month in it This year=2021 is the current
> year the current month=1 This is the year=2021 the month=2/>
>
> <This is a line of text with a year=33020 month=12 in it This line of
> text does not have a year or month in it This year=33020 is the
> current year the current month=1 This is the year=33020 the month=2/>
> :
This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/
可以看到,这也修饰了分隔符;但把它加回来并不太难(提示:{ print "<" $0 ">" }
)。