用gensub和awk进行非贪婪匹配替换



我正在尝试使用AWK清理一堆带有POS标记的句子。每个句子可以没有、一个或多个格式为POS{word|type}的标签。我很难理解带有多个标签的句子。我找不到使正则表达式不贪婪的方法。示例

输入

sentence_1,My POS{tailor,noun} is POS{rich,adj}.

所需输出

sentence_1,My tailor is rich.

我目前所在的

echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}."|awk -F "," 'BEGIN{OFS=","} {_id=$1;$1="";s=gensub(/\POS{(.+?),.+?}/, "\1", "gm", $0); print _id s}'

我得到错误的输出:

sentence_1,My tailor,noun} is POS{rich.

正则表达式不是贪婪的。我知道awk不能处理贪婪的表达式,但你会怎么做呢?提前谢谢。

对于您所展示的示例,您是否可以尝试以下内容,这些内容是在GNUawk中编写和测试的,我认为应该在任何awk中工作。

echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." | 
awk '
{
first=val=finalVal=""
count=0
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){
if(++count==1){
first=substr($0,1,RSTART-1)
}
val=substr($0,RSTART,RLENGTH)
sub(/\POS{/,"",val)
finalVal=(finalVal?finalVal OFS:"")val
$0=substr($0,RSTART+RLENGTH)
}
print first finalVal
}'

或者尝试以下操作,如果您在POS{rich,adj}.之后有任何内容,如本例中的.,则使用以下方式:

echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." | 
awk '
{
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){
if(++count==1){
first=substr($0,1,RSTART-1)
}
val=substr($0,RSTART,RLENGTH)
sub(/\POS{/,"",val)
finalVal=(finalVal?finalVal OFS:"")val
$0=substr($0,RSTART+RLENGTH)
}
sub(/.*}/,"")
print first finalVal $0
}'

解释:添加以上详细解释。

echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." |  ##Using echo to print value.
##Sending its output as input to awk program.
awk '                                          ##Starting awk program from here.
{
first=val=finalVal=""                        ##Nullifying variables here.
count=0                                      ##Setting count to 0 here.
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){    ##Using while loop to run match in it.
##Match has regex to match one or more alphabets space POS{ till comma comes.
if(++count==1){                            ##Checking condition if count is 1 then do following.
first=substr($0,1,RSTART-1)              ##Creating first to have everything before matched this should have very first matches before value eg--> sentence_1,My
}
val=substr($0,RSTART,RLENGTH)              ##Creating val which is sub string of matched regex.
sub(/\POS{/,"",val)                       ##Using substitute POS{ with NULL.
finalVal=(finalVal?finalVal OFS:"")val     ##Creating finalVal to have all values in it.
$0=substr($0,RSTART+RLENGTH)               ##Re-creating whole line to have only rest of the line in it, removing matched part.
}
print first finalVal                         ##Printing first and finalVal here.
}'

以下是使用否定括号表达式的sed解决方案:

s='sentence_1,My POS{tailor,noun} is POS{rich,adj}.'
sed -E s'/\POS{([^,]+),[^}]*}/1/g' <<< "$s"
sentence_1,My tailor is rich.

RegEx解释:

  • \POS{:匹配POS{
  • ([^,]+):在组#1中匹配1个或多个非逗号字符并捕获
  • ,:匹配逗号
  • [^}]*:匹配0个或多个非}字符
  • }:匹配一个}
  • /1:替换为1,即捕获组#1的反向引用

或一位"更简单";(?(与gawk的gensub(如最初尝试的(:

$ echo 'sentence_1,My POS{tailor,noun} is POS{rich,adj}' | gawk '{s=gensub(/\POS{([^,]+),[^}]+}/, "\1", "G", $0); print s}'
sentence_1,My tailor is rich

最新更新