我正在尝试使用AWK清理一堆带有POS标记的句子。每个句子可以没有、一个或多个格式为POS{word|type}
的标签。我很难理解带有多个标签的句子。我找不到使正则表达式不贪婪的方法。示例
输入
sentence_1,My POS{tailor,noun} is POS{rich,adj}.
所需输出
sentence_1,My tailor is rich.
我目前所在的
echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}."|awk -F "," 'BEGIN{OFS=","} {_id=$1;$1="";s=gensub(/\POS{(.+?),.+?}/, "\1", "gm", $0); print _id s}'
我得到错误的输出:
sentence_1,My tailor,noun} is POS{rich.
正则表达式不是贪婪的。我知道awk不能处理贪婪的表达式,但你会怎么做呢?提前谢谢。
对于您所展示的示例,您是否可以尝试以下内容,这些内容是在GNUawk
中编写和测试的,我认为应该在任何awk
中工作。
echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." |
awk '
{
first=val=finalVal=""
count=0
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){
if(++count==1){
first=substr($0,1,RSTART-1)
}
val=substr($0,RSTART,RLENGTH)
sub(/\POS{/,"",val)
finalVal=(finalVal?finalVal OFS:"")val
$0=substr($0,RSTART+RLENGTH)
}
print first finalVal
}'
或者尝试以下操作,如果您在POS{rich,adj}.
之后有任何内容,如本例中的.
,则使用以下方式:
echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." |
awk '
{
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){
if(++count==1){
first=substr($0,1,RSTART-1)
}
val=substr($0,RSTART,RLENGTH)
sub(/\POS{/,"",val)
finalVal=(finalVal?finalVal OFS:"")val
$0=substr($0,RSTART+RLENGTH)
}
sub(/.*}/,"")
print first finalVal $0
}'
解释:添加以上详细解释。
echo "sentence_1,My POS{tailor,noun} is POS{rich,adj}." | ##Using echo to print value.
##Sending its output as input to awk program.
awk ' ##Starting awk program from here.
{
first=val=finalVal="" ##Nullifying variables here.
count=0 ##Setting count to 0 here.
while(match($0,/[a-zA-Z]+ \POS{[^,]*/)){ ##Using while loop to run match in it.
##Match has regex to match one or more alphabets space POS{ till comma comes.
if(++count==1){ ##Checking condition if count is 1 then do following.
first=substr($0,1,RSTART-1) ##Creating first to have everything before matched this should have very first matches before value eg--> sentence_1,My
}
val=substr($0,RSTART,RLENGTH) ##Creating val which is sub string of matched regex.
sub(/\POS{/,"",val) ##Using substitute POS{ with NULL.
finalVal=(finalVal?finalVal OFS:"")val ##Creating finalVal to have all values in it.
$0=substr($0,RSTART+RLENGTH) ##Re-creating whole line to have only rest of the line in it, removing matched part.
}
print first finalVal ##Printing first and finalVal here.
}'
以下是使用否定括号表达式的sed
解决方案:
s='sentence_1,My POS{tailor,noun} is POS{rich,adj}.'
sed -E s'/\POS{([^,]+),[^}]*}/1/g' <<< "$s"
sentence_1,My tailor is rich.
RegEx解释:
\POS{
:匹配POS{
([^,]+)
:在组#1中匹配1个或多个非逗号字符并捕获,
:匹配逗号[^}]*
:匹配0个或多个非}
字符}
:匹配一个}
/1
:替换为1
,即捕获组#1的反向引用
或一位"更简单";(?(与gawk的gensub
(如最初尝试的(:
$ echo 'sentence_1,My POS{tailor,noun} is POS{rich,adj}' | gawk '{s=gensub(/\POS{([^,]+),[^}]+}/, "\1", "G", $0); print s}'
sentence_1,My tailor is rich