我有一个awk命令,它按预期工作。
awk -F';' ' # delimiter
NR==FNR { # process the stem file
gsub(/"/,"") # off with the double quotes
a[$2]=$1 # hash
next
}
{
if($1 in a) # if corpus entry found in stem
print "XXX" # output
else
print $1
}' stem.txt corpus.txt
退货:
this
is
XXX
XXX
as
XXX
XXX
但我希望输出包括"测试"一词,预期结果是:
this
is
XXX
XXX
as
test
XXX
这是因为";测试";在stem文件的第1列和第2列中是相同的。
# cat stem.txt
"test";"tested";"test";"Suffix";"A";"7673";"321: 0 xxx"
"test";"testing";"test";"Suffix";"A";"7673";"322: 0 xxx"
"test";"test";"test";"Suffix";"b";"5942";"001: 0 xxx"
"break";"broke";"break";"Suffix";"b";"5942";"002: 0 xxx"
"break";"broken";"break";"Suffix";"b";"5942";"003: 0 xxx"
"break";"breaks";"break";"Suffix";"c";"5778";"001: 0 xxx"
"tell";"told";"tell";"Suffix";"c";"5778";"002: 0 xx"
只有列1和列2不匹配的记录才有资格与语料库文件进行比较。
# cat corpus.txt
this
is
broken
testing
as
test
told
我试图修改if条款,但似乎不起作用
if($1 in a && a[$1] == a[$2])
使用您显示的示例,请尝试以下操作。
awk '
BEGIN{
FS=OFS=";"
}
{
gsub(/"/,"")
}
FNR==NR{
if($1!=$2){ arr[$2] }
next
}
{
print ($0 in arr)?"XXX":$0
}' stem.txt corpus.txt
显示的样本输出如下:
this
is
XXX
XXX
as
test
XXX
解释:添加以上详细解释。
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=";" ##Setting field and output field separators to ; here.
}
{
gsub(/"/,"") ##Globally substituting ; to NULL here.
}
FNR==NR{ ##Checking condition which will be TRUE when stem.txt is being read.
if($1!=$2){ arr[$2] } ##Checking if 1st and 2nd fields are NOT equal then set arr index as $2.
next ##next will skip all further statements from here.
}
{
print ($0 in arr)?"XXX":$0 ##Printing XXX is current line present in arr else print current line.
}' stem.txt corpus.txt ##Mentioning Input_file names here.