比较csv文件的两列不匹配的文件



我有一个awk命令,它按预期工作。

awk -F';' '          # delimiter
NR==FNR {             # process the stem file
gsub(/"/,"")      # off with the double quotes
a[$2]=$1          # hash
next
}
{
if($1 in a)       # if corpus entry found in stem
print "XXX"   # output
else
print $1
}' stem.txt corpus.txt

退货:

this
is
XXX
XXX
as
XXX
XXX

但我希望输出包括"测试"一词,预期结果是:

this
is
XXX
XXX
as
test
XXX

这是因为";测试";在stem文件的第1列和第2列中是相同的。

# cat stem.txt
"test";"tested";"test";"Suffix";"A";"7673";"321: 0 xxx"
"test";"testing";"test";"Suffix";"A";"7673";"322: 0 xxx"
"test";"test";"test";"Suffix";"b";"5942";"001: 0 xxx"
"break";"broke";"break";"Suffix";"b";"5942";"002: 0 xxx"
"break";"broken";"break";"Suffix";"b";"5942";"003: 0 xxx"
"break";"breaks";"break";"Suffix";"c";"5778";"001: 0 xxx"
"tell";"told";"tell";"Suffix";"c";"5778";"002: 0 xx"

只有列1和列2不匹配的记录才有资格与语料库文件进行比较。

# cat corpus.txt
this
is
broken
testing
as
test
told

我试图修改if条款,但似乎不起作用

if($1 in a && a[$1] == a[$2])

使用您显示的示例,请尝试以下操作。

awk '
BEGIN{
FS=OFS=";"
}
{
gsub(/"/,"")
}
FNR==NR{
if($1!=$2){ arr[$2] }
next
}
{
print ($0 in arr)?"XXX":$0
}' stem.txt corpus.txt

显示的样本输出如下:

this
is
XXX
XXX
as
test
XXX

解释:添加以上详细解释。

awk '                            ##Starting awk program from here.
BEGIN{                           ##Starting BEGIN section from here.
FS=OFS=";"                     ##Setting field and output field separators to ; here.
}
{
gsub(/"/,"")                   ##Globally substituting ; to NULL here.
}
FNR==NR{                         ##Checking condition which will be TRUE when stem.txt is being read.
if($1!=$2){ arr[$2] }          ##Checking if 1st and 2nd fields are NOT equal then set arr index as $2.
next                           ##next will skip all further statements from here.
}
{
print ($0 in arr)?"XXX":$0     ##Printing XXX is current line present in arr else print current line.
}' stem.txt corpus.txt           ##Mentioning Input_file names here.

最新更新