使用awk将一个文件与两个单独的查找文件进行比较



基本上,我想检查lookup_1&lookup_ 2存在于我的xyz.txt文件中;将输出重定向到输出文件。此外,我的代码目前正在替换lookup_1中所有出现的字符串,甚至作为子字符串,但我只需要它在有完整单词匹配的情况下进行替换。你能帮助调整代码以实现同样的效果吗?

代码

awk '
FNR==NR { if ($0 in lookups)    
next                            
lookups[$0]=$0
for (i=1;i<=NF;i++) {         
oldstr=$i
newstr=""
while (oldstr) {               
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)   
}
ndx=index(lookups[$0],$i)   
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) { 
ndx=index($0,i)                
while (ndx > 0) {                       t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)                    
}
}
print
}
' lookup_1 xyz.txt > output.txt

查找_1

ha
achine
skhatw
at
ree
ter
man
dun

查找_2

United States
CDEXX123X
Institution

xyz.txt

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

电流输出

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

所需输出

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

我们可以对当前代码进行一些更改:

  • cat lookup_1 lookup_2的结果馈送到awk,使其看起来像是awk的单个文件(请参阅新代码的最后一行(
  • 使用单词边界标志(<>(构建正则表达式,用它们执行替换(请参阅新代码的后半部分(

新代码:

awk '
# the FNR==NR block of code remains the same
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
# complete rewrite of the following block to perform replacements based on a regex using word boundaries
{ for (i in lookups) {
regex= "\<" i "\>"            # build regex
gsub(regex,lookups[i])          # replace strings that match regex
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

这将生成:

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

注意:

  • ‘边界’字符(<>(在非单词字符上匹配;在awk中,单词被定义为数字、字母和下划线的序列;有关更多详细信息,请参阅GNU awk-regex运算符
  • 所有样本查找值都在awk字的定义范围内,因此此新代码可以按需工作
  • 您之前的问题包括不能被视为awk"单词"的查找值(例如,@vanti Finserv Co.11:11 - CapitalMS&CO(NY)(,在这种情况下,此新代码可能无法替换这些新查找值
  • 对于包含非单词字符的查找值,不清楚如何定义"单词匹配">,因为您还需要确定非单词字符(例如@(何时被视为查找字符串的一部分,何时被视作为单词边界

如果需要替换包含(awk(非单词字符的查找值,可以尝试用W替换单词边界字符,尽管这会导致(awk("单词"的查找值出现问题。

一种可能的解决方法是为每个查找值运行一组正则表达式匹配,例如:

awk '
FNR==NR { ... no changes to this block of code ... }
{ for (i in lookups) {
regex= "\<" i "\>"
gsub(regex,lookups[i])
regex= "\W" i "\W"
gsub(regex,lookups[i])
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt

您需要确定第二个正则表达式是否违反了"单词匹配">的要求。

相关内容

  • 没有找到相关文章

最新更新