使用awk比较两个不同文件中的两个数值范围,并打印file1中的所有行和file2中的匹配行



这个新问题是最近一个问题的后续问题:用awk比较两个不同文件中的两个数值范围。建议的解决方案,完美地工作是不实际的下游分析(误解我的问题,而不是解决方案的工作)。

我有一个3列的文件。第2列和第3列定义了一个数值范围。在第2列中,数据从较小的值到较大的值排序。数值范围从不重叠。

file1

S   24     96
S   126    352
S   385    465
S   548    600
S   621    707
S   724    736

我有一个结构类似的第二个文件e2 (test)。

file2

S   27     93
S   123    348
S   542    584
S   726    740
S   1014   2540
S   12652  12987

期望输出:打印file1中的所有行及其旁边,file2中数值范围与file1重叠(包括部分重叠)的行。如果从file2到file1的范围没有重叠,则在fil1的范围旁边打印零。

S   24    96     S   27    93       * 27-93 overlaps with 24-96
S   126   352    S   123   355      * 123-355 overlaps with 126-352
S   385   465    0                  * nothing in file2 overlaps with this range
S   548   600    S   542   584      * 542-584 overlaps with 548-600
S   621   707    0                  * nothing in file2 overlaps with this range
S   724   736    S   726   740      * 726-740 overlaps with 724-736

根据@EdMorton上一个问题的回答,我修改了tst的print命令。Awk脚本添加这些新特性。此外,我还将命令file1/file2更改为file2/file1,以便打印file1中的所有行(无论第二个文件中是否有匹配)

'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if (    ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) )  ) {
print $0,"t",$1,"t",beg,"t",end
else 
print $0,"t","0"
next
}
}
}

注意:$1在file1和file2中是相同的。这就是为什么我使用印刷…1美元让它出现。不知道如何从file2而不是file1打印它(如果我理解正确的话,这个$1指的是file1。

然后使用awk -f test启动分析。Awk file2 file1

脚本不接受else参数,我不明白为什么?我假设它与循环有关,但我尝试了几次更改都没有成功。如果你能帮我的话,谢谢。

假设:

  • file1的一个范围只能与file2的一个范围重叠

当前的代码几乎是正确的,只是需要一些工作与括号的位置(使用一些一致的缩进帮助):

awk '
BEGIN     { OFS="t" }                                 # output field delimiter is "t"
NR == FNR { begs2ends[$2] = $3; next } 
{
# $1=$1                                    # uncomment to have current line ($0) reformatted with "t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end                  # spacing within $0 unchanged, 3 new fields prefaced with "t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0"                               # spacing within $0 unchanged, 1 new field prefaced with "t"
}
' file2 file1

由此产生:

S   24     96   S       27      93
S   126    352  S       123     348
S   385    465  0
S   548    600  S       542     584
S   621    707  0
S   724    736  S       726     740

如果$1=$1行没有注释,输出将变成:

S       24      96      S       27      93
S       126     352     S       123     348
S       385     465     0
S       548     600     S       542     584
S       621     707     0
S       724     736     S       726     740
S       900     1000    S       901     905

对@markp-fuso的回答略有改动

适用于GNU awk:另存为overlaps.awk

BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}

然后

gawk -f overlaps.awk file2 file1 | column -t

输出
S  24   96   S  27   93
S  126  352  S  123  348
S  385  465  0
S  548  600  S  542  584
S  621  707  0
S  724  736  S  726  740
$ cat tst.awk
BEGIN { OFS="t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if (    ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) )  ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}

$ awk -f tst.awk file2 file1 | column -s$'t' -t
S   24     96   S   27     93   * 27-93 overlaps with 24-96
S   126    352  S   123    348  * 123-348 overlaps with 126-352
S   385    465  0               * nothing in file2 overlaps with this range
S   548    600  S   542    584  * 542-584 overlaps with 548-600
S   621    707  0               * nothing in file2 overlaps with this range
S   724    736  S   726    740  * 726-740 overlaps with 724-736

最新更新