这个新问题是最近一个问题的后续问题:用awk比较两个不同文件中的两个数值范围。建议的解决方案,完美地工作是不实际的下游分析(误解我的问题,而不是解决方案的工作)。
我有一个3列的文件。第2列和第3列定义了一个数值范围。在第2列中,数据从较小的值到较大的值排序。数值范围从不重叠。
file1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
我有一个结构类似的第二个文件e2 (test)。
file2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
期望输出:打印file1中的所有行及其旁边,file2中数值范围与file1重叠(包括部分重叠)的行。如果从file2到file1的范围没有重叠,则在fil1的范围旁边打印零。
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
根据@EdMorton上一个问题的回答,我修改了tst的print命令。Awk脚本添加这些新特性。此外,我还将命令file1/file2更改为file2/file1,以便打印file1中的所有行(无论第二个文件中是否有匹配)
'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,"t",$1,"t",beg,"t",end
else
print $0,"t","0"
next
}
}
}
注意:$1在file1和file2中是相同的。这就是为什么我使用印刷…1美元让它出现。不知道如何从file2而不是file1打印它(如果我理解正确的话,这个$1指的是file1。
然后使用awk -f test启动分析。Awk file2 file1
脚本不接受else
参数,我不明白为什么?我假设它与循环有关,但我尝试了几次更改都没有成功。如果你能帮我的话,谢谢。
假设:
file1
的一个范围只能与file2
的一个范围重叠
当前的代码几乎是正确的,只是需要一些工作与括号的位置(使用一些一致的缩进帮助):
awk '
BEGIN { OFS="t" } # output field delimiter is "t"
NR == FNR { begs2ends[$2] = $3; next }
{
# $1=$1 # uncomment to have current line ($0) reformatted with "t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end # spacing within $0 unchanged, 3 new fields prefaced with "t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0" # spacing within $0 unchanged, 1 new field prefaced with "t"
}
' file2 file1
由此产生:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
如果$1=$1
行没有注释,输出将变成:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905
对@markp-fuso的回答略有改动
适用于GNU awk:另存为overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}
然后
gawk -f overlaps.awk file2 file1 | column -t
输出S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
$ cat tst.awk
BEGIN { OFS="t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736