查找最接近的值:多列条件

  • 本文关键字:条件 最接近 查找 bash awk
  • 更新时间 :
  • 英文 :


在我的第一个问题之后,我想扩展从第一列和第二列的两个不同文件中查找最接近值的条件,并打印特定列。

文件1

1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1

文件 2

1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d 
2 4 5 e
9 4 1 f 
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j

因此,例如,我需要为文件 1 中每 1 美元找到文件 1 中最接近的值 1,然后搜索最接近的值 2 美元。

输出:

1 2 a1*
1 2 b*
1 4 b1 
1 4 d 
8 5 c1 
9 5 g 
* 第一列文件 1 和第二列文件 2,因为对于第一列(文件 1

)最接近的值(来自文件 2 的第一列)为 1,第二个条件是也必须是第二列的最接近值,在这种情况下是 2。我从文件 1,2,5 打印

1,2,5 美元,从文件 2 打印 1,2,4 美元

对于另一个输出是相同的过程。

找到最接近它的解决方案是在我的另一篇文章中,由@Tensibai给出。但任何解决方案都会奏效。谢谢!

听起来

有点复杂,但有效:

function closest(array,searched) {
  distance=999999; # this should be higher than the max index to avoid returning null
  split(searched,skeys,OFS)
  # Get the first part of key
  for (x in array) { # loop over the array to get its keys
    split(x,mkeys,OFS) # split the array key
    (mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
    if (tmp < distance) { # if the distance if less than preceding, update
      distance = tmp
      found1 = mkeys[1] # and save the key actually found closest
    }
  }
  # At this point we have the first part of key found, let's redo the work for the second part
  distance=999999;
  for (x in array) {
    split(x,mkeys,OFS)
    if (mkeys[1] == found1) { # Filter on the first part of key
      (mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
      if (tmp < distance) { # if the distance if less than preceding, update
        distance = tmp
        found2 = mkeys[2] # and save the key actually found closest
      }
    }
  }
  # Now we got the second field, woot
  return (found1 OFS found2)  # return the combined key from out two search
}
{
   if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
     b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
   } else {
     key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
     akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
     a[key] = $5 # make an array with the key stored previously and $5 as value
   }
}
END { # Now we ended parsing the two files, print the result
  for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
    print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
    if (akeys[i] in b) { # if the same key exist in second file
      print akeys[i],b[akeys[i]] # then print it
    } else {
      bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
      print bindex,b[bindex] # print what we found
    }
  }
}

请注意,我正在使用 OFS 来组合字段,因此如果您将其更改为输出,它将正常运行。

警告

:这应该与相对较短的文件有关,但是由于现在遍历第二个文件中的数组两次,每次搜索的长度将是两次警告结束

如果您的文件已排序,则可以使用更好的搜索算法(但是上一个问题并非如此,并且您希望保留文件中的顺序)。第一个改进 在这种情况下,当距离开始大于前一个时,中断 for 循环。

示例文件的输出:

$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

相关内容

  • 没有找到相关文章

最新更新