如何搜索给定数字介于两者之间的数字范围?

我有两个文件

文件1

由染色体及其位置组成的SNP数据(约400,000个条目(

chr pos
a1 456
a2 789
. .
. . 
so on

文件2

由染色体、position_start、position_end和详细信息组成的 GTF 数据(约 500,000 个条目(

chr pos_start pos_end detail
a1 100 400 gene1
a1 401 700 gene2
a2 200 500 gene3
a2 501 900 gene4
. .
. . 
so on

期望的结果

chr pos chr pos_start pos_end detail
a1 456 a1 401 700 gene2
a2 789 a2 501 900 gene4

我使用 shell 脚本得到这个结果：

(grep "$chr" file2.gtf | awk '{if($2 <= '$pos' && $3 >= '$pos') print $0}')

在一段时间循环中，但处理file1中的所有数字需要太多时间。

有谁知道在shell，Python或Perl中更有效的方式来实现这一目标？

这是一个perl版本。基本思想是它将 gtf 数据缓存到哈希表中，然后对于 snp 文件中的每一行，它只查看与该染色体匹配的 gtf 条目。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use autodie;
my $snp_file = "file1.txt";
my $gtf_file = "file2.txt";
# Read the gtf data into a hash of arrays
my %gtf;
open my $file, "<", $gtf_file;
my $hdr = <$file>; # Discard header line
while (<$file>) {
chomp;
my @cols = split /s+/;
push @{$gtf{$cols[0]}}, @cols;
}
close $file;
open $file, "<", $snp_file;
$hdr = <$file>; # Discard header line
say "chrtpostchrtstarttendtdetail";
# Read the snp data
$" = "t"; # Use tab for array element separator
while (<$file>) {
chomp;
my ($chr, $pos) = split /s+/;
# Look up all matches of this chromosome in the gtf hash and filter just
# the ones where pos is in range.
my @matches = grep { $pos >= $_->[1] && $pos <= $_->[2] } @{$gtf{$chr}};
# And print them out.
for my $match (@matches) {
say "$chrt$post@$match";
}
}
close $file;

如果您要对这些数据做很多事情，我会选择另一种选择，是将其全部加载到 sqlite 或其他数据库中并使用 SQL 查找结果。这样，您就不必继续读取数据文件;您只需在预填充的表中查找内容(使用适当的索引以提高工作效率(。

我认为这可以满足您的需求awk：

awk '
FNR==1  { next}
FNR==NR { chr[FNR]=$1; start[FNR]=$2; end[FNR]=$3; det[FNR]=$4; N=FNR; next}
{ c=$1; p=$2;
for(i=2;i<=N;i++){
if((c==chr[i]) && (p>=start[i]) && (p<=end[i])){
print c, p, chr[i], start[i], end[i], det[i]
next
}
}
}
' file2 file1

因此，首先从最后一行开始注意，对awk的单个调用正在处理这两个文件。

在处理过程中，通过检查当前文件中的行号是否为 1 来忽略每个文件的第一行，如果是，则跳过：

FNR==1  { next}

然后，如果当前文件中的记录号等于awk处理的总记录号，那么我们必须读取第一个文件。因此，我们将每个字段保存在按行号索引的数组中，并避免任何进一步的处理：

FNR==NR { chr[FNR]=$1; start[FNR]=$2; end[FNR]=$3; det[FNR]=$4; N=FNR; next}

否则，我们必须处理第二个文件。在这种情况下，我们遍历从第一个文件保存的所有数组以找到匹配的条目。如果我们在正确的范围内找到一个，我们打印您想要的零碎部分并立即移动到下一条记录：

{ c=$1; p=$2;
for(i=2;i<=N;i++){
if((c==chr[i]) && (p>=start[i]) && (p<=end[i])){
print c, p, chr[i], start[i], end[i], det[i]
next
}
}
}

文件1

文件2

期望的结果

相关内容

最新更新

热门标签：