查找 2 个文件之间的匹配项(如何提高效率)



@file1仅包含起点-终点对,每个索引代表每个对。file2是一个文本文件,对于@file2每个索引表示每一行。我正在尝试逐行搜索@file1@file2的每一对。找到完全匹配项后,我会尝试从file2中提取information1并将其打印出来。但是现在,我正在尝试在file2中搜索匹配的配对。匹配模式的格式如下:

匹配案例

$file1[0]

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

如果file2包含以下内容,则匹配:

Line with other stuff
Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff

不匹配大小写:

从文件 1:

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

从文件 2:

Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:

对于文本files2,我将其存储在@file2中。对于files1,我已经成功地提取并存储了每个起点和下一行端点,作为上面的格式在@file1。(提取和存储每对没有问题,所以我不会为此显示代码,这里花了大约 4 分钟(然后我拆分@address的每个元素,它们是起点和终点。在files2逐行检查,如果起点匹配,那么我将在下一行继续检查端点,只有当起点后的下一行与终点匹配时,才认为匹配,否则尝试再次搜索直到files2的结束行。这个脚本完成了这项工作,但花了 3 个半小时才能完成(file1大约有 60k 对和 800k 行要签入file2(。还有其他有效的方法可以做到这一点吗?

我是Perl脚本的新手,对于任何愚蠢的错误,无论是在我的解释还是编码中,我深表歉意。 以下是代码:

#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (@file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into @file1, so I wont include 
#the code for this 
$size = scalar@file1;
$size2 = scalar@file2;
for ( $total=0; $total<$size; $total++ ) {
my @file1_split = split('n',$file1[$total]);
chomp @file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pairn";
last LABEL2;         #move on to check next start-endpoint 
#pair 
}
else {
$split = 0;          #reset back to check the same startpoint 
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] ) 
{ 
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1;         #move on next line to match endpoint
next;  
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)n";
}
}
}

如果我理解你的代码试图做什么, 看起来这样做会更有效率:

my %split=@file1;
my %total;
@total{@file1}=(0..$#file1);
my $split;
for( @file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pairn";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})n";
}

如果我了解您的规格(显示匹配(,我敢打赌这将在不到 5 秒的时间内完成,除非您使用的是旧的戴尔 D333。为了进一步最小化响应时间,您需要编写一些额外的代码来驱动具有最少键的哈希的 while 循环(您隐含了 file1(。如果使用对哈希的引用,则可以编写一个小的 if-else 语句来交换哈希引用,而无需编写重复的 while 语句代码。

use strict;
use warnings;
sub makeHash($) {
my ($filename) = @_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1;    # captured group in regular expression
$line = $.;     # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %sn";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matchesn";

下面是我的测试数据生成器(谢谢(。我生成了两个相等文件之间 1,000,000 个匹配项的最坏情况。上面的匹配代码使用生成的测试数据在 20 秒内在我的 MBP 上完成。

use strict;
use warnings;
sub rndStr { join'', @_[ map{ rand @_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $startn";
print F1 "Endpoint: $endn";
print F2 "Startpoint: $startn";
print F2 "Endpoint: $endn";
}
close(F1);
close(F2);

相关内容

  • 没有找到相关文章

最新更新