查找 2 个文件之间的匹配项(如何提高效率)

@file1仅包含起点-终点对，每个索引代表每个对。file2是一个文本文件，对于@file2每个索引表示每一行。我正在尝试逐行搜索@file1@file2的每一对。找到完全匹配项后，我会尝试从file2中提取information1并将其打印出来。但是现在，我正在尝试在file2中搜索匹配的配对。匹配模式的格式如下：

匹配案例

从`$file1[0]`

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

如果`file2`包含以下内容，则匹配：

Line with other stuff
Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff

不匹配大小写：

从文件 1：

Startpoint: /source/in_out/map (positive-triggered) 
Endpoint: /output/end/scan_all (positive-triggered)

从文件 2：

Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:

对于文本files2，我将其存储在@file2中。对于files1，我已经成功地提取并存储了每个起点和下一行端点，作为上面的格式在@file1。(提取和存储每对没有问题，所以我不会为此显示代码，这里花了大约 4 分钟(然后我拆分@address的每个元素，它们是起点和终点。在files2逐行检查，如果起点匹配，那么我将在下一行继续检查端点，只有当起点后的下一行与终点匹配时，才认为匹配，否则尝试再次搜索直到files2的结束行。这个脚本完成了这项工作，但花了 3 个半小时才能完成(file1大约有 60k 对和 800k 行要签入file2(。还有其他有效的方法可以做到这一点吗？

我是Perl脚本的新手，对于任何愚蠢的错误，无论是在我的解释还是编码中，我深表歉意。以下是代码：

#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (@file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into @file1, so I wont include 
#the code for this 
$size = scalar@file1;
$size2 = scalar@file2;
for ( $total=0; $total<$size; $total++ ) {
my @file1_split = split('n',$file1[$total]);
chomp @file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pairn";
last LABEL2;         #move on to check next start-endpoint 
#pair 
}
else {
$split = 0;          #reset back to check the same startpoint 
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] ) 
{ 
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1;         #move on next line to match endpoint
next;  
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)n";
}
}
}

如果我理解你的代码试图做什么，看起来这样做会更有效率：

my %split=@file1;
my %total;
@total{@file1}=(0..$#file1);
my $split;
for( @file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pairn";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})n";
}

如果我了解您的规格(显示匹配(，我敢打赌这将在不到 5 秒的时间内完成，除非您使用的是旧的戴尔 D333。为了进一步最小化响应时间，您需要编写一些额外的代码来驱动具有最少键的哈希的 while 循环(您隐含了 file1(。如果使用对哈希的引用，则可以编写一个小的 if-else 语句来交换哈希引用，而无需编写重复的 while 语句代码。

use strict;
use warnings;
sub makeHash($) {
my ($filename) = @_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1;    # captured group in regular expression
$line = $.;     # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %sn";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matchesn";

下面是我的测试数据生成器(谢谢(。我生成了两个相等文件之间 1,000,000 个匹配项的最坏情况。上面的匹配代码使用生成的测试数据在 20 秒内在我的 MBP 上完成。

use strict;
use warnings;
sub rndStr { join'', @_[ map{ rand @_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $startn";
print F1 "Endpoint: $endn";
print F2 "Startpoint: $startn";
print F2 "Endpoint: $endn";
}
close(F1);
close(F2);

匹配案例

从`$file1[0]`

如果`file2`包含以下内容，则匹配：

不匹配大小写：

从文件 1：

从文件 2：

相关内容

最新更新

热门标签：

查找 2 个文件之间的匹配项(如何提高效率)

匹配案例

从$file1[0]

如果file2包含以下内容，则匹配：

不匹配大小写：

从文件 1：

从文件 2：

相关内容

最新更新

热门标签：

从`$file1[0]`

如果`file2`包含以下内容，则匹配：