优化Perl脚本在两个文件之间关联记录


open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
$SSA = substr( $lines, 194, 9 );
open( FH1, 'MAH2' ) or die "$!";
while ( $array1 = <FH1> ) {
@fieldnames = split( /|/, $array1 );
$SSA1       = $fieldnames[1];
$report4    = $fieldnames[0];
if ( $SSA =~ /$SSA1/ ) {
$report5= $report4;
}
}
}

我正在尝试提取"SSA"值,并在MAH2文件中搜索该值。如果找到,返回"report4";价值。我能得到输出,但是处理它要花很多时间。有没有什么方法可以优化代码,使其快速完成?

我的每个文件有30万条记录,文件大小为15 MB。目前处理

需要5个小时。

创建一个查找表

my $foo_qfn = 'MAH';
my $bar_qfn = 'MAH2';
my %foos;
{
open(my $fh, '<', $foo_qfn)
or die("Can't open "$foo_qfn": $!n");
while ( my $foo_line = <$fh> ) {
my $ssa = substr($foo_line, 194, 9);
$foos{$ssa} = $foo_line;
}
}
{
open(my $fh, '<', $bar_qfn)
or die("Can't open "$bar_qfn": $!n");
while ( my $bar_line = <$fh> ) {
chomp($bar_line);
my ($report4, $ssa) = split(/|/, $bar_line);
my $foo_line = $foos{$ssa};
...
}
}

您的原始代码所花费的时间与foos的数量乘以条的数量(O(N*M))间接成正比。

这将花费的时间与最大的食物数量和条数(O(N+M))间接成正比。

换句话说,这应该快10万倍以上。我们说的是秒,不是小时。

如果您的任务只是通过SSA字段查找file2中与file1中记录对应的记录,那么还有另一种方法可以比传统的查找哈希表方法更快更简单。

您可以使用从file1中的记录构造的正则表达式来一次解析、匹配和从file2中提取。是的,Perl可以处理300,000个变量的正则表达式!:)这在Perl中是合理的,它的正则表达式引擎可以构造交替树。(5.10+在此之前可以使用Regexp::Assemble)

## YOUR CODE ##
open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
$SSA = substr( $lines, 194, 9 );
open( FH1, 'MAH2' ) or die "$!";
while ( $array1 = <FH1> ) {
@fieldnames = split( /|/, $array1 );
$SSA1       = $fieldnames[1];
$report4    = $fieldnames[0];
if ( $SSA =~ /$SSA1/ ) {
$report5= $report4;
}
}
}

正则表达式:

our $file1 = "MAH";
our $file2 = "MAH2";
open our $fh1, "<", $file1 or die $!;
our $ssa_regex = "(?|" . 
join( "|", 
map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
map substr( $_, 194, 9 ), 
<$fh1> ) .
")"
;
close $fh1;
open our $fh2, "<", $file2 or die $!;
our @ssa_matches = do { local $/; <$fh2> =~ m/$ssa_regex/mg; };
close $fh2;
undef $ssa_regex;
die "match array contains an odd number of entries??n" if @ssa_matches % 2;
while (@ssa_matches) {
my($report4, $SSA1) = splice @ssa_matches, 0, 2;
## do whatever with this information ##
}

让我们用一些注释来打断它。

读取file1并构建正则表达式

our $file1 = "MAH";
our $file2 = "MAH2";
# open file1 as normal
open our $fh1, "<", $file1 or die $!;
# build up a regular expressions that will match all of the SSA fields
our $ssa_regex = 
# Start the alternation reset group.  This way you always have $1 
# and $2 regardless of how many groups or total parens there are.
"(?|" . 
# Join all the alternations together
join( "|", 
# Create one regex group that will match the beginning of the line, 
# the first "record4" field, the | delimiter, the SSA, and then 
# make sure the following character is the delimiter.  [|] is 
# another way to escape the | character that can be more clear 
# than |.
# Escape any weird characters in the SSA with quotemeta(). Omit 
# this if plain text.
map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
# Pull out the SSA value with substr().
map substr( $_, 194, 9 ), 
# Read all the lines of file1 and feed them into the map pipeline.
<$fh1> ) .
# Add the closing parethesis for the alternation reset group.
")"
;
# Close file1.
close $fh1;

读取file2并应用正则表达式

# Open file2 as normal.
open our $fh2, "<", $file2 or die $!;
# Read all of file2 and apply the regex to get an array of the wanted
# "record4" field and the matching SSA.
our @ssa_matches = 
# Using a do{} block lets do the undef inline.
do { 
# Undefine $/ which is the input record seperator which will let 
# us read the entire file as a single string.
local $/; 
# Read the file as a single string and apply the regex, doing a global 
# multiline match.  /m means to apply the ^ assertion at every line, 
# not just at the beginning of the string.  /g means to perform and 
# return all of the matches at once.
<$fh2> =~ m/$ssa_regex/mg;
};
# Close file2 as normal.
close $fh2;
# Clear the memory for the regex if we don't need it anymore
undef $ssa_regex;
# Make sure we got pairs
die "match array contains an odd number of entries??n" if @ssa_matches % 2;
# Now just iterate through @ssa_matches two at a time to do whatever
# you wanted to do with the matched SSA values and that "record4" 
# field.  Why is it record4 if it's the first field?
while (@ssa_matches) {
# Use splice() to pull out and remove the two values from @ssa_matches
my($report4, $SSA1) = splice @ssa_matches, 0, 2;
## do whatever with this information ##
}

如果我们是迂腐的,正则表达式可以更紧凑一点。

our $ssa_regex = "^([^|]*)[|](" . 
join( "|", 
map quotemeta($_), 
map substr( $_, 194, 9 ), 
<$fh1> ) .
")(?=[|])"
;

我不保证这种方法比其他方法更好或更快,但它是一种用更少步骤完成的方法。

ikegami已经指出了将一个文件存储为查找表的一种更好的方法。但请允许我提供一些我的观察,也许这些也可以适用。

通过这个表达式,我们将$SSA1视为正则表达式:

$SSA =~ /$SSA1/

我发现很少在文件中存储正则表达式…你可能意味着做子字符串搜索,而不是处理$SSA1作为一个正则表达式?如果是这种情况,这可能是:

index($SSA, $SSA1) >= 0

OTOH在同一个if语句中,匹配成功后的反应是:

$report5 = $report4

当在同一内循环中有多个成功匹配时,同一语句被执行多次,这意味着$report5存储与最后一次匹配相对应的内容。

如果MAH2最多只能匹配一次,也许可以添加一个'last'来离开内循环。

if ( index($SSA, $SSA1) >= 0 ) {
$report5 = $report4;
last;
}

取决于MAH2中匹配的位置,这可能会节省一些时间。虽然,这将在第一个匹配处停止循环,而不是在最后一个匹配处…也就是说它不能直接替代你原来的货单。如果这仍然符合你的目的,也许可以使用它。

然而,当输出"在这段程序中,$report5只在给定的代码段中使用了一次,这意味着在我们所做的所有90亿次迭代中,只有一次匹配是真正重要的——也许离开外部循环也是有意义的(同样,这可能不是您想要的)

最新更新