在perl中并行读取两个文件时的性能

我需要加入两个大文件。该任务类似于联接实用程序正在执行的任务。在玩我的算法时，这与这个问题非常相似：我已经意识到coreutil的join要快得多(但并没有完全完成我需要的操作(。但是，在检查了这段perl代码之后，可以在perl中实现join这似乎与coreutil的join的C代码非常相似。我看到区别是一样的，perl代码太慢了。

一些测试：

time join -j 2 file1.txt file2.txt| mbuffer > result.c
in @ 20.0 MiB/s, out @ 20.0 MiB/s, 86.0 MiB total, buffer   0% full
summary: 95.6 MiByte in  5.9sec - average of 16.2 MiB/s
join -j 2 file1.txt file2.txt  4.99s user 0.09s system 99% cpu 5.091 total
time perl ./join -j 1 file1.txt file2.txt| mbuffer > result.perl                                                                                   
in @ 4092 kiB/s, out @ 4092 kiB/s, 94.0 MiB total, buffer   0% full
summary: 95.6 MiByte in 44.5sec - average of 2202 kiB/s
perl ./join -j 1  file1.txt file2.txt  44.15s user 0.08s system 99% cpu 44.226 total

有关于如何提高性能的提示吗？我怀疑这可能与缓冲有关。

根据nytrpof的说法，瓶颈是get_a_line()子

# spent 59.4s (57.3+2.12) within main::get_a_line which was called 25884852 times, avg 2µs/call:
# 15574181 times (34.4s+1.27s) by main::RUNTIME at line 69, avg 2µs/call
# 10309895 times (22.9s+857ms) by main::RUNTIME at line 75, avg 2µs/call
#      478 times (868µs+42µs) by main::RUNTIME at line 85, avg 2µs/call
#      296 times (533µs+30µs) by main::RUNTIME at line 80, avg 2µs/call
#           once (12µs+26µs) by main::RUNTIME at line 59
#           once (4µs+7µs) by main::RUNTIME at line 60
sub get_a_line {
121 25884852    2.60s             my ($aref, $fh) = @_;
122 25884852    29.1s   25884852    2.12s     my $not_eof = defined(my $line = <$fh>);
# spent  2.12s making 25884852 calls to main::CORE:readline, avg 82ns/call
123 25884852    5.50s             if ($not_eof) {
124 25884851    1.92s               chomp $line;
125 25884851    12.5s               push (@$aref,
126                           defined $delimiter ?
127                           [split $delimiter, $line, -1] : [split ' ', $line, -1]);
128                   }
129 1   300ns             else { push @$aref, undef }
130 25884852    27.4s             return $not_eof;
131                 }

按照建议，我做了一个最小的测试，得到了更好的结果：

#!/usr/bin/perl
use strict;
use warnings;
for (@ARGV) { die "need 2 files" if ! -e };
my ($file1,$file2) = @ARGV;
sub read_f {
my $file = shift;
open my $fh, ($file =~ m/.gz/ ? "gzip -fdc $file|mbuffer|" : $file)
or die "Can't open file $file: $!n";
return $fh;
}
sub read_file_line {
my $fh = shift;
if ($fh and my $line = readline $fh) {
return $line;
}
return;
}
my $f1 = read_f($file1);
my $f2 = read_f($file2);
my $line1 = read_file_line($f1);
my $line2 = read_file_line($f2);
while ($line1 or $line2) {
$line1 = read_file_line($f1);
$line2 = read_file_line($f2);
}

结果(mbuffer为perl和c代码的读写测量了不同的东西，但我想全局图已经足够清晰了(：

time perl ./testcase.pl sample1.gz sample2.gz
in @  0.0 kiB/s, out @ 20.0 MiB/s,  164 MiB total, buffer   2% full                                                                                                                
summary:  172 MiByte in  8.5sec - average of 20.2 MiB/s
in @  0.0 kiB/s, out @ 12.0 MiB/s,  202 MiB total, buffer   1% full
summary:  207 MiByte in 11.4sec - average of 18.1 MiB/s
perl ./testcase.pl sample1.gz sample2.gz 12.75s user 2.23s system 130% cpu 11.462 tota
time join -j 2 -a 1 -a 2  <(zcat sample.gz) <(sample2.gz ) |mbuffer 1>/dev/null 
in @ 40.0 MiB/s, out @ 40.0 MiB/s,  326 MiB total, buffer   0% full
summary:  330 MiByte in  6.8sec - average of 48.6 MiB/s
join -j 2 -a 1 -a 2 <(zcat sample1.gz)   6.55s user 0.24s system 99% cpu 6.789 total
mbuffer > /dev/null  0.04s user 0.44s system 7% cpu 6.799 total

因此，像split、join、callingsubs和来回传递数据这样的附加操作似乎会增加很多时间。由于数据非常庞大，所以所有的总结。。。

相关内容

最新更新

热门标签：