我需要加入两个大文件。该任务类似于联接实用程序正在执行的任务。在玩我的算法时,这与这个问题非常相似:我已经意识到coreutil的join要快得多(但并没有完全完成我需要的操作(。但是,在检查了这段perl代码之后,可以在perl中实现join这似乎与coreutil的join的C代码非常相似。我看到区别是一样的,perl代码太慢了。
一些测试:
time join -j 2 file1.txt file2.txt| mbuffer > result.c
in @ 20.0 MiB/s, out @ 20.0 MiB/s, 86.0 MiB total, buffer 0% full
summary: 95.6 MiByte in 5.9sec - average of 16.2 MiB/s
join -j 2 file1.txt file2.txt 4.99s user 0.09s system 99% cpu 5.091 total
time perl ./join -j 1 file1.txt file2.txt| mbuffer > result.perl
in @ 4092 kiB/s, out @ 4092 kiB/s, 94.0 MiB total, buffer 0% full
summary: 95.6 MiByte in 44.5sec - average of 2202 kiB/s
perl ./join -j 1 file1.txt file2.txt 44.15s user 0.08s system 99% cpu 44.226 total
有关于如何提高性能的提示吗?我怀疑这可能与缓冲有关。
根据nytrpof的说法,瓶颈是get_a_line()
子
# spent 59.4s (57.3+2.12) within main::get_a_line which was called 25884852 times, avg 2µs/call:
# 15574181 times (34.4s+1.27s) by main::RUNTIME at line 69, avg 2µs/call
# 10309895 times (22.9s+857ms) by main::RUNTIME at line 75, avg 2µs/call
# 478 times (868µs+42µs) by main::RUNTIME at line 85, avg 2µs/call
# 296 times (533µs+30µs) by main::RUNTIME at line 80, avg 2µs/call
# once (12µs+26µs) by main::RUNTIME at line 59
# once (4µs+7µs) by main::RUNTIME at line 60
sub get_a_line {
121 25884852 2.60s my ($aref, $fh) = @_;
122 25884852 29.1s 25884852 2.12s my $not_eof = defined(my $line = <$fh>);
# spent 2.12s making 25884852 calls to main::CORE:readline, avg 82ns/call
123 25884852 5.50s if ($not_eof) {
124 25884851 1.92s chomp $line;
125 25884851 12.5s push (@$aref,
126 defined $delimiter ?
127 [split $delimiter, $line, -1] : [split ' ', $line, -1]);
128 }
129 1 300ns else { push @$aref, undef }
130 25884852 27.4s return $not_eof;
131 }
按照建议,我做了一个最小的测试,得到了更好的结果:
#!/usr/bin/perl
use strict;
use warnings;
for (@ARGV) { die "need 2 files" if ! -e };
my ($file1,$file2) = @ARGV;
sub read_f {
my $file = shift;
open my $fh, ($file =~ m/.gz/ ? "gzip -fdc $file|mbuffer|" : $file)
or die "Can't open file $file: $!n";
return $fh;
}
sub read_file_line {
my $fh = shift;
if ($fh and my $line = readline $fh) {
return $line;
}
return;
}
my $f1 = read_f($file1);
my $f2 = read_f($file2);
my $line1 = read_file_line($f1);
my $line2 = read_file_line($f2);
while ($line1 or $line2) {
$line1 = read_file_line($f1);
$line2 = read_file_line($f2);
}
结果(mbuffer为perl和c代码的读写测量了不同的东西,但我想全局图已经足够清晰了(:
time perl ./testcase.pl sample1.gz sample2.gz
in @ 0.0 kiB/s, out @ 20.0 MiB/s, 164 MiB total, buffer 2% full
summary: 172 MiByte in 8.5sec - average of 20.2 MiB/s
in @ 0.0 kiB/s, out @ 12.0 MiB/s, 202 MiB total, buffer 1% full
summary: 207 MiByte in 11.4sec - average of 18.1 MiB/s
perl ./testcase.pl sample1.gz sample2.gz 12.75s user 2.23s system 130% cpu 11.462 tota
time join -j 2 -a 1 -a 2 <(zcat sample.gz) <(sample2.gz ) |mbuffer 1>/dev/null
in @ 40.0 MiB/s, out @ 40.0 MiB/s, 326 MiB total, buffer 0% full
summary: 330 MiByte in 6.8sec - average of 48.6 MiB/s
join -j 2 -a 1 -a 2 <(zcat sample1.gz) 6.55s user 0.24s system 99% cpu 6.789 total
mbuffer > /dev/null 0.04s user 0.44s system 7% cpu 6.799 total
因此,像split、join、callingsubs和来回传递数据这样的附加操作似乎会增加很多时间。由于数据非常庞大,所以所有的总结。。。