用perl处理一个有15亿行的文件



我需要处理一个包含15亿个条目、11列、300GB大小的文件。我需要从每一行中提取一些信息。我把这个问题分为两部分。首先,我读取了该文件,并尝试将该文件缩减为所需的列和一些筛选。我写了这个所谓的缩减文件~100GB与700mil。条目。

然后我读取缩减后的文件并处理每个条目。这是我的代码段:

while (my $line = <$fh_file_vss_rlrp_inst>) {
chomp $line;
my @columns_line = split('s+', $line);
if( scalar @columns_line == 10 && $columns_line[-1] !~ /X|Y|Z|K/ ) {
print $fh_reduced "$columns_line[1] $columns_line[7] $columns_line[-1]n";
} 
$inst_count++;
if($inst_count % 10000000 == 0) { 
$date = `date`;
print "Processed $inst_count ... time: $date";
}
}

现在正在读取缩减后的文件以进行进一步处理,我在内存中有%h_kI_vB,其中包含100000个条目:

while (my $line = <$fh_file_vss_rlrp_inst>) {
chomp $line;
my @columns_line = split('s+', $line);
my $wip = $columns_line[0];
my $this_inst = $columns_line[0];
my @loc_xyz;
my @loc_ijk;
while (grep(///, $wip)){
if (exists($h_kI_vB{$wip})){ 
push(@loc_xyz, $h_kI_vB{$wip});                         
}
$wip =~ s//[^/]+$//;
}
if (exists($h_kI_vB{$wip})){
push(@loc_xyz, $h_kI_vB{$wip});
}

if (@loc_xyz){
my $loc_block_string;
foreach my $loc ( @loc_xyz){
if (not defined $loc_block_string){
$loc_block_string = "$loc";
} else {
$loc_block_string = $loc_block_string.":$loc";
}
}
$hierarchical_blocks{$this_inst}=$loc_block_string;

}
if(exists $hierarchical_blocks{$this_inst}) {
my @x_y_layer = split(',', $columns_line[1]);
$x_y_layer[0] =~ s/(//;
if(! defined $H_instBlock_coordinates{ $hierarchical_blocks{$this_inst} } ) { 
my @initial_coordinate = ($x_y_layer[0], $x_y_layer[1], $x_y_layer[0], $x_y_layer[1]);
print $fh_log "For inst:$this_inst $hierarchical_blocks{$this_inst}: @initial_coordinaten" ;
@{ $H_instBlock_coordinates{$hierarchical_blocks{$this_inst}} } = @initial_coordinate; 
} else {
my @old_x1_y1_x2_y2 =  @{ $H_instBlock_coordinates{$hierarchical_blocks{$this_inst}} };
my $x1 = $old_x1_y1_x2_y2[0];
my $y1 = $old_x1_y1_x2_y2[1];
my $x2 = $old_x1_y1_x2_y2[2];
my $y2 = $old_x1_y1_x2_y2[3];
if($x_y_layer[0] < $x1) {
$x1 = $x_y_layer[0];
} elsif ($x_y_layer[0] > $x2) {
$x2 = $x_y_layer[0];
}

if($x_y_layer[1] < $y1) {
$y1 = $x_y_layer[1];
} elsif ($x_y_layer[1] > $y2) {
$y2 = $x_y_layer[1];
}

my @new_x1_y1_x2_y2 = ($x1, $y1, $x2, $y2);

print $fh_log "For inst:$this_inst Changing coordinate to block $hierarchical_blocks{$this_inst}: @new_x1_y1_x2_y2n" ;
@{ $H_instBlock_coordinates{$hierarchical_blocks{$this_inst}} } = @new_x1_y1_x2_y2; 

}
}

$inst_count++;
if($inst_count % 1000000 == 0) {
$date = `date`;
print "Processed $inst_count ...time: $daten";
}  
}

这太慢了。我将作业调度到一个具有450GB内存的远程服务器,它运行24小时。我需要优化代码,以便在1小时内完成(最坏的情况(。

提前谢谢。

真正的性能提升是通过从根本上改变方法来实现的。由于你有一个简单的O(N(算法,所以在这个部门没有什么真正突出的。

下面,我清理了您的代码,并提供了一些微优化。但我怀疑他们会有大的影响。原因是你的代码已经很快了。你说处理70000000条线需要24小时,这意味着每条线只需要12μs。这是合理的。

24 / 700,000,000 hour/line
* 60 minute/hour
* 60 s/minute
= 1.2 * 10^(-4) s/line
= 12 μs/line

您可能会从并行化中获得收益。例如,没有什么可以阻止第二个程序与第一个程序同时运行,使用两个内核而不是1个。它看起来是这样的:

./prog_a | ./prog_b

使这个过程进一步并行化变得复杂的是,处理一行依赖于处理早期行的输出。

尽管如此,将$block的处理从prog_a移动到prog_b可能是有利的,或者甚至可能在管道中创建中间阶段。

./prog_a | ./prog_i | ./prog_b

这取决于你在阶段数量和每个阶段所做的对你产生最佳结果之间找到平衡。例如,我猜测将原始文件的第10个字段的解析移到prog_a会更有利,因此在我下面发布的版本中,将其从prog_b移到了prog_a

但这项工作仍然主要是按顺序进行的。下一步是将使用$block的工作划分为多个核心。只要具有相同$block值的行最终由同一实例处理,就可以实现这一点。我把这个留给你。

#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
my @fields = split;
# Using a lookup hash would be faster if you know
# the specific values C<< $fields[9] >> can take.
# Something like the following before the loop:
# C<< my %skip = map { $_ => 1 } qw( X Y Z K ); >>.
# Then, you'd use C<< !$skip{$fields[9]} >>
# instead of a regex match.
if (@fields == 10 && $fields[9] !~ /[XYZK]/) {
my ($x, $y) = split(/,/, substr($fields[7], 1))
print "$fields[1] $x $yn";
} 

if ($. % 10_000_000 == 0) { 
my $ts = localtime();
print STDERR "[$ts] prog_a processed $. lines.n";
}
}
#!/usr/bin/perl
use strict;
use warnings;
# Should be safe to use since it's deemed safe enough to enabled by default in Perl 7.
# Allows us to cleanly avoid repeatedly doing the same hash lookup.
# We could use a reference instead of an alias by replacing
#    my $coords = $H_instBlock_coordinates{$block};
# with
#    my $coords_ref = $H_instBlock_coordinates{$block};
# and changing all other instances of
#    $coords
# with
#    ${$coords_ref}
# But this would be a lot more noisy.
use experimental qw( refaliasing );
my %h_kI_vB = ...;
my %hierarchical_blocks;
my %H_instBlock_coordinates;
while (<>) {
my ($this_inst, $x, $y) = split;
# C<< $this_inst >> contains something like C<< a/b/c >>
my @loc_xyz;
{
my $wip = $this_inst;
while (1) {
# If an existing C<< $h_kI_vB{$wip}) >> won't ever be a false value
# (zero or an empty string), replacing C<< exists($h_kI_vB{$wip}) >>
# with C<< $h_kI_vB{$wip} >> would be a tiny tiny bit faster.
if (exists($h_kI_vB{$wip})) {
push(@loc_xyz, $h_kI_vB{$wip});
}
# The regex engine is pretty heavy, so while the
# remainder of the loop could be replaced with
# C<< $wip =~ s{/[^/]*z}{} or last; >>, it
# probably wouldn't be as fast.
( my $i = rindex($wip, "/") ) >= 0
or last;
substr($wip, $i, length($wip), "");
}
}
if (@loc_xyz) {
my $block = join(":", @loc_xyz);
$hierarchical_blocks{$this_inst} = $block;
# C<< $block >> contains something like C<< d:e:f >>.
# It may have fewer parts than C<< $this_inst >> did.
# C<< $coords >> is an alias for C<< $H_instBlock_coordinates{$block} >>.
my $coords = $H_instBlock_coordinates{$block};
if ($coords) {
if    ($x < $coords->[0]) { $coords->[0] = $x; }
elsif ($x > $coords->[2]) { $coords->[2] = $x; }
if    ($y < $coords->[1]) { $coords->[1] = $y; }
elsif ($y > $coords->[3]) { $coords->[3] = $y; }
print $fh_log "For inst:$this_inst Changing coordinate to block $block: @$coordsn";
} else {
$coords = [ $x, $y, $x, $y ];
print $fh_log "For inst:$this_inst " .                         "$block: @$coordsn";
}
}
if ($. % 1_000_000 == 0) {
my $ts = localtime();
print STDERR "[$ts] prog_b processed $. lines.n";
}
}

最新更新