我有一个近 20k 行的巨大 csv 文件,格式如下:
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d
我需要在顶部放置 2 行具有相同语法的模式(即第 4 列(。然后其余的行将按原样存在。这意味着前两行语法为"perl",然后是"java","python"等。
到目前为止,我已经使用查找和告诉编写了下面的代码以使其优化。但是,它没有按预期工作。
use strict;
use warnings;
open(FP, "+<mycsv.csv");
my %hash = ();
my $cur_pos;
while(<FP>) {
my $line = $_;
chomp $line;
#print "$line aaan";
if($line =~ /^file,tools,/) {next;}
if($line =~ /^w+,w+,w+,(w+),.*$/) {
my $type = $1;
#print "type $typen";
if($hash{$type}->{count} < 2 ) {
#print "--- here type = $type | lastpos = ", $hash{$type}->{lastpos} , "n";
$cur_pos = tell(FP);
my $pos = tell(FP) - length($line);
if($hash{$type}->{lastpos} ) {
my $lastpos = $hash{$type}->{lastpos};
seek(FP, $lastpos, 1);
print FP $line;
seek(FP, $cur_pos, 1);
}
$hash{$type}->{lastpos} = $pos;
}
if(exists $hash{$type} ) {
$hash{$type}->{count} += 1;
} else {
$hash{$type}->{count} = 1;
}
}
}
close(FP);
预期输出应如下所示:
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,e,c,perl,d
a,w,c33,java,d
wa,b,c33,java,d
a,s,c,python,d1
a,f,c,python,dd
a,n,c,php,d3
wa,b,c33,php,d
d,r,hhh,cpp,d0
d,buuu,hhh,cpp,d0
d,m,hhh,c#,d0
wa,b,c33,c#,d
a,o,c,pdf,d3
a,yb,c,c,ddf
d44,b,hhh,nlp,d0
a,be,c,js,d4
a,h,c,perl,dg
a,b,c,perl,dt
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d
任何使其工作的帮助将不胜感激。
谢谢。
我会通过解析文件来收集数据结构中的前几行并将其他行发送到临时文件来解决这个问题。分析完文件后,将数据结构中的行对打印到输出文件中,然后将临时文件添加到输出文件的末尾。
示例代码:
use strict;
use warnings;
use feature ':5.16';
my $infile = 'infile';
my $outfile = 'outfile';
my $tempfile = 'temp';
my $quantity = 2; # or whatever
open my $in, '<', $infile or die 'Could not open infile: ' . $!;
open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;
my $hash = {};
my @order;
my $hdr;
while ( <$in> ) {
if ( $hdr ) {
my @cols = split ",", $_;
my $key = $cols[3];
# have we seen this key before?
if ( ! $hash->{$key} ) {
push @order, $key;
$hash->{$key} = [ $_ ];
}
elsif ( scalar @{$hash->{$key}} < $quantity ) {
push @{$hash->{$key}}, $_;
}
else {
print { $temp } $_;
}
}
else {
# the header line
print { $out } $_;
$hdr = $_;
}
}
# print the collected twofers out into the tempfile
for my $key ( @order ) {
print { $out } @{$hash->{$key}};
}
close $out;
close $temp;
# concatenate the files
system join ' ', ( 'cat', $tempfile, '>>', $outfile );
如果配对的行不必按照它们在源文件中出现的顺序排列,则可以跳过@order
内容。
相同的逻辑,我得到的输出与您的输出几乎没有什么不同。您能否浏览此输出并让我知道是否需要任何更改?方法在评论中内联提及。
use strict;
use warnings;
use feature 'say';
my $syntax = [];
my $NUM = 2; # change number if needed
my $filename = 'file.txt';
my $data = {}; # make a hash of data
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
next if $. == 1; # skip header row
my @columns = split (',', $row);
push @$syntax, $columns[3]; # make a list of all syntaxes available
push @{$data->{$columns[3]}}, $row;
}
close $fh;
my $processed = {};
# loop throught the syntax array and print data from hash
# also, make a counter of the number of times that syntax is used.
# it will help us to skip next (n-1) occurence of that syntax
for my $syntax (@$syntax) {
if (!$processed->{$syntax}){
for my $s (splice @{$data->{$syntax}}, 0, $NUM) {
$processed->{$syntax} += 1;
say $s;
}
} else {
$processed->{$syntax} -= 1;
}
}
# print out the remaining values
for my $rem (values %$data){
say for @$rem;
}
输出:
a,b,c,perl,d
a,e,c,perl,d
a,w,c33,java,d
wa,b,c33,java,d
a,s,c,python,d1
a,f,c,python,dd
a,n,c,php,d3
wa,b,c33,php,d
d,r,hhh,cpp,d0
d,buuu,hhh,cpp,d0
d,m,hhh,c#,d0
wa,b,c33,c#,d
a,o,c,pdf,d3
a,yb,c,c,ddf
a,h,c,perl,dg
a,b,c,perl,dt
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,perl,d
我有一个近 20k 行的巨大 CSV 文件,格式如下:
这无论如何都不大。文件大小可能约为一兆字节。
虽然我通常建议逐行处理以确保文件大小的稳健性,但在这种情况下,您知道您正在处理的文件很小。问题是你花在优化这件事上的时间是否值得。
如果我理解正确,您的问题可以通过浪费一些内存来快速解决(在程序员时间内(:
#!/usr/bin/env perl
use strict;
use warnings;
use List::Util qw( uniqstr );
my $TOP = 2;
(my $header = <DATA>) =~ s/s+z//;
my @header = split /,|s+/, $header;
my %idx = map +($header[$_] => $_), 0 .. $#header;
my @lines = grep /S/, <DATA>;
my %syntax_of = map +($_ => (split /,/, $_)[$idx{syntax}]), @lines;
my @syntaxes = uniqstr map $syntax_of{$_}, @lines;
my %lines_of;
for my $n (0 .. $#lines) {
push @{$lines_of{$syntax_of{$lines[$n]}}}, $n;
}
print "$headern";
for my $syntax (@syntaxes) {
my @top = grep defined, map $lines_of{$syntax}->[$_ - 1], 1 .. $TOP;
print @lines[@top];
# normally, invoking delete on an array slice is not
# but it is just what we need here.
delete @lines[@top];
}
print grep defined, @lines;
__DATA__
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d
PS:另见领带::文件
PPS:乍一看,如果想花时间在上面,至少有六件事可能倾向于在这里进行调整。