Perl 中的多列最小表示发现



我有一个包含大约 20 列的数据 csv,每列将有多个不同的值。顶部(标题(之后的每一行都是单独的数据样本。我想以编程方式缩小列表范围,以便我拥有最少的数据样本,但仍然表示列数据的每种排列。

示例数据

SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15

使用哈希来确定以前是否见过特定的列组合,然后使用该哈希来确定是否打印特定行。

这里有一个相当基本的例子来演示这个想法:

filter.pl

#!/usr/bin/env perl
use warnings;
use strict;
die "usage: $0 file col1,col2,col3, ... colnn" unless @ARGV;
my ($file, $columns) = @ARGV;
-f $file or die "$file does not exist!";
defined $columns or die "need to pass in columns!";
my @columns;
for my $col ( split /,/, $columns ) {
    die "Invalid column id $col" unless $col >= 1; # 1-based
    push @columns, $col - 1; # 0-based
}
scalar @columns or die "No columns!";
open my $fh, "<", $file or die "Unable to open $file : $!";
my %uniq;
while (<$fh>) {
    chomp();
    next if $. == 1; # Skip Header
    my (@data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file
    my $key = join '|', @data[ @columns ]; # key will look like 'foo|bar|baz'
    if (not defined $uniq{ $key } ) {
        print $_ . "n"; # Print the whole line with the first unique set of columns
        $uniq{ $key } = 1; # Now we have seen this combo
    }
}

数据.csv

SERIAL,TRUTH,IN,PARALLEL
123,TRUE,YES,5
124,TRUE,YES,5
125,TRUE,YES,3
126,TRUE,NO,5
127,FALSE,YES,1
128,FALSE,YES,3
129,FALSE,NO,7

输出

perl filter.pl data.csv 2,3
123,TRUE,YES,5
126,TRUE,NO,5
127,FALSE,YES,1
129,FALSE,NO,7

最新更新