解析文件在Perl中垂直分隔



我有一个像这样的文件:

*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

每个记录块有不同的行数,(例如CX条目并不总是存在)。但是如果CX存在,in只显示为1个条目。我们想要得到一个以"MH"为键,以"CX"为值的哈希值。

因此,解析上述数据,我们希望得到这样的结构:

$VAR = {  "Urinary Bladder" => ["CYST-" , "VESIC-"]};

解析它的正确方法是什么?

我被困住了,这并没有给我我想要的结果。

use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {
   chomp;
   if (/^MH = (w+/)) {
      $key = $1;     
      push @{$bighash{$key}}, " ";
   }
   elsif ( /^CX = (w+/)) {
      $cx = $1;
   }
   else {
      push @{$bighash{$key}}, $cx;
   }
} 

如果您使用$/每次读取一段数据,则会变得更简单。我很惊讶没有人这样建议。

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %bighash;
$/ = '';
while (<DATA>) {
  if (my ($k) = /^MH = (.*?)$/m and my ($v) = /^CX = (.*?)$/m) {
    $bighash{$k} = [ $v =~ /([A-Z]+-)/g ];
  }
}
say Dumper %bighash;
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

输出如下所示:

$VAR1 = {
          'Urinary Bladder' => [
                                 'CYST-',
                                 'VESIC-'
                               ]
        };

尝试以下操作。检查这些变化(或者听听Aki的意见)可能是个好主意:

use strict;
use warnings;
use Data::Dumper;
my %bighash;
my $current_key;
while ( <DATA> ) {
    chomp;
    if ( m/^MH = (.+)/ ) {
        $current_key = $1;
    } elsif ( /^CX = (.+)/ ) {
        my $text = $1;
        $bighash{ $current_key } = [ $text =~ /([A-Z]+-)/g ];
    }
}
print Dumper ( %bighash );
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

更新:使用regex - capture代替splitgrep

我最近没有练习我的perl功夫,但是最后的else语句看起来很可疑。

尝试删除最后一个else语句,并在第二个else语句之后直接添加'push'语句。基本上在匹配CX后直接进行push操作。

同样,你知道MH必须总是出现在CX之前,否则逻辑就会中断。

  • 修复正则表达式/^MH = (w+/)应该是/^MH (w+)/。您可能想使用s+s*代替空格
  • if块中删除push
  • 删除else
  • elsif块中使用$key
  • 将$cx推入哈希列表项
  • 添加use strict;use warnings;到你的代码

试试这些,如果你有困难,我会帮助你的代码

使用Config::Tiny或Config::YAML对文件进行初始传递,然后分别遍历每个记录可能更简单。虽然如果你的文件是一个gb或更多,这可能会占用你所有的内存。

这是我快速做的一些事情,我希望它能给你一个开始的想法:

use Data::Dumper;
# Set your record separator
{
  local $/="*NEWRECORDn";
  while(<DATA>) {
    # Get rid of your separator
    chomp($_);
    print "Parsing record # $.n";
    push @records, $_ if ( $_ );
  }
}

foreach (@records) {
  # Get your sub records
  @lines = split(/n/,$_);
  my %h = ();
  my %result = ();
  # Create a hash from your sub records
  foreach (@lines) {
    ($k, $v) = split(/s*=s*/, $_);
    $h{$k} = $v;
  }
  # Parse the CX and strip the lower case comments
  $h{ 'CX' } =~ s/[a-z]//g;
  $h{ 'CX' } =~ s/^s+//g;
  # Have the upper case values as an array ref in the result hash
  $result{ $h{ 'MH' } } = [ split( /s+/, $h{ 'CX' } ) ] if ( $h{ 'CX' } );
  print Dumper( %h );
  print "Result:n";
  print Dumper( %result );
}
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

最新更新