使用 Perl 的 SEEK 跳转到文件中的一行并继续读取该文件



我的目标是打开一个包含固定长度的单列的文件(在我的 Mac 上为 1 个字符 = 2 个字节),然后将文件的行读取到数组中,从指定点开始和结束。 文件很长,所以我使用 seek 命令跳转到文件的相应起始行。该文件是一个染色体序列,排列为一列。 我已成功跳转到文件中的适当点,但是在将序列读入数组时遇到问题。

my @seq = (); # to contain the stretch of sequence I am seeking to retrieve from file.
my $from_bytes = 2*$from - 2; # specifies the "start point" in terms of bytes.
seek( SEQUENCE, $from_bytes, 0 );
my $from_base = <SEQUENCE>;
push ( @seq, $from_base ); # script is going to the correct line and retrieving correct base.
my $count = $from + 1; # here I am trying to continue the read into @seq
while ( <SEQUENCE> ) {
        if ( $count = $to ) { # $to specifies the line at which to stop
              last;
        }
        else {
             push( @seq, $_ );
             $count++;
             next;  
        }
}
print "seq is: @seqnn"; # script prints only the first base

您似乎正在阅读固定宽度的记录,由 $to 行组成,每行有 2 个字节(1 个字符 + 1 个换行符)。因此,您只需一次读取即可读取每个染色体序列。举个小例子:

use strict;
use warnings;
use autodie;
my $record_number    = $ARGV[0];
my $lines_per_record = 4; # change to the correct value
my $record_length    = $lines_per_record * 2;
my $offset           = $record_length * $record_number;
my $fasta_test = "fasta_test.txt";
if (open my $SEQUENCE, '<', $fasta_test) {
    my $sequence_string;
    seek $SEQUENCE, $offset, 0;
    my $chars_read = read($SEQUENCE, $sequence_string, $record_length);
    if ($chars_read) {
        my @seq = split /n/, $sequence_string; # if you want it as an array
        $sequence_string =~ s/n//g; # if you want the chromosome sequence as a single string without newlines
        print $sequence_string, "n";
    } else {
        print STDERR "Failed to read record $record_number!n";
    }
    close $SEQUENCE;
}

有了更多的信息,人们可能会提出更好的解决方案。

最新更新