从哈希中获取 fasta 文件序列



我正在尝试将 FASTA 文件放入哈希中,以便以后可以操作它,将 ID 作为键,将序列作为值。但我的输出只是打印最后一个 ID 并将所有序列连接在一起。

输入:

>cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop
UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUA
UCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
>cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop
CACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUA
UCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA
>cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop
UUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUA
UCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
>cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop
GUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCA
CCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

我的输出是这个

cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

我想获取每个 ID 和相应的序列作为输出

法典:

use strict;
use warnings;
my %fastahash = ();
my $id        = '';
my $seq       = '';
open FILE, "file.fasta", or die $!;
while ( <FILE> ) {
chomp;
if ( $_ =~ /^>(.+)/ ) {
$id = $1;
}
elsif ( $_ =~ m/^[A-Z]+$/ ) {
$seq .= $_;
}
else {
$fastahash{$id} .= $_;
}
}
foreach my $sequence ( keys %fastahash ) {
print "$id $seqn";
}
close FILE;

我应该更改哪个部分?

另外,如何获取序列作为键和 id 作为值?

您没有正确累积哈希,也没有打印它。

while (<FILE>) {
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
} elsif (/^[A-Z]+$/) {
$seq .= $_;
} else {
$fastahash{$id} = $seq;   # Populate the hash.
}
}
for my $id (keys %fastahash) {
print "$id $fastahash{$id}n";  # Print it.
}

我认为您应该分配$seq时,您将$_分配给fastahash。此外,您永远不会重置 id 或 seq,因此存在潜在的错误。尝试这样的事情:

while (<FILE>) {
chomp;
if (/^>(.+)/) {
$id = $1;
} elsif (/^[A-Z]+$/) {
$seq .= $_;
} else {
$fastahash{$id} = $seq if $id;
$id = undef;
$seq = '';
}
}
$fastahash{$id} = $seq if $id;

我意识到这不是代码审查,但我认为对您的代码进行一些评论会很有用

  • 声明变量时通常不需要定义变量。实际上,如果您将标量变量设置为空字符串,它通常会删除有用的错误消息

  • 最佳做法是使用词法文件句柄和open的三参数形式。所以

    open FILE, "file.fasta", or die $!;
    

    最好写成

    open my $fh, '<', 'file.fasta' or die $!;
    

    (请注意,原始代码中还有一个多余的逗号。

    词法文件句柄通常无需close它们,因为它们在超出范围时销毁时将被隐式关闭

  • 你可能不熟悉Perl的默认变量$_,但是如果你使用它,代码可以做得更清晰、更简洁。

    你已经用了chomp,相当于chomp $_$_ =~ /^>(.+)/只需要/^>(.+)/

  • 注意foreach完全等同于for,大多数熟悉Perl的程序员会更喜欢前者。

我会给你写这样的程序

use strict;
use warnings;
open my $fh, '<', 'file.fasta' or die $!;
my %fasta_hash;
my ($id, $seq);
while ( <$fh> ) {
chomp;
if ( /^>(.+)/ ) {
$id = $1;
}
elsif ( /S/ and not /[^ACGTU]/ ) {
$seq .= $_;
}
else {
$fasta_hash{$id} = $seq;
}
}
for my $id ( keys %fasta_hash ) {
print "$id -- $fasta_hash{$id}n";
}

输出

cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA

至于如何反转哈希以便将序列用作键,在我上面的版本中,您只需将行$fasta_hash{$id} = $seq;更改为$fasta_hash{$seq} = $id;

最新更新