我正在尝试将 FASTA 文件放入哈希中,以便以后可以操作它,将 ID 作为键,将序列作为值。但我的输出只是打印最后一个 ID 并将所有序列连接在一起。
输入:
>cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop
UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUA
UCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
>cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop
CACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUA
UCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA
>cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop
UUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUA
UCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
>cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop
GUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCA
CCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
我的输出是这个
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
我想获取每个 ID 和相应的序列作为输出
法典:
use strict;
use warnings;
my %fastahash = ();
my $id = '';
my $seq = '';
open FILE, "file.fasta", or die $!;
while ( <FILE> ) {
chomp;
if ( $_ =~ /^>(.+)/ ) {
$id = $1;
}
elsif ( $_ =~ m/^[A-Z]+$/ ) {
$seq .= $_;
}
else {
$fastahash{$id} .= $_;
}
}
foreach my $sequence ( keys %fastahash ) {
print "$id $seqn";
}
close FILE;
我应该更改哪个部分?
另外,如何获取序列作为键和 id 作为值?
您没有正确累积哈希,也没有打印它。
while (<FILE>) {
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
} elsif (/^[A-Z]+$/) {
$seq .= $_;
} else {
$fastahash{$id} = $seq; # Populate the hash.
}
}
for my $id (keys %fastahash) {
print "$id $fastahash{$id}n"; # Print it.
}
我认为您应该分配$seq
时,您将$_
分配给fastahash。此外,您永远不会重置 id 或 seq,因此存在潜在的错误。尝试这样的事情:
while (<FILE>) {
chomp;
if (/^>(.+)/) {
$id = $1;
} elsif (/^[A-Z]+$/) {
$seq .= $_;
} else {
$fastahash{$id} = $seq if $id;
$id = undef;
$seq = '';
}
}
$fastahash{$id} = $seq if $id;
我意识到这不是代码审查,但我认为对您的代码进行一些评论会很有用
-
声明变量时通常不需要定义变量。实际上,如果您将标量变量设置为空字符串,它通常会删除有用的错误消息
-
最佳做法是使用词法文件句柄和
open
的三参数形式。所以open FILE, "file.fasta", or die $!;
最好写成
open my $fh, '<', 'file.fasta' or die $!;
(请注意,原始代码中还有一个多余的逗号。
词法文件句柄通常无需
close
它们,因为它们在超出范围时销毁时将被隐式关闭 -
你可能不熟悉Perl的默认变量
$_
,但是如果你使用它,代码可以做得更清晰、更简洁。你已经用了
chomp
,相当于chomp $_
,$_ =~ /^>(.+)/
只需要/^>(.+)/
-
注意
foreach
完全等同于for
,大多数熟悉Perl的程序员会更喜欢前者。
我会给你写这样的程序
use strict;
use warnings;
open my $fh, '<', 'file.fasta' or die $!;
my %fasta_hash;
my ($id, $seq);
while ( <$fh> ) {
chomp;
if ( /^>(.+)/ ) {
$id = $1;
}
elsif ( /S/ and not /[^ACGTU]/ ) {
$seq .= $_;
}
else {
$fasta_hash{$id} = $seq;
}
}
for my $id ( keys %fasta_hash ) {
print "$id -- $fasta_hash{$id}n";
}
输出
cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA
至于如何反转哈希以便将序列用作键,在我上面的版本中,您只需将行$fasta_hash{$id} = $seq;
更改为$fasta_hash{$seq} = $id;