这是我输入Genbank文件的一部分:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
我想摘录期刊的标题,例如《维罗尔将军》。(不包括发行号和页数)
这是我的代码,它没有给出任何结果,所以我想知道出了什么问题。我确实用括号表示1美元、2美元等……虽然有效,但我的导师告诉我不要用那种方法,而是用substr。
foreach my $line (@lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/sd/ ) {
print substr( $line, 12, $index - 13 );
print "n";
}
$index++;
}
}
另一种方法是利用BioPerl,它可以解析GenBank文件:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "n";
}
如果运行此脚本时AC_00000005.1保存在一个名为AC_000005.1.gb的文件中,则会得到:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003) J. Virol. 68 (1), 379-389 (1994) J. Virol. 67 (2), 682-693 (1993) J. Virol. 63 (8), 3535-3540 (1989) Nucleic Acids Res. 9 (23), 6571-6589 (1981) Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
与其匹配并使用substr
,不如使用单个正则表达式捕获整个JOURNAL
行,并使用括号捕获表示日志信息的文本:
foreach my $line (@lines) {
if ($line =~ /JOURNALs+(.+)/) {
print "Journal information: $1n";
}
}
正则表达式查找后面跟着一个或多个空白字符的JOURNAL
,然后(.+
)捕获行中的其余字符。
为了在不使用$1
的情况下获取文本,我认为您正在尝试这样做:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
如果您已经知道左列中有多少个字符,那么只需执行substr($line, 12)
(或其他操作)即可检索从字符12:开始的$line
的子字符串
foreach my $line (@lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "n";
}
}
您可以将这两种技术结合起来,从日记账数据中删除问题编号和日期:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_namen";
}
我认为从发布的API获取数据会更容易!;)