我正试图从以下站点的表格中提取每个G蛋白偶联受体的信息:
http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1
更具体地说,我想从列(配体、Sp.、动作、亲和性、单位)中提取信息。目前,我一直在从提取中输出空文件,所以模块似乎无法识别我指定的表。这是我迄今为止写的代码,它被设计成遍历每个HTML文件,这些文件对应于每个G蛋白偶联受体的信息。
use warnings;
use strict;
use HTML::TableExtract;
my @names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html`;
foreach (@names)
{
#Delete empty lines in HTML
open (IN, "</home/wallakin/LINDA/ligands/iuphar/data/html/$_") or die "Can't open html";
my @htmllines = <IN>;
close IN;
for (@htmllines)
{
s/^s*$// or s/^s*//;
}
open (OUT, ">/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
print OUT @htmllines;
close OUT;
#Extract data from HTML tables based on column headers
my $te = HTML::TableExtract->new (
headers => [ qw(Ligand Sp. Action Affinity Units) ],
depth => 1,
count => 1
);
$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
my $output = $_;
$output =~ s/.html/.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/ligands/$output");
close RESET;
open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/ligands/$output");
binmode (DATA, ":utf8");
binmode (STDOUT, ":utf8");
foreach my $ts ($te->tables)
{
print "Table (", join(',', $ts->coords), "):n";
foreach my $row ($te->rows)
{
foreach ( grep {defined} @$row)
{
$_ =~ s/n/ /g;
$_ =~ s/r//g;
#$_ =~ s/s+/ /g;
}
#Each column's data separated by tabs
print DATA join ("t", grep {defined} @$row),"n";
}
}
close DATA;
}
我之前写了一个程序(谢天谢地,这个程序成功了),它为每个G蛋白偶联受体获取了我各自的HTML文件,并将其传递到这个程序中。我不确定我是否使用了正确的标题、深度或计数。
如果这篇文章在任何方面听起来都很愚蠢,我很抱歉,但总的来说,我是生物信息学和编程的新手。谢谢你的帮助!
这似乎适用于您提供的URL:
use 5.014;
use strict;
use warnings;
use open qw(:std :utf8);
use HTML::TableExtract;
my $te = HTML::TableExtract->new(
headers => [qw(Ligand Sp. Action Affinity Units Reference)],
);
$te->parse_file('sample.html');
my @tables = $te->tables;
for my $t (@tables) {
my @rows = $t->rows;
for my $r (@rows) {
for my $c (@$r) {
$c =~ s/As+//;
$c =~ s/s+z//;
}
say "@$r";
}
}