大意
以下是我正在使用的内容的片段:
my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;
foreach (@blarg_links) {
$url_temp = $_;
$page_temp = get( $url_temp ) or die $!;
$p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
@temp_stuff = $p_temp->look_down(
_tag => 'foo',
class => 'bar'
);
foreach (@temp_stuff) {
push(@collector, "http://www.foobar.sx" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};
};
希望很明显,我无可救药地试图做的是将每个链接列表中找到的链接结尾推送到一个名为 @temp_stuff
的数组中。因此,@blarg_links
中的第一个链接在访问时具有大于或等于 1 个带有关联bar
类的foo
标签,当由 as_HTML
操作时,它将匹配我想要的href
相等的内容,然后泵入具有我真正想要的数据的链接数组......这有意义吗?
实际数据
my $url2 = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $page2 = get( $url2 ) or die $!;
my $p2 = HTML::TreeBuilder->new_from_content( $page2 );
my @stuff2 = $p2->look_down(
_tag => 'div',
class => 'year mini-day-on'
);
my @chem_links;
foreach (@stuff2) {
push(@chem_links, $1) if $_->as_HTML =~ m/(http://www.chemistry.ucla.edu/calendar-node-field-date/day/[0-9]{4}-[0-9]{2}-[0-9]{2})/;
};
my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;
foreach (@chem_links) {
$url_temp = $_;
$page_temp = get( $url_temp ) or die $!;
$p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
@temp_stuff = $p_temp->look_down(
_tag => 'span',
class => 'field-content'
);
};
foreach (@temp_stuff) {
push(@collector, "http://www.chemistry.ucla.edu" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};
注意:我想使用 HTML::TreeBuilder。我知道有替代方案。
这是我认为你想要的粗略尝试。
它获取第一页上的所有链接并依次访问每个链接,在每个<span class="field-content">
元素中打印链接。
use strict;
use warnings;
use 5.010;
use HTML::TreeBuilder;
STDOUT->autoflush;
my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $tree = HTML::TreeBuilder->new_from_url($url);
my @chem_links;
for my $div ( $tree->look_down( _tag => 'div', class => qr{bmini-day-onb} ) ) {
my ($anchor)= $div->look_down(_tag => 'a', href => qr{http://www.chemistry.ucla.edu});
push @chem_links, $anchor->attr('href');
};
my @collector;
for my $url (@chem_links) {
say $url;
my $tree = HTML::TreeBuilder->new_from_url($url);
my @seminars;
for my $span ( $tree->look_down( _tag => 'span', class => 'field-content' ) ) {
my ($anchor) = $span->look_down(_tag => 'a', href => qr{/});
push @seminars, 'http://www.chemistry.ucla.edu'.$anchor->attr('href');
}
say " $_" for @seminars;
say '';
push @collector, @seminars;
};
对于更现代的网页解析框架,我建议您查看 Mojo::UserAgent
和 Mojo::DOM
。 您不必手动浏览 html 树的每个部分,而是可以使用 css 选择器的强大功能来归零所需的特定数据。 在Mojocast Episode 5
有一个关于框架的 8 分钟介绍视频。
# Parses the UCLA Chemistry Calendar and displays all seminar links
use strict;
use warnings;
use Mojo::UserAgent;
use URI;
my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
for my $dayhref ($dom->find('div.mini-day-on > a[href*="/day/"]')->attr('href')->each) {
my $dayurl = URI->new($dayhref)->abs($url);
print $dayurl, "n";
my $daydom = $ua->get($dayurl->as_string)->res->dom;
for my $seminarhref ($daydom->find('span.field-content > a[href]')->attr('href')->each) {
my $seminarurl = URI->new($seminarhref)->abs($dayurl);
print " $seminarurln";
}
print "n";
}
输出与使用 HTML::TreeBuilder
的鲍罗丁溶液相同:
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-06
http://www.chemistry.ucla.edu/seminars/nano-rheology-enzymes
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-09
http://www.chemistry.ucla.edu/seminars/imaging-approach-biology-disease-through-chemistry
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-10
http://www.chemistry.ucla.edu/seminars/arginine-methylation-%E2%80%93-substrates-binders-function
http://www.chemistry.ucla.edu/seminars/special-inorganic-chemistry-seminar
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-13
http://www.chemistry.ucla.edu/events/robert-l-scott-lecture-0
...