试图弄清楚如何将每个链接中包含的特定链接推送到单独的链接列表中



大意


以下是我正在使用的内容的片段:

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;
foreach (@blarg_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'foo',
                class => 'bar'
        );
        foreach (@temp_stuff) {
                push(@collector, "http://www.foobar.sx" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
        };
};

希望很明显,我无可救药地试图做的是将每个链接列表中找到的链接结尾推送到一个名为 @temp_stuff 的数组中。因此,@blarg_links中的第一个链接在访问时具有大于或等于 1 个带有关联bar类的foo标签,当由 as_HTML 操作时,它将匹配我想要的href相等的内容,然后泵入具有我真正想要的数据的链接数组......这有意义吗?


实际数据


my $url2 = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $page2 = get( $url2 ) or die $!;
my $p2 = HTML::TreeBuilder->new_from_content( $page2 );
my @stuff2 = $p2->look_down(
        _tag => 'div',
        class => 'year mini-day-on'
);
my @chem_links;
foreach (@stuff2) {
        push(@chem_links, $1) if $_->as_HTML =~ m/(http://www.chemistry.ucla.edu/calendar-node-field-date/day/[0-9]{4}-[0-9]{2}-[0-9]{2})/;
};
my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;
foreach (@chem_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'span',
                class => 'field-content'
        );
};
foreach (@temp_stuff) {
                push(@collector, "http://www.chemistry.ucla.edu" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};

注意:我想使用 HTML::TreeBuilder。我知道有替代方案。


这是我认为你想要的粗略尝试。

它获取第一页上的所有链接并依次访问每个链接,在每个<span class="field-content">元素中打印链接。

use strict;
use warnings;
use 5.010;
use HTML::TreeBuilder;
STDOUT->autoflush;
my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $tree = HTML::TreeBuilder->new_from_url($url);
my @chem_links;
for my $div ( $tree->look_down( _tag => 'div', class => qr{bmini-day-onb} ) ) {
  my ($anchor)= $div->look_down(_tag => 'a', href => qr{http://www.chemistry.ucla.edu});
  push @chem_links, $anchor->attr('href');
};
my @collector;
for my $url (@chem_links) {
  say $url;
  my $tree = HTML::TreeBuilder->new_from_url($url);
  my @seminars;
  for my $span ( $tree->look_down( _tag => 'span', class => 'field-content' ) ) {
    my ($anchor) = $span->look_down(_tag => 'a', href => qr{/});
    push @seminars, 'http://www.chemistry.ucla.edu'.$anchor->attr('href');
  }
  say "  $_" for @seminars;
  say '';
  push @collector, @seminars;
};

对于更现代的网页解析框架,我建议您查看 Mojo::UserAgentMojo::DOM 。 您不必手动浏览 html 树的每个部分,而是可以使用 css 选择器的强大功能来归零所需的特定数据。 在Mojocast Episode 5有一个关于框架的 8 分钟介绍视频。

# Parses the UCLA Chemistry Calendar and displays all seminar links
use strict;
use warnings;
use Mojo::UserAgent;
use URI;
my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
for my $dayhref ($dom->find('div.mini-day-on > a[href*="/day/"]')->attr('href')->each) {
    my $dayurl = URI->new($dayhref)->abs($url);
    print $dayurl, "n";
    my $daydom = $ua->get($dayurl->as_string)->res->dom;
    for my $seminarhref ($daydom->find('span.field-content > a[href]')->attr('href')->each) {
        my $seminarurl = URI->new($seminarhref)->abs($dayurl);
        print "  $seminarurln";
    }
    print "n";
}

输出与使用 HTML::TreeBuilder 的鲍罗丁溶液相同:

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-06
  http://www.chemistry.ucla.edu/seminars/nano-rheology-enzymes
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-09
  http://www.chemistry.ucla.edu/seminars/imaging-approach-biology-disease-through-chemistry
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-10
  http://www.chemistry.ucla.edu/seminars/arginine-methylation-%E2%80%93-substrates-binders-function
  http://www.chemistry.ucla.edu/seminars/special-inorganic-chemistry-seminar
http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-13
  http://www.chemistry.ucla.edu/events/robert-l-scott-lecture-0
...

相关内容

  • 没有找到相关文章

最新更新