如何将URL从HTML页面打印到STDOUT,并对其进行迭代以执行同样的操作?
use strict;
use warnings;
use 5.010;
use LWP::Simple qw(get);
use HTML::TreeBuilder 5 -weak;
my $name = 'perl';
my $limit = 100;
my $offset = 0;
my $total;
while (1) {
my $url = "http://hoshikuso.jp/?to=hunminj@ktng.com?limit=$limit&offset=$offset";
my $html = get $url;
my $tree = HTML::TreeBuilder->new;
print ($tree);
$tree->parse($html);
if (not $total) {
$total = $tree->look_down('http', 'https')->as_text;
say $total;
}
};
您可以在Mojo::Collections管道中完成大部分任务。甚至还有一个size
方法可以告诉你集合中有多少项,而不需要自己计算:。grep
删除了该页面中奇怪的a
标签,map
可以做你喜欢的事情,然后size
获得总数。我在Mojo Web客户端中有很多例子:
#!/usr/bin/env perl
use v5.10;
use open qw(:std :utf8);
use warnings;
use strict;
use Mojo::UserAgent;
use Mojo::Util qw(trim);
my $url = 'https://www.tripadvisor.com/Restaurants-g147275-Varadero_Matanzas_Province_Cuba.html';
my $ua = Mojo::UserAgent->new;
my $count = $ua->get( $url )
->res
->dom
->find( 'a[href]' )
->grep( sub { $_->attr('href') =~ m|A/RestaurantsNear-| } )
->map( sub {
my $t = trim( $_->all_text );
printf qq(%s -> "%s"n), $_->attr("href"), $t;
})
->size;
say "Total is $count";
这段代码应该会让您顺利实现目标。这三个文档页面应该是您进一步了解Mojo::DOM Mojo:收集Mojo:用户代理所需的全部内容
#!/usr/bin/env perl
use warnings;
use strict;
use Mojo::UserAgent;
my $url = 'https://www.tripadvisor.com/Restaurants-g147275-Varadero_Matanzas_Province_Cuba.html';
my $ua = Mojo::UserAgent->new;
my $total=0;
my $page = $ua->get( $url )->res->dom() ; ## returns Mojo::DOM object of whole page
for my $node ( $page->find( "a[href]" )->each() ) ## returns Mojo::DOM object of a tags with href attribute
{
print "###########n";
print $node->text() . "n";
print $node->attr("href") ."n";
$total++;
}
print "nntotal A tags with href attribute : $totaln";