提取特定形式的所有链接

我有一个页面，我希望关闭所有链接（例如 http://www.stephenfry.com/）。我想将所有 http://www.stephenfry.com/WHATEVER 形式的链接放入一个数组中。我现在得到的只是以下方法：

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
# I ONLY WANT TO USE JUST THESE
my $url = 'http://www.stephenfry.com/';
my $doc = get( $url );
my $adt = HTML::Tree->new();
$adt->parse( $doc );
my @juice = $adt->look_down(
    _tag => 'a',
    href => 'REGEX?'
);

不知道如何只放入这些链接。

您需要

使用 extract_links() 方法，而不是look_down() ：

use strict;
use warnings;
use LWP::Simple;
use HTML::Tree;
my %seen;
my $url = 'http://www.stephenfry.com/';
my $doc = get($url);
my $adt = HTML::Tree->new();
$adt->parse($doc);
my $links_array_ref = $adt->extract_links('a');
my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->[0],
  @$links_array_ref;
print "$_n" for @links;

部分输出：

http://www.stephenfry.com/
http://www.stephenfry.com/blog/
http://www.stephenfry.com/category/blessays/
http://www.stephenfry.com/category/features/
http://www.stephenfry.com/category/general/
...

使用 WWW：：Mechanize 可能更简单，并且确实会返回更多链接：

use strict;
use warnings;
use WWW::Mechanize;
my %seen;
my $mech = WWW::Mechanize->new();
$mech->get('http://www.stephenfry.com/');
my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->url,
  $mech->links();
print $_, "n" for @links;

部分输出：

http://www.stephenfry.com/wp-content/themes/fry/images/favicon.png
http://www.stephenfry.com/wp-content/themes/fry/style.css
http://www.stephenfry.com/wordpress/xmlrpc.php
http://www.stephenfry.com/feed/
http://www.stephenfry.com/comments/feed/
...

希望这有帮助！

相关内容

最新更新

热门标签：