使用PERL从除HTML锚固链接以外的字符串中剥离所有内容

使用perl，我如何使用正则施加的字符串在其中带有随机html，带有一个带有锚的html链接，例如：

  <a href="http://example.com" target="_blank">Whatever Example</a>

它只留下它并摆脱其他一切？不管＆lt; a在href属性内是什么，例如 title=或 style=或其他。它留下了锚点："任何例子"和＆lt;/a>？

您可以利用流解析器，例如html :: tokeparser :: simple：

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html = <<EO_HTML;
Using Perl, how can I use a regex to take a string that has random HTML in it
with one HTML link with anchor, like this:
   <a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>
       and it leave ONLY that and get rid of everything else? No matter what
   was inside the href attribute with the <a, like title=, or style=, or
   whatever. and it leave the anchor: "Whatever Example" and the </a>?
EO_HTML
my $parser = HTML::TokeParser::Simple->new(string => $html);
while (my $tag = $parser->get_tag('a')) {
    print $tag->as_is, $parser->get_text('/a'), "</a>n";
}

输出：

 $ ./what whate.pl＆lt; a href =" http://example.com" target =" _ blank">任何有趣的例子＆lt;/a>

如果您需要简单的正则解决方案，则可能是：

my @anchors = $text =~ m@(<a[^>]*?>.*?</a>)@gsi;

然而，正如 @dan1111所提到的那样，正式表达式不是出于各种原因解析HTML的正确工具。

如果您需要可靠的解决方案，请寻找HTML解析器模块。

相关内容

最新更新

热门标签：