我一直试图编写一个perl脚本来废弃amazon并下载产品评论,但我一直无法做到这一点。我一直在使用perl模块LWP::Simple和HTML::TreeBuilder::XPath来实现这一点。
对于HTML
<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small">
<span class="a-size-mini a-color-state a-text-bold">
Verified Purchase
</span>
<div class="a-section">
I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes
</div>
</div>
</div>
</div>
我想摘录一下产品评论。为此,我写道:-
use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>n";
# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('div[@class ="a-section"]');
foreach (@data)
{
print "$_n";
}
但我没有得到任何输出。有人能指出我的错误吗?
我认为XPath应该是'//div[@class ="a-section"]'
(在表达式开头添加//,以便在HTML中的任何位置找到div
)
正如choroba所说,XPath表达式应该以//
开头,以查找类型为div
的子体。目前,您正在文档的根目录中搜索<div>
元素,但没有。
您还正在寻找一个class
属性,该属性等于a-section
,而实际上每个div
元素的class
属性可以包含多个类,如
class="a-section a-subheader a-breadcrumb celwidget"
并且你希望它们中的任何一个是CCD_ 10。
有几种方法可以解决这个问题。最明显的是使用XPathcontains来查看a-section
是否出现在类字符串中的任何位置,如以下
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
my $asin = 'B0031EJBI4';
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
my $tree = HTML::TreeBuilder::XPath->new->parse(get $url);
my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]');
say scalar @nodes;
其报告页面中的60个这样的节点。这是正确的结果,您可能不想再进一步,但该解决方案并不安全,因为它将匹配等节点
<div class="aaa-sections">
同样。为了正确地解决这个问题,您需要恢复到非XPath HTML::Element
方法look_down
,就像这样,它在a-section
之前和之后坚持一个单词边界。
my @nodes = $tree->look_down(
_tag => 'div',
class => qr/ba-sectionb/,
);
say scalar @nodes;
同样,结果是正确的64。
但即使是这种解决方案也不允许以-section
这样的非单词字符开头或结尾的类,因为永远找不到/b-sectionb/
。最常见的解决方案是在look_down
标准中使用一个子例程,就像这样,它在空白处拆分类字符串(' '
是正确的:不要为/ /
或/s+/
更改它),并构建使用所有子字符串作为键的%classes
哈希。那么a-section
类的存在就是$classes{'a-section'}
的值
@nodes = $tree->look_down(
_tag => 'div',
sub {
return unless my $class = $_[0]->attr('class');
my %classes = map { $_ => 1 } split ' ', $class;
$classes{'a-section'};
}
);
say scalar @nodes;
这个页面的结果再一次是64,但这个解决方案将适用于任何类字符串。
use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>n";
# Assemble the URL from the passed ASIN.
my $url = "http://www.amazon.com/gp/product/B00R3DO58K/ref=s9_ri_gw_g74_i2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-3&pf_rd_r=01F13XCKC1KBQAJ4EY87&pf_rd_t=36701&pf_rd_p=1970558902&pf_rd_i=desktop";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]');
#print $content;
foreach (@data)
{
print "$_n";
}