如何从HTML中提取亚马逊评论



我一直试图编写一个perl脚本来废弃amazon并下载产品评论,但我一直无法做到这一点。我一直在使用perl模块LWP::Simple和HTML::TreeBuilder::XPath来实现这一点。

对于HTML

<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small">
  <span class="a-size-mini a-color-state a-text-bold">
    Verified Purchase
  </span>
  <div class="a-section">
    I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes
  </div>
</div>
</div>
</div>

我想摘录一下产品评论。为此,我写道:-

use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>n";
# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('div[@class ="a-section"]');
foreach (@data)
{
    print "$_n";
}

但我没有得到任何输出。有人能指出我的错误吗?

我认为XPath应该是'//div[@class ="a-section"]'(在表达式开头添加//,以便在HTML中的任何位置找到div

正如choroba所说,XPath表达式应该以//开头,以查找类型为div子体。目前,您正在文档的根目录中搜索<div>元素,但没有。

您还正在寻找一个class属性,该属性等于a-section,而实际上每个div元素的class属性可以包含多个类,如

class="a-section a-subheader a-breadcrumb celwidget"

并且你希望它们中的任何一个是CCD_ 10。

有几种方法可以解决这个问题。最明显的是使用XPathcontains来查看a-section是否出现在类字符串中的任何位置,如以下

use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
my $asin = 'B0031EJBI4';
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
my $tree = HTML::TreeBuilder::XPath->new->parse(get $url);
my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]');
say scalar @nodes;

其报告页面中的60个这样的节点。这是正确的结果,您可能不想再进一步,但该解决方案并不安全,因为它将匹配等节点

<div class="aaa-sections">

同样。为了正确地解决这个问题,您需要恢复到非XPath HTML::Element方法look_down,就像这样,它在a-section之前和之后坚持一个单词边界。

my @nodes = $tree->look_down(
  _tag => 'div',
  class => qr/ba-sectionb/,
);
say scalar @nodes;

同样,结果是正确的64。

但即使是这种解决方案也不允许以-section这样的非单词字符开头或结尾的类,因为永远找不到/b-sectionb/。最常见的解决方案是在look_down标准中使用一个子例程,就像这样,它在空白处拆分类字符串(' '是正确的:不要为/ //s+/更改它),并构建使用所有子字符串作为键的%classes哈希。那么a-section类的存在就是$classes{'a-section'}的值

@nodes = $tree->look_down(
  _tag => 'div',
  sub {
    return unless my $class = $_[0]->attr('class');
    my %classes = map { $_ => 1 } split ' ', $class;
    $classes{'a-section'};
  }
);
say scalar @nodes;

这个页面的结果再一次是64,但这个解决方案将适用于任何类字符串。

use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>n";
# Assemble the URL from the passed ASIN.
my $url = "http://www.amazon.com/gp/product/B00R3DO58K/ref=s9_ri_gw_g74_i2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-3&pf_rd_r=01F13XCKC1KBQAJ4EY87&pf_rd_t=36701&pf_rd_p=1970558902&pf_rd_i=desktop";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);

die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]');

#print $content;
foreach (@data)
{
    print "$_n";
}

最新更新