Perl:Scrape网站以及如何使用PerlSelenium:Chrome从网站下载PDF文件



所以我正在研究在Perl上使用Selenium:Chrome的Scraping网站,我只是想知道如何下载2017年至2021年的所有pdf文件,并将其存储到该网站的文件夹中https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021。到目前为止,这就是我所做的

use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;

my $collection_name = "mre_zen_test3";
make_path("$collection_name");
#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;
#NAVIGATE TO SITE
print "trying to get toc_urln";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);
#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^x00-x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.htmln";
sleep(5);
$toc_content = read_file("toc.html");

此脚本仅下载网站的全部内容。希望这里有人能帮助我,教我。非常感谢。

这里有一些工作代码,希望能帮助你开始

use warnings;
use strict;
use feature 'say';
use Path::Tiny;  # only convenience
use Selenium::Chrome;
my $base_url = q(https://www.fda.gov/drugs/)
. q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
my $show = 1;  # to see navigation. set to false for headless operation

# A little demo of how to set some browser options
my %chrome_capab = do {
my @cfg = ($show) 
? ('window-position=960,10', 'window-size=950,1180')
: 'headless';
'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
};
my $drv = Selenium::Chrome->new( %chrome_capab );
my @years = 2017..2021;
foreach my $year (@years) {
my $url = $base_url . "untitled-letters-$year";
$drv->get($url);
say "nPage title: ", $drv->get_title;
sleep 1 if $show;
my $elem = $drv->find_element(
q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
);
sleep 1 if $show;

# Downloading the file is surprisingly not simple with Selenium (see text)
# But as we found the link we can get its url and then use Selenium-provided 
# user-agent (it's LWP::UserAgent)
my $href = $elem->get_attribute('href');
say "pdf's url: $href";
my $response = $drv->ua->get($href);
die $response->status_line if not $response->is_success;
say "Downloading 'Content-Type': ", $response->header('Content-Type'); 
my $filename = "download_$year.pdf";
say "Save as $filename";
path($filename)->spew( $response->decoded_content );
}

这需要走捷径,切换方法,并回避一些问题(为了更充分地利用这个有用的工具,需要解决这些问题(。它从每个页面下载一个pdf;要下载所有内容,我们需要更改用于定位它们的XPath表达式

my @hrefs = 
map { $_->get_attribute('href') } 
$drv->find_elements(
# There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
q{//li[contains(text(), '(PDF)')]}
. q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]} 
);

现在在链接上循环,更仔细地形成文件名,并下载上面程序中的每一个。如果需要的话,我可以进一步填补空白。

该代码将pdf文件放在磁盘的工作目录中。请在运行之前检查一下,以确保没有任何内容被覆盖!

请参阅Selenium::Remove::Driver以了解初学者。


注意:此特定任务不需要硒;这都是直接的HTTP请求,没有JavaScript。所以LWP::UserAgentMojo会做得很好。但我认为你想学习如何使用硒,因为硒通常是需要的,而且很有用。

相关内容

  • 没有找到相关文章

最新更新