嗨,我正在用文章爬取一个网站,每篇文章中都有一个指向文件的链接,我设法爬取了所有的文章链接,现在我想访问每一个并收集其中的链接,而不是可能必须将第一次爬取的结果保存到json中,然后编写另一个脚本。
事情是我是新手,所以我真的不知道该怎么做,提前谢谢!
import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
# pdf link can easily be collected with response.css(".file a::attr(href)").get()
# then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)
您可以通过一个新的回调向那些pdf链接发出请求,您将在其中放置提取的逻辑
-
爬行蜘蛛比简单的基本蜘蛛更适合处理这一问题。默认情况下会生成基本的spider模板,因此在生成spider时必须指定要使用的模板。
-
假设你已经创建了项目&在根文件夹中:
$ scrapy genspider -t crawl sgbd sante.sec.gouv.sn
-
打开
sgbd.py
文件,您会注意到它与;基本的spider模板。 -
如果您不熟悉XPath,下面是
-
CCD_ 2&
Rule
将根据文档定义您的蜘蛛行为 -
编辑文件:
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class SgbdSpider(CrawlSpider): name = 'sgbd' allowed_domains = ['sante.sec.gouv.sn'] start_urls = ['https://sante.sec.gouv.sn/actualites'] user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36' def set_user_agent(self, request, spider): request.headers['User-Agent'] = self.user_agent return request # First rule get the links to the articles; callback is the function executed after following the link to each article # Second rule handles pagination # Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css, # used XPaths instead rules = ( Rule( LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'), callback='parse_item', follow=True, process_request='set_user_agent', ), Rule( LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'), process_request='set_user_agent', ) ) # Extract title & link to pdf def parse_item(self, response): yield { 'title': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(), 'href': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/@href').get() }
-
不幸的是,这是我所能做的,因为即使使用不同的代理,该网站也无法访问,响应时间太长。您可能需要进一步调整这些XPath。祝你好运。
-
运行spider&将输出保存到json
$ scrapy crawl sgbd -o results.json
分析另一个函数中的链接。然后在另一个函数中再次解析。您可以在这些函数中的任何一个函数中产生您想要的任何结果。
我同意@bens_ak47&user9958765表示,使用单独的功能。
例如,更改此:
yield scrapy.Request(next_page, callback=self.parse)
到此:
yield scrapy.Request(next_page, callback=self.parse_pdffile)
然后添加新方法:
def parse_pdffile(self, response):
print(response.url)