如何在同一解析函数中对另一个爬网的结果进行爬网



嗨,我正在用文章爬取一个网站,每篇文章中都有一个指向文件的链接,我设法爬取了所有的文章链接,现在我想访问每一个并收集其中的链接,而不是可能必须将第一次爬取的结果保存到json中,然后编写另一个脚本。

事情是我是新手,所以我真的不知道该怎么做,提前谢谢!

import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
#   pdf link can easily be collected with response.css(".file a::attr(href)").get()
#   then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)

您可以通过一个新的回调向那些pdf链接发出请求,您将在其中放置提取的逻辑

  • 爬行蜘蛛比简单的基本蜘蛛更适合处理这一问题。默认情况下会生成基本的spider模板,因此在生成spider时必须指定要使用的模板。

  • 假设你已经创建了项目&在根文件夹中:

    $ scrapy genspider -t crawl sgbd sante.sec.gouv.sn
    
  • 打开sgbd.py文件,您会注意到它与;基本的spider模板。

  • 如果您不熟悉XPath,下面是

  • CCD_ 2&Rule将根据文档定义您的蜘蛛行为

  • 编辑文件:

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    class SgbdSpider(CrawlSpider):
    name = 'sgbd'
    allowed_domains = ['sante.sec.gouv.sn']
    start_urls = ['https://sante.sec.gouv.sn/actualites']
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    def set_user_agent(self, request, spider):
    request.headers['User-Agent'] = self.user_agent
    return request
    # First rule get the links to the articles; callback is the function executed after following the link to each article
    # Second rule handles pagination
    # Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css,
    # used XPaths instead
    rules = (
    Rule(
    LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'), 
    callback='parse_item', 
    follow=True,
    process_request='set_user_agent',
    ),
    Rule(
    LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'),
    process_request='set_user_agent',
    )
    )
    # Extract title & link to pdf
    def parse_item(self, response):
    yield {
    'title': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(),
    'href': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/@href').get()
    }
    
  • 不幸的是,这是我所能做的,因为即使使用不同的代理,该网站也无法访问,响应时间太长。您可能需要进一步调整这些XPath。祝你好运。

  • 运行spider&将输出保存到json

    $ scrapy crawl sgbd -o results.json
    

分析另一个函数中的链接。然后在另一个函数中再次解析。您可以在这些函数中的任何一个函数中产生您想要的任何结果。

我同意@bens_ak47&user9958765表示,使用单独的功能。

例如,更改此:

yield scrapy.Request(next_page, callback=self.parse)

到此:

yield scrapy.Request(next_page, callback=self.parse_pdffile)

然后添加新方法:

def parse_pdffile(self, response):
print(response.url)

最新更新