如何在同一解析函数中对另一个爬网的结果进行爬网

嗨，我正在用文章爬取一个网站，每篇文章中都有一个指向文件的链接，我设法爬取了所有的文章链接，现在我想访问每一个并收集其中的链接，而不是可能必须将第一次爬取的结果保存到json中，然后编写另一个脚本。

事情是我是新手，所以我真的不知道该怎么做，提前谢谢！

import scrapy
class SgbdSpider(scrapy.Spider):
name = "sgbd"
start_urls = [
"http://www.sante.gouv.sn/actualites/"
]
def parse(self,response):
base = "http://www.sante.gouv.sn/actualites/"
for link in response.css(".card-title a"):
title = link.css("a::text").get()
href = link.css("a::attr(href)").extract()
#here instead of yield, i want to parse the href and then maybe yield the result of THAT parse.
yield{
"title" : title,
"href" : href
}
# next step for each href, parse again and get link in that page for pdf file
#   pdf link can easily be collected with response.css(".file a::attr(href)").get()
#   then write that link in a json file
next_page = response.css("li.pager-next a::attr(href)").get()
if next_page is not None and next_page.split("?page=")[-1] != "35":
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page,callback=self.parse)

您可以通过一个新的回调向那些pdf链接发出请求，您将在其中放置提取的逻辑

爬行蜘蛛比简单的基本蜘蛛更适合处理这一问题。默认情况下会生成基本的spider模板，因此在生成spider时必须指定要使用的模板。

假设你已经创建了项目&在根文件夹中：

$ scrapy genspider -t crawl sgbd sante.sec.gouv.sn

打开sgbd.py文件，您会注意到它与；基本的spider模板。
如果您不熟悉XPath，下面是
CCD_ 2&Rule将根据文档定义您的蜘蛛行为

编辑文件：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class SgbdSpider(CrawlSpider):
name = 'sgbd'
allowed_domains = ['sante.sec.gouv.sn']
start_urls = ['https://sante.sec.gouv.sn/actualites']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
def set_user_agent(self, request, spider):
request.headers['User-Agent'] = self.user_agent
return request
# First rule get the links to the articles; callback is the function executed after following the link to each article
# Second rule handles pagination
# Couldn't get it to work when passing css selectors in LinkExtractor as restrict_css,
# used XPaths instead
rules = (
Rule(
LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[2]/ul/li[11]/a'), 
callback='parse_item', 
follow=True,
process_request='set_user_agent',
),
Rule(
LinkExtractor(restrict_xpaths='//*[@id="main-content"]/div[1]/div/div[1]/div/div/div/div[3]/span/div/h4/a'),
process_request='set_user_agent',
)
)
# Extract title & link to pdf
def parse_item(self, response):
yield {
'title': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/font/font/text()').get(),
'href': response.xpath('//*[@id="main-content"]/section/div[1]/div[1]/article/div[2]/div[2]/div/span/a/@href').get()
}

不幸的是，这是我所能做的，因为即使使用不同的代理，该网站也无法访问，响应时间太长。您可能需要进一步调整这些XPath。祝你好运。
运行spider&将输出保存到json
```
$ scrapy crawl sgbd -o results.json
```

分析另一个函数中的链接。然后在另一个函数中再次解析。您可以在这些函数中的任何一个函数中产生您想要的任何结果。

我同意@bens_ak47&user9958765表示，使用单独的功能。

例如，更改此：

yield scrapy.Request(next_page, callback=self.parse)

到此：

yield scrapy.Request(next_page, callback=self.parse_pdffile)

然后添加新方法：

def parse_pdffile(self, response):
print(response.url)

相关内容

最新更新

热门标签：