如何同时抓取和抓取数据?

这是我第一次体验网页抓取，我不确定我是否做得很好。问题是我想同时抓取和抓取数据。

获取我要抓取的所有链接
将它们存储到 MongoDB 中

逐个访问它们以抓取其内容

# Crawling: get all links to be scrapped later on 
class LinkCrawler(Spider):
name="link"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
def parse(self,response):
# loop for all pages
next_page = Selector(response).xpath('//li[@class="active"]/following-sibling::li[1]/a/@href').extract()
if not not next_page:
yield Request("https://"+next_page[0], callback = self.parse)
# loop for all links in a single page
links = Selector(response).xpath('//div[@class="row-fluid job-details pointer"]/div[@class="bloc-right"]/div[@class="row-fluid"]')
for link in links:
item = Link()
url = response.urljoin(link.xpath('a/@href')[0].extract())
item['url'] = url
items.append(item)
for item in items:
yield item
# Scraping: get all the stored links on MongoDB and scrape them????

您的用例到底是什么？您主要对它们指向的页面的链接或内容感兴趣吗？即是否有任何理由先将链接存储在MongoDB中，然后再抓取页面？如果你真的需要在MongoDB中存储链接，最好使用项目管道来存储项目。在链接中，甚至还有在MongoDB中存储项目的示例。如果您需要更复杂的东西，请查看刮擦的mongodb软件包。

除此之外，还有一些对您发布的实际代码的评论：

而不是Selector(response).xpath(...)只使用response.xpath(...).
如果只需要从选择器中提取的第一个元素，请使用extract_first()而不是使用extract()和索引。
不要用if not not next_page:，用if next_page:。
不需要遍历items的第二个循环，yield循环中的项目links。

相关内容

最新更新

热门标签：