如何抓取分类网站



我正在尝试使用 Scrapy 编写一个爬虫来抓取分类类型(目标)站点并从目标站点上的链接中获取信息。关于 Scrapy 的教程只能帮助我从目标 URL 获取链接,而不是我寻求的第二层数据收集。有线索吗?

例如,目标站点将是:

start_url = "http://newyork.craigslist.org/search/cta"

对于目标网站上的所有链接,我想转到每个列表并获取价格,卖家,位置,电话或电子邮件

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin

class CompItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    location = scrapy.Field()


class criticspider(CrawlSpider):
    name = "craig"
    allowed_domains = ["newyork.craigslist.org"]
    start_urls = ["http://newyork.craigslist.org/search/cta"]

    def parse(self, response):
        sites = response.xpath('//div[@class="content"]')
        items = []
        for site in sites:
            item = CompItem()
            item['name'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="pl"]/a/text()').extract().
            item['price'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="price"]/text()').extract()
            item['location'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="pnr"]/small/text()').extract()
            items.append(item)
            return items

最新更新