我正在尝试使用 Scrapy 编写一个爬虫来抓取分类类型(目标)站点并从目标站点上的链接中获取信息。关于 Scrapy 的教程只能帮助我从目标 URL 获取链接,而不是我寻求的第二层数据收集。有线索吗?
例如,目标站点将是:
start_url = "http://newyork.craigslist.org/search/cta"
对于目标网站上的所有链接,我想转到每个列表并获取价格,卖家,位置,电话或电子邮件
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
location = scrapy.Field()
class criticspider(CrawlSpider):
name = "craig"
allowed_domains = ["newyork.craigslist.org"]
start_urls = ["http://newyork.craigslist.org/search/cta"]
def parse(self, response):
sites = response.xpath('//div[@class="content"]')
items = []
for site in sites:
item = CompItem()
item['name'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="pl"]/a/text()').extract().
item['price'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="price"]/text()').extract()
item['location'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="pnr"]/small/text()').extract()
items.append(item)
return items