Python 抓取以提取特定的 Xpath 字段



我有以下结构(示例)。 我正在使用刮擦来提取详细信息。我需要提取"href"字段和"会计"等文本。我正在使用以下代码。我是Xpath的新手。任何帮助扩展特定字段.

<div class = 'something'>
    <ul>
        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="1">Accounting</a></li> 
        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="2">Administrative</a></li> 
        <li><a href="http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="3">Advertising</a></li> 
        <li><a href="http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="4">Airline</a></li> 
    </ul>
</div>

我的代码是:

from scrapy.spider import BaseSpider
from jobfetch.items import JobfetchItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose

class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
    def parse(self, response):
    count = 0
    for sel in response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
        item = JobfetchItem()
        item['title'] = sel.extract()
        item['link'] = sel.extract()
        count = count+1
        print item
    yield item

代码中遇到的问题:

  • yield item应该在循环内,因为您在那里实例化项目
  • 您拥有的 XPath 非常混乱且不太可靠,因为它严重依赖于父标签内的元素位置,并且几乎从文档的顶部父级开始
  • 您的 XPath 不正确 - 它应该下降到内部li内部的a元素ul
  • sel.extract()只会给你提取ul元素

为了举例说明,请在此处使用CSS selector来获取li标记:

import scrapy
from jobfetch.items import JobfetchItem

class JobFetchSpider(scrapy.Spider):
    name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
    def parse(self, response):
        for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
            item = JobfetchItem()
            item['title'] = sel.xpath('text()').extract()[0]
            item['link'] = sel.xpath('@href').extract()[0]
            yield item

运行蜘蛛会产生:

{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}

仅供参考,我们也可以使用xpath()

//article[@itemprop="articleBody"]//div[@class="expert-content-text"]/ul/li/a

使用以下脚本提取要抓取的数据。

In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]: 
[u'Accounting',
 u'Administrative',
 u'Advertising',
 u'Airline',
 u'Animal',
 u'Alternative Energy',
 u'Auction House',
 u'Banking',
 u'Biotechnology',
 u'Business',
 u'Business Intelligence',
 u'Chef',
 u'College Admissions',
 u'College Alumni Relations and Development ',
 u'College Student Services',
 u'Construction',
 u'Consulting',
 u'Corporate',
 u'Cruise Ship',
 u'Customer Service',
 u'Data Science',
 u'Engineering',
 u'Entry Level Jobs',
 u'Environmental',
 u'Event Planning',
 u'Fashion',
 u'Film',
 u'First Job',
 u'Fundraiser',
 u'Healthcare/Medical',
 u'Health/Safety',
 u'Hospitality',
 u'Human Resources',
 u'Human Services / Social Work',
 u'Information Technology (IT)',
 u'Insurance',
 u'International Affairs / Development',
 u'International Business',
 u'Investment Banking',
 u'Law Enforcement',
 u'Legal',
 u'Maintenance',
 u'Management',
 u'Manufacturing',
 u'Marketing',
 u'Media',
 u'Museum',
 u'Music',
 u'Non Profit',
 u'Nursing',
 u'Outdoor ',
 u'Public Administration',
 u'Public Relations',
 u'Purchasing',
 u'Radio',
 u'Real Estate ',
 u'Restaurant',
 u'Retail',
 u'Sales',
 u'School',
 u'Science',
 u'Ski and Snow Jobs',
 u'Social Media',
 u'Social Work',
 u'Sports',
 u'Television',
 u'Trades',
 u'Transportation',
 u'Travel',
 u'Yacht Jobs']

In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/@href').extract()
Out[2]: 
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
 u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
 u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
 u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
 u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
 u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
 u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
 u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']

最新更新