使用 Scrapy 解析 HTML 表时遇到问题(Xpath & Python)



我想让Scrapy读取一个HTML表,并返回一个CSV,其中包含从表中提取的所需信息。

每个item字段都与tr中的th元素相对应。我想我的xpath遇到了问题,但我不确定我是否也正确使用了选择器。

    # -*- coding: utf-8 -*-
    from scrapy.spiders import Spider
    from scrapy.selector import Selector
    from scraper_app.items import GenomeCanadaGrants
    class GCSpider(Spider):
        """Spider for Genome Canada Awards page."""
        name = 'GCSpider'
        allowed_domains = ["http://genomereports.ca"]
        start_urls= ["http://genomereports.ca/section.php?Action=List2&Lang=En&addnew=&Report=consolidated_commitments.php&Report_Text=Funding+Commitments&Nav=Section&ID=3&Login=&Password=&Consolidated_Centre=ALL&Consolidated_Category=ALL&Consolidated_Sector=ALL&Consolidated_Competition=ALL&Consolidated_FY=ALL&Consolidated_Status=ALL"]

def parse(self, response):
    sel = Selector(response)
    grants = sel.xpath('//html/body/table[1]/tr[]')
    items = []
    for response in grants:
        item = GenomeCanadaGrants()
        item['Province'] = response.xpath('./th[1]/text()').extract()  # Column Header: CENTRE
        item['Sector'] = response.xpath('./th[2]/text()').extract()  # Column Header: SECTOR
        item['Fund'] = response.xpath('./th[3]/text()').extract()  # Column Header: PROGRAM & Fiscal Yr Awarded
        item['Date'] = response.xpath('./th[3]/text()').re('d+d-d+d')  # Column Header: PROGRAM & Fiscal Yr Awarded
        item['Status'] = response.xpath('./th[4]/text()').extract  # Column Header: STATUS
        item['Principal_Investigator'] = response.xpath('./th[5]/text()').extract()  # Column Header: PROJECT LEADER(S)
        item['Project_Title'] = response.xpath("./th[6]/text()").extract  # Column Header: PROJECT TITLE
        item['Additional_Information'] = response.xpath("./th[7]//a[@href='url']").extract  # Link to PDF with Project Details
        item['Amount'] = response.xpath('./th[8]/text()').extrac  # Column Header: APPROVED BUDGET
        item['GC_Contribution'] = response.xpath('./th[9]/text()').extract  # Column Header: GC CONTRIBUTION
        items.append(item)
    return items

我认为这只是您没有正确定位所需的表——它在带有id="content_frame"div中。

这里有一个适用于我的示例代码:

def parse(self, response):
    for row in response.css('div#content_frame table tr'):  
        try:
            cells = row.xpath(".//th//text()")
            first_cell = cells[0].extract()
            print(first_cell)
        except IndexError:
            continue  # it's just to skip rows not containing CENTRE value

打印第一列的内容:

CENTRE
Genome British Columbia
Genome British Columbia
Genome British Columbia
Genome Alberta
Genome Alberta
Genome Alberta
...

最新更新