在多个formrequest页面上存储物品的废品?元?Python



,所以我让刮板处理一个表单请求。我什至可以看到终端从此页面版本传到scrape数据时打印出来:

class MySpider(BaseSpider):
    name = "swim"
    start_urls = ["example.website"]
    DOWNLAD_DELAY= 30.0
    def parse(self, response):
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.swimparse1,dont_click=True)]
    def swimparse1(self, response):       
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []
        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["free"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)
        return items

但是,当我添加第二个formRequest回电时,它仅刮擦第二个项目中的项目。它还只从第二页上打印刮擦,就好像它完全跳过了第一页的刮擦吗?:

class MySpider(BaseSpider):
    name = "swim"
    start_urls = ["example.website"]
    DOWNLAD_DELAY= 30.0
    def parse(self, response):
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.swimparse1,dont_click=True)]
    def swimparse1(self, response):       
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []
        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["free"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)
            #print item[]           
        return [FormRequest.from_response(response,formname="TTForm",
                    formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                    "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
                    "How_Many": "50", "foolOldPerl": ""}
                    ,callback=self.Swimparse2,dont_click=True),]
    def swimparse2(self, response):
        open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []
        for rows in rows[4:54]:
            item = swimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["fly"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)
            #print item[]
        return items

猜测:a)如何将项目从第一个刮擦导出或返回第二个刮擦中,以便我将所有项目数据一起汇总在一起,就好像是从一个页面上刮掉的?

吗?

b)或如果第一个刮擦是完全跳过的,我该如何停止跳过并将这些物品传递到下一个?

谢谢!

ps:另外:我尝试使用:

item = response.request.meta = ["item]
item = response.request.meta = []
item = response.request.meta = ["names":item, "age":item, "free":item, "team":item]

所有这些都会创建一个键错误或其他异常提出

ive还尝试修改表单请求,以包括元= {" names":item," age":ege':fief"," free":item," tem"," team":item}。没有引起错误,但不会刮擦或存储任何东西。

编辑:我尝试使用这样的收益率:

class MySpider(BaseSpider):
name = "swim"
start_urls = ["www.website.com"]
DOWNLAD_DELAY= 30.0
def parse(self, response):
    open_in_browser(response)
    hxs = Selector(response)
    rows = hxs.xpath(".//tr")
    items = []
    for rows in rows[4:54]:
        item = swimItem()
        item["names"] = rows.xpath(".//td[2]/text()").extract()
        item["age"] = rows.xpath(".//td[3]/text()").extract()
        item["free"] = rows.xpath(".//td[4]/text()").extract()
        item["team"] = rows.xpath(".//td[6]/text()").extract()
        items.append(item) 
        yield [FormRequest.from_response(response,formname="TTForm",
                formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                "lowage": "", "highage": "", "sex": "W", "StrkDist": "10025", 
                "How_Many": "50", "foolOldPerl": ""}
                ,callback=self.parse,dont_click=True)]
    for rows in rows[4:54]:
        item = swimItem()
        item["names"] = rows.xpath(".//td[2]/text()").extract()
        item["age"] = rows.xpath(".//td[3]/text()").extract()
        item["fly"] = rows.xpath(".//td[4]/text()").extract()
        item["team"] = rows.xpath(".//td[6]/text()").extract()
        items.append(item)
        yield [FormRequest.from_response(response,formname="TTForm",
                formdata={"Ctype":"A", "Req_Team": "", "AgeGrp": "0-6", 
                "lowage": "", "highage": "", "sex": "W", "StrkDist": "40025", 
                "How_Many": "50", "foolOldPerl": ""}
                ,callback=self.parse,dont_click=True)]

仍然没有刮擦任何东西。我知道X Path是正确的,就像我只尝试并刮擦了一种形式(以返回而不是产量)时,它可以很好地工作。我已经阅读了杂乱的文档,但这不是很有帮助:(

您缺少一个非常简单的解决方案,将return更改为yield

然后,您不必在数组中积累物品,只要从功能中产生尽可能多的项目和请求,scrapy将完成其余的

来自零工文档:

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import MyItem
class MySpider(Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]
    def parse(self, response):
        sel = Selector(response)
        for h3 in sel.xpath('//h3').extract():
            yield MyItem(title=h3)
        for url in sel.xpath('//a/@href').extract():
            yield Request(url, callback=self.parse)

相关内容

  • 没有找到相关文章