从Scrapy结果中移除/排除非中断空间



我目前正在尝试抓取一个网站的文章价格,但我遇到了一个问题(以某种方式解决了价格是动态生成的问题,这是一个巨大的痛苦)。

我能够毫无问题地接收价格和商品名称,但是每隔一秒'price'的结果都是"xa0"。我曾尝试使用'normalize-space()'删除它,但无济于事。

我的代码:
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys
class mySpider(scrapy.Spider):
    name = "placeholder"
    allowed_domains = ["placeholder.com"]
    start_urls = ["https://www.placeholder.com"]
    def __init__(self):
        self.driver = webdriver.Chrome()
        dispatcher.connect(self.spider_closed, signals.spider_closed)
    def spider_closed(self, spider):
        self.driver.close()
    def parse(self, response):
        self.driver.get("https://www.placeholder.com")
        response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//body'):
            item = myItem()
            item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
            item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
            yield item

xa0是Latin1中的不换行空格。像这样替换:

string = string.replace(u'xa0', u' ')

更新:

您可以按如下方式应用代码:

for post in response.xpath('//body'):
    item = myItem()
    item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
    item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
    item['price'] = item['price'].replace(u'xa0', u' ')
    if(item['price'].strip()):
        yield item

在这里,您将替换char,然后仅在price不为空的情况下生成项目。

最新更新