runspider返回一个空文件[DEBUG:Ccrawled(200)]



为了分析不同产品的价格,我创建了一个函数,通过scrapy库下载它们,但是,当我执行该例程时,会返回一条错误消息。

我已经将scratch.exe文件保存在运行.py文件的同一工作目录中

这是我的代码

import scrapy
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from bs4 import BeautifulSoup

class Articulo(Item):
titulo = Field()
precio = Field()
descripcion = Field()

class MercadoLibreCrawler(CrawlSpider):
name = 'mercadoLibre'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
'CLOSESPIDER_PAGECOUNT': 5
}
download_delay = 1

allowed_domains = ['articulo.mercadolibre.cl', 'listado.mercadolibre.cl']   #puedo poner más dominios solo poniendo comas

start_urls = ['https://listado.mercadolibre.cl/animales-mascotas/caballos/']

rules = (
Rule(  # REGLA #1 => HORIZONTALIDAD POR PAGINACION
LinkExtractor(
allow=r'/_Desde_d+'
), follow=True),

Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
LinkExtractor(
allow=r'/MLC-'
), follow=True, callback='parse_items'),
)

def limpiarTexto(self, texto): 
nuevoTexto = texto.replace('n', ' ').replace('r',' ').replace('t', ' ').strip()
return nuevoTexto

def parse_items(self, response):
item = ItemLoader(Articulo(), response)
item.add_xpath('titulo', '//h1/text()')
item.add_xpath('descripcion', '//div[@class="ui-pdp-description__content"]/p/text()', MapCompose(self. limpiarTexto))
item.add_xpath('precio', '//span[@class="andes-money-amount__fraction"]/text()', MapCompose(self.limpiarTexto))
yield item.load_item()

代码执行时没有出现问题,尽管结果返回了一个空文件。我认为问题在于这个";DEBUG:爬网(200(+(referer:None(";消息,但我不太明白如何修复

C:UsersgustaOneDriveDocumentosEmpresa>scrapy runspider 20220910_scraping_mercado_libre.py -o mercado_libre.csv -t csv
C:UsersgustaAppDataLocalProgramsPythonPython310libsite-packagesscrapycommands__init__.py:131: ScrapyDeprecationWarning: The -t command line option is deprecated in favor of specifying the output format within the output URI. See the documentation of the -o and -O options for more information.
feeds = feed_process_params_from_cli(
2022-09-11 02:55:48 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2022-09-11 02:55:48 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.19044-SP0
2022-09-11 02:55:49 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
'SPIDER_LOADER_WARN_ONLY': True,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36'}
2022-09-11 02:55:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-09-11 02:55:49 [scrapy.extensions.telnet] INFO: Telnet Password: 2f2010a00a1f6efa
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-09-11 02:55:49 [scrapy.core.engine] INFO: Spider opened
2022-09-11 02:55:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-11 02:55:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-11 02:55:50 [filelock] DEBUG: Attempting to acquire lock 1646952198736 on C:UsersgustaAppDataLocalProgramsPythonPython310libsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Lock 1646952198736 acquired on C:UsersgustaAppDataLocalProgramsPythonPython310libsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Attempting to release lock 1646952198736 on C:UsersgustaAppDataLocalProgramsPythonPython310libsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Lock 1646952198736 released on C:UsersgustaAppDataLocalProgramsPythonPython310libsite-packagestldextract.suffix_cache/publicsuffix.org-tldsde84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://listado.mercadolibre.cl/animales-mascotas/caballos/> (referer: None)
2022-09-11 02:55:50 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-09-11 02:55:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 332,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 105595,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.003421,
'finish_reason': 'closespider_pagecount',
'finish_time': datetime.datetime(2022, 9, 11, 5, 55, 51, 184037),
'httpcompression/response_bytes': 722794,
'httpcompression/response_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 52,
'scheduler/enqueued/memory': 52,
'start_time': datetime.datetime(2022, 9, 11, 5, 55, 50, 180616)}
2022-09-11 02:55:51 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)

我在这里看到你的规则中有一个拼写错误:

Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
LinkExtractor(
allow=r'/MCL-'
), follow=True, callback='parse_items')

它应该是MLC(而不是MCL(:

Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
LinkExtractor(
allow=r'/MLC-'
), follow=True, callback='parse_items')

更新修复后,项目处理器中出现另一个错误。应该是:

item.add_xpath('descripcion', '//div[@class="ui-pdp-description__content"]/p/text()', MapCompose(self.limpiarTexto))

最新更新