如何抓取 div 内列表中的链接



考虑这个语句:

url=hxs.xpath('//ul[@class="product-wrapper product-wrapper-four-tile"]/li/div/div/div/div/div/a').get()

输出:

'<a href="https://www.michaelkors.com/gemma-large-tri-color-pebbled-leather-tote/_/R-US_30S9LGXT3T?color=1791"><div class="product-image-container"><div><div class="LazyLoad"><img src="data:image/png;base64,...'

我需要爬到在多个div中级联的链接。上面的陈述正确地给了我锚点。因为它是一个字符串,所以我对它应用正则表达式,然后产生

 WEB_URL_REGEX = r"""(?i)b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.-]+[.](?:com|net|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^s()<>{}[]]+|([^s()]*?([^s()]+)[^s()]*?)|([^s]+?))+(?:([^s()]*?([^s()]+)[^s()]*?)|([^s]+?)|[^s`!()[]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.-][a-z0-9]+)*[.](?:com|net)b/?(?!@)))"""
 listing_url = re.findall(WEB_URL_REGEX, url)[0]
 yield scrapy.Request(listing_url, callback=self.parse_produrls)

网址已正确提取。但是,它正在生成以下错误:

追踪:

2019-07-15 01:21:15 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.michaelkors.com/women/clothing/skirts-shorts/_/N-28en> (referer: https://www.michaelkors.com/women/clothing/dresses/_/N-28ei)
Traceback (most recent call last):
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyutilsdefer.py", line 102, in iter_errback
    yield next(it)
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
    for x in result:
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyspidermiddlewaresreferer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyspidermiddlewaresurllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libsite-packagesscrapyspiderscrawl.py", line 78, in _parse_response
    for requests_or_item in iterate_spider_output(cb_res):
  File "C:Usersfatima.arshadMKMKspidersMichaelKors.py", line 107, in parse_list
    listing_url = re.findall(WEB_URL_REGEX, url)[0]
  File "C:Usersfatima.arshadAppDataLocalContinuumanaconda3libre.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

编辑 :原因可能是网址变量不是字符串。如果我在hxs.xpath(..../text(的末尾添加/text,那么返回的列表为空。

这里不需要使用正则表达式。有很多简单的方法:

def parse_list(self, response):
    for product_url in response.xpath('//ul[@class="product-wrapper product-wrapper-four-tile"]//li[@class="product-name-container"]/a/@href').getall():
        yield scrapy.Request(response.urljoin(product_url), callback=self.parse_product)

你得到的一些值不是str,所以明智的做法是str((它们并评估结果。希望这将进一步指导您解决您的问题。

listing_url = str(re.findall(WEB_URL_REGEX, url)[0])

你想获得该列表中所有链接的 href 是对的吗?然后,您可以使用此 xpath 表达式。还是我错过了什么?

urls=hxs.xpath('//ul[@class="product-wrapper product-wrapper-four-tile"]/li/div/div/div/div/div/a/@href').getall()

最新更新