Scrapy通过xpath查询返回None



嗨,所以我使用图形抓取网站https://www.centralbankofindia.co.in,我得到一个响应,但通过XPath查找地址时,我得到None

start_urls = [
"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={}".format(
i
)
for i in range(0, 5)
]
brand_name = "Central Bank of India"
spider_type = "chain"
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[1]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[2]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[3]/td[2]/div/span[2]
def parse(self, response, **kwargs):
"""Parse response."""
# print(response.text)
for id in range(1, 11):
address = self.get_text(
response,
f'//*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[{id}]/td[2]/div/span[2]',
)
print(address)
def get_text(self, response, path):
sol = response.xpath(path).extract_first()
return sol

在网站的地址的span类没有一个唯一的id,是什么导致的问题?

我觉得你创建的xpath太复杂了。你应该跳过一些元素,而使用//

一些浏览器可能会在DevTools中显示tbody,但它可能不存在于scrapy从服务器获取的HTML中,所以最好总是跳过它。

你可以用extract()代替tr[{id}]extract_first()

这个xpath适合我。

all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()

for address in all_items:
print(address)

BTW:我在xpath中使用text()来获得没有HTML标签的地址。


完整工作代码。

你可以把它们放在一个文件中,作为python script.py运行,而不需要创建project

将结果保存在output.csv中。

start_urls我设置只有链接到第一页,因为parse()搜索链接到下一页的HTML -所以它可以得到所有的页面,而不是range(0, 5)

#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):

start_urls = [
# f"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={i}"
# for i in range(0, 5)

# only first page - links to other pages it will find in HTML
"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page=0"
]

name = "Central Bank of India"

def parse(self, response):
print(f'url: {response.url}')

all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()

for address in all_items:
print(address)
yield {'address': address}
# get link to next page

next_page = response.xpath('//a[@rel="next"]/@href').extract_first()

if next_page:
print(f'Next Page: {next_page}')
yield response.follow(next_page)

# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

最新更新