403 HTTP状态码未处理或不允许



我试图从https://www.taylorwimpey.co.uk/sitemap获得一个位置列表。它在我的浏览器中打开很好,但当我尝试使用scrapy时,我什么也没有得到:

2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2020,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 2.297067,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
'httpcompression/response_bytes': 3932,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)

我已经尝试在设置/py中进行调整,例如更改用户代理,但到目前为止还不起作用。

我的代码是:
import scrapy
from TaylorWimpey.items import TaylorwimpeyItem
from scrapy.http import TextResponse
from selenium import webdriver
class taylorwimpeySpider(scrapy.Spider):

name = "taylorwimpey"
allowed_domains = ["taylorwimpey.co.uk"]
start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")       

def parse(self, response): # build a list of all locations
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')

url_list1 = []

for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
url = response1.urljoin(href.extract())
url_list1.append(url)
print(url)

有什么建议吗?

您得到403,因为该网站处于CloudFlare保护中。

https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration
https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!

和Scrapy硒不能处理它。但硒本身可以处理这种情况,并顺利克服保护。

import time
import pandas as pd 
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
url=url.get_attribute('href')
URL.append(url)
#print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)

输出:

Links
0     https://www.taylorwimpey.co.uk/new-homes/abera...
1     https://www.taylorwimpey.co.uk/new-homes/aberarth
2     https://www.taylorwimpey.co.uk/new-homes/aberavon
3     https://www.taylorwimpey.co.uk/new-homes/aberdare
4     https://www.taylorwimpey.co.uk/new-homes/aberdeen
...                                                 ...
1691   https://www.taylorwimpey.co.uk/new-homes/yateley
1692  https://www.taylorwimpey.co.uk/new-homes/yealm...
1693    https://www.taylorwimpey.co.uk/new-homes/yeovil
1694      https://www.taylorwimpey.co.uk/new-homes/york
1695  https://www.taylorwimpey.co.uk/new-homes/ystra...
[1696 rows x 1 columns]

chromedriverManager

最新更新