使用BeautifulSoup python访问网站时访问被拒绝[403]

我想使用BeautifulSoup抓取 https://www.jdsports.it/，但我的访问被拒绝。

在我的电脑上，我访问该站点没有任何问题，并且我使用的是 Python 程序的相同用户代理，但在程序上结果不同，您可以看到下面的输出。

编辑：我想我需要 cookie 才能访问该网站。我怎样才能获得它们并使用它们通过python程序访问站点来抓取它？

- 如果我使用同一站点但具有不同区域的"https://www.jdsports.com"，则该脚本有效。

谢谢！

import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)

输出为：

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>

Python Beautifulsoup User-Agent Cookies python-requests

起初怀疑是HTTP/2，但也无法使其工作。也许你更幸运，这里有一个HTTP/2的起点：

import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())

(在Windows和Linux上都经过测试。这可能与TLS1.2有关吗？这就是我接下来要看的地方，因为curl有效。

相关内容

最新更新

热门标签：