我正在尝试从以下网站下载html文件:
https://www.avto.net/Ads/results.asp?znamka=Audi&模型=,modelID =,提示= katerikoli % 20 tip& znamka2 =, model2 =, tip2 = katerikoli % 20 tip& znamka3 =, model3 =,一句话= katerikoli % 20 tip& cenaMin = 0, cenaMax = 999999, letnikMin = 0, letnikMax = 2090, bencin = 0, starost2 = 999, oblika = 0, ccmMin = 0, ccmMax = 99999, mocMin =, mocMax =, kmMin = 0, kmMax = 9999999, kwMin = 0, kwMax = 999, motortakt =, motorvalji =, lokacija = 0, sirina =, dolzina =, dolzinaMIN =, dolzinaMAX =, nosilnostMIN =, nosilnostMAX =, lezisc =, presek =, prem =, =上校,vijakov =, EToznaka =, vozilo =,安全气囊=,barva =, barvaint =, EQ1 = 1000000000, EQ2 = 1000000000, EQ3 = 1000000000, EQ4 = 100000000, EQ5 = 1000000000, EQ6 = 1000000000,扩音器= 1000000120,EQ8 = 1010000001, EQ9 = 1000000000, KAT = 1010000000, PIA =, PIAzero =, PSLO =, akcija =, paketgarancije =,代理=,prikazkategorije =, kategorija =, ONLvid =和议员;ONLnak =, zaloga =, arhiv =,预分类=,tipsort =, stran = 1
如果我在谷歌浏览器中查看源代码,我可以毫无问题地获得HTML。但是,我想用Python请求下载多个页面。但是,如果我尝试以这种方式获取html,就会遇到错误。
使用:
response = requests.get(url)
content = response.text
with open('filename', 'w') as dat:
dat.write(content)
我得到以下错误:
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
我也尝试使用"allow_redirects=False",但是,如果我这样做,我得到一个错误的html,其中只包含以下文本:
Object Moved
This document may be found here.
我想知道如何做才能在python中使用请求下载这个html。
如果我添加标题:
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
代码确实运行,但再一次,不给我正在寻找的html。它创建的html就像这样
<html><head><title>avto.net</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script>var ...
尝试为您的request .get()函数定义一个标头,即
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',}
url = <url-here>
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
这为我解决了这个问题。