Python错误时,抓取HTTP错误403:禁止



我是这方面的新手,正试图从国会记录中抓取。我有一个。txt文件(url_list.txt)与我想下载的网站。.txt文件数据如下所示:

https://www.congress.gov/congressional-record/2003/3/12/house-section/article/h1752-1
https://www.congress.gov/congressional-record/2003/11/7/house-section/article/h10982-2
https://www.congress.gov/congressional-record/2003/1/29/house-section/article/h231-3

我使用这个代码:

import urllib.request
with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
for line in f:
url = line
path = '/Users/myusername/Desktop/py_test'+url.split('/', -1)[-1]
urllib.request.urlretrieve(url, path.rstrip('n'))username

我得到这个错误:

Traceback (most recent call last):
File "/Users/myusername/Desktop/py_test/py_try.py", line 7, in <module>
urllib.request.urlretrieve(url, path.rstrip('n'))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

如有任何帮助,不胜感激。

Http错误403表示您已被阻止访问您所请求的资源

检查你正在尝试请求的URL是否正确(尝试打印URL以确保它是正确的),如果是,你可能需要编辑请求的User-Agent头。

要做到这一点,我建议使用requests而不是urllib,因为requests更容易使用。使用请求时,您的代码可能会像这样:

import requests
with open('/Users/myusername/Desktop/py_test/url_list.txt') as f:
url_list = f.read().split("n")
for url in url_list:
with open('/Users/myusername/Desktop/py_test/' + url.split('/')[-1], 'w') as f:
with requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) as r:
f.write(r.text)

如果这不起作用,那么你可能已经被阻止访问网站,你也无能为力,因为它是服务器端而不是客户端

相关内容

  • 没有找到相关文章

最新更新