我正在尝试学习Python的基础知识& &;我在一本关于如何进行网络抓取的书中遇到了这个练习。我试图复制代码,但得到这个错误-"urllib.error。HTTP错误:HTTPError 406: Not Acceptable".
代码有什么问题吗?
我在Windows 10上使用Anaconda/VS Code。
下面是我的代码:
from urllib import request
from bs4 import BeautifulSoup
page_url = 'https://alansimpson.me/python/scrape_sample.html'
rawpage = request.urlopen(page_url)
soup = BeautifulSoup(rawpage, 'html5lib')
content = soup.article
links_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append({'url' : url, 'img' : img, 'text' : text})
except AttributeError:
pass
这是我得到的错误-
Traceback (most recent call last):
File "c:UserssrikaOneDriveAIO_Pythonscraper.py", line 6, in <module>
rawpage = request.urlopen(page_url)
File "C:ProgramDataAnaconda3liburllibrequest.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:ProgramDataAnaconda3liburllibrequest.py", line 523, in open
response = meth(req, response)
File "C:ProgramDataAnaconda3liburllibrequest.py", line 632, in http_response
response = self.parent.error(
File "C:ProgramDataAnaconda3liburllibrequest.py", line 561, in error
return self._call_chain(*args)
File "C:ProgramDataAnaconda3liburllibrequest.py", line 494, in _call_chain
result = func(*args)
File "C:ProgramDataAnaconda3liburllibrequest.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 406: Not Acceptable
我试图安装'urllib',但它已经安装。试图添加异常"urllib.error"。
如何解决这个问题?请帮助!
您需要添加user-agent这是有效的。
如果你不输入某个浏览器的user-agent,网站会认为你是bot和阻止你.
import requests
from bs4 import BeautifulSoup
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
rawpage = requests.get(page_url,headers=headers)
soup = BeautifulSoup(rawpage.content, 'html.parser')
content = soup.article
links_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append({'url' : url, 'img' : img, 'text' : text})
except AttributeError:
pass
print("Total data scraped: " + str(len(links_list)))
for link in links_list:
print(link)
输出:
Total data scraped: 13
{'url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/', 'img': '../datascience/python/basics/basics256.jpg', 'text': 'Basics'}
{'url': 'https://alansimpson.me/datascience/python/beginner/', 'img': '../datascience/python/beginner/beginner256.jpg', 'text': 'Beginner'}
Error 406 Not Acceptable status code是一个错误消息,表示您的网站或web应用程序不支持客户端使用特定协议的请求。
在标题中添加用户代理,然后再试一次。
urllib的解决方案:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
page_url = 'https://alansimpson.me/python/scrape_sample.html'
req = Request(page_url)
req.add_header('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36')
rawpage = urlopen(req).read()
soup = BeautifulSoup(rawpage, 'html5lib')
content = soup.article
links_list = []
for link in content.find_all('a'):
try:
url = link.get('href')
img = link.img.get('src')
text = link.span.text
links_list.append({'url' : url, 'img' : img, 'text' : text})
print(links_list)
except AttributeError:
pass
输出:
[{'url': 'http://www.sixthresearcher.com/python-3-reference-cheat-sheet-for-beginners/',
'img': '../datascience/python/basics/basics256.jpg',
'text': 'Basics'},
{'url': 'https://alansimpson.me/datascience/python/beginner/',
'img': '../datascience/python/beginner/beginner256.jpg',
'text': 'Beginner'},
{'url': 'https://alansimpson.me/datascience/python/justbasics/',
'img': '../datascience/python/justbasics/justbasics256.jpg',
'text': 'Just the Basics'},
{'url': 'https://alansimpson.me/datascience/python/cheatography/',
'img': '../datascience/python/cheatography/cheatography256.jpg',
'text': 'Cheatography'},
{'url': 'https://alansimpson.me/datascience/python/dataquest/',
'img': '../datascience/python/dataquest/dataquest256.jpg',
'text': 'Dataquest'},
{'url': 'https://alansimpson.me/datascience/python/essentials/',
'img': '../datascience/python/essentials/essentials256.jpg',
'text': 'Essentials'},
{'url': 'https://alansimpson.me/datascience/python/memento/',
'img': '../datascience/python/memento/memento256.jpg',
'text': 'Memento'},
{'url': 'https://alansimpson.me/datascience/python/syntax/',
'img': '../datascience/python/syntax/syntax256.jpg',
'text': 'Syntax'},
{'url': 'https://alansimpson.me/datascience/python/classes/',
'img': '../datascience/python/classes/classes256.jpg',
'text': 'Classes'},
{'url': 'https://alansimpson.me/datascience/python/dictionaries/',
'img': '../datascience/python/dictionaries/dictionaries256.jpg',
'text': 'Dictionaries'},
{'url': 'https://alansimpson.me/datascience/python/functions/',
'img': '../datascience/python/functions/functions256.jpg',
'text': 'Functions'},
{'url': 'https://alansimpson.me/datascience/python/ifwhile/',
'img': '../datascience/python/ifwhile/ifwhile256.jpg',
'text': 'If & While Loops'},
{'url': 'https://alansimpson.me/datascience/python/lists/',
'img': '../datascience/python/lists/lists256.jpg',
'text': 'Lists'}]