是否有可能请求无法很好地抓取网站(即使超时等)？

我想刮一些Goodreads页面，我的代码就可以工作了。然而，我使用requests库，我注意到有时当我运行代码时，我会收到一个错误，几乎每次都是在不同的请求下。

我的意思是，当我使用这些功能浏览网站页面并通过输入一些ISBN:来获取书籍的版本信息时

def get_editions_urls(ed_details):
# Unpack the tuple with the informations about the editions
url, ed_num, isbn = ed_details
# Navigate to all pages for books with more than 100 editions
for page in range((ed_num // 100) + 1):
r = requests.get(url, params={
'page': str(page + 1),
'per_page': '100',
#'filter_by_format': 'Paperback',
'utf8': "%E2%9C%93"})
soup = bs(r.text, 'lxml')
# Find all elements for the editions of the book
editions = soup.find_all("div", class_="editionData")
with open(f"urls_files/{isbn}_urls.txt", 'a', encoding='utf-8') as fp:
for book in editions:
if item := book.find("a", class_="bookTitle"):
if language := book.find_all("div", class_="dataValue")[-2].text:
fp.write(f"https://www.goodreads.com{item['href']}n" + f"language: {language}n"
)
# Let some time to the goodreads server between the requests
time.sleep(3)

在图书版本的循环中，每次代码停在不同的书上时——有时是第一本，有时是五本之后，等等。我试图更改time.sleep并移动它，但什么都没有。错误是AttributeError: 'NoneType' object has no attribute 'find'，但我正是出于这个原因设置了条件(有时循环适用于我列表中的所有书籍(。在我看来，网站代码并没有改变，所以……request库在这一点上可能很弱吗？如果是，我可以用什么来替换它？

是否可能请求库在这一点上很弱，如果是，我可以用什么来代替它？

否，但您的代码较弱。您没有检查HTTP状态代码。应该是200。如果您得到了其他内容，请中止，将HTML转储到日志或文本文件中，然后分析内容。

这就是您应该做的：检查响应，即HTTP状态代码和web服务器的HTML输出。

错误在于，您正试图解析一个很可能不存在的响应。您的脚本可能会收到HTTP/403页面或类似页面。您需要在代码中进行更多验证。

相关内容

最新更新

热门标签：