如何使用python抓取Chegg教科书解决方案页面?



长话短说,我正在重新访问旧VBA教科书中的练习来做一些练习(特别是VBA for Modelers - 5th Edition, S. Christian Albright)。

在这样做的时候,我想检索练习的答案,在这样做的时候,我来到Chegg,并认为我可以尝试在解决方案页面(下面的示例超链接)中抓取代码块。

示例Chegg教科书解决方案页面-红色矩形中的代码块和HTML

我一直在尝试更熟悉python,并认为这将是一个学习更多网络抓取的好项目。

下面是我开始编写的代码,因为我意识到这并不像从每个解决方案页面抓取HTML那么简单。我最初只是想找到页面本身的所有div元素,然后再进一步循环遍历每个练习页面,并像这样抓取代码块。

#!/usr/bin/python3
# scrapeChegg.py - Scrapes all answer code blocks from each problem exercise in each chapter for a textbook (VBA For Modelers - 5th Editiion)
import bs4, os, requests
# Starting URL point
url = 'https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc'
# Retrieve sol'n HTML
head = {'User Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'}
res = requests.get(url, headers=head)
try:
res.status_code
cheggSoup = bs4.BeautifulSoup(res.text, 'html.parser')
print(cheggSoup.find_all('div'))
except Exception as exc:
print('Issue occurred: %s' % (exc))

在其中一个div结果中,输出如下:

<p>
Access to this page has been denied because we believe you are using automation tools to browse the
website.
</p>
<p>
This may happen as a result of the following:
</p>
<ul>
<li>
Javascript is disabled or blocked by an extension (ad blockers for example)
</li>
<li>
Your browser does not support cookies
</li>
</ul>
<p>
Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking
them from loading.
</p>
<p>
Reference ID: #5ca2ea20-0052-11ec-8c04-7749576e4445
</p>
</div>

所以基于以上,我可以看到页面阻止我使用自动化工具。我看过人们提出的关于从Chegg中抓取数据的类似问题,很多解决方案超出了我目前的知识范围(例如,各种解决方案在头部字典中有更多的键/值对,我不确定如何解释)。

本质上我的问题是我如何获得更多的知识(或者我应该深入研究什么资源-即HTTP,与python抓取等)使这个项目工作如果可能的话,就是这样。如果有人以前做过这样的工作,我将感谢任何关于我自己看什么或如何使这个特定项目成功的建议。谢谢!

尝试在User AgentHTTP头中添加-:

import requests
from bs4 import BeautifulSoup
url = "https://www.chegg.com/homework-help/open-new-workbook-get-vbe-insert-module-enter-following-code-chapter-5-problem-1e-solution-9781285869612-exc"
# Retrieve sol'n HTML
head = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0",
}
res = requests.get(url, headers=head)
soup = BeautifulSoup(res.content, "html.parser")
print(soup.h1.text)

打印:

VBA for Modelers (5th Edition) Edit editionThis problem has been solved:Solutions for Chapter 5Problem 1E: Open a new workbook, get into the VBE, insert a module, and enter the following code:Sub Variables() Dim nPounds As Integer, dayOfWeek As Integer nPounds = 17.5 dayOfWeek = “Monday” MsgBox nPounds & “ pounds were ordered on ” & dayOfWeekEnd SubThere are two problems here. One causes the program to fail, and the other causes an incorrect result. Explain what they are and then fix them.…