如何制作爬虫以使用 BS4 抓取网站

我写了一个脚本来抓取引号来抓取引号和作者姓名。在这个项目中，我使用请求来获取页面的代码和 bs4 来解析 HTML。我使用 while 循环通过分页链接到下一页，但我希望我的代码在没有剩余页面时停止运行。我的代码有效，但它不会停止运行。

这是我的代码：

from bs4 import BeautifulSoup as bs
import requests
def scrape():
page = 1
url = 'http://quotes.toscrape.com'
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
p_link = soup.find('a',text="Next")
condition = True
while condition:
with open('quotes.txt','a') as f:
for i in range(len(authors)):
f.write(quotes[i].text+' '+authors[i].text+'n')
if p_link not in soup:
condition = False
page += 1
url = 'http://quotes.toscrape.com/page/{}'.format(page)
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
condition = True
else:
condition = False
print('done')

scrape()

因为p_link从来不在汤里。我发现有两个原因。

您可以使用文本"下一步"进行搜索。但似乎实际链接为"下一步"+空格+向右箭头
该标记包含指向下一页的属性"href"。对于每个页面，这将具有不同的值。

同样，对于第一个 if 块，在 while 循环中将条件设为 False 也没有区别。无论如何，您都会将其设置回块的末尾。

所以。。。

不要按"下一步"搜索，而是使用：

soup.find('li',attrs={"class":"next"})

对于条件，请使用：

if soup.find('li',attrs={"class":"next"}) is None:
condition = False

最后，如果你也想写最后一页的引文，我建议你把"写到文件"的部分放在最后。或者完全避免它..喜欢这个：

from bs4 import BeautifulSoup as bs
import requests
def scrape():
page = 1
while True:
if page == 1:
url = 'http://quotes.toscrape.com'
else:
url = 'http://quotes.toscrape.com/page/{}'.format(page)
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
with open('quotes.txt','a') as f:
for i in range(len(authors)):
f.write(str(quotes[i].encode("utf-8"))+' '+str(authors[i].encode("utf-8"))+'n')       
if soup.find('li',attrs={"class":"next"}) is None:
break
page+=1
print('done')

scrape()

相关内容

最新更新

热门标签：