为什么在进行python网页抓取时显示索引错误和数据抓取不完整



我正在尝试抓取网页"https://global.oup.com/academic/content/series/v/very-short-introductions-vsi/?type=listing&lang=en&cc=in";运行脚本后,它会给出索引错误。在网站上,总共有大约739本书,在我运行脚本后,下载的excel表显示719本。请帮我写剧本,为什么它会给出Indexerror,为什么它只抓取719本书?

import requests
from time import sleep
from random import randint
import numpy as np
from bs4 import BeautifulSoup as bs
import openpyxl
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
excel = openpyxl.Workbook()
sheet = excel.active
sheet.title = 'avsi'
sheet.append(['Titles', 'Price', 'Author', 'ISBN', 'Paperback', 'Date'])

pages = np.arange(0,800,100)
for page in pages:
page = requests.get("https://global.oup.com/academic/content/series/v/very-short-introductions-vsi/?prevNumResPerPage=100&prevSortField=1&resultsPerPage=100&sortField=1&type=listing&start="+str(page)+"&lang=en&cc=in")
soup = bs(page.text, 'html.parser')
sleep(randint(2,8))
books = soup.find('div', class_='search_result_list').find_all('tr')
#print(len(books))
for book in books:
titles = book.find_all('td', class_='result_biblio')
for title in titles:
name = title.find('a').text
#print(len(name))
price = title.select('p')[2].text
author = title.select('p')[3].text
#print(name)
isbn = title.select('p')[4].text.split(" ")[1]
paperback = title.select('p')[4].text.split(" ")[2]
date = title.select('p')[4].text.split(" ")[3]
month = title.select('p')[4].text.split(" ")[4]
year = title.select('p')[4].text.split(" ")[5]
#data = [name,price,author]
#data = [name,price,author,isbn]
data = [name,price,author,isbn,paperback,date+" "+month+" "+year]
sheet.append(data)
excel.save('avs1.xlsx')

为什么它只刮719本书?

答案很简单,在一个页面的书籍上迭代,每次迭代都会附加一本书并保存excel表,这就是你的错误消息进入游戏的地方。脚本停止,因此excel表中的书籍/行数不再增加——因此,很明显,在这一点上会发现错误。

它为什么给出Indexerror?

您正在处理ResultSets,并根据索引选择信息,假设索引必须存在于每个结果中。但在最后一页上有一本书,接近尾声时你无法添加到购物车中,这导致了错误。

如何修复

选择更具体的书籍,例如,只有您可以添加到购物车的书籍:

books = soup.select('.search_result_list tr:has(div.result_add)')

或者检查你正在搜索的信息是否可用,并根据你的需求处理这种情况:

...
try:
isbn = title.select('p')[4].text.split(" ")[1]
except:
isbn = None
...

示例

import requests
from time import sleep
from random import randint
from bs4 import BeautifulSoup as bs
import openpyxl
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
excel = openpyxl.Workbook()
sheet = excel.active
sheet.title = 'avsi'
sheet.append(['Titles', 'Price', 'Author', 'ISBN', 'Paperback', 'Date'])
pages = np.arange(0,800,100)
for page in pages:
page = requests.get("https://global.oup.com/academic/content/series/v/very-short-introductions-vsi/?prevNumResPerPage=100&prevSortField=1&resultsPerPage=100&sortField=1&type=listing&start="+str(page)+"&lang=en&cc=in")
soup = bs(page.text, 'html.parser')
sleep(randint(1,5))
books = soup.select('.search_result_list tr')
for book in books:
titles = book.find_all('td', class_='result_biblio')
for title in titles:
name = title.find('a').text
price = title.select('p')[2].text
author = title.select('p')[3].text
###cause all following information is placed in same tag it would be okay group the try/except
try:
isbn = title.select('p')[4].text.split(' ')[1]
paperback = title.select('p')[4].text.split(' ')[2]
date = ' '.join(title.select('p')[4].text.split(' ')[3:6])
except:
isbn,paperback,date = (None,)*3
data = [name,price,author,isbn,paperback,date]
sheet.append(data)
excel.save('avs1.xlsx')

最新更新