我用BeautifulSoup为这个网站编写了一个(可能效率低下的)网络爬虫。当它工作时,当遇到包含javascript的帖子时,获取帖子的功能崩溃,因为通过帖子内容(for item in i.find_all("p")[1:]:
)的循环停止,并且稍后对帖子的元数据(i.select('span')[0].get_text()
)的请求无法找到特定的元素。一个例子是这里的最后一篇文章。虽然我可以编写异常代码,但我更愿意了解问题并直接解决它。我做错了什么?
from urllib.request import urlopen
import requests as rs
from bs4 import BeautifulSoup as BS
import re
from itertools import chain
posts = []
def post_data(postlist, weblink, rmin, rmax):
page = rs.get(weblink)
soup = BS(page.content, 'lxml')
for d in range(rmin, rmax):
for i in soup.find_all("div", id="position-"+str(d)):
text = []
for item in i.find_all("p")[1:]:
text.append(item.get_text().replace("n" , "/" ).replace("," , "$" ))
text = "".join(text)
text.replace("n", "/").replace("," , "$" )
postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
postlist=list((chain.from_iterable(postlist)))
post_data(posts, "http://www.poliscirumors.com/topic/tenure-denial-blog/page/23", 460, 461)
错误如下:
File "p3.py", line 20, in post_data
postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
IndexError: list index out of range
请始终尝试 ##then 稍后调试
尝试: postlist.append((weblink, str(d
), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))除了: 通过