Web Scraper (Python 3.6) 在遇到字符串中的 javascript 时崩溃



我用BeautifulSoup为这个网站编写了一个(可能效率低下的)网络爬虫。当它工作时,当遇到包含javascript的帖子时,获取帖子的功能崩溃,因为通过帖子内容(for item in i.find_all("p")[1:]:)的循环停止,并且稍后对帖子的元数据(i.select('span')[0].get_text())的请求无法找到特定的元素。一个例子是这里的最后一篇文章。虽然我可以编写异常代码,但我更愿意了解问题并直接解决它。我做错了什么?

from urllib.request import urlopen
import requests as rs
from bs4 import BeautifulSoup as BS
import re
from itertools import chain
posts = []
def post_data(postlist, weblink, rmin, rmax):
    page = rs.get(weblink)
    soup = BS(page.content, 'lxml')
    for d in range(rmin, rmax):
        for i in soup.find_all("div", id="position-"+str(d)):
            text = []
            for item in i.find_all("p")[1:]:
                text.append(item.get_text().replace("n" , "/" ).replace("," , "$" ))
            text = "".join(text)
            text.replace("n", "/").replace("," , "$" )
            postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
    postlist=list((chain.from_iterable(postlist)))
post_data(posts, "http://www.poliscirumors.com/topic/tenure-denial-blog/page/23", 460, 461)

错误如下:

File "p3.py", line 20, in post_data
postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
IndexError: list index out of range

请始终尝试 ##then 稍后调试

尝试: postlist.append((weblink, str(d

), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))

除了: 通过

最新更新