Web Scraper (Python 3.6) 在遇到字符串中的 javascript 时崩溃

我用BeautifulSoup为这个网站编写了一个（可能效率低下的）网络爬虫。当它工作时，当遇到包含javascript的帖子时，获取帖子的功能崩溃，因为通过帖子内容（for item in i.find_all("p")[1:]:）的循环停止，并且稍后对帖子的元数据（i.select('span')[0].get_text()）的请求无法找到特定的元素。一个例子是这里的最后一篇文章。虽然我可以编写异常代码，但我更愿意了解问题并直接解决它。我做错了什么？

from urllib.request import urlopen
import requests as rs
from bs4 import BeautifulSoup as BS
import re
from itertools import chain
posts = []
def post_data(postlist, weblink, rmin, rmax):
    page = rs.get(weblink)
    soup = BS(page.content, 'lxml')
    for d in range(rmin, rmax):
        for i in soup.find_all("div", id="position-"+str(d)):
            text = []
            for item in i.find_all("p")[1:]:
                text.append(item.get_text().replace("n" , "/" ).replace("," , "$" ))
            text = "".join(text)
            text.replace("n", "/").replace("," , "$" )
            postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
    postlist=list((chain.from_iterable(postlist)))
post_data(posts, "http://www.poliscirumors.com/topic/tenure-denial-blog/page/23", 460, 461)

错误如下：

File "p3.py", line 20, in post_data
postlist.append((weblink, str(d), i.find("strong").get_text() , text , i.select('span')[0].get_text(), i.select('span')[1].get_text(), i.span["id"][1:], list(i.find("div", class_="poststuff"))[0]))
IndexError: list index out of range

请始终尝试 ##then 稍后调试

尝试： postlist.append（（weblink， str（d

）， i.find（"strong"）.get_text（）， text ， i.select（'span'）[0].get_text（）， i.select（'span'）[1].get_text（）， i.span["id"][1：]， list（i.find（"div"， class_="poststuff"））[0]））

除了：通过

请始终尝试 ##then 稍后调试

相关内容

最新更新

热门标签：