>我正在尝试使用BeautifulSoup解析页面。这是引发异常的函数:
def get_page_items(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
name =
soup.find(class_="ccpProductDetail__title__text").text.strip()
price = soup.find(attrs={"data-price-main" : "price-main"}).text.strip()
images_routes_src = soup.find_all(class_="ccpProductDetailSlideshow__slider__wrapper__list__item__image")
images_routes = []
try:
for image in images_routes_src:
images_routes.append(image['src'].strip())
except:
pass
description_html = soup.find_all(class_="block large")
description_html[0].div.decompose()
new_tag = soup.new_tag("h3")
new_tag.string = 'Hinweise'
description_html[2].span.replace_with(new_tag)
beschreibung_html = soup.find(class_="block large text")
description_html.insert(1, beschreibung_html)
item = Item(name, price, images_routes, description_html)
return item
这就是使用池调用该方法的地方:
for index, page in enumerate(pages_urls):
if page is not pages_urls[-len(pages_urls)]:
init_BeautifulSoup(pages_urls[index])
get_all_page_item_links()
page_items = pool.map(get_page_items, items_urls)
total_items.extend(page_items)
这是输出:
Traceback (most recent call last):
File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 120, in <module>
page_items = pool.map(get_page_items, items_urls)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
IndexError: list index out of range
我在许多其他帖子中看到,有时这种情况发生在多处理中,但我尝试在 for 循环中调用该函数,但出现此错误:
Traceback (most recent call last):
File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 123, in <module>
page_items.append(get_page_items(url))
File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 78, in get_page_items
description_html[0].div.decompose()
IndexError: list index out of range
如果我只尝试使用列表中的随机元素执行此操作,则脚本运行时没有错误,例如:
get_all_page_item_links()
item = get_page_items(items_urls[3])
print item.description_html
这是怎么回事?
你能验证description_html
里面是否有任何价值吗? line 78, in get_page_items
>>> foo = []
>>> foo[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
我建议添加一个 if 条件。
description_html = soup.find_all(class_="block large")
if description_html:
description_html[0].div.decompose()
new_tag = soup.new_tag("h3")
new_tag.string = 'Hinweise'
description_html[2].span.replace_with(new_tag)
else:
pass
# Do something here
beschreibung_html = soup.find(class_="block large text")
if beschreibung_html:
description_html.insert(1, beschreibung_html)
item = Item(name, price, images_routes, description_html)
else:
pass
# do something here