为什么美丽汤和多处理会在列表中将索引提高到超出范围



>我正在尝试使用BeautifulSoup解析页面。这是引发异常的函数:

def get_page_items(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    name = 
    soup.find(class_="ccpProductDetail__title__text").text.strip()
    price = soup.find(attrs={"data-price-main" : "price-main"}).text.strip()
    images_routes_src = soup.find_all(class_="ccpProductDetailSlideshow__slider__wrapper__list__item__image")
    images_routes = []
    try:
        for image in images_routes_src:
        images_routes.append(image['src'].strip())
    except:
        pass

    description_html = soup.find_all(class_="block large")
    description_html[0].div.decompose()
    new_tag = soup.new_tag("h3")
    new_tag.string = 'Hinweise'
    description_html[2].span.replace_with(new_tag)
    beschreibung_html = soup.find(class_="block large text")
    description_html.insert(1, beschreibung_html)
    item = Item(name, price, images_routes, description_html)
return item

这就是使用池调用该方法的地方:

for index, page in enumerate(pages_urls):
    if page is not pages_urls[-len(pages_urls)]:
        init_BeautifulSoup(pages_urls[index])
    get_all_page_item_links()
    page_items = pool.map(get_page_items, items_urls)
    total_items.extend(page_items)

这是输出:

Traceback (most recent call last):
  File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 120, in <module>
    page_items = pool.map(get_page_items, items_urls)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
IndexError: list index out of range

我在许多其他帖子中看到,有时这种情况发生在多处理中,但我尝试在 for 循环中调用该函数,但出现此错误:

Traceback (most recent call last):
  File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 123, in <module>
    page_items.append(get_page_items(url))
  File "/Users/rodrigopeniche/Documents/workspace/WebScraping/conrad_scrapping (4).py", line 78, in get_page_items
    description_html[0].div.decompose()
IndexError: list index out of range

如果我只尝试使用列表中的随机元素执行此操作,则脚本运行时没有错误,例如:

 get_all_page_item_links()
 item = get_page_items(items_urls[3])
 print item.description_html

这是怎么回事?

你能验证description_html里面是否有任何价值吗? line 78, in get_page_items

>>> foo = []
>>> foo[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

我建议添加一个 if 条件。

description_html = soup.find_all(class_="block large")
if description_html:
    description_html[0].div.decompose()
    new_tag = soup.new_tag("h3")
    new_tag.string = 'Hinweise'
    description_html[2].span.replace_with(new_tag)
else:
    pass
    # Do something here
beschreibung_html = soup.find(class_="block large text")
if beschreibung_html:
    description_html.insert(1, beschreibung_html)
    item = Item(name, price, images_routes, description_html)
else:
    pass
    # do something here

相关内容

  • 没有找到相关文章

最新更新