使用python从网站中提取img url



这段代码从网站上获取图像,但对一些人来说,我得到的是没有img数据的list index out of range。如何克服这一点。已经使用了许多尝试例外,除了之外还有其他方法吗

url=

https://www.redbook.com.au/cars/details/2016-isuzu-d-max-ls-u-high-ride-auto-4x2-my155/SPOT-ITM-445820/

对于谁没有图像我得到这个错误

list index out of range

喜欢这个url

https://www.redbook.com.au/cars/details/2019-audi-s3-auto-quattro-my19/SPOT-ITM-522293/

如何跳过这种情况

代码:

# -*- coding: utf-8 -*-
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs
import requests
from bs4 import BeautifulSoup as bs
import re
import json
cars = []  # gobal array for storing each car_data object

with open('url.txt') as f:
# read file without newlines
urls = f.read().splitlines()

for url in urls:
car_data = {}  # use it as a local variable
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
soup = bs(page.content, 'html.parser')

img_url = tree.xpath('//ul/li/a/img/@src')[0]
img_url = str(img_url)
img_url = img_url + '0'
car_data['image_url'] = img_url
script = soup.find('script', text=re.compile('CsnInsights.metaData'))
jsonData = 
json.loads(script.text.split('CsnInsights.metaData = ')[-1].rsplit(';', 1)[0])

您可以应用EAFP原则并处理IndexError,这是本例中抛出的内置异常:

try:
img_url = str(tree.xpath('//ul/li/a/img/@src')[0]) + '0'
except IndexError:
img_url = ''

请注意,当图像url值不可用(无法从HTML中提取(时,我会使用空字符串作为图像url值,但根据您的情况,您可以选择另一个值,例如None,或使用continue完全跳过处理此项目。

最新更新