当我处理一些html时,我想要一个图像的src-url,但我得到的是一个编码的图像。如果我想要网址,我做错了什么?
给定如下url:"http://www.amazon.com/Cheese-Plate-multi-purpose-mounting-plate/dp/B00CI06DWE/"
和一个桌面用户代理:
from lxml import etree
import requests
page = requests.get(url, headers=agent)
page_txt = page.text
html_parser = etree.HTMLParser()
tree = etree.parse(StringIO(page_txt), html_parser)
path = '//img[@id="landingImage"]'
img = tree.xpath(path)
img_src = img[0].get('src')
使用该代码,我将返回:
'\data:image/jpeg;base64,/9j/4AAQSkZJR'(截断)
当我想要:
http://ecx.images-amazon.com/images/I/41SNmVfXvhL.SY355.jpg
src
属性中有一个base64编码的图像。您可以从data-a-dynamic-image
属性中获得实际的URL,它包含JSON字符串,其中包含URL:
import json
path = '//img[@id="landingImage"]/@data-a-dynamic-image'
print next(json.loads(tree.xpath(path)[0]).iterkeys())
打印:
http://ecx.images-amazon.com/images/I/41SNmVfXvhL._SX466_.jpg