Python 刮擦图像管道未下载(301 错误)



我正在尝试从本网站上的以下页面下载图像: http://39.moscowfilmfestival.ru/miff39/eng/films/?id=39016。 但是我收到 301 错误,并且图像未下载。 我可以毫无问题地下载所有其他数据点,包括images_url.(我正在重用在其他类似网站上工作的抓取代码。如果我将下载的images_url输入浏览器,它会返回一个带有图像的页面。但是,页面的URL略有不同,正斜杠(/)被内插:

submit: http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg
receive: http://moscowfilmfestival.ru/upimg//cache/photo/640/6521.jpg

上述页面的输出日志如下:

2018-01-02 11:19:40 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62638/session/949ab9c1-6a0a-6a42-a19a-ef72c55acc33/url {"sessionId": "949ab9c1-6a0a-6a42-a19a-ef72c55acc33", "url": "http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016"}
2018-01-02 14:46:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016> (referer: None)
2018-01-02 14:46:59 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg> (referer: None)
2018-01-02 14:46:59 [scrapy.pipelines.files] WARNING: File (code: 301): Error downloading file from <GET http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg> referred in <None>
2018-01-02 14:46:59 [scrapy.core.scraper] DEBUG: Scraped from <200 http://39.moscowfilmfestival.ru//miff39/eng/films/?id=39016>
{'camera': ['HUANG LIAN'],
'cast': ['GAO ZIFENG, MENG HALYAN, JHAO ZIFENG, HE MIAO, WAN PEILU'],
'country': ['CHINA'],
'design': ['YANG ZHIWEN'],
'director': ['Liang Qiao'],
'festival_edition': ['39th'],
'festival_year': ['2017'],
'image_urls': ['http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg'],
'images': [],
'length': ['107'],
'music': [''],
'producer': ['DUAN PENG'],
'production': ['SUNNYWAY FILM'],
'program': ['Main Competition'],
'script': ['LI YONG'],
'sound': ['HU MAI, HAO CONG'],
'synopsis': ['The story begins with Vince Kang, a reporter in Beijing, having '
'to go back to his hometown to report a crested ibis, one of the '
'national treasures found unexpectedly. During the process of '
'pursuit and hide of the crested ibis, everyone’s interest is '
'revealed and the scars, both mental and physical were rip up. '
'In addition, the environment pollution, an aftermath from '
'China`s development pattern, is brought into daylight. The '
'story, from the perspective of a returnee, reveals the living '
'condition of rural China and exposes the dilemma of humanity. '
'In the end, Vince, the renegade, had no alternative but make a '
'compromise with his birthland.'],
'title': ['CRESTED IBIS'],
'year': ['2017']}

要解决此问题,请执行以下操作:

  1. 我试图通过插入附加/来模仿浏览器 URL。没有效果。

  2. 我尝试将 301 异常处理程序添加到蜘蛛类 (handle_httpstatus_all = True) 以及settings.py文件中。没有效果。

有趣的是,我编写的早期版本的蜘蛛错误地完成了一个部分 URL,并带有额外的/(在 URL 的.rumiff部分之间),并且GETPOST请求工作正常。 它们与当前版本的蜘蛛中的正确原始页面URL的工作方式相同。

任何帮助真诚地感谢。

我建议你使用 urllib 库下载任何图像。

import urllib
from urllib import request
url = 'http://39.moscowfilmfestival.ru/upimg/cache/photo/640/6521.jpg'
file_path = r'C:/Users/admin/Desktop/test/6521.jpg'
getPath, headers = urllib.request.urlretrieve(url, file_path)
print(getPath) #This is the image path

最新更新