刮擦 - 使用正确的扩展名下载



我有以下蜘蛛:

class Downloader(scrapy.Spider):
name = "sor_spider"
download_folder = FOLDER
def get_links(self):
df = pd.read_excel(LIST)
return df["Value"].loc
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True)
def download_file(self, response):
url = response.url
index = response.meta["index"]
content_type = response.headers['Content-Type']
download_path = os.path.join(self.download_folder, r"{}".format(str(index)))
with open(download_path, "wb") as f:
f.write(response.body)
yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded")

def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")

它应该:

  1. 阅读带有链接的 Excel (LIST)
  2. 转到每个链接并将文件下载到FOLDER
  3. LinkCheckerItem记录结果(我正在将其导出为 csv)

这通常可以正常工作,但我的列表包含不同类型的文件 - zip,pdf,doc等。

这些是我LIST中的链接示例:

https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563
http://npoimpuls.ru/templates/npoimpuls/material/documents/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA%20%D0%B0%D1%84%D1%84%D0%B8%D0%BB%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%20%D0%BB%D0%B8%D1%86%20%D0%BD%D0%B0%2030.06.2016.pdf
http://нпоимпульс.рф/assets/download/sal30.09.2017.pdf
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166287

我希望它以其原始扩展名保存文件,无论它是什么......就像我的浏览器打开保存文件的警报一样。

我试图使用response.headers["Content-type"]来找出类型,但在这种情况下,它总是application/octet-stream.

我该怎么做?

您需要解析Content-Disposition标头以获取正确的文件名。

最新更新