aiohttp下载大量pdf文件



我正在尝试异步下载大量pdf文件,python请求与async功能不能很好地配合使用

但我发现aiohttp很难用pdf下载实现,也找不到用于此特定任务的线程,对于新进入pythonasync世界的人来说,这很容易理解。

是的,它可以用threadpoolexecutor来完成,但在这种情况下,最好保持在一个线程中。

此代码有效,但需要处理大约100个url异步

import aiohttp        
import aiofiles
async with aiohttp.ClientSession() as session:
url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
async with session.get(url) as resp:
if resp.status == 200:
f = await aiofiles.open('download_pdf.pdf', mode='wb')
await f.write(await resp.read())
await f.close()

提前谢谢。

您可以尝试这样的方法。为了简单起见,相同的伪pdf将被多次下载到具有不同文件名的磁盘上:

from asyncio import Semaphore, gather, run, wait_for
from random import randint
import aiofiles
from aiohttp.client import ClientSession
# Mock a list of different pdfs to download
pdf_list = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
]
MAX_TASKS = 5
MAX_TIME = 5

async def download(pdf_list):
tasks = []
sem = Semaphore(MAX_TASKS)
async with ClientSession() as sess:
for pdf_url in pdf_list:
# Mock a different file name each iteration
dest_file = str(randint(1, 100000)) + ".pdf"
tasks.append(
# Wait max 5 seconds for each download
wait_for(
download_one(pdf_url, sess, sem, dest_file),
timeout=MAX_TIME,
)
)
return await gather(*tasks)

async def download_one(url, sess, sem, dest_file):
async with sem:
print(f"Downloading {url}")
async with sess.get(url) as res:
content = await res.read()
# Check everything went well
if res.status != 200:
print(f"Download failed: {res.status}")
return
async with aiofiles.open(dest_file, "+wb") as f:
await f.write(content)
# No need to use close(f) when using with statement

if __name__ == "__main__":
run(download(pdf_list))

请记住,向服务器发出多个并发请求可能会使您的IP在一段时间内被禁止。在这种情况下,可以考虑添加睡眠调用(这有点违背了使用aiohttp的目的(或切换到经典的顺序脚本。为了使事情保持并发但对服务器更友好,脚本将在任何给定时间(MAX_TASKS(发出最大5个请求。

最新更新