将 ThreadPoolExecutor 和 ProcessPoolExecutor 从 Python 的 concurrent.futures 中组合在一起



我必须下载许多压缩bz2文件并解压缩它们以进行进一步处理。下载是I/O绑定,解压缩是CPU绑定。所以我认为最好把ThreadPoolExecutorProcessPoolExecutor结合起来。需要说明的是:我不想等到所有文件都下载完后再解压缩。我更希望在下载其他文件时使用我的CPU资源。我读了这个帖子,但它似乎对我没有用处。我有这样的代码:

import bz2
import requests
from concurrent import futures
from io import BytesIO
class Source:

def __init__(self, url):
self.url = url
self.compressed = None
self.binary = None

def download(self):
print(f'Start downloading {self.url}')
req = requests.get(self.url, timeout=5)
self.compressed = req.content
print(f'Finished downloading {self.url}')
return self

def unzip(self):
print(f'Start unzipping {self.url}')
with bz2.open(BytesIO(self.compressed), 'rb') as file:
self.binary = file.read() 
print(f'Finished unzipping {self.url}')
return self

list_sources_init = [Source(url) for url in list_urls]
with futures.ThreadPoolExecutor() as executor_threads, futures.ProcessPoolExecutor() as executor_processes:
list_futures_after_download = [
executor_threads.submit(source.download)  
for source in list_sources_init
]
list_futures_after_unzip = []
for future in futures.as_completed(list_futures_after_download):
source = future.result()
list_futures_after_unzip.append(executor_processes.submit(source.unzip))

list_sources_unzipped = [future.result() for future in list_futures_after_unzip]

这是有效的,但它似乎有点可疑。此外,我想知道为什么list_sources_init中的元素没有被下载。首先,我计划只处理这个列表,并对其元素执行并行操作。现在我得到了3个列表,其中部分包含相同的数据。最痛苦的是,压缩后的数据存储在list_futures_after_downloadlist_futures_after_unzip中。

我想有更好的方法。但如何?

这样更简洁。我认为在这方面使用线程没有好处:-

from multiprocessing import Pool, freeze_support
import requests
import bz2
from io import BytesIO
URL_LIST = []
def processURL(url):
try:
with requests.Session() as session:
r = session.get(url, timeout=5)
r.raise_for_status()
with bz2.open(BytesIO(r.content), 'rb') as data:
return data.read()
except Exception:
pass # will implicitly return None

def main():
with Pool() as pool:
results = pool.map(processURL, URL_LIST)
for r in results:
print(r)

if __name__ == '__main__':
freeze_support()
main()

相关内容

最新更新