与本地文件系统相比,Python代码解压缩文件和写入谷歌云存储的时间太长



这很奇怪,我所要做的就是解压缩文件并保存它。文件具有

size: 16 Mb
extension = .json.gz
Source location = Google Cloud Storage
Destination location = Google Cloud Storage / Local File System

当我使用时

%%time
import gzip
import shutil
import gcsfs
with gcp_file_system.open('somebucket/<file.json.gz>','rb') as fl_:
with gzip.open(fl_, 'rb') as f_in:        
with gcp_file_system.open('somebucket/<file.json>','wb') as f_out:
shutil.copyfileobj(f_in, f_out)

它产生:Wall time: 5min 51s

但当我尝试相同的并将目的地更改为本地机器时

%%time
import gzip
import shutil
import gcsfs
with gcp_file_system.open('somebucket/<file.json.gz>','rb') as fl_:
with gzip.open(fl_, 'rb') as f_in:        
with open('localdir/<file.json>','wb') as f_out:
shutil.copyfileobj(f_in, f_out)

它产生:Wall time: 8.28 s

不确定是什么在起作用,比如buf_size,网络速度,一些gcsfs后端。

不要使用gcsfs文件,而是使用GCS客户端库中的BlobReader类,例如:

本地目的地

%%time
import gzip
import shutil
from google.cloud import storage
from google.cloud.storage import fileio 
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('file.json.gz')
reader = fileio.BlobReader(blob)
f_out = open('localdir/file.json','wb')
gz = gzip.GzipFile(fileobj=reader, mode="rb")
shutil.copyfileobj(gz, f_out)
f_out.close()
gz.close()
reader.close()

GCS目的地:

%%time
import gzip
import shutil
from google.cloud import storage
from google.cloud.storage import fileio 
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob_in = bucket.blob('file.json.gz')
reader = fileio.BlobReader(blob_in)
blob_out = bucket.blob('file.json')
writer = fileio.BlobWriter(blob_out)
gz = gzip.GzipFile(fileobj=reader, mode="rb")
shutil.copyfileobj(gz, writer)
gz.close()
reader.close()
writer.close()

最新更新