这很奇怪,我所要做的就是解压缩文件并保存它。文件具有
size: 16 Mb
extension = .json.gz
Source location = Google Cloud Storage
Destination location = Google Cloud Storage / Local File System
当我使用时
%%time
import gzip
import shutil
import gcsfs
with gcp_file_system.open('somebucket/<file.json.gz>','rb') as fl_:
with gzip.open(fl_, 'rb') as f_in:
with gcp_file_system.open('somebucket/<file.json>','wb') as f_out:
shutil.copyfileobj(f_in, f_out)
它产生:Wall time: 5min 51s
但当我尝试相同的并将目的地更改为本地机器时
%%time
import gzip
import shutil
import gcsfs
with gcp_file_system.open('somebucket/<file.json.gz>','rb') as fl_:
with gzip.open(fl_, 'rb') as f_in:
with open('localdir/<file.json>','wb') as f_out:
shutil.copyfileobj(f_in, f_out)
它产生:Wall time: 8.28 s
不确定是什么在起作用,比如buf_size,网络速度,一些gcsfs后端。
不要使用gcsfs文件,而是使用GCS客户端库中的BlobReader
类,例如:
本地目的地
%%time
import gzip
import shutil
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob = bucket.blob('file.json.gz')
reader = fileio.BlobReader(blob)
f_out = open('localdir/file.json','wb')
gz = gzip.GzipFile(fileobj=reader, mode="rb")
shutil.copyfileobj(gz, f_out)
f_out.close()
gz.close()
reader.close()
GCS目的地:
%%time
import gzip
import shutil
from google.cloud import storage
from google.cloud.storage import fileio
storage_client = storage.Client()
bucket = storage_client.bucket('my_bucket')
blob_in = bucket.blob('file.json.gz')
reader = fileio.BlobReader(blob_in)
blob_out = bucket.blob('file.json')
writer = fileio.BlobWriter(blob_out)
gz = gzip.GzipFile(fileobj=reader, mode="rb")
shutil.copyfileobj(gz, writer)
gz.close()
reader.close()
writer.close()