背景:我刚刚开始学习scikit,并在页面底部阅读了关于joblib和pickle的内容。
使用joblib代替pickle(joblib.dump&joblib.load)可能更有趣,它在大数据上更高效,但只能pickle到磁盘,而不能pickle成字符串
我读了这个问答;A关于泡菜,Python中pickle的常见用例,想知道这里的社区是否可以分享joblib和pickle之间的差异?什么时候应该使用一个而不是另一个?
- joblib在大型numpy数组上通常速度更快,因为它对numpy数据结构的数组缓冲区有特殊处理。要了解实现的详细信息,可以查看源代码。它还可以在使用zlib或lz4进行酸洗时实时压缩数据
- joblib还可以在加载未压缩的joblib pickle numpy数组时对其数据缓冲区进行内存映射,从而可以在进程之间共享内存
- 如果不pickle大型numpy数组,那么常规pickle可能会更快,尤其是在小型python对象的大型集合上(例如str对象的大型dict),因为标准库的pickle模块是用C实现的,而joblib是纯python
- 由于PEP574(pickleprotocol5)已经在Python 3.8中合并,现在使用标准库Pickle大型numpy数组的效率要高得多(内存和cpu方面)。在此上下文中,大型阵列意味着4GB或更多
- 但是joblib在Python3.8中仍然可以用于在
mmap_mode="r"
的内存映射模式下加载具有嵌套numpy数组的对象
感谢Gunjan给我们这个脚本!我为Python3结果修改了它
#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib
file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891
我遇到了同样的问题,所以我尝试了这个问题(使用Python 2.7),因为我需要加载一个大的pickle文件
#comapare pickle loaders
from time import time
import pickle
import os
try:
import cPickle
except:
print "Cannot import cPickle"
import joblib
t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
此输出为
time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392
根据这个joblib比这3个模块中的cPickle和Pickle模块工作得更好。感谢
只是一个谦虚的注释。。。Pickle更适合拟合scikit学习估计器/训练模型。在ML应用程序中,训练后的模型主要被保存和加载以用于预测。