我有一个稀疏的特征矩阵,它是使用sklearn进行以下操作的结果:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000)
train_data_features = vectorizer.fit_transform(y)
转换为连续数组表示将物化内存中的所有零,结果大小将为:
train_data_features.shape[0] * train_data_features.shape[1] * train_data_features.dtype.itemsize / 1e6
收益率:"6242.4
这是8GB,而原始的稀疏表示不到1MB。那么,如何解决这个问题,使我能够有效地将结果数组拟合到随机森林分类器中呢?
"
试试这个:
m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='w+', shape=train_data_features.shape)
train_data_features.todense(out=m)
# Some work with m here, if you want, reading, writing, etc
# Better to call delete when you've done all work with it, del will flush buffers automatically
del m
# If you want to load memmap in another script
m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='r+', shape=train_data_features.shape)
但是就像上面@yangjie说的,你应该尽可能地对稀疏矩阵进行操作。