如何将一个大的稀疏矩阵转换为数组(详细信息如下)



我有一个稀疏的特征矩阵,它是使用sklearn进行以下操作的结果:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000) 
train_data_features = vectorizer.fit_transform(y)

转换为连续数组表示将物化内存中的所有零,结果大小将为:

train_data_features.shape[0] * train_data_features.shape[1] * train_data_features.dtype.itemsize / 1e6

收益率:"6242.4

这是8GB,而原始的稀疏表示不到1MB。那么,如何解决这个问题,使我能够有效地将结果数组拟合到随机森林分类器中呢?

"

试试这个:

m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='w+', shape=train_data_features.shape)
train_data_features.todense(out=m)
# Some work with m here, if you want, reading, writing, etc
# Better to call delete when you've done all work with it, del will flush buffers automatically
del m
# If you want to load memmap in another script
m = np.memmap('train_data_features_dense.mmap', dtype=train_data_features.dtype, mode='r+', shape=train_data_features.shape)

但是就像上面@yangjie说的,你应该尽可能地对稀疏矩阵进行操作。

相关内容

  • 没有找到相关文章