我试图在一个大型语料库(400万个文档)上进行文档分类,并在使用标准scikit学习方法时不断遇到内存错误。在清理/阻塞我的数据之后,我有一个非常稀疏的矩阵,大约有1密耳的字。我的第一个想法是使用sklearn.decomposition.TruncatedVD,但由于内存错误,我无法用足够大的k执行.fit()操作(我能做的最大操作只占数据方差的25%)。我试着在这里遵循sklearn分类,但在进行KNN分类时仍然内存不足我想手动进行核外矩阵变换,将PCA/SVD应用于矩阵以降低维度,但需要一种首先计算特征向量的方法我希望使用scipy.sparse.linalg.eigs有没有一种方法可以计算特征向量矩阵来完成下面显示的代码
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp
import numpy as np
import cPickle as pkl
from sklearn.neighbors import KNeighborsClassifier
def pickleLoader(pklFile):
try:
while True:
yield pkl.load(pklFile)
except EOFError:
pass
#sample docs
docs = ['orange green','purple green','green chair apple fruit','raspberry pie banana yellow','green raspberry hat ball','test row green apple']
classes = [1,0,1,0,0,1]
#first k eigenvectors to keep
k = 3
#returns sparse matrix
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(docs)
#write sparse matrix to file
pkl.dump(tfs, open('pickleTest.p', 'wb'))
#NEEDED - THIS LINE THAT CALCULATES top k eigenvectors
del tfs
x = np.empty([len(docs),k])
#iterate over sparse matrix
with open('D:\GitHub\Avitro-Classification\pickleTest.p') as f:
rowCounter = 0
for dataRow in pickleLoader(f):
colCounter = 0
for col in k:
x[rowCounter, col] = np.sum(dataRow * eingenvectors[:,col])
f.close()
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(x, k_class)
如有任何帮助或指导,我们将不胜感激!如果有更好的方法可以做到这一点,我很乐意尝试不同的方法,但我想在这个大型稀疏数据集上尝试KNN,最好使用一些降维(这在我运行的小型测试数据集上表现得很好-我不想因为愚蠢的内存限制而失去性能!)
编辑:这是我第一次尝试运行的代码,它引导我走上了自己的核心外稀疏PCA实现的道路。任何关于修复这个内存错误的帮助都会让这件事变得更容易!
from sklearn.decomposition import TruncatedSVD
import pickle
dataFolder = 'D:\GitHub\project\'
# in the form of a list: [word sample test word, big sample test word test, green apple test word]
descWords = pickle.load(open(dataFolder +'descriptionWords.p'))
vectorizer = TfidfVectorizer()
X_words = vectorizer.fit_transform(descWords)
print np.shape(X_words)
del descWords
del vectorizer
svd = TruncatedSVD(algorithm='randomized', n_components=50000, random_state=42)
output = svd.fit_transform(X_words)
输出:
(3995803, 923633)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-27-c0db86bd3830> in <module>()
16
17 svd = TruncatedSVD(algorithm='randomized', n_components=50000, random_state=42)
---> 18 output = svd.fit_transform(X_words)
C:Python27libsite-packagessklearndecompositiontruncated_svd.pyc in fit_transform(self, X, y)
173 U, Sigma, VT = randomized_svd(X, self.n_components,
174 n_iter=self.n_iter,
--> 175 random_state=random_state)
176 else:
177 raise ValueError("unknown algorithm %r" % self.algorithm)
C:Python27libsite-packagessklearnutilsextmath.pyc in randomized_svd(M, n_components, n_oversamples, n_iter, transpose, flip_sign, random_state, n_iterations)
297 M = M.T
298
--> 299 Q = randomized_range_finder(M, n_random, n_iter, random_state)
300
301 # project M to the (k + p) dimensional space using the basis vectors
C:Python27libsite-packagessklearnutilsextmath.pyc in randomized_range_finder(A, size, n_iter, random_state)
212
213 # generating random gaussian vectors r with shape: (A.shape[1], size)
--> 214 R = random_state.normal(size=(A.shape[1], size))
215
216 # sampling the range of A using by linear projection of r
C:Python27libsite-packagesnumpyrandommtrand.pyd in mtrand.RandomState.normal (numpyrandommtrandmtrand.c:9968)()
C:Python27libsite-packagesnumpyrandommtrand.pyd in mtrand.cont2_array_sc (numpyrandommtrandmtrand.c:2370)()
MemoryError:
编辑:我忘记在第一次回复中指定"关于稀疏数据"。