内存错误:sklearn K最近的邻居 knn



我正在开发Windows 10 64位12gb RAM Core i5。

现在使用大约 30k 的亚马逊数据集进行 IM 测试

训练数据中246621项,测试数据中的 61656

项我尝试在scikit learn中使用其他机器学习工作正常,但使用Knn时出现内存错误问题。

我的代码

knn = KNeighborsClassifier(n_neighbors=5).fit(X_train_tfidf, y_train)
prediction['knn'] = knn.predict(X_test_tfidf)
accuracy_score(y_test, prediction['knn'])*100

我的错误

MemoryError                               Traceback (most recent call last)
<ipython-input-13-4d958e7f8f5b> in <module>()
1 knn = KNeighborsClassifier(n_neighbors=5).fit(X_train_tfidf, y_train)
----> 2 prediction['knn'] = knn.predict(X_test_tfidf)
3 accuracy_score(y_test, prediction['knn'])*100
~Anaconda3libsite-packagessklearnneighborsclassification.py in predict(self, X)
143         X = check_array(X, accept_sparse='csr')
144 
--> 145         neigh_dist, neigh_ind = self.kneighbors(X)
146 
147         classes_ = self.classes_
~Anaconda3libsite-packagessklearnneighborsbase.py in kneighbors(self, X, n_neighbors, return_distance)
355             if self.effective_metric_ == 'euclidean':
356                 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 357                                           n_jobs=n_jobs, squared=True)
358             else:
359                 dist = pairwise_distances(
~Anaconda3libsite-packagessklearnmetricspairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245         func = partial(distance.cdist, metric=metric, **kwds)
1246 
-> 1247     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248 
1249 
~Anaconda3libsite-packagessklearnmetricspairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088     if n_jobs == 1:
1089         # Special case to avoid picklability checks in delayed
-> 1090         return func(X, Y, **kwds)
1091 
1092     # TODO: in some cases, backend='threading' may be appropriate
~Anaconda3libsite-packagessklearnmetricspairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244         YY = row_norms(Y, squared=True)[np.newaxis, :]
245 
--> 246     distances = safe_sparse_dot(X, Y.T, dense_output=True)
247     distances *= -2
248     distances += XX
~Anaconda3libsite-packagessklearnutilsextmath.py in safe_sparse_dot(a, b, dense_output)
133     """
134     if issparse(a) or issparse(b):
--> 135         ret = a * b
136         if dense_output and hasattr(ret, "toarray"):
137             ret = ret.toarray()
~Anaconda3libsite-packagesscipysparsebase.py in __mul__(self, other)
367             if self.shape[1] != other.shape[0]:
368                 raise ValueError('dimension mismatch')
--> 369             return self._mul_sparse_matrix(other)
370 
371         # If it's a list or whatever, treat it like a matrix
~Anaconda3libsite-packagesscipysparsecompressed.py in _mul_sparse_matrix(self, other)
538                                     maxval=nnz)
539         indptr = np.asarray(indptr, dtype=idx_dtype)
--> 540         indices = np.empty(nnz, dtype=idx_dtype)
541         data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype))
542 
MemoryError: 

您可以尝试增加 KNeighborsClassifier 文档中提出的leaf_size

leaf_size:整数,可选(默认值 = 30(

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store

树。最佳值取决于问题的性质。

首先设置algorithm = "kd_tree"然后尝试例如leaf_size = 300

最新更新