基于矩阵的Python PCA太大,内存无法容纳



我有一个csv是100,000行x 27,000列,我试图做PCA上产生100,000行x 300列矩阵。csv文件大小为9GB。下面是我当前正在做的:

from sklearn.decomposition import PCA as RandomizedPCA
import csv
import sys
import numpy as np
import pandas as pd
dataset = sys.argv[1]
X = pd.DataFrame.from_csv(dataset)
Y = X.pop("Y_Level")
X = (X - X.mean()) / (X.max() - X.min())
Y = list(Y)
dimensions = 300
sklearn_pca = RandomizedPCA(n_components=dimensions)
X_final = sklearn_pca.fit_transform(X)

当我运行上面的代码时,我的程序在执行.from_csv时被终止。我可以通过将csv分成10000个集合来解决这个问题;逐个读取它们,然后调用pd。concat。这允许我进入规范化步骤(X - X.mean())....在被杀之前。我的数据对于我的macbook air来说是不是太大了?或者有更好的方法。我真的很想把我所有的数据都用在我的机器学习应用上。


如果我想按照下面的答案建议使用增量PCA,我是这样做的吗?:

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd
dataset = sys.argv[1]
chunksize_ = 10000
#total_size is 100000
dimensions = 300
reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
Y = []
for chunk in reader:
    y = chunk.pop("virginica")
    Y = Y + list(y)
    sklearn_pca.partial_fit(chunk)
X = ???
#This is were i'm stuck, how do i take my final pca and output it to X,
#the normal transform method takes in an X, which I don't have because I
#couldn't fit it into memory.

我在网上找不到好的例子。

尝试将数据划分或分批加载到脚本中,并在每个批处理上使用Incremetal PCA的partial_fit方法来适合您的PCA。

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd
dataset = sys.argv[1]
chunksize_ = 5 * 25000
dimensions = 300
reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
    y = chunk.pop("Y")
    sklearn_pca.partial_fit(chunk)
# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)
Xtransformed = None
for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
    y = chunk.pop("Y")
    Xchunk = sklearn_pca.transform(chunk)
    if Xtransformed == None:
        Xtransformed = Xchunk
    else:
        Xtransformed = np.vstack((Xtransformed, Xchunk))

有用链接

PCA需要计算一个关联矩阵,它将是100,000x100,000。如果数据以双精度存储,则为80gb。我敢打赌你的Macbook没有80gb内存。

对于一个合理大小的随机子集,PCA变换矩阵可能几乎相同。

相关内容

  • 没有找到相关文章

最新更新