scikit-learn normanizer.fit()内存错误



我在使用dict vectorizer在此处转换数据集之前试图将数据集正常化。尽管我的计算机中有244GB的内存,但是当我将数据归一化时,我会遇到此内存错误。这是我的代码片段,

X是我的功能数据。

# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)

错误: -

File "train_models.py", line 336, in splittingdata
    normalizer = Normalizer().fit(X)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py", line 1426, in fit
    X = check_array(X, accept_sparse='csr')
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    and not np.isfinite(X).all()):
MemoryError

,数据集的大小为560000行,带有23列

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           240G        563M        238G        8.6M        979M        238G
Swap:            0B          0B          0B

这是我的python架构。

Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import platform
>>> import sys
>>> platform.architecture(), sys.maxsize
(('64bit', 'ELF'), 9223372036854775807)

####Here is my code.
def splittingdata(X,Y):
    # Split X and Y into training and testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
    # Normalizer that will normalize the data
    normalizer = Normalizer().fit(X)
    # Normalized Features:
    X_norm = normalizer.transform(X)
    # Split X and Y into training and testing sets for normalized data
    X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
    # Store Normalizer
    joblib.dump(normalizer, '../models/normalizer.pkl')
    actualdata = list([X_train, X_test, Y_train, Y_test])
    normalizeddata = list([X_norm_train, X_norm_test, Y_norm_train, Y_norm_test])
    return list([actualdata,normalizeddata])

def data_encoding(data):
    # Build X and Y
    # X : Features
    # Y : Target
    start_time = datetime.datetime.now()
    print "Start time of data encoding : ", start_time
    # Removing id column (listing_id)
    datav1 = data.drop(['id'], axis = 1)
    # Taking out the numeric columns separately
    numeric_cols = ['list','of','numeric','columns']
    #x_numeric = datav1[ numeric_cols ].as_matrix()
    x_numeric = datav1[ numeric_cols ]
    # Constructing list of dictionaries (one dictionary for each column) to use dictvectorizer
    cat_cols = ['list','of','categorical','columns']
    cat_dict = datav1[ cat_cols ].to_dict( orient = 'records' )
    # The DictVectorizer converts data from a dictionary to an array
    vectorizer = DictVectorizer()
    # Convert X to Array
    x_categorical = vectorizer.fit_transform(cat_dict).toarray()
    # Combining numeric and categorical data
    X = np.hstack(( x_numeric, x_categorical ))
    # Store Vectorizer
    joblib.dump(vectorizer, '../models/vectorizer.pkl')
    # Taking out the target variable
    Y = datav1.target_col
    outdata = list([X,Y])
    end_time = datetime.datetime.now()
    print "End time of data encoding : ", end_time
    total_time = end_time - start_time
    print "Total time taken for data encoding : ", total_time
    return outdata

def main():
    #Reading the preprocessed data
    processed_data = pd.read_csv('../data/data.csv', sep=',', encoding='utf-8',index_col=0)
    #processed_data = processed_data.head(5)
    #Encoding dataset
    encoded_data = data_encoding(processed_data)
    #Splitting dataset
    splitted_data = splittingdata(encoded_data[0], encoded_data[1])
    actualdata = splitted_data[0]
    normalizeddata = splitted_data[1]
    output = runmodels(actualdata,normalizeddata)

如评论中所建议的,我试图这样做这样的垃圾收集,在encoding_data函数的末尾添加了此片段,

#Garbage collection
del x_numeric
del x_categorical
del cat_dict
del datav1
gc.collect()

在调用SplittingData funciton之前,类似地将其添加到我的主要功能中,

del processed_data
gc.collect()

我注意到的是,当矢量器运行时,我的内存会100%,然后完成后,不确定为什么即使收集垃圾收集,它也不会下降,降低到47%,然后函数将标准器增加到100%和失败。因此,如果我按照答案中建议的代码进行重新处理,那么我认为问题仍然存在。有没有办法找出在运行时保存大多数内存的对象?

splittingdata(..)中,您正在加载和分配数据两次。首先加载:

# Split X and Y into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

和第二个:

# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)

删除第一个train_testrongplit行并保持这样的保留:

# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)
# Normalized Features:
X_norm = normalizer.transform(X)
# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)

编辑后的功能应像这样:

def splittingdata(X,Y):
    # Normalizer that will normalize the data
    normalizer = Normalizer().fit(X)
    # Normalized Features:
    X_norm = normalizer.transform(X)
    # Split X and Y into training and testing sets for normalized data
    X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
    # Store Normalizer
    joblib.dump(normalizer, '../models/normalizer.pkl')
    actualdata = list([X_train, X_test, Y_train, Y_test])
    normalizeddata = list([X_norm_train, X_norm_test, Y_norm_train, Y_norm_test])
    return list([actualdata,normalizeddata])

如果要保持两者,则需要如下重构代码:

  1. 使函数splittingdata(..)通用以支持actual_datanormalized_data独立支持。
  2. 更新您的main(..):您需要考虑重构代码,以便将actual_datanormalized_data单独调用模型/模型,并在末尾组合输出,类似的内容:

     def main():
        #Reading the preprocessed data
        processed_data = pd.read_csv('../data/data.csv', sep=',', encoding='utf-8',index_col=0) 
        #processed_data = processed_data.head(5) 
        #Encoding dataset
        encoded_data = data_encoding(processed_data)
        X,Y = encoded_data[0], encoded_data[1]
        #Splitting dataset
        # non normalized 
        actualdata = splittingdata(X,Y)
        # run only on actualdata 
        output1 = runmodels(actualdata)
        # Normalizer that will normalize the data
        normalizer = Normalizer().fit(X)
        # Normalized Features:
        X_norm = normalizer.transform(X)
        normalizeddata = splittingdata(X_norm, Y)
        output2 = runmodels(normalizeddata)
        # combine output1 and output2 here 
    

相关内容

  • 没有找到相关文章

最新更新