我在使用dict vectorizer在此处转换数据集之前试图将数据集正常化。尽管我的计算机中有244GB的内存,但是当我将数据归一化时,我会遇到此内存错误。这是我的代码片段,
X是我的功能数据。
# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)
错误: -
File "train_models.py", line 336, in splittingdata
normalizer = Normalizer().fit(X)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.py", line 1426, in fit
X = check_array(X, accept_sparse='csr')
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
and not np.isfinite(X).all()):
MemoryError
,数据集的大小为560000行,带有23列
$ free -h
total used free shared buff/cache available
Mem: 240G 563M 238G 8.6M 979M 238G
Swap: 0B 0B 0B
这是我的python架构。
Python 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import platform
>>> import sys
>>> platform.architecture(), sys.maxsize
(('64bit', 'ELF'), 9223372036854775807)
####Here is my code.
def splittingdata(X,Y):
# Split X and Y into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)
# Normalized Features:
X_norm = normalizer.transform(X)
# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
# Store Normalizer
joblib.dump(normalizer, '../models/normalizer.pkl')
actualdata = list([X_train, X_test, Y_train, Y_test])
normalizeddata = list([X_norm_train, X_norm_test, Y_norm_train, Y_norm_test])
return list([actualdata,normalizeddata])
def data_encoding(data):
# Build X and Y
# X : Features
# Y : Target
start_time = datetime.datetime.now()
print "Start time of data encoding : ", start_time
# Removing id column (listing_id)
datav1 = data.drop(['id'], axis = 1)
# Taking out the numeric columns separately
numeric_cols = ['list','of','numeric','columns']
#x_numeric = datav1[ numeric_cols ].as_matrix()
x_numeric = datav1[ numeric_cols ]
# Constructing list of dictionaries (one dictionary for each column) to use dictvectorizer
cat_cols = ['list','of','categorical','columns']
cat_dict = datav1[ cat_cols ].to_dict( orient = 'records' )
# The DictVectorizer converts data from a dictionary to an array
vectorizer = DictVectorizer()
# Convert X to Array
x_categorical = vectorizer.fit_transform(cat_dict).toarray()
# Combining numeric and categorical data
X = np.hstack(( x_numeric, x_categorical ))
# Store Vectorizer
joblib.dump(vectorizer, '../models/vectorizer.pkl')
# Taking out the target variable
Y = datav1.target_col
outdata = list([X,Y])
end_time = datetime.datetime.now()
print "End time of data encoding : ", end_time
total_time = end_time - start_time
print "Total time taken for data encoding : ", total_time
return outdata
def main():
#Reading the preprocessed data
processed_data = pd.read_csv('../data/data.csv', sep=',', encoding='utf-8',index_col=0)
#processed_data = processed_data.head(5)
#Encoding dataset
encoded_data = data_encoding(processed_data)
#Splitting dataset
splitted_data = splittingdata(encoded_data[0], encoded_data[1])
actualdata = splitted_data[0]
normalizeddata = splitted_data[1]
output = runmodels(actualdata,normalizeddata)
如评论中所建议的,我试图这样做这样的垃圾收集,在encoding_data函数的末尾添加了此片段,
#Garbage collection
del x_numeric
del x_categorical
del cat_dict
del datav1
gc.collect()
在调用SplittingData funciton之前,类似地将其添加到我的主要功能中,
del processed_data
gc.collect()
我注意到的是,当矢量器运行时,我的内存会100%,然后完成后,不确定为什么即使收集垃圾收集,它也不会下降,降低到47%,然后函数将标准器增加到100%和失败。因此,如果我按照答案中建议的代码进行重新处理,那么我认为问题仍然存在。有没有办法找出在运行时保存大多数内存的对象?
在splittingdata(..)
中,您正在加载和分配数据两次。首先加载:
# Split X and Y into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
和第二个:
# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
删除第一个train_testrongplit行并保持这样的保留:
# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)
# Normalized Features:
X_norm = normalizer.transform(X)
# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
编辑后的功能应像这样:
def splittingdata(X,Y):
# Normalizer that will normalize the data
normalizer = Normalizer().fit(X)
# Normalized Features:
X_norm = normalizer.transform(X)
# Split X and Y into training and testing sets for normalized data
X_norm_train, X_norm_test, Y_norm_train, Y_norm_test = train_test_split(X_norm, Y, test_size=0.33)
# Store Normalizer
joblib.dump(normalizer, '../models/normalizer.pkl')
actualdata = list([X_train, X_test, Y_train, Y_test])
normalizeddata = list([X_norm_train, X_norm_test, Y_norm_train, Y_norm_test])
return list([actualdata,normalizeddata])
如果要保持两者,则需要如下重构代码:
- 使函数
splittingdata(..)
通用以支持actual_data
和normalized_data
独立支持。 更新您的
main(..)
:您需要考虑重构代码,以便将actual_data
和normalized_data
单独调用模型/模型,并在末尾组合输出,类似的内容:def main(): #Reading the preprocessed data processed_data = pd.read_csv('../data/data.csv', sep=',', encoding='utf-8',index_col=0) #processed_data = processed_data.head(5) #Encoding dataset encoded_data = data_encoding(processed_data) X,Y = encoded_data[0], encoded_data[1] #Splitting dataset # non normalized actualdata = splittingdata(X,Y) # run only on actualdata output1 = runmodels(actualdata) # Normalizer that will normalize the data normalizer = Normalizer().fit(X) # Normalized Features: X_norm = normalizer.transform(X) normalizeddata = splittingdata(X_norm, Y) output2 = runmodels(normalizeddata) # combine output1 and output2 here