清空数据帧中的列,保持低内存资源



我有一个有很多列的大数据框架,比如130列&以毫秒为单位的日期时间作为索引。我现在想让一些列的值为空。我不想删除这些列,因为我将来可能会用到它。

I tried 2 methods

试验1:使用"-但它将列转换为字符串

# Make not used columns as nan (dummy)
def make_not_used_columns_nan (df):
dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
df[dummy_cols] = ""
return df
df = make_not_used_columns_nan(df)

试验2:使用np。南法

def make_not_used_columns_nan (df):
dummy_cols = [0, 2, 3, 4, 5, 8, 9, 14, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 51, 52, 53, 54, 57, 58, 59, 63,
64, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 107,
108, 111, 112, 113, 114, 115, 116, 117, 118, 124, 125, 126, 127, 128, 129]
df[dummy_cols] = np.NaN
df[dummy_cols] = df[dummy_cols].astype('Int32')
return df
df = make_not_used_columns_nan(df)

初df

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(113)
memory usage: 683.4 KB

试验1 -使用">

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: int16(17), int8(29), object(84)
memory usage: 3.2+ MB

试验2 df - using np.nan

DatetimeIndex: 4515 entries, 2022-07-20 09:02:31.120000 to 2022-07-20 11:02:20.817000
Columns: 130 entries, 0 to 129
dtypes: Int32(84), int16(17), int8(29)
memory usage: 2.1 MB

我想知道哪是最好的方式来清空列,同时保持内存低?

我能找到的最低内存使用是使用分类数据:

df[dummy_cols] = np.NaN
df[dummy_cols] = df[dummy_cols].astype('category')

基于您的数据的示例:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.full((4515, 130), 1, dtype=np.int8),
index=np.linspace(0, 1, 4515, dtype='datetime64[ms]'))
df.iloc[:,-17:] = df.iloc[:,-17:].astype(np.int16)
df.info()
# dtypes: int16(17), int8(113)
# memory usage: 683.4 KB
df.iloc[:,:84] = np.nan
df.iloc[:,:84] = df.iloc[:,:84].astype('category')
df.info()
# dtypes: category(84), int16(17), int8(29)
# memory usage: 692.3 KB

正如numpy文档中所述,narray是一个一维数组(填充与数据量相等的空间)+索引方案(具有固定大小)。所以你能做的最好的是将它最小化到1字节的dtype,比如boolbyte

作为旁注-请记住,如果您需要重新定位每个条目,更改列的大小可能会非常慢。我建议您检查它是否真的有助于您保存此内存,因为您将需要稍后重新填充它