tl;dr:需要帮助清理下面的downstast_int(df(函数
你好,我正在尝试编写自己的下转换函数来节省内存使用。我很好奇我的代码(坦率地说,相当混乱,但功能正常(的替代方案,以使它更可读,也许更快。
下转换函数直接修改我的数据帧,我不确定我应该这么做。
感谢您的帮助。
示例df
df = pd.DataFrame({
'first': [1_000, 200_000],
'second': [-30, -40_000],
'third': ["some", "string"],
'fourth': [4.5, 6.1],
'fifth': [-6, -8]
})
第一第二第三第四第五0 1000-30约4.5-61 200000-40000字符串6.1-8
df.info((
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 first 2 non-null int64
1 second 2 non-null int64
2 third 2 non-null object
3 fourth 2 non-null float64
4 fifth 2 non-null int64
dtypes: float64(1), int64(3), object(1)
下行功能
def downcast_int(df):
"""Select all int columns. Convert them to unsigned or signed types."""
cols = df.select_dtypes(include=['int64']).columns
cols_unsigned = None
# There is at least one negative number in a column.
if (df[cols] < 0).any().any():
df_unsigned = (df[cols] < 0).any()
cols_unsigned = df_unsigned[df_unsigned == True].index
df[cols_unsigned] = df[cols_unsigned].apply(pd.to_numeric, downcast='signed')
# If there were any changed columns, remove them.
if cols_unsigned is not None:
cols = cols.drop(cols_unsigned)
# Turn the remaining columns into unsigned integers.
df[cols] = df[cols].apply(pd.to_numeric, downcast='unsigned')
下广播后的df.info((
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 first 2 non-null uint32
1 second 2 non-null int32
2 third 2 non-null object
3 fourth 2 non-null float64
4 fifth 2 non-null int8
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)
只需应用to_numeric()
两次。一次达到min-signed,然后第二次减少unsigned。
df2 = df.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='signed')
df2 = df2.select_dtypes(include=[np.number]).apply(pd.to_numeric, downcast='unsigned')
df[df2.columns] = df2
与您的方法相同的输出:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 first 2 non-null uint32
1 second 2 non-null int32
2 third 2 non-null object
3 fourth 2 non-null float64
4 fifth 2 non-null int8
dtypes: float64(1), int32(1), int8(1), object(1), uint32(1)