相当于pandas替换的Numpy(字典映射)



我知道处理numpy数组可能比panda更快。

我想知道是否有一种等效的方法(而且更快(在numpy数组上执行pandas.replace

在下面的示例中,我创建了一个数据帧和一个字典。字典包含列的名称及其对应的映射。我想知道是否有任何函数可以让我向numpy数组提供切片,以进行映射并获得更快的处理时间?

import pandas as pd
import numpy as np
# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)
# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} ,  'col2' : {4:1}}
# result using pandas replace
print(df.replace(d_mapping))
# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np =  df.to_records(index=False)

您可以尝试np.select()。我认为这取决于要替换的独特元素的数量。

def replace_values(df, d_mapping):
def replace_col(col):
# extract numpy array and column name from pd.Series
col, name = col.values, col.name
# generate condlist and choicelist
# for every key in mapping create a boolean mask
condlist = [col == x for x in d_mapping[name].keys()]
choicelist = d_mapping[name].values()
# use np.where to keep the existing value which won't be replaced 
return np.select(condlist, choicelist, col)
return df.apply(replace_col)

用法:

replace_values(df, d_mapping)

我还相信,如果您在映射中使用列表/数组而不是dicts,并用索引查找替换keys()values()调用,您可以加快上面的代码:

d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),

Upd:

这是基准

import pandas as pd
import numpy as np
# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}

def replace_values(df, mapping):
def replace_col(col):
col, (m0, m1) = col.values, mapping[col.name]
return np.select([col == x for x in m0], m1, col)
return df.apply(replace_col)

from timeit import timeit
print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))

在我6岁的笔记本电脑上,它打印出:

np.select:  3.6562702230003197
df.replace:  4.714512745998945

np.select比快约20%

最新更新