使用混合数量的标识符合并左侧



我有一个映射表和一个df,我想在其中应用左合并以在一组或多列上映射附加列。但是,在我的情况下,可用的标识符每行都不同。

下面是一个示例:

maptable =
asset_class currency target
0      Equity      EUR     t1
1          FX      EUR     t2
2       Rates      USD     t3
3       Rates              t3o
4       Bonds              t4o
5       Bonds      AAA     t4

假设我们有以下 df:

df =
asset_class currency
0      Equity      EUR
1      Equity      USD
2      Equity      GBP
3       Rates      EUR
4       Rates      USD
5       Rates      GBP
6       Bonds      AAA
7       Bonds      BBB
8       Bonds      CCC

在这种情况下,所需的结果应该是:

asset_class currency target
0      Equity      EUR     t1   (we have Equity+EUR)
1      Equity      USD    NaN   (we don't have Equity+USD and also not Equity)
2      Equity      GBP    NaN   (we don't have Equity+GBP and also not Equity)
3       Rates      EUR    t3o   (we don't have Rates+EUR, but we do have Rates)
4       Rates      USD     t3   (we have Rates+USD)
5       Rates      GBP    t30   (we don't have Rates+GBP, but we do have Rates)
6       Bonds      AAA     t4   (we have Bonds+AA)
7       Bonds      BBB    t4o   (we don't have Bonds+BBB, but we do have Bonds)
8       Bonds      CCC    t4o   (we don't have Bonds+CCC, but we do have Bonds)

简单地在asset_class和货币上应用左合并是行不通的,因为两个标识符列中的一个具有值的情况,它将被忽略:

df_m = df.merge(maptable, how='left', on=['asset_class','currency'])

另外,非常重要的一点是,如果我们使用更多的标识符列,我们需要在目标列已经映射的情况下进行覆盖。例如,使用"asset_class"和"货币"比仅仅映射"asset_class"具有更高的优先级。出于这个原因,fillna将无法工作,因为我们实际上需要一个update

如何以有效的方式实现这一目标?


示例数据

您可以重新创建上面的示例,如下所示:

import pandas as pd
maptable = pd.DataFrame({
'asset_class': ['Equity', 'FX',   'Rates', 'Rates', 'Bonds', 'Bonds'],
'currency':    ['EUR',    'EUR',  'USD',   '',      '',      'AAA'],
'target':      ['t1',     't2',   't3',    't3o',    't4o',    't4']
})
df = pd.DataFrame({
'asset_class': ['Equity', 'Equity', 'Equity', 'Rates', 'Rates', 'Rates', 'Bonds', 'Bonds', 'Bonds'],
'currency':    ['EUR', 'USD', 'GBP', 'EUR', 'USD', 'GBP', 'AAA', 'BBB', 'CCC'],
})

到目前为止我尝试过什么

这是我到目前为止尝试过的(但这真的很简陋):

def merge_mix(dl, dr, target_cols, id_cols):
"""Apply a merge left with a mixed number of identifiers
:param dl:  target DataFrame on which we want to map the target_cols, contains id_cols but might also
contain target_cols. If non-NA matching target values are found in dr, it will overwrite the values for the
index/col combinations
:param dr:  mapping DataFrame that contains both target_cols and id_cols
:param target_cols: list of column names that we want to map from the dr
:param id_cols: list of columns that we want to use as identifier, can be empty
"""
def is_empty(x):
"""Check if empty"""
if x is not None:
if isinstance(x, str) and x != '':
return False
else:
if not pd.np.isnan(value):
return False
return True
# Append target col
for target_col in target_cols:
if target_col not in dl:
dl.insert(loc=len(dl.columns), column=target_col, value=None)
# Clean dr
dr = dr[id_cols + target_cols]
dr = dr.drop_duplicates(keep='last')
# Loop over all the indices and check which combinations exists
for index in dr.index:
combo_cols = []
for col in id_cols:
value = dr.loc[index, col]
# Add combination if value is not empty
if not is_empty(value):
combo_cols.append(col)
# The combination for this index
dr.loc[index, 'combo_cols'] = "+".join(combo_cols)
dr.loc[index, 'combo_count'] = len(combo_cols)
# Get the unique combo cols combinations. Take first the least granular and then work towards more granular
# as we are working with .update and not with .merge
combos_count = list(dr['combo_count'].unique())  # Unique list
combos_count = [x for x in combos_count if x > 0]  # Take out zero count combo cols
combos_count.sort(reverse=False)  # Sort to move the least granular first
for count in combos_count:
# For a given count, check all combo combinations with this count
dr_cc = dr[dr['combo_count'] == count]
unique_combo_cols_cc = list(dr_cc['combo_cols'].unique())
for combo_col in unique_combo_cols_cc:
# Maptable for given combo col
dr_uc_cc = dr_cc[dr_cc['combo_cols'] == combo_col]
dr_uc_cc = dr_uc_cc.drop_duplicates(keep='last')
# Set index on the id cols for this combo combination
id_cols_uc_cc = combo_col.split('+')
dl = dl.set_index(id_cols_uc_cc)
dr_uc_cc = dr_uc_cc.set_index(id_cols_uc_cc)
# Update matching row, cols
dl.update(dr_uc_cc[target_cols])
dl = dl.reset_index()
return dl

创建一个自定义函数,如果由两个部分组成的元组不存在,则默认只检查元组的第一个组件。

mapdict = {
tuple(filter(pd.notna, (a, c))): t
for a, c, t in maptable.itertuples(index=False)
}
def get(x):
return mapdict.get(x, mapdict.get((x[0], ''), mapdict.get(x[:1])))
list_of_cols = ['asset_class', 'currency']
df.assign(target=[*map(get, zip(*map(df.get, list_of_cols)))])
asset_class currency target
0      Equity      EUR     t1
1      Equity      USD   None
2      Equity      GBP   None
3       Rates      EUR    t3o
4       Rates      USD     t3
5       Rates      GBP    t3o
6       Bonds      AAA     t4
7       Bonds      BBB    t4o
8       Bonds      CCC    t4o

使用pd.merge合并"asset_class", "currency"列上的两个数据帧,获取df_m.

df_m = pd.merge(df, maptable, on=["asset_class", "currency"], how="left")
# df_m
asset_class currency target
0      Equity      EUR     t1
1      Equity      USD    NaN
2      Equity      GBP    NaN
3       Rates      EUR    NaN
4       Rates      USD     t3
5       Rates      GBP    NaN
6       Bonds      AAA     t4
7       Bonds      BBB    NaN
8       Bonds      CCC    NaN

然后从df数据帧中获取mappings字典,该数据帧对应于行,其中货币值'',此字典中的键来自asset_class列,值来自target列。

mappings = maptable[maptable["currency"].eq('')].set_index("asset_class")["target"].to_dict()
# mappings
{'Rates': 't3o', 'Bonds': 't4o'}

现在,从nancoltarget值的df_m中筛选asset_classcol,并使用mappings字典映射此 col 在上一步中获得以创建新的系列s

s = df_m.loc[df_m["target"].isna(), "asset_class"].map(mappings)
# s
1    NaN
2    NaN
3    t3o
5    t3o
7    t4o
8    t4o

然后使用.fillna函数使用系列s填充targetdf_mnan值。


用:

df_m = pd.merge(df, maptable, on=["asset_class", "currency"], how="left")
# {'Bonds': 't4o', 'Rates': 't3o'}
mappings = maptable[maptable["currency"].eq('')].set_index("asset_class")["target"].to_dict()
s = df_m.loc[df_m["target"].isna(), "asset_class"].map(mappings)
df_m["target"] = df_m["target"].fillna(s)
print(df_m)

这将打印:

asset_class currency target
0      Equity      EUR     t1
1      Equity      USD    NaN
2      Equity      GBP    NaN
3       Rates      EUR    t3o
4       Rates      USD     t3
5       Rates      GBP    t3o
6       Bonds      AAA     t4
7       Bonds      BBB    t4o
8       Bonds      CCC    t4o

另一种方法是首先检查maptable中是否存在特定的asset_classcurrency对,用默认值('')填充缺失的,然后合并:

keys = ['asset_class', 'currency']
df_m = df.assign(currency= 
np.where(df.set_index(keys).index.isin(maptable.set_index(keys).index), df['currency'], '') 
).merge(maptable, on=keys, how='left').assign(currency=df['currency'])

结果:

asset_class currency target
0      Equity      EUR     t1
1      Equity      USD    NaN
2      Equity      GBP    NaN
3       Rates      EUR    t3o
4       Rates      USD     t3
5       Rates      GBP    t3o
6       Bonds      AAA     t4
7       Bonds      BBB    t4o
8       Bonds      CCC    t4o

您可以先从合并的数据开始:

merged = df.merge(maptable, how='left', on=['asset_class','currency'])

这将为您提供第一层:

asset_class currency target
0      Equity      EUR     t1
1      Equity      USD    NaN
2      Equity      GBP    NaN
3       Rates      EUR    NaN
4       Rates      USD     t3
5       Rates      GBP    NaN
6       Bonds      AAA     t4
7       Bonds      BBB    NaN
8       Bonds      CCC    NaN

然后从另一个合并中执行fillna,仅匹配currency的默认值''

merged['target'].fillna(df.assign(currency='').merge(maptable, on=['asset_class','currency'], how='left')['target'], inplace=True)

这将为您提供结果:

>>> merged
asset_class currency target
0      Equity      EUR     t1
1      Equity      USD    NaN
2      Equity      GBP    NaN
3       Rates      EUR    t3o
4       Rates      USD     t3
5       Rates      GBP    t3o
6       Bonds      AAA     t4
7       Bonds      BBB    t4o
8       Bonds      CCC    t4o

不用说,根据您的回退值,您需要相应地更新''。 如果是NaN,请使用maptable['currency'].isna()

一个衬里是:

df_m = df.assign(target=
df.merge(maptable, on=['asset_class','currency'], how='left')['target'].fillna( 
df.assign(currency='').merge(maptable, on=['asset_class','currency'], how='left')['target']))

最新更新