我有一个映射表和一个df
,我想在其中应用左合并以在一组或多列上映射附加列。但是,在我的情况下,可用的标识符每行都不同。
下面是一个示例:
maptable =
asset_class currency target
0 Equity EUR t1
1 FX EUR t2
2 Rates USD t3
3 Rates t3o
4 Bonds t4o
5 Bonds AAA t4
假设我们有以下 df:
df =
asset_class currency
0 Equity EUR
1 Equity USD
2 Equity GBP
3 Rates EUR
4 Rates USD
5 Rates GBP
6 Bonds AAA
7 Bonds BBB
8 Bonds CCC
在这种情况下,所需的结果应该是:
asset_class currency target
0 Equity EUR t1 (we have Equity+EUR)
1 Equity USD NaN (we don't have Equity+USD and also not Equity)
2 Equity GBP NaN (we don't have Equity+GBP and also not Equity)
3 Rates EUR t3o (we don't have Rates+EUR, but we do have Rates)
4 Rates USD t3 (we have Rates+USD)
5 Rates GBP t30 (we don't have Rates+GBP, but we do have Rates)
6 Bonds AAA t4 (we have Bonds+AA)
7 Bonds BBB t4o (we don't have Bonds+BBB, but we do have Bonds)
8 Bonds CCC t4o (we don't have Bonds+CCC, but we do have Bonds)
简单地在asset_class和货币上应用左合并是行不通的,因为两个标识符列中的一个具有值的情况,它将被忽略:
df_m = df.merge(maptable, how='left', on=['asset_class','currency'])
另外,非常重要的一点是,如果我们使用更多的标识符列,我们需要在目标列已经映射的情况下进行覆盖。例如,使用"asset_class"和"货币"比仅仅映射"asset_class"具有更高的优先级。出于这个原因,fillna
将无法工作,因为我们实际上需要一个update
。
如何以有效的方式实现这一目标?
示例数据
您可以重新创建上面的示例,如下所示:
import pandas as pd
maptable = pd.DataFrame({
'asset_class': ['Equity', 'FX', 'Rates', 'Rates', 'Bonds', 'Bonds'],
'currency': ['EUR', 'EUR', 'USD', '', '', 'AAA'],
'target': ['t1', 't2', 't3', 't3o', 't4o', 't4']
})
df = pd.DataFrame({
'asset_class': ['Equity', 'Equity', 'Equity', 'Rates', 'Rates', 'Rates', 'Bonds', 'Bonds', 'Bonds'],
'currency': ['EUR', 'USD', 'GBP', 'EUR', 'USD', 'GBP', 'AAA', 'BBB', 'CCC'],
})
到目前为止我尝试过什么
这是我到目前为止尝试过的(但这真的很简陋):
def merge_mix(dl, dr, target_cols, id_cols):
"""Apply a merge left with a mixed number of identifiers
:param dl: target DataFrame on which we want to map the target_cols, contains id_cols but might also
contain target_cols. If non-NA matching target values are found in dr, it will overwrite the values for the
index/col combinations
:param dr: mapping DataFrame that contains both target_cols and id_cols
:param target_cols: list of column names that we want to map from the dr
:param id_cols: list of columns that we want to use as identifier, can be empty
"""
def is_empty(x):
"""Check if empty"""
if x is not None:
if isinstance(x, str) and x != '':
return False
else:
if not pd.np.isnan(value):
return False
return True
# Append target col
for target_col in target_cols:
if target_col not in dl:
dl.insert(loc=len(dl.columns), column=target_col, value=None)
# Clean dr
dr = dr[id_cols + target_cols]
dr = dr.drop_duplicates(keep='last')
# Loop over all the indices and check which combinations exists
for index in dr.index:
combo_cols = []
for col in id_cols:
value = dr.loc[index, col]
# Add combination if value is not empty
if not is_empty(value):
combo_cols.append(col)
# The combination for this index
dr.loc[index, 'combo_cols'] = "+".join(combo_cols)
dr.loc[index, 'combo_count'] = len(combo_cols)
# Get the unique combo cols combinations. Take first the least granular and then work towards more granular
# as we are working with .update and not with .merge
combos_count = list(dr['combo_count'].unique()) # Unique list
combos_count = [x for x in combos_count if x > 0] # Take out zero count combo cols
combos_count.sort(reverse=False) # Sort to move the least granular first
for count in combos_count:
# For a given count, check all combo combinations with this count
dr_cc = dr[dr['combo_count'] == count]
unique_combo_cols_cc = list(dr_cc['combo_cols'].unique())
for combo_col in unique_combo_cols_cc:
# Maptable for given combo col
dr_uc_cc = dr_cc[dr_cc['combo_cols'] == combo_col]
dr_uc_cc = dr_uc_cc.drop_duplicates(keep='last')
# Set index on the id cols for this combo combination
id_cols_uc_cc = combo_col.split('+')
dl = dl.set_index(id_cols_uc_cc)
dr_uc_cc = dr_uc_cc.set_index(id_cols_uc_cc)
# Update matching row, cols
dl.update(dr_uc_cc[target_cols])
dl = dl.reset_index()
return dl
创建一个自定义函数,如果由两个部分组成的元组不存在,则默认只检查元组的第一个组件。
mapdict = {
tuple(filter(pd.notna, (a, c))): t
for a, c, t in maptable.itertuples(index=False)
}
def get(x):
return mapdict.get(x, mapdict.get((x[0], ''), mapdict.get(x[:1])))
list_of_cols = ['asset_class', 'currency']
df.assign(target=[*map(get, zip(*map(df.get, list_of_cols)))])
asset_class currency target
0 Equity EUR t1
1 Equity USD None
2 Equity GBP None
3 Rates EUR t3o
4 Rates USD t3
5 Rates GBP t3o
6 Bonds AAA t4
7 Bonds BBB t4o
8 Bonds CCC t4o
使用pd.merge
合并"asset_class", "currency"
列上的两个数据帧,获取df_m
.
df_m = pd.merge(df, maptable, on=["asset_class", "currency"], how="left")
# df_m
asset_class currency target
0 Equity EUR t1
1 Equity USD NaN
2 Equity GBP NaN
3 Rates EUR NaN
4 Rates USD t3
5 Rates GBP NaN
6 Bonds AAA t4
7 Bonds BBB NaN
8 Bonds CCC NaN
然后从df
数据帧中获取mappings
字典,该数据帧对应于行,其中货币值''
,此字典中的键来自asset_class
列,值来自target
列。
mappings = maptable[maptable["currency"].eq('')].set_index("asset_class")["target"].to_dict()
# mappings
{'Rates': 't3o', 'Bonds': 't4o'}
现在,从nan
coltarget
值的df_m
中筛选asset_class
col,并使用mappings
字典映射此 col 在上一步中获得以创建新的系列s
。
s = df_m.loc[df_m["target"].isna(), "asset_class"].map(mappings)
# s
1 NaN
2 NaN
3 t3o
5 t3o
7 t4o
8 t4o
然后使用.fillna
函数使用系列s
填充target
列df_m
的nan
值。
用:
df_m = pd.merge(df, maptable, on=["asset_class", "currency"], how="left")
# {'Bonds': 't4o', 'Rates': 't3o'}
mappings = maptable[maptable["currency"].eq('')].set_index("asset_class")["target"].to_dict()
s = df_m.loc[df_m["target"].isna(), "asset_class"].map(mappings)
df_m["target"] = df_m["target"].fillna(s)
print(df_m)
这将打印:
asset_class currency target
0 Equity EUR t1
1 Equity USD NaN
2 Equity GBP NaN
3 Rates EUR t3o
4 Rates USD t3
5 Rates GBP t3o
6 Bonds AAA t4
7 Bonds BBB t4o
8 Bonds CCC t4o
另一种方法是首先检查maptable
中是否存在特定的asset_class
和currency
对,用默认值(''
)填充缺失的,然后合并:
keys = ['asset_class', 'currency']
df_m = df.assign(currency=
np.where(df.set_index(keys).index.isin(maptable.set_index(keys).index), df['currency'], '')
).merge(maptable, on=keys, how='left').assign(currency=df['currency'])
结果:
asset_class currency target
0 Equity EUR t1
1 Equity USD NaN
2 Equity GBP NaN
3 Rates EUR t3o
4 Rates USD t3
5 Rates GBP t3o
6 Bonds AAA t4
7 Bonds BBB t4o
8 Bonds CCC t4o
您可以先从合并的数据开始:
merged = df.merge(maptable, how='left', on=['asset_class','currency'])
这将为您提供第一层:
asset_class currency target
0 Equity EUR t1
1 Equity USD NaN
2 Equity GBP NaN
3 Rates EUR NaN
4 Rates USD t3
5 Rates GBP NaN
6 Bonds AAA t4
7 Bonds BBB NaN
8 Bonds CCC NaN
然后从另一个合并中执行fillna
,仅匹配currency
的默认值''
:
merged['target'].fillna(df.assign(currency='').merge(maptable, on=['asset_class','currency'], how='left')['target'], inplace=True)
这将为您提供结果:
>>> merged
asset_class currency target
0 Equity EUR t1
1 Equity USD NaN
2 Equity GBP NaN
3 Rates EUR t3o
4 Rates USD t3
5 Rates GBP t3o
6 Bonds AAA t4
7 Bonds BBB t4o
8 Bonds CCC t4o
不用说,根据您的回退值,您需要相应地更新''
。 如果是NaN
,请使用maptable['currency'].isna()
。
一个衬里是:
df_m = df.assign(target=
df.merge(maptable, on=['asset_class','currency'], how='left')['target'].fillna(
df.assign(currency='').merge(maptable, on=['asset_class','currency'], how='left')['target']))