我想根据另外 2 个的值(列val1_df
和行的val2_df
)索引一个DataFrame
(aaxx_df
)。
>>> val1_df = pd.DataFrame({"A": ["aa", "bb"], "B": ["bb", "aa"]})
A B
0 aa bb
1 bb aa
>>> val2_df = pd.DataFrame({"A": ["xx", "yy"], "B": ["zz", "yy"]})
A B
0 xx zz
1 yy yy
>>> aaxx_df = pd.DataFrame(
... {"aa": [10, 20, 30], "bb": [-10, -20, -30]},
... index=["xx", "yy", "zz"]
... )
aa bb
xx 10 -10
yy 20 -20
zz 30 -30
>>> comb(val1_df, val2_df, aaxx_df)
A B
0 10.0 -30.0
1 -20.0 20.0
我在下面提出了一个适合我的问题的解决方案,但我想,一定有一些更干净的解决方案,可能通过 SQL(在我看来,它与关系数据库问题非常相似)。
def comb(df1, df2, link_df):
comb_df = df1.copy()
for val in link_df:
tmp_df = df2.copy()
tmp_df = tmp_df.applymap(lambda x: link_df.loc[x, val])
comb_df = comb_df.replace({val: np.nan}).combine_first(tmp_df)
return comb_df
如果要使用val2_df(行)和val1_df(列)中的值为aaxx_df编制索引,请使用查找:
vals = aaxx_df.lookup(val2_df.values.flatten(), val1_df.values.flatten())
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)
输出
A B
0 10 -30
1 -20 20
自 2020 年 12 月 26 日起,在 pandas 1.2 中,不推荐使用lookup
方法。这是一个作为函数的最小版本:
def lookup(df, row_labels, col_labels) -> np.ndarray:
"""
Label-based "fancy indexing" function for DataFrame.
Given equal-length arrays of row and column labels, return an
array of the values corresponding to each (row, col) pair.
Parameters
----------
row_labels : sequence
The row labels to use for lookup.
col_labels : sequence
The column labels to use for lookup.
Returns
-------
numpy.ndarray
The found values.
"""
row_idx = df.index.get_indexer(row_labels)
col_idx = df.columns.get_indexer(col_labels)
flat_index = row_idx * len(df.columns) + col_idx
return df.values.flat[flat_index]
代码变为:
vals = lookup(aaxx_df, val2_df.values.flatten(), val1_df.values.flatten())
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)
让我们做一行
out = (
pd.concat([val1_df, val2_df])
.groupby(level=0)
.agg(tuple)
.applymap(lambda x: aaxx_df.loc[x[1], x[0]])
)
A B
0 10 -30
1 -20 20
更新
s = aaxx_df.stack().reindex(
pd.MultiIndex.from_arrays([val2_df.stack(), val1_df.stack()])
)
val1_df[:] = s.values.reshape(val1_df.shape)
A B
0 10 -30
1 -20 20
此代码不太简洁,但创建字典并直接替换字典值可能比使用lookup
更快。此外,您可以使用行 + 列组合的唯一字符串而不是元组来简化代码:
a = val2_df.values + '|' + val1_df.values.tolist()
rows = np.repeat(aaxx_df.index, len(aaxx_df.columns)).tolist()
cols = np.tile(aaxx_df.columns, len(aaxx_df)).tolist()
vals = aaxx_df.values.ravel().tolist() # .tolist() makes code fasterfor dictionary conversion from numpy array
b = dict(zip(f'{rows}|{cols}', vals)) #this line of code is the bottleneck on this small dataframe, but it may have a high fixed cost and lower variable cost than lookup.
pd.DataFrame(a, columns=val1_df.columns).replace(b)
A B
0 10 -30
1 -20 20
对于创建"b"的代码,请参阅Jezrael的答案,以获取类似内容:https://stackoverflow.com/a/50505726/6366770