根据其他 2 个数据帧的值为数据帧编制索引的更好解决方案



我想根据另外 2 个的值(列val1_df和行的val2_df)索引一个DataFrame(aaxx_df)。

>>> val1_df = pd.DataFrame({"A": ["aa", "bb"], "B": ["bb", "aa"]})
A   B
0  aa  bb
1  bb  aa
>>> val2_df = pd.DataFrame({"A": ["xx", "yy"], "B": ["zz", "yy"]})
A   B
0  xx  zz
1  yy  yy
>>> aaxx_df = pd.DataFrame(
...     {"aa": [10, 20, 30], "bb": [-10, -20, -30]}, 
...     index=["xx", "yy", "zz"]
... )
aa  bb
xx  10 -10
yy  20 -20
zz  30 -30
>>> comb(val1_df, val2_df, aaxx_df)
A     B
0  10.0 -30.0
1 -20.0  20.0

我在下面提出了一个适合我的问题的解决方案,但我想,一定有一些更干净的解决方案,可能通过 SQL(在我看来,它与关系数据库问题非常相似)。

def comb(df1, df2, link_df):
comb_df = df1.copy()

for val in link_df:
tmp_df = df2.copy()
tmp_df = tmp_df.applymap(lambda x: link_df.loc[x, val])
comb_df = comb_df.replace({val: np.nan}).combine_first(tmp_df)
return comb_df

如果要使用val2_df(行)和val1_df(列)中的值为aaxx_df编制索引,请使用查找:

vals = aaxx_df.lookup(val2_df.values.flatten(), val1_df.values.flatten())

res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

输出

A   B
0  10 -30
1 -20  20

自 2020 年 12 月 26 日起,在 pandas 1.2 中,不推荐使用lookup方法。这是一个作为函数的最小版本:

def lookup(df, row_labels, col_labels) -> np.ndarray:
"""
Label-based "fancy indexing" function for DataFrame.
Given equal-length arrays of row and column labels, return an
array of the values corresponding to each (row, col) pair.
Parameters
----------
row_labels : sequence
The row labels to use for lookup.
col_labels : sequence
The column labels to use for lookup.
Returns
-------
numpy.ndarray
The found values.
"""
row_idx = df.index.get_indexer(row_labels)
col_idx = df.columns.get_indexer(col_labels)
flat_index = row_idx * len(df.columns) + col_idx
return df.values.flat[flat_index]

代码变为:

vals = lookup(aaxx_df, val2_df.values.flatten(), val1_df.values.flatten())

res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

让我们做一行

out = (
pd.concat([val1_df, val2_df])
.groupby(level=0)
.agg(tuple)
.applymap(lambda x: aaxx_df.loc[x[1], x[0]])
)
A   B
0  10 -30
1 -20  20

更新

s = aaxx_df.stack().reindex(
pd.MultiIndex.from_arrays([val2_df.stack(), val1_df.stack()])
)
val1_df[:] = s.values.reshape(val1_df.shape)
A   B
0  10 -30
1 -20  20

此代码不太简洁,但创建字典并直接替换字典值可能比使用lookup更快。此外,您可以使用行 + 列组合的唯一字符串而不是元组来简化代码:

a = val2_df.values + '|' + val1_df.values.tolist()
rows = np.repeat(aaxx_df.index, len(aaxx_df.columns)).tolist()
cols = np.tile(aaxx_df.columns, len(aaxx_df)).tolist()
vals = aaxx_df.values.ravel().tolist()  # .tolist() makes code fasterfor dictionary conversion from numpy array
b = dict(zip(f'{rows}|{cols}', vals)) #this line of code is the bottleneck on this small dataframe, but it may have a high fixed cost and lower variable cost than lookup.
pd.DataFrame(a, columns=val1_df.columns).replace(b)
A   B
0  10 -30
1 -20  20

对于创建"b"的代码,请参阅Jezrael的答案,以获取类似内容:https://stackoverflow.com/a/50505726/6366770

最新更新