pandas:使用map()根据字典中调用的范围赋值



我尝试分配值(取自数据帧(,这些值需要2个标识符列才能在单独的df中分配值。我尝试了我能想到的一切,现在最接近的是将我的2个标识符打包到一个元组中,并将它们的值与dict一起打包,但我的语法可能不正确(或者我根本不能这样做(。对可能存在的任何其他方式都持开放态度(我不接受元组的想法(。谢谢大家。

import pandas as pd
# look up table for the 'condition'. Every 'cell' has 1 'baseline', 'washon', 'washoff' each 
# or only one 'baseline'
lut_cols = ['filename', 'condition', 'start', 'end']
lut_vals = [['cell1', 'baseline', 1, 24],
['cell1', 'washon', 50, 70],
['cell1', 'washoff', 100, 120],
['cell2', 'baseline', 2, 22],
['cell2', 'washon', 50, 70],
['cell2', 'washoff', 100, 120],
['cell3', 'baseline', 1, 20]]
lut_df = pd.DataFrame(lut_vals, columns=lut_cols)
# the data that needs a 'condition' column mapped based on the 'filename' and 'record' values that
# need to be in the range 'start' to 'end' given in lut_df
data_cols = ['filename', 'record', 'data']
data_vals = [['cell1', 1, 'some_data0'],
['cell1', 1, 'some_data1'],
['cell1', 1, 'some_data2'],
['cell1', 25, 'some_data3'],
['cell1', 25, 'some_data4'],
['cell1', 101, 'some_data5'],
['cell2', 2, 'some_data6'],
['cell2', 2, 'some_data7'],
['cell2', 50, 'some_data8'],
['cell2', 50, 'some_dat9'],
['cell2', 80, 'some_dat10']]

df = pd.DataFrame(data_vals, columns=data_cols)
# the 'filename' and 'start' to 'end' together make up the unique identifiers keys (as tuple)
key = ()
lut_dict = {}
for i in range(lut_df.filename.shape[0]):
key = (lut_df.filename.iloc[i], range(lut_df.start.iloc[i], lut_df.end.iloc[i]))
lut_dict[key] =  lut_df.condition[i]
# the 'record' column is now the index because I thought it would make life easier...
df.set_index('record', inplace=True)
df['condition'] = df['filename'].map(lut_dict)
print(df)

lut_dict供参考:

lut_dict:
{('cell1', range(1, 24)): 'baseline',
('cell1', range(50, 70)): 'washon',
('cell1', range(100, 120)): 'washoff',
('cell2', range(2, 22)): 'baseline',
('cell2', range(50, 70)): 'washon',
('cell2', range(100, 120)): 'washoff',
('cell3', range(1, 20)): 'baseline'}

遗憾的是,这就是输出:

filename    data    condition
record          
1   cell1   some_data0  NaN
1   cell1   some_data1  NaN
1   cell1   some_data2  NaN
25  cell1   some_data3  NaN
25  cell1   some_data4  NaN
101 cell1   some_data5  NaN
2   cell2   some_data6  NaN
2   cell2   some_data7  NaN
50  cell2   some_data8  NaN
50  cell2   some_dat9   NaN
80  cell2   some_dat10  NaN

lut_dict创建一个数据帧,然后将其合并到数据中:

df1 = pd.Series(lut_dict).rename_axis(['filename', 'record']).rename('condition') 
.reset_index().explode('record')
out = df.merge(df1, on=['filename', 'record'], how='left')

输出:

>>> out
filename record        data condition
0     cell1      1  some_data0  baseline
1     cell1      1  some_data1  baseline
2     cell1      1  some_data2  baseline
3     cell1     25  some_data3       NaN
4     cell1     25  some_data4       NaN
5     cell1    101  some_data5   washoff
6     cell2      2  some_data6  baseline
7     cell2      2  some_data7  baseline
8     cell2     50  some_data8    washon
9     cell2     50   some_dat9    washon
10    cell2     80  some_dat10       NaN
>>> df1
filename record condition
0     cell1      1  baseline
0     cell1      2  baseline
0     cell1      3  baseline
0     cell1      4  baseline
0     cell1      5  baseline
..      ...    ...       ...
6     cell3     15  baseline
6     cell3     16  baseline
6     cell3     17  baseline
6     cell3     18  baseline
6     cell3     19  baseline
[142 rows x 3 columns]

您可以使用merge_asof。由于两个数据帧都必须按关键字排序,因此我们首先按要合并的关键字对每个数据帧进行排序。

df = df.sort_values(by='record')
lut_df = lut_df.sort_values(by='start')
out = pd.merge_asof(df, lut_df, left_on='record', right_on='start', suffixes=('','_'))[['filename','data', 'condition']]

输出:

filename        data condition
0     cell1  some_data0  baseline
1     cell1  some_data1  baseline
2     cell1  some_data2  baseline
3     cell2  some_data6  baseline
4     cell2  some_data7  baseline
5     cell1  some_data3  baseline
6     cell1  some_data4  baseline
7     cell2  some_data8    washon
8     cell2   some_dat9    washon
9     cell2  some_dat10    washon
10    cell1  some_data5   washoff

最新更新