基于唯一id和范围截断对pandas列进行分层



我有一个df,它将收入分为男性和女性以及数千个邮政编码。我需要在df2中添加一列,按邮政编码映射每个人的收入水平(平均水平、高于平均水平等(

这个想法是指定一个人收入超过的最高限额,或者默认情况下指定最低级别的

每个级别的收入水平也因邮政编码而异。对于某些邮政编码,等级数量有限(例如没有非常高的收入(。由于空间原因,按邮政编码划分的男性也有不同的等级。

我想我需要创建一些字典,不知道如何处理。任何帮助都会有很大帮助,谢谢。

**编辑:第一个df充当一个键,我希望用它将"收入水平"列中相应的行值分配给df2

例如,对于df2中的唯一id,将df2["年度收入"]与df["年度收益截止"]中的匹配id进行比较。然后将df中可能的最高收入级别指定为df2 中的新行值

import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex  Area Code   Income level  Annual Income cutoff
0  female      10009      very high              10000000
1  female      10009           high                100000
2  female      10009  above average                75000
3  female      10009        average                 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex  Area Code  Annual Income
0  female      10009          98000
1  female      10009          56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex  Area Code  Annual Income   Income Level
0  female      10009          98000  above average
1  female      10009          56000        average

一种方法是使用pd.merge_asof:

pd.merge_asof(df2.sort_values('Annual Income'), 
df.sort_values('Annual Income cutoff'), 
left_on = 'Annual Income', 
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')

输出:

Sex  Area Code  Annual Income Income level  Annual Income cutoff
0  female      10009          56000      average                 50000
1  female      10009          98000      average                 50000

最新更新