我有一个df,它将收入分为男性和女性以及数千个邮政编码。我需要在df2中添加一列,按邮政编码映射每个人的收入水平(平均水平、高于平均水平等(
这个想法是指定一个人收入超过的最高限额,或者默认情况下指定最低级别的
每个级别的收入水平也因邮政编码而异。对于某些邮政编码,等级数量有限(例如没有非常高的收入(。由于空间原因,按邮政编码划分的男性也有不同的等级。
我想我需要创建一些字典,不知道如何处理。任何帮助都会有很大帮助,谢谢。
**编辑:第一个df充当一个键,我希望用它将"收入水平"列中相应的行值分配给df2
例如,对于df2中的唯一id,将df2["年度收入"]与df["年度收益截止"]中的匹配id进行比较。然后将df中可能的最高收入级别指定为df2 中的新行值
import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex Area Code Income level Annual Income cutoff
0 female 10009 very high 10000000
1 female 10009 high 100000
2 female 10009 above average 75000
3 female 10009 average 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex Area Code Annual Income
0 female 10009 98000
1 female 10009 56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex Area Code Annual Income Income Level
0 female 10009 98000 above average
1 female 10009 56000 average
一种方法是使用pd.merge_asof
:
pd.merge_asof(df2.sort_values('Annual Income'),
df.sort_values('Annual Income cutoff'),
left_on = 'Annual Income',
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')
输出:
Sex Area Code Annual Income Income level Annual Income cutoff
0 female 10009 56000 average 50000
1 female 10009 98000 average 50000