我如何在这个问题中工作:通过添加权重来校正抽样偏差



如果我有一个数据集(抽样或来自调查),其中包含400,000个具有人口统计类别(年龄,种族和教育水平)的人id。前30行:


person id,age,education,ethnicity
0,75_84,Some College,white
1,85_120,HS Diploma,white
2,25_34,Some College,white
3,55_64,HS Diploma,black
4,45_54,Bachelor Degree,white
5,25_34,HS Diploma,white
6,55_64,Some College,white
7,45_54,HS Diploma,white
8,18_24,Some College,white
9,75_84,Some College,white
10,45_54,HS Diploma,black
11,55_64,Some College,white
12,55_64,Graduate Degree,white
13,55_64,Graduate Degree,black
14,18_24,Some College,white
15,25_34,Some College,white
16,25_34,Some College,white
17,45_54,HS Diploma,white
18,65_74,,white
19,55_64,HS Diploma,black
20,55_64,HS Diploma,black
21,55_64,HS Diploma,black
22,35_44,Some College,white
23,35_44,Some College,white
24,35_44,Some College,white
25,18_24,Some College,black
26,55_64,Some College,white
27,55_64,Some College,white
28,55_64,Bachelor Degree,white
29,55_64,Bachelor Degree,white
30,25_34,Bachelor Degree,white

通过使用python,如何计算一组个人层面的权重(每个人一个单一的权重)消除数据集的偏差。每个类别的权重之和应该是你在demo中拥有的ground truth dataset。


demo ground truth dataset:

demographic category,number of individuals
18_24,11839159
25_34,16399632
35_44,15335704
45_54,16430762
55_64,15148777
65_74,9990412
75_84,5221430
0_4,7500407
5_9,7748669
10_14,7815759
15_17,4758751
85_120,2293226
< Than HS Diploma,12274025
Bachelor Degree,16305721
Graduate Degree,9343192
HS Diploma,25799018
Some College,28937146
asian,6145151
black,14626476
hispanic,21953456
islander,190389
white,73838168
answer = {'demographic category':[],
'number of individuals':[],
}
for k in df['demographic category'].unique():
answer['demographic category'].append(k)
answer['number of individuals'].append(df[df['demographic category']==k].shape[0])
for k in df.age.unique():
answer['demographic category'].append(k)
answer['number of individuals'].append(df[df.age==k].shape[0])
answer = pandas.DataFrame(answer)

相关内容

最新更新