Pandas数据帧通过添加合并行



我想从人口普查数据创建一个数据帧。我想计算每个特定收入群体的纳税申报人数。

现在,我写了这个

census_df = pd.read_csv('../zip code data/19zpallagi.csv')
sub_census_df = census_df[['zipcode', 'agi_stub', 'N02650', 'A02650', 'ELDERLY', 'A07180']].copy()
num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
for i, column_name in zip(range(1, 7), num_of_returns):
sub_census_df[column_name] = sub_census_df[sub_census_df['agi_stub'] == i]['N02650']

我有6组附加到一个特定的邮政编码。我想得到一行,特定邮政编码的返回次数只作为列出现一次。我已经尝试将NaN更改为0,并使用groupby('zipcode'(.sum((,但我得到了邮政编码0的5000万行,其中似乎只存在大约800k。

这是我目前得到的数据帧:

zipcode agi_stub    N02650  A02650  ELDERLY A07180  Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000 Number_of_returns_200000_more   Amount_1_25000  Amount_25000_50000  Amount_50000_75000  Amount_75000_100000 Amount_100000_200000    Amount_200000_more
0   0   1   778140.0    10311099.0  144610.0    2076.0  778140.0    NaN NaN NaN NaN NaN 10311099.0  NaN NaN NaN NaN NaN
1   0   2   525940.0    19145621.0  113810.0    17784.0 NaN 525940.0    NaN NaN NaN NaN NaN 19145621.0  NaN NaN NaN NaN
2   0   3   285700.0    17690402.0  82410.0 9521.0  NaN NaN 285700.0    NaN NaN NaN NaN NaN 17690402.0  NaN NaN NaN
3   0   4   179070.0    15670456.0  57970.0 8072.0  NaN NaN NaN 179070.0    NaN NaN NaN NaN NaN 15670456.0  NaN NaN
4   0   5   257010.0    35286228.0  85030.0 14872.0 NaN NaN NaN NaN 257010.0    NaN NaN NaN NaN NaN 35286228.0  NaN

这是我想要得到的:

zipcode Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000 Number_of_returns_200000_more   
0   0   778140.0                              525940.0                         285700.0     179070.0                 257010.0   850.0

这里有一种方法可以使用groupby并对所需列求和

num_of_returns = ['Number_of_returns_1_25000', 'Number_of_returns_25000_50000', 'Number_of_returns_50000_75000',
'Number_of_returns_75000_100000', 'Number_of_returns_100000_200000', 'Number_of_returns_200000_more']
df.groupby('zipcode', as_index=False)[num_of_returns].sum()
zipcode     Number_of_returns_1_25000   Number_of_returns_25000_50000   Number_of_returns_50000_75000   Number_of_returns_75000_100000  Number_of_returns_100000_200000     Number_of_returns_200000_more
0   0   778140.0    525940.0    285700.0    179070.0    257010.0    0.0

这个问题需要更多信息才能给出正确的答案。例如,您遗漏了数据框架中某些列的含义:

- `N1: Number of returns`

- `agi_stub: Size of adjusted gross income`
According to IRS this has the following levels.
Size of adjusted gross income  "0 = No AGI Stub
1 = ‘Under $1’
2 = '$1 under $10,000'

3 = '$10,000 under $25,000'

4 = '$25,000 under $50,000'

5 = '$50,000 under $75,000'

6 = '$75,000 under $100,000'

7 = '$100,000 under $200,000'
8 = ‘$200,000 under $500,000’
9 = ‘$500,000 under $1,000,000’
10 = ‘$1,000,000 or more’"

我从https://www.irs.gov/pub/irs-soi/16incmdocguide.doc

有了这些信息,我想你想找到的是为agi_stub的每个收入水平提交纳税申报表的人。如果这就是你的意思,那么这可以通过以下方式实现:

import pandas as pd
data = pd.read_csv("./data/19zpallagi.csv")
## select only the desired columns
data = data[['zipcode', 'agi_stub', 'N1']]
## solution to your problem?
df = data.pivot_table(
index='zipcode',
values='N1',
columns='agi_stub',
aggfunc=['sum']
)
## bit of cleaning up. 
PREFIX = 'agi_stub_level_'
df.columns = [PREFIX + level for level in df.columns.get_level_values(1).astype(str)]

这是输出。

In [77]: df
Out[77]: 
agi_stub_level_1  agi_stub_level_2  ...  agi_stub_level_5  agi_stub_level_6
zipcode                                      ...                                    
0              50061850.0        37566510.0  ...        21938920.0         8859370.0
1001               2550.0            2230.0  ...            1420.0             230.0
1002               2850.0            1830.0  ...            1840.0             990.0
1005                650.0             570.0  ...             450.0              60.0
1007               1980.0            1530.0  ...            1830.0             460.0
...                   ...               ...  ...               ...               ...
99827               470.0             360.0  ...             170.0              40.0
99833               550.0             380.0  ...             290.0              80.0
99835              1250.0            1130.0  ...             730.0             190.0
99901              1960.0            1520.0  ...            1030.0             290.0
99999            868450.0          644160.0  ...          319880.0          142960.0
[27595 rows x 6 columns]

最新更新