我有一个数据集(。TSV文件),包含以下列。(这是世界银行最新的世界发展指数数据集。好!)
country countrycode varname 1960 1961 1962
afghanistan AFG GDP 5.6 5.7 5.8
afghanistan AFG Gini .77 .78 .75
afghanistan AFG educ 8.1 8.2 8.3
afghanistan AFG pop 888 889 890
albania ALB GDP 6.6 6.7 6.8
albania ALB Gini .45 .46 .47
albania ALB educ 6.2 6.3 6.4
albania ALB pop 777 778 779
我需要一个pandas DataFrame ['GDP','基尼','edu','pop']作为列,以及['country', 'countrycode', 'year']。因此,"year"的值目前是列!我希望每个国家/年份组合只有一行。
例如,列和第一行应该是
country countrycode year GDP Gini educ pop
afghanistan AFG 1960 5.6 .77 8.1 888
这似乎是一个复杂的支点或"融化"的反义词,但我不明白。
In [59]: df
Out[59]:
country countrycode varname 1960 1961 1962
0 afghanistan AFG GDP 5.60 5.70 5.80
1 afghanistan AFG Gini 0.77 0.78 0.75
2 afghanistan AFG educ 8.10 8.20 8.30
3 afghanistan AFG pop 888.00 889.00 890.00
4 albania ALB GDP 6.60 6.70 6.80
5 albania ALB Gini 0.45 0.46 0.47
6 albania ALB educ 6.20 6.30 6.40
7 albania ALB pop 777.00 778.00 779.00
In [60]: df = df.set_index(['country', 'countrycode', 'varname'])
In [61]: df.columns.name = 'year'
In [62]: df.stack().unstack('varname')
Out[62]:
varname GDP Gini educ pop
country countrycode year
afghanistan AFG 1960 5.6 0.77 8.1 888
1961 5.7 0.78 8.2 889
1962 5.8 0.75 8.3 890
albania ALB 1960 6.6 0.45 6.2 777
1961 6.7 0.46 6.3 778
1962 6.8 0.47 6.4 779
后者是一个带有MultiIndex的框架,您可以执行reset_index将MultiIndex移动到常规列。
将DataFrame
按country
和countrycode
分组,然后应用自己的函数:
In [13]: def f(df):
....: del df['country']
....: del df['countrycode']
....: df = df.set_index('varname')
....: df.index.name = None
....: df = df.T
....: df.index.name = 'year'
....: return df
....:
In [14]: df.groupby(['country', 'countrycode']).apply(f).reset_index()
Out[14]:
country countrycode year GDP Gini educ pop
0 afghanistan AFG 1960 5.6 0.77 8.1 888
1 afghanistan AFG 1961 5.7 0.78 8.2 889
2 afghanistan AFG 1962 5.8 0.75 8.3 890
3 albania ALB 1960 6.6 0.45 6.2 777
4 albania ALB 1961 6.7 0.46 6.3 778
5 albania ALB 1962 6.8 0.47 6.4 779
我建议@Wouter可以把这个放在他的(可接受的)答案中,因为它使用了来自WDI数据的实际名称,并且为使用它们的其他人提供了更多的剪切和粘贴。对不起——我敢肯定这不是正确的沟通方式……
对于您想要保留/使用的任何变量,只需在此字典中给出它们的名称:
WDIconversions={"Year":'year',
"YearCode":'',
"Country Name":'country_name_wb',
"Country Code":'countryCode_ISO3_WB',
"Inflation, consumer prices (annual %)":'',
"Inflation, GDP deflator (annual %)":'',
"GDP per capita, PPP (constant 2005 international $)":'GDPpc',
"Firms with female participation in ownership (% of firms)":'',
"Investment in energy with private participation (current US$)":'',
"Investment in telecoms with private participation (current US$)":'',
"Investment in transport with private participation (current US$)":'',
"Investment in water and sanitation with private participation (current US$)":'',
"Labor participation rate, female (% of female population ages 15+)":'',
"Labor participation rate, male (% of male population ages 15+)":'',
"Labor participation rate, total (% of total population ages 15+)":'',
"Ratio of female to male labor participation rate (%)":'',
"Life expectancy at birth, female (years)":'',
"Life expectancy at birth, male (years)":'',
"Life expectancy at birth, total (years)":'lifeExpectancy',
"Population, total":'nat_pop',
"GINI index":'GiniWB',
} # etc etc etc
dfW=pd.read_table(WBDrawfile)
df = dfW.set_index(['Country Name','Country Code','Indicator Name'])
del df['Indicator Code']
df.columns.name = 'year'
df=df.stack().unstack('Indicator Name')
df=df[[kk for kk,ii in WDIconversions.items() if ii and kk in df]].reset_index().rename(columns=WDIconversions)
结果是:
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12983 entries, 0 to 12982
Data columns:
country_name_wb 12983 non-null values
countryCode_ISO3_WB 12983 non-null values
year 12983 non-null values
GiniWB 845 non-null values
nat_pop 12601 non-null values
GDPpc 6292 non-null values
educPrimary 4949 non-null values
lifeExpectancy 11077 non-null values
dtypes: float64(5), object(3)