查找数据集与多级列的相关性

这是我数据集的一部分。

Country              Australia                        Belgium
                                      gdp       wage                 gdp       wage
2006-01-01 00:00:00    745,522,000,000.00  23,826.64  409,813,000,000.00  20,228.74
2007-01-01 00:00:00    851,963,000,000.00  24,616.84  471,821,000,000.00  20,486.16
2008-01-01 00:00:00  1,052,580,000,000.00  24,185.70  518,626,000,000.00  20,588.93
2009-01-01 00:00:00    926,448,000,000.00  24,496.84  484,553,000,000.00  21,284.21
2010-01-01 00:00:00  1,144,260,000,000.00  24,373.76  483,548,000,000.00  20,967.05

我想找到两国"GDP"列和"工资"列的相关性。

我试过使用，

df.corr()

但输出结果是空的。

预期输出可以是这样的：

  Country    Correlation
Australia            1.0
  Belgium           0.98

（相关性的值不准确。显示此内容仅用于说明。

我可以运行哪些代码来实现此结果？

编辑：执行行

打印（df.columns）

导致这样的输出

MultiIndex(levels=[['Australia', 'Belgium', 'Brazil', 'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Czech Republic', 'Estonia', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Israel', 'Japan', 'Korea', 'Latvia', 'Lithuania', 'Luxembourg', 'Mexico', 'Netherlands', 'New Zealand', 'Poland', 'Portugal', 'Russian Federation', 'Slovak Republic', 'Slovenia', 'Spain', 'Turkey', 'United Kingdom', 'United States'], ['gdp', 'wage']],
       labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 29, 29, 30, 30, 31, 31], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
       names=['Country', None])

首先将

列转换为带replace的数字并转换为float，然后通过DataFrame.xs DataFrame.corrwith选择级别以进行关联：

#if create DataFrame from file
#df = pd.read_csv(file, header=[0,1], thousands=',')
df = df.replace(',','', regex=True).astype(float)
s = df.xs('gdp', axis=1, level=1).corrwith(df.xs('wage', axis=1, level=1))
print (s)
Australia    0.325915
Belgium      0.521564
dtype: float64

最后为数据帧添加reset_index：

df1 = s.reset_index()
df1.columns = ['Country','Correlation']
print (df1)
     Country  Correlation
0  Australia     0.325915
1    Belgium     0.521564

详情：

print (df.xs('gdp', axis=1, level=1))
                        Australia       Belgium
2006-01-01 00:00:00  7.455220e+11  4.098130e+11
2007-01-01 00:00:00  8.519630e+11  4.718210e+11
2008-01-01 00:00:00  1.052580e+12  5.186260e+11
2009-01-01 00:00:00  9.264480e+11  4.845530e+11
2010-01-01 00:00:00  1.144260e+12  4.835480e+11

相关内容

最新更新

热门标签：