我如何正确导入这个JSON文件到熊猫?



使用的链接:http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_mmor?precision=1&geo=HU&geo=PL&geo=SK&unit=RCH_M&coicop=CP00它有三个国家在一段时间内的通货膨胀数据。这是我到目前为止的代码:

import urllib.request, json
import pandas as pd
import requests
from pandas.io.json import json_normalize
url = r"http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/prc_hicp_mmor?precision=1&geo=HU&geo=PL&geo=SK&unit=RCH_M&coicop=CP00"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode())
df2 = pd.DataFrame(pd.json_normalize(data))
print(df) 

这是输出,这是相当远,我需要的…:我需要JSON文件中提到的每个日期的3个国家的HCIP。

version                                              label  
0     2.0  HICP (2015 = 100) - monthly data (monthly rate...   
href    source     updated  
0  http://ec.europa.eu/eurostat/wdds/rest/data/v2...  Eurostat  2021-08-31   
class                         id            size status.0 status.1  ...  
0  dataset  [unit, coicop, geo, time]  [1, 1, 3, 307]        d        d  ...   
dimension.time.category.label.2020M11 dimension.time.category.label.2020M12  
0                               2020M11                               2020M12   
dimension.time.category.label.2021M01 dimension.time.category.label.2021M02  
0                               2021M01                               2021M02   
dimension.time.category.label.2021M03 dimension.time.category.label.2021M04  
0                               2021M03                               2021M04   
dimension.time.category.label.2021M05 dimension.time.category.label.2021M06  
0                               2021M05                               2021M06   
dimension.time.category.label.2021M07 dimension.time.category.label.2021M08  
0                               2021M07                               2021M08  

你知道我做错了什么吗?或者如何正确导入这种格式?

根据我对你的数据的理解:

  • data['id']指定尺寸顺序
  • data['dimension'][*]['category']指定每个维度的值顺序
  • data['value']为原始值。

其余的似乎都是多余的。

现在最简单的是提取值,让我们也确保索引被正确排序:

>>> val = pd.Series(data['value']).rename(index=int).sort_index()
>>> val
0      2.4
1      1.6
2      1.7
3      2.2
4      0.8
... 
916    0.2
917    0.6
918    0.4
919    0.5
920    0.2
Length: 921, dtype: float64

然后,对于每个维度,我们可以从信息构造一个简单的数据框架,例如geo:

>>> pd.DataFrame({key: val for key, val in data['dimension']['geo']['category'].items()})
index     label
HU      0   Hungary
PL      1    Poland
SK      2  Slovakia

按列index排序,我们将得到label的顺序,并根据data['id']排序,我们得到:

>>> dimensions = [pd.DataFrame({
...     key: val for key, val in data['dimension'][dim]['category'].items()
... }).sort_values('index')['label'].values for dim in data['id']]
>>> dimensions
[array(['Monthly rate of change'], dtype=object), array(['All-items HICP'], dtype=object), array(['Hungary', 'Poland', 'Slovakia'], dtype=object), array(['1996M02', '1996M03', '1996M04', '1996M05', '1996M06', '1996M07',
'1996M08', '1996M09', '1996M10', '1996M11', '1996M12', '1997M01',
'1997M02', '1997M03', '1997M04', '1997M05', '1997M06', '1997M07',
'1997M08', '1997M09', '1997M10', '1997M11', '1997M12', '1998M01',
'1998M02', '1998M03', '1998M04', '1998M05', '1998M06', '1998M07',
'1998M08', '1998M09', '1998M10', '1998M11', '1998M12', '1999M01',
'1999M02', '1999M03', '1999M04', '1999M05', '1999M06', '1999M07',
'1999M08', '1999M09', '1999M10', '1999M11', '1999M12', '2000M01',
'2000M02', '2000M03', '2000M04', '2000M05', '2000M06', '2000M07',
'2000M08', '2000M09', '2000M10', '2000M11', '2000M12', '2001M01',
'2001M02', '2001M03', '2001M04', '2001M05', '2001M06', '2001M07',
'2001M08', '2001M09', '2001M10', '2001M11', '2001M12', '2002M01',
'2002M02', '2002M03', '2002M04', '2002M05', '2002M06', '2002M07',
'2002M08', '2002M09', '2002M10', '2002M11', '2002M12', '2003M01',
'2003M02', '2003M03', '2003M04', '2003M05', '2003M06', '2003M07',
'2003M08', '2003M09', '2003M10', '2003M11', '2003M12', '2004M01',
'2004M02', '2004M03', '2004M04', '2004M05', '2004M06', '2004M07',
'2004M08', '2004M09', '2004M10', '2004M11', '2004M12', '2005M01',
'2005M02', '2005M03', '2005M04', '2005M05', '2005M06', '2005M07',
'2005M08', '2005M09', '2005M10', '2005M11', '2005M12', '2006M01',
'2006M02', '2006M03', '2006M04', '2006M05', '2006M06', '2006M07',
'2006M08', '2006M09', '2006M10', '2006M11', '2006M12', '2007M01',
'2007M02', '2007M03', '2007M04', '2007M05', '2007M06', '2007M07',
'2007M08', '2007M09', '2007M10', '2007M11', '2007M12', '2008M01',
'2008M02', '2008M03', '2008M04', '2008M05', '2008M06', '2008M07',
'2008M08', '2008M09', '2008M10', '2008M11', '2008M12', '2009M01',
'2009M02', '2009M03', '2009M04', '2009M05', '2009M06', '2009M07',
'2009M08', '2009M09', '2009M10', '2009M11', '2009M12', '2010M01',
'2010M02', '2010M03', '2010M04', '2010M05', '2010M06', '2010M07',
'2010M08', '2010M09', '2010M10', '2010M11', '2010M12', '2011M01',
'2011M02', '2011M03', '2011M04', '2011M05', '2011M06', '2011M07',
'2011M08', '2011M09', '2011M10', '2011M11', '2011M12', '2012M01',
'2012M02', '2012M03', '2012M04', '2012M05', '2012M06', '2012M07',
'2012M08', '2012M09', '2012M10', '2012M11', '2012M12', '2013M01',
'2013M02', '2013M03', '2013M04', '2013M05', '2013M06', '2013M07',
'2013M08', '2013M09', '2013M10', '2013M11', '2013M12', '2014M01',
'2014M02', '2014M03', '2014M04', '2014M05', '2014M06', '2014M07',
'2014M08', '2014M09', '2014M10', '2014M11', '2014M12', '2015M01',
'2015M02', '2015M03', '2015M04', '2015M05', '2015M06', '2015M07',
'2015M08', '2015M09', '2015M10', '2015M11', '2015M12', '2016M01',
'2016M02', '2016M03', '2016M04', '2016M05', '2016M06', '2016M07',
'2016M08', '2016M09', '2016M10', '2016M11', '2016M12', '2017M01',
'2017M02', '2017M03', '2017M04', '2017M05', '2017M06', '2017M07',
'2017M08', '2017M09', '2017M10', '2017M11', '2017M12', '2018M01',
'2018M02', '2018M03', '2018M04', '2018M05', '2018M06', '2018M07',
'2018M08', '2018M09', '2018M10', '2018M11', '2018M12', '2019M01',
'2019M02', '2019M03', '2019M04', '2019M05', '2019M06', '2019M07',
'2019M08', '2019M09', '2019M10', '2019M11', '2019M12', '2020M01',
'2020M02', '2020M03', '2020M04', '2020M05', '2020M06', '2020M07',
'2020M08', '2020M09', '2020M10', '2020M11', '2020M12', '2021M01',
'2021M02', '2021M03', '2021M04', '2021M05', '2021M06', '2021M07',
'2021M08'], dtype=object)]

现在简单地用这个来定义你的数据的索引,你就完成了:

>>> val.index = pd.MultiIndex.from_product(dimensions, names=data['id'])
>>> val
unit                    coicop          geo       time
Monthly rate of change  All-items HICP  Hungary   1996M02    2.4
1996M03    1.6
1996M04    1.7
1996M05    2.2
1996M06    0.8
... 
Slovakia  2021M04    0.2
2021M05    0.6
2021M06    0.4
2021M07    0.5
2021M08    0.2
Length: 921, dtype: float64

现在stackunstack你可以根据自己的意愿透视维度:

>>> val.unstack('geo')
geo                                            Hungary  Poland  Slovakia
unit                   coicop         time                              
Monthly rate of change All-items HICP 1996M02      2.4     1.4       0.3
1996M03      1.6     1.6       0.2
1996M04      1.7     2.0       0.3
1996M05      2.2     1.3       0.5
1996M06      0.8     0.9       0.2
...                                                ...     ...       ...
2021M04      0.8     0.7       0.2
2021M05      0.6     0.4       0.6
2021M06      0.4     0.1       0.4
2021M07      0.6     0.4       0.5
2021M08      0.2     0.3       0.2
[307 rows x 3 columns]

最新更新