在使用pandas.read_xml()时理解xml标记

迄今为止从未尝试过解析xml标记进行数据提取。现在我的任务是从统计来源中提取数据。我天真的方法是完全依赖pandas.read_xml((，如下所示：

import io
import pandas as pd
import requests
from zipfile import ZipFile
pd.options.display.max_columns = 8

URL = "https://www.federalreserve.gov/datadownload/Output.aspx?rel=g17&filetype=zip"
r = requests.get(URL)
with ZipFile(io.BytesIO(r.content)) as z:
# =========================================================================
# Select the Largest File Containing the Most of the Data
# =========================================================================
_map = {_.file_size: _.filename for _ in z.filelist}
with z.open(_map[max(_map)]) as f:
df = pd.read_xml(f)
print(df)

在pd.read_xml(f(中，有以下标记：

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>  <message:MessageGroup xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message"  xmlns:common="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/common"  xmlns:frb="http://www.federalreserve.gov/structure/compact/common"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message SDMXMessage.xsd http://www.federalreserve.gov/structure/compact/common frb_common.xsd"><message:Header xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message">
<message:ID>G17</message:ID>
<message:Test>false</message:Test>
<message:Name>G.17 - Industrial Production and Capacity Utilization</message:Name>
<message:Prepared>2022-07-15T00:38:22</message:Prepared>
<message:Sender id="FRB">
<message:Name>Federal Reserve Board</message:Name>
<message:Contact>
<message:Name>Public Affairs</message:Name>
<message:Telephone>(202) 452 - 3204</message:Telephone>
</message:Contact>
</message:Sender>
</message:Header>
<frb:DataSet id="IP_MAJOR_INDUSTRY_GROUPS" xmlns:kf="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS" xsi:schemaLocation="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS G17_IP_MAJOR_INDUSTRY_GROUPS.xsd"  > 
<kf:Series CURRENCY="NA" FREQ="129" SA="SA" SERIES_CODE="B50001" SERIES_NAME="IP.B50001.S" UNIT="Index:_2017_100" UNIT_MULT="1"  > 
<frb:Annotations>
<common:Annotation>
<common:AnnotationType>Short Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
<common:Annotation>
<common:AnnotationType>Long Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
</frb:Annotations>
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.8773" TIME_PERIOD="1919-01-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6617" TIME_PERIOD="1919-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.5270" TIME_PERIOD="1919-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6078" TIME_PERIOD="1919-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6348" TIME_PERIOD="1919-05-31" />
... Skip ...
<frb:Obs OBS_STATUS="A" OBS_VALUE="102.9981" TIME_PERIOD="2022-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="103.7286" TIME_PERIOD="2022-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5224" TIME_PERIOD="2022-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5729" TIME_PERIOD="2022-05-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.3648" TIME_PERIOD="2022-06-30" />
</kf:Series>
... Many More Lines Omitted To the End of the Document ...

所示样品的理想输出为：

TIME_PERIOD IP.B50001.S
1919-01-31     4.8773
1919-02-28     4.6617
1919-03-31     4.5270
1919-04-30     4.6078
1919-05-31     4.6348
... Skip ...
2022-02-28   102.9981
2022-03-31   103.7286
2022-04-30   104.5224
2022-05-31   104.5729
2022-06-30   104.3648

目前，它没有返回理想的结果，因为整个目标数据似乎被忽略了。我得到的输出是：

ID   Test                                               Name  
0    G17  false  G.17 - Industrial Production and Capacity Util...   
1   None   None                                               None   
2   None   None                                               None   
3   None   None                                               None   
4   None   None                                               None   
5   None   None                                               None   
6   None   None                                               None   
7   None   None                                               None   
8   None   None                                               None   
9   None   None                                               None   
10  None   None                                               None   
11  None   None                                               None   
12  None   None                                               None   
13  None   None                                               None   
14  None   None                                               None   
Prepared  Sender                                      id  
0   2022-07-15T00:38:22     NaN                                    None   
1                  None     NaN                IP_MAJOR_INDUSTRY_GROUPS   
2                  None     NaN                 IP_DURABLE_GOODS_DETAIL   
3                  None     NaN              IP_NONDURABLE_GOODS_DETAIL   
4                  None     NaN            IP_MINING_AND_UTILITY_DETAIL   
5                  None     NaN                        IP_MARKET_GROUPS   
6                  None     NaN                   IP_SPECIAL_AGGREGATES   
7                  None     NaN  IP_GROSS_VALUE_STAGE_OF_PROCESS_GROUPS   
8                  None     NaN                                     MVA   
9                  None     NaN                                    DIFF   
10                 None     NaN                                     CAP   
11                 None     NaN                                  CAPUTL   
12                 None     NaN                                    GVIP   
13                 None     NaN                                     RIW   
14                 None     NaN                                      KW   
schemaLocation  Series  
0                                                None     NaN  
1   http://www.federalreserve.gov/structure/compac...     NaN  
2   http://www.federalreserve.gov/structure/compac...     NaN  
3   http://www.federalreserve.gov/structure/compac...     NaN  
4   http://www.federalreserve.gov/structure/compac...     NaN  
5   http://www.federalreserve.gov/structure/compac...     NaN  
6   http://www.federalreserve.gov/structure/compac...     NaN  
7   http://www.federalreserve.gov/structure/compac...     NaN  
8   http://www.federalreserve.gov/structure/compac...     NaN  
9   http://www.federalreserve.gov/structure/compac...     NaN  
10  http://www.federalreserve.gov/structure/compac...     NaN  
11  http://www.federalreserve.gov/structure/compac...     NaN  
12  http://www.federalreserve.gov/structure/compac...     NaN  
13  http://www.federalreserve.gov/structure/compac...     NaN  
14  http://www.federalreserve.gov/structure/compac...     NaN

这看起来更像是目录，而不是目标数据本身。

目前，任务分为两部分：

我应该将哪些kwargs传递给pandas.read_xml((以获得更有意义的数据？
如果成功检索到更有意义的数据，那么下一步将是使用zip归档中的辅助文件(例如，文件<G17_IP_MAJOR_INDUSTRY_GROUPS.xsd>(来分配列名，如果第一步的输出不太容易被人阅读的话。

谢谢！

您的示例xml有一些问题，但假设它按照我认为的方式修复了(它太长了，无法粘贴到这里(，那么read_xml()可能无法解析它，从而生成预期的输出。

相反，我会先遍历lxml，然后将某些xpath表达式的输出(见下文(传递给panda以创建数据帧：

from lxml import etree
tree = etree.parse(f)
doc = tree.getroot()
#your expected columns:
cols = ["TIME_PERIOD", "IP.B50001.S"]
#the base xpath expression
expr = '//*[local-name()="Obs"]'
rows = []
for r in doc.xpath(expr):
row = []
#use more xpath expressions to get to the target attributes
row.extend([r.xpath('.//@TIME_PERIOD')[0], r.xpath('.//@OBS_VALUE')[0]])
rows.append(row)
frdf = pd.DataFrame(rows, columns=cols)
frdf

输出：

TIME_PERIOD     IP.B50001.S
0   1919-01-31  4.8773
1   1919-02-28  4.6617
2   1919-03-31  4.5270

等等。

相关内容

最新更新

热门标签：