迄今为止从未尝试过解析xml标记进行数据提取。现在我的任务是从统计来源中提取数据。我天真的方法是完全依赖pandas.read_xml((,如下所示:
import io
import pandas as pd
import requests
from zipfile import ZipFile
pd.options.display.max_columns = 8
URL = "https://www.federalreserve.gov/datadownload/Output.aspx?rel=g17&filetype=zip"
r = requests.get(URL)
with ZipFile(io.BytesIO(r.content)) as z:
# =========================================================================
# Select the Largest File Containing the Most of the Data
# =========================================================================
_map = {_.file_size: _.filename for _ in z.filelist}
with z.open(_map[max(_map)]) as f:
df = pd.read_xml(f)
print(df)
在pd.read_xml(f(中,有以下标记:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?> <message:MessageGroup xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message" xmlns:common="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/common" xmlns:frb="http://www.federalreserve.gov/structure/compact/common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message SDMXMessage.xsd http://www.federalreserve.gov/structure/compact/common frb_common.xsd"><message:Header xmlns:message="http://www.SDMX.org/resources/SDMXML/schemas/v1_0/message">
<message:ID>G17</message:ID>
<message:Test>false</message:Test>
<message:Name>G.17 - Industrial Production and Capacity Utilization</message:Name>
<message:Prepared>2022-07-15T00:38:22</message:Prepared>
<message:Sender id="FRB">
<message:Name>Federal Reserve Board</message:Name>
<message:Contact>
<message:Name>Public Affairs</message:Name>
<message:Telephone>(202) 452 - 3204</message:Telephone>
</message:Contact>
</message:Sender>
</message:Header>
<frb:DataSet id="IP_MAJOR_INDUSTRY_GROUPS" xmlns:kf="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS" xsi:schemaLocation="http://www.federalreserve.gov/structure/compact/G17_IP_MAJOR_INDUSTRY_GROUPS G17_IP_MAJOR_INDUSTRY_GROUPS.xsd" >
<kf:Series CURRENCY="NA" FREQ="129" SA="SA" SERIES_CODE="B50001" SERIES_NAME="IP.B50001.S" UNIT="Index:_2017_100" UNIT_MULT="1" >
<frb:Annotations>
<common:Annotation>
<common:AnnotationType>Short Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
<common:Annotation>
<common:AnnotationType>Long Description</common:AnnotationType>
<common:AnnotationText>Total index; s.a. IP</common:AnnotationText>
</common:Annotation>
</frb:Annotations>
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.8773" TIME_PERIOD="1919-01-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6617" TIME_PERIOD="1919-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.5270" TIME_PERIOD="1919-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6078" TIME_PERIOD="1919-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="4.6348" TIME_PERIOD="1919-05-31" />
... Skip ...
<frb:Obs OBS_STATUS="A" OBS_VALUE="102.9981" TIME_PERIOD="2022-02-28" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="103.7286" TIME_PERIOD="2022-03-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5224" TIME_PERIOD="2022-04-30" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.5729" TIME_PERIOD="2022-05-31" />
<frb:Obs OBS_STATUS="A" OBS_VALUE="104.3648" TIME_PERIOD="2022-06-30" />
</kf:Series>
... Many More Lines Omitted To the End of the Document ...
所示样品的理想输出为:
TIME_PERIOD IP.B50001.S
1919-01-31 4.8773
1919-02-28 4.6617
1919-03-31 4.5270
1919-04-30 4.6078
1919-05-31 4.6348
... Skip ...
2022-02-28 102.9981
2022-03-31 103.7286
2022-04-30 104.5224
2022-05-31 104.5729
2022-06-30 104.3648
目前,它没有返回理想的结果,因为整个目标数据似乎被忽略了。我得到的输出是:
ID Test Name
0 G17 false G.17 - Industrial Production and Capacity Util...
1 None None None
2 None None None
3 None None None
4 None None None
5 None None None
6 None None None
7 None None None
8 None None None
9 None None None
10 None None None
11 None None None
12 None None None
13 None None None
14 None None None
Prepared Sender id
0 2022-07-15T00:38:22 NaN None
1 None NaN IP_MAJOR_INDUSTRY_GROUPS
2 None NaN IP_DURABLE_GOODS_DETAIL
3 None NaN IP_NONDURABLE_GOODS_DETAIL
4 None NaN IP_MINING_AND_UTILITY_DETAIL
5 None NaN IP_MARKET_GROUPS
6 None NaN IP_SPECIAL_AGGREGATES
7 None NaN IP_GROSS_VALUE_STAGE_OF_PROCESS_GROUPS
8 None NaN MVA
9 None NaN DIFF
10 None NaN CAP
11 None NaN CAPUTL
12 None NaN GVIP
13 None NaN RIW
14 None NaN KW
schemaLocation Series
0 None NaN
1 http://www.federalreserve.gov/structure/compac... NaN
2 http://www.federalreserve.gov/structure/compac... NaN
3 http://www.federalreserve.gov/structure/compac... NaN
4 http://www.federalreserve.gov/structure/compac... NaN
5 http://www.federalreserve.gov/structure/compac... NaN
6 http://www.federalreserve.gov/structure/compac... NaN
7 http://www.federalreserve.gov/structure/compac... NaN
8 http://www.federalreserve.gov/structure/compac... NaN
9 http://www.federalreserve.gov/structure/compac... NaN
10 http://www.federalreserve.gov/structure/compac... NaN
11 http://www.federalreserve.gov/structure/compac... NaN
12 http://www.federalreserve.gov/structure/compac... NaN
13 http://www.federalreserve.gov/structure/compac... NaN
14 http://www.federalreserve.gov/structure/compac... NaN
这看起来更像是目录,而不是目标数据本身。
目前,任务分为两部分:
我应该将哪些kwargs传递给pandas.read_xml((以获得更有意义的数据?
如果成功检索到更有意义的数据,那么下一步将是使用zip归档中的辅助文件(例如,文件<G17_IP_MAJOR_INDUSTRY_GROUPS.xsd>(来分配列名,如果第一步的输出不太容易被人阅读的话。
谢谢!
您的示例xml有一些问题,但假设它按照我认为的方式修复了(它太长了,无法粘贴到这里(,那么read_xml()
可能无法解析它,从而生成预期的输出。
相反,我会先遍历lxml,然后将某些xpath表达式的输出(见下文(传递给panda以创建数据帧:
from lxml import etree
tree = etree.parse(f)
doc = tree.getroot()
#your expected columns:
cols = ["TIME_PERIOD", "IP.B50001.S"]
#the base xpath expression
expr = '//*[local-name()="Obs"]'
rows = []
for r in doc.xpath(expr):
row = []
#use more xpath expressions to get to the target attributes
row.extend([r.xpath('.//@TIME_PERIOD')[0], r.xpath('.//@OBS_VALUE')[0]])
rows.append(row)
frdf = pd.DataFrame(rows, columns=cols)
frdf
输出:
TIME_PERIOD IP.B50001.S
0 1919-01-31 4.8773
1 1919-02-28 4.6617
2 1919-03-31 4.5270
等等。