r-使用XML2-SDMX大小写将XML转换为数据帧



请看一下这个问题中的reprex。SDMX是一个用于传播统计数据的数据模型,Python和R中有一些工具可以处理它。SDMX通常以XML文件的形式提供(最近也以JSON文件的形式(。我可以用一个专用的库来处理reprex中url中给出的简单示例,但我想了解发生了什么,所以我想使用xml2和。。。这就是我把头撞到墙上的地方。

原因是在不久的将来,我可能不得不处理复杂的XML文件,这些文件接近SDMX,但并不完全相同,这意味着我需要能够手动完成。欢迎提出任何建议。感谢

library(tidyverse)
library(xml2)
library(rsdmx)

url <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/FDIINDEX/AUT+BEL.4+5+8+9+14.V.INDEX/all?startTime=1997&endTime=2019"

##Very easy if I resort to a dedicated library
sdmx <- readSDMX(url, isURL = T)
stats <- as_tibble(sdmx)  ## and I have my nice tibble
print(stats)
#> # A tibble: 130 x 7
#>    LOCATION SECTOR RESTYPE SERIES TIME_FORMAT obsTime obsValue
#>    <chr>    <chr>  <chr>   <chr>  <chr>       <chr>      <dbl>
#>  1 AUT      4      V       INDEX  P1Y         1997           0
#>  2 AUT      4      V       INDEX  P1Y         2003           0
#>  3 AUT      4      V       INDEX  P1Y         2006           0
#>  4 AUT      4      V       INDEX  P1Y         2010           0
#>  5 AUT      4      V       INDEX  P1Y         2011           0
#>  6 AUT      4      V       INDEX  P1Y         2012           0
#>  7 AUT      4      V       INDEX  P1Y         2013           0
#>  8 AUT      4      V       INDEX  P1Y         2014           0
#>  9 AUT      4      V       INDEX  P1Y         2015           0
#> 10 AUT      4      V       INDEX  P1Y         2016           0
#> # … with 120 more rows

xmlobj <- read_xml(url)
## and then I do not know how to proceed...

由reprex包(v0.3.0(于2020-09-01创建

您应该了解XPath。我在代码中给出评论以帮助您理解:

library(xml2)
url <- "https://stats.oecd.org/restsdmx/sdmx.ashx/GetData/FDIINDEX/AUT+BEL.4+5+8+9+14.V.INDEX/all?startTime=1997&endTime=2019"
series <- read_xml(url) %>% xml_ns_strip() %>% xml_find_all("//DataSet/Series") # find all Series nodes
# note that the easiest way to read nodes in this file is to remove the namespaces by xml_ns_strip()
data <- 
purrr::map_dfr(
series,
function(x) {
data.frame(
LOCATION = x %>% xml_find_first(".//Value[@concept='LOCATION']") %>% xml_attr("value"), # for each Series node, get the first Value node has 'concept' attribute is 'LOCATION' and extract the 'value' attribute value
SECTOR = x %>% xml_find_first(".//Value[@concept='SECTOR']") %>% xml_attr("value"),
RESTYPE = x %>% xml_find_first(".//Value[@concept='RESTYPE']") %>% xml_attr("value"),
SERIES = x %>% xml_find_first(".//Value[@concept='SERIES']") %>% xml_attr("value"),
TIME_FORMAT = x %>% xml_find_first(".//Value[@concept='TIME_FORMAT']") %>% xml_attr("value"),
data.frame(
Time = x %>% xml_find_all(".//Obs/Time") %>% xml_text(trim = TRUE) %>% as.integer(),
ObsValue = x %>% xml_find_all(".//Obs/ObsValue") %>% xml_attr("value") %>% as.numeric()
)
) 
}
)

最新更新