解析多级json文件



我对R有很好的理解,但对JSON文件类型和解析的最佳实践是新手。我在从原始JSON文件构建数据框架时遇到了困难。JSON文件(下面的数据)由重复的测量数据组成,每个用户有多个观察值。

当原始文件被读入r

jdata<-read_json("./raw.json")

它以"List of 1"该列表为user_ids。在每个user_id中是进一步的列表,如下所示-

jdata$user_id$`sjohnson`$date$`2020-09-25`$city

最后一个位置实际上分为两个选项- $city或$zip。在最高级别,整个文件中大约有89个用户。

我的目标是最终得到一个矩形数据帧或多个数据帧,我可以像这样合并在一起——我实际上不需要邮政编码。

示例表

我试过jsonlite和tidyverse,我似乎得到的最远的是一个数据框架,在最小的级别上有一个变量-城市和邮政编码交替行使用这个

df  <-  as.data.frame(matrix(unlist(jdata), nrow=length(unlist(jdata["users"]))))

任何帮助/建议更接近上面的表格将不胜感激。我有一种感觉,我不能在不同的关卡中循环它。

下面是一个json文件结构的示例:

{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
},
"asmith: {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"City": "Elmhurst",
"zip": "00013
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}

rrapply-包中使用rrapply()的另一个(直接的)解决方案:

library(rrapply)
library(dplyr)
rrapply(jdata, how = "melt") %>%
filter(L5 == "city") %>%
select(user_id = L2, date = L4, city = value)
#>    user_id       date         city
#> 1 sjohnson 2020-09-25       Denver
#> 2 sjohnson 2020-10-01      Atlanta
#> 3 sjohnson 2020-11-04 Jacksonville
#> 4   asmith 2020-10-16    Cleavland
#> 5   asmith 2020-11-10     Elmhurst

数据
jdata <- jsonlite::fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')

我们可以一步一步地构建我们想要的结构:

library(jsonlite)
library(tidyverse)
df <- fromJSON('{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}')
df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(city = unlist(value)) %>%
filter(var == 'city') %>%
select(-var, -value)

给了:

# A tibble: 5 x 3
user_id  date       city        
<chr>    <chr>      <chr>       
1 sjohnson 2020-09-25 Denver      
2 sjohnson 2020-10-01 Atlanta     
3 sjohnson 2020-11-04 Jacksonville
4 asmith   2020-10-16 Cleavland   
5 asmith   2020-11-10 Elmhurst

受@Greg启发的另一种解决方案,我们更改了最后两行:

df %>%
bind_rows() %>%
pivot_longer(everything(), names_to = 'user_id') %>%
unnest_longer(value, indices_to = 'date') %>%
unnest_longer(value, indices_to = 'var') %>%
mutate(value = unlist(value)) %>%
pivot_wider(names_from = "var") %>%
select(user_id, date, city)

这给出了几乎相同的结果,除了一个额外的情况,城市是NA:

# A tibble: 6 x 3
user_id  date                city        
<chr>    <chr>               <chr>       
1 sjohnson 2020-09-25          Denver      
2 sjohnson 2020-10-01          Atlanta     
3 sjohnson 2020-11-04          Jacksonville
4 asmith   2020-10-16          Cleavland   
5 asmith   2020-11-10          Elmhurst    
6 asmith   2020-11-10 08:49:36 NA    

tidyverse一个自定义函数unnestable(),用于递归地嵌套写入像你描述的list的内容。看到有关该列表及其表格格式的详细信息。

<标题>

解决方案首先确保必要的库存在:

library(jsonlite)
library(tidyverse)

然后定义unnestable()函数如下:

unnestable <- function(v) {
# If we've reached the bottommost list, simply treat it as a table...
if(all(sapply(
X = v,
# Check that each element is a single value (or NULL).
FUN = function(x) {
is.null(x) || purrr::is_scalar_atomic(x)
},
simplify = TRUE
))) {
v %>%
# Replace any NULLs with NAs to preserve blank fields...
sapply(
FUN = function(x) {
if(is.null(x))
NA
else
x
},
simplify = FALSE
) %>%
# ...and convert this bottommost list into a table.
tidyr::as_tibble()
}
# ...but if this list contains another nested list, then recursively unnest its
# contents and combine their tabular results.
else if(purrr::is_scalar_list(v)) {
# Take the contents within the nested list...
v[[1]] %>%
# ...apply this 'unnestable()' function to them recursively...
sapply(
FUN = unnestable,
simplify = FALSE,
USE.NAMES = TRUE
) %>%
# ...and stack their results.
dplyr::bind_rows(.id = names(v)[1])
}
# Otherwise, the format is unrecognized and yields no results.
else {
NULL
}
}

最后,对JSON数据进行如下处理:

# Read the JSON file into an R list.
jdata <- jsonlite::read_json("./raw.json")

# Flatten the R list into a table, via 'unnestable()'
flat_data <- unnestable(jdata)

# View the raw table.
flat_data

当然,您可以根据需要重新格式化此表:

library(lubridate)
flat_data <- flat_data %>%
dplyr::transmute(
user_id = as.character(user_id),
date = lubridate::as_datetime(date),
city = as.character(city)
) %>%
dplyr::distinct()

# View the reformatted table.
flat_data
<标题>

结果给定一个raw.json文件,就像这里采样的那样

{
"user_id": {
"sjohnson": {
"date": {
"2020-09-25": {
"city": "Denver",
"zip": "80014"
},
"2020-10-01": {
"city": "Atlanta",
"zip": "30301"
},
"2020-11-04": {
"city": "Jacksonville",
"zip": "14001"
}
}
},
"asmith": {
"date": {
"2020-10-16": {
"city": "Cleavland",
"zip": "34321"
},
"2020-11-10": {
"city": "Elmhurst",
"zip": "00013"
},
"2020-11-10 08:49:36": {
"location": null,
"timestamp": 1605016176013
}
}
}
}
}

unnestable()将生成如下的tibble

# A tibble: 6 x 6
user_id  date                city         zip   location     timestamp
<chr>    <chr>               <chr>        <chr> <lgl>            <dbl>
1 sjohnson 2020-09-25          Denver       80014 NA                  NA
2 sjohnson 2020-10-01          Atlanta      30301 NA                  NA
3 sjohnson 2020-11-04          Jacksonville 14001 NA                  NA
4 asmith   2020-10-16          Cleavland    34321 NA                  NA
5 asmith   2020-11-10          Elmhurst     00013 NA                  NA
6 asmith   2020-11-10 08:49:36 NA           NA    NA       1605016176013

whichdplyr将格式化为以下结果:

# A tibble: 6 x 3
user_id  date                city        
<chr>    <dttm>              <chr>       
1 sjohnson 2020-09-25 00:00:00 Denver      
2 sjohnson 2020-10-01 00:00:00 Atlanta     
3 sjohnson 2020-11-04 00:00:00 Jacksonville
4 asmith   2020-10-16 00:00:00 Cleavland   
5 asmith   2020-11-10 00:00:00 Elmhurst    
6 asmith   2020-11-10 08:49:36 NA          
<标题>详细信息

列表格式确切地说,list表示嵌套的分组,由字段{group_1,group_2,…,group_n},格式必须为:

list(
group_1 = list(
"value_1" = list(
group_2 = list(
"value_1.1" = list(
# .
#  .
#   .
group_n = list(
"value_1.1.….n.1" = list(
field_a =    1,
field_b = TRUE
),
"value_1.1.….n.2" = list(
field_a =   2,
field_c = "2"
)
# ...
)
),
"value_1.2" = list(
# .
#  .
#   .
)
# ...
)
),
"value_2" = list(
group_2 = list(
"value_2.1" = list(
# .
#  .
#   .
group_n = list(
"value_2.1.….n.1" = list(
field_a =   3,
field_d = 3.0
)
# ...
)
),
"value_2.2" = list(
# .
#  .
#   .
)
# ...
)
)
# ...
)
)

表格格式给定一个这种形式的list,unnestable()将把它平展成如下形式的表:

# A tibble: … x …
group_1 group_2   ... group_n         field_a field_b field_c field_d
<chr>   <chr>     ... <chr>             <dbl> <lgl>   <chr>     <dbl>
1 value_1 value_1.1 ... value_1.1.….n.1       1 TRUE    NA           NA
2 value_1 value_1.1 ... value_1.1.….n.2       2 NA      2            NA
3 value_1 value_1.2 ... value_1.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
j value_2 value_2.1 ... value_2.1.….n.1       3 NA      NA            3
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮
k value_2 value_2.2 ... value_2.2.….n.1     ... ...     ...         ...
⋮    ⋮         ⋮                 ⋮              ⋮  ⋮       ⋮             ⋮

最新更新