R-如何将一个大型，复杂，深度嵌套的JSON文件弄平到多个CSV文件链接标识符

我有一个复杂的JSON文件(〜8GB(，其中包含公开可用的企业数据。我们决定将文件分配到多个CSV文件(或.xlsx中的选项卡(中，因此客户端可以轻松地消耗数据。这些文件将由NZBN列/键链接。

我正在使用r和jsonlite读取一个小样本(在扩展到完整文件之前(。我猜我需要某种方法来指定每个文件中的密钥/列(即，第一个文件将具有标题：澳大利亚人，澳大利亚人companynumber，澳大利亚人列米伯，澳大利亚人服务dress，第二个文件将带有标题：年度returnfillingminth，年度returnfilingmonth，YouncyRorturnTurnturnturnnlastfiled，countryforigiled，countryoforigin ...(...(

以下是两个业务/实体的样本(我也捆绑了一些数据，所以请忽略实际值(：测试文件

我已经阅读了几乎所有有关类似问题的S/O的帖子，但似乎没有一个运气。我尝试了Purrr的变体， *应用命令，自定义扁平功能和JQR(" JQ"的R版本 - 看起来很有希望，但我似乎无法运行它(。

这是一个尝试创建我的单独文件的尝试，但是我不确定如何包括链接标识符(NZBN( 我一直遇到进一步的嵌套列表(我不确定那里有多少层次(

bulk <- jsonlite::fromJSON("bd_test.json")
coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE] 
company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)
shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)
shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)

我不确定在扁平/争论过程中将文件拆分是否更容易，或者只是完全弄平了整个文件，因此我每个业务/实体只有一行(然后根据需要收集列( - 我的唯一关心的是我需要将其扩展到约130万节点(8GB JSON文件(。

理想情况下，我希望每次都有新集合时将CSV文件拆分，并且该集合中的值将成为新的CSV/TAB的列。

任何帮助或提示都将不胜感激。

-------更新-------

更新，因为我的问题有些模糊，我认为我只需要一些代码来生产CSV/TABS之一，我为其他集合复制了。

说例如，我想创建以下元素的CSV：

EntityName(唯一链接标识符(
nzbn(唯一的链接标识符(
emailAddress__ UniqueIdentifier
emailAddress__emailAddress
emailaddress__emailpurpose
emailAddress__emailpurposedescription
emailaddress__ -startdate

我将如何解决？

我不确定有多少层次

这将非常有效地为此提供答案：

jq '
  def max(s): reduce s as $s (null; 
    if . == null then $s elif $s > . then $s else . end);
   max(paths|length)' input.json

(使用测试文件，答案为14。(

要获取数据的总体视图(模式(，您可以运行：

 jq 'include "schema"; schema' input.json

shema.jq可以在此要旨可用。这将产生结构性模式。

"例如，我想创建以下元素的CSV："

这是一个JQ解决方案，除了标头：

.companies.entity[]
| [.entityName, .nzbn]
  + (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| @csv

股权

股权数据很复杂，因此在下面我使用了此页面其他地方定义的to_table函数。

示例数据不包括"公司名称"字段，因此在下面，我添加了一个基于0的"公司索引"字段：

  .companies.entity[]
  | [.entityName, .nzbn] as $ix
  | .company
  | range(0;length) as $cix
  | .[$cix]
  | $ix + [$cix] + (.shareholding[] | to_table(false))

JQR

上面的解决方案使用独立的JQ可执行文件，但是一切顺利，与JQR一起使用相同的过滤器应该很琐碎，尽管使用JQ的include，但明确指定路径可能是最简单的，例如：

include "schema" {search: "~/.jq"};

如果输入json足够规律，则可能会发现以下扁平函数有帮助，尤其是因为它可以根据输入的叶子元素以"路径"的形式散发出标头的形式，可以任意嵌套：

# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
  def prettify: map( (map(tostring)|join(":") ));
  def composite: type == "object" or type == "array";
  def check:
     select(hdr|type == "array") 
     | if prettify == hdr then empty
       else error("expected head is (hdr) but imputed header is (.)")
       end ;
  . as $in
  | [paths(composite|not)]           # the paths in array-of-array form
  | if hdr==true then prettify
    else check, map(. as $p | $in | getpath($p))
    end;

例如，为.emailaddress生成所需的表(无标题(，可以写：

.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| @tsv

(添加标题并检查一致性，现在是练习，但在下面处理。(

生成多个文件

更有趣的是，您可以选择所需的级别，并自动产生多个表。有效地将输出分配到单独文件中的一种方法是使用尴尬。例如，您可以使用此JQ滤波器输送输出：

["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header
# First emit the line giving all the headers:
| $filename, ($common + $header | @tsv),
# Then emit the rows of the table:
  (.[$key][]
   | ($filename,  ($ix + to_table(false) | @tsv)))

awk -F\t 'fn {print >> fn; fn=0;next} {fn=$1".tsv"}'

这将在每个文件中产生标头；如果您需要检查一致性，请将to_table(false)更改为to_table($header)。

"例如，我想创建以下元素的CSV："

股权

JQR

生成多个文件

相关内容

最新更新

热门标签：