自动化类似分形的嵌套JSON规范化



问题:

我有100多个JSON,有一个类似分形的dicts列表结构。数据结构的宽度和高度因JSON而异。每个标签都是一个句子的组成部分。

test = [
{
"label": "I",
"children": [
{
"label": "want",
"children": [
{
"label": "a",
"children": [
{"label": "coffee"},
{"label": "big", "children": [{"label": "piece of cake"}]},
],
}
],
},
{"label": "need", "children": [{"label": "time"}]},
{"label": "like",
"children": [{"label": "italian", "children": [{"label": "pizza"}]}],
},
],
},
{
"label": "We",
"children": [
{"label": "are", "children": [{"label": "ok"}]},
{"label": "will", "children": [{"label": "rock you"}]},
],
},
]

我想自动化JSON的标准化,以获得这种类型的输出

sentences = [
'I want a coffee', 
'I want a big piece of cake', 
'I need time', 
'I like italian pizza', 
'We are ok',
'We will rock you',
] 

它实际上就像CCD_ 1函数;路径";。

我尝试过的:

  • pandas.json_normalize,但它需要预定义metarecord_path参数才能使用复杂的体系结构;

  • jsonpath_ng和parse('[*]..label'),但我找不到解决这个问题的方法;

  • 像这篇帖子中获得的平坦函数:

{'0label': 'I',
'0children_0label': 'want',
'0children_0children_0label': 'a',
'0children_0children_0children_0label': 'coffee',
'0children_0children_0children_1label': 'big',
'0children_0children_0children_1children_0label': 'piece of cake',
'0children_1label': 'need',
'0children_1children_0label': 'time',
'0children_2label': 'like',
'0children_2children_0label': 'italian',
'0children_2children_0children_0label': 'pizza',
'1label': 'We',
'1children_0label': 'are',
'1children_0children_0label': 'ok',
'1children_1label': 'will',
'1children_1children_0label': 'rock you'}

我试图分割键来识别层次结构,但我有一个索引问题。例如,我不明白为什么像"1children_0label"这样的键包含"0label"而不是应该引用{'1label":"We"}的"1label"索引。

  • while循环,输出一个"级别"列表作为包含n+1个子项计数和标签的元组列表。这本应是重新创建最终输出的第一步,但我也无法解决这个问题
import copy
levels = []
idx = [i for i in range(len(test))]
stack = copy.deepcopy(test)
lvl = 1
while stack: 
idx = []
children = []
for i,d in enumerate(stack):
if 'children' in d:
n = len(d['children'])
else : 
n = 0
occurences = (n,d['label'])
idx.append(occurences)

children = stack[i].copy()
if 'children' in stack[i]:
children.extend(stack[i]['children'])

stack = childs.copy()
children = []
levels.append(idx.copy())       
print(levels)    

输出:

[[(3, 'I'), (2, 'We')], [(1, 'want'), (1, 'need'), (1, 'like'), (1, 'are'), (1, 'will')], [(2, 'a'), (0, 'time'), (1, 'italian'), (0, 'ok'), (0, 'rock you')], [(0, 'coffee'), (1, 'big'), (0, 'pizza')], [(0, 'piece of cake')]]

请帮忙

您可以尝试递归:

def get_sentences(o):
if isinstance(o, dict):
if "children" in o:
for item in get_sentences(o["children"]):
yield o["label"] + " " + item
else:
yield o["label"]
elif isinstance(o, list):
for v in o:
yield from get_sentences(v)

print(list(get_sentences(test)))

打印:

[
"I want a coffee",
"I want a big piece of cake",
"I need time",
"I like italian pizza",
"We are ok",
"We will rock you",
]

最新更新