解析和扁平化复杂的JSON与Pydantic



我需要从第三方API中使用JSON,也就是说,我必须处理这个API返回的任何内容,并且不能更改。

对于这个特定的任务,API返回它所谓的"实体"。是啊,没什么意义。问题是结构是深度嵌套的,在我的解析中,我希望能够在某种程度上使其扁平化。这里要解释的是一个单一"实体"的模糊示例。在完整的响应中,这是一个名为"data"的数组。里面可以有多个实体

{
"type": "entity",
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"links": {
"self": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb"
},
"attributes": {
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"name": "E03075-042",
"description": "",
"createdAt": "2021-07-14T05:58:47.239Z",
"editedAt": "2022-09-22T11:28:53.327Z",
"state": "open",
"fields": {
"Department": {
"value": "Foo"
},
"Description": {
"value": ""
},
"Division": {
"value": "Bar"
},
"Name": {
"value": "E03075-042"
},
"Project": {
"details": {
"description": ""
},
"value": "My Project"
}
}
},
"relationships": {
"createdBy": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"editedBy": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"ancestors": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
},
"data": [
{
"type": "entity",
"id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
"meta": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f"
}
}
}
]
},
"owner": {
"links": {
"self": "https://example.com/api/rest/v1.0/users/101"
},
"data": {
"type": "user",
"id": "101"
}
},
"pdf": {
"links": {
"self": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
}
}
}
}

我想把它解析成一个数据容器。我是开放的自定义解析,只是使用Pydantic上的数据类,如果它是不可能的,我想要的。

数据问题:

  • links:在JSON中使用self作为字段名。我想打开这个,并有一个简单命名为link
  • 的顶级字段
  • attributes: unnest以及不让它们在Attributes模型
  • fields: unnest to top level and remove/ignore duplicate (name,description)
  • Projectfields: unnest到顶层,只使用value字段
  • relationships: unnest,忽略一些,甚至可能解析到实际用户名

我可以控制Pydantic以这样一种方式来打开我喜欢的数据并忽略未映射的字段吗?

解析是否也包括解析,这意味着更多的API调用?

Pydantic提供根验证器来对整个模型的数据执行验证。但是在这种情况下,我不确定在一个巨大的验证函数中完成这一切是不是一个好主意。

我可能会使用两阶段解析设置。第一个模型应该捕获"原始";数据或多或少在您期望从API获得的模式中。第二个模型应该反映您自己想要的数据模式。

这样,如果遇到错误,您可以很容易地查明它是来自API的意外数据格式,还是因为您的平坦化/解析过程出现了错误。

示例如下:

从定义基类开始,用于继承/少重复:

from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import Any
from pydantic import AnyHttpUrl, BaseModel, Field, root_validator, validator
from pydantic.fields import ModelField, SHAPE_LIST
class StateEnum(Enum):
open = "open"
something_else = "something_else"
class BaseAttributes(BaseModel):
id: str
eid: str
created_at: datetime = Field(alias="createdAt")
edited_at: datetime = Field(alias="editedAt")
state: StateEnum
# some fields:
name: str = Field(alias="Name")
description: str = Field(alias="Description")
class Config:
allow_population_by_field_name = True
class RawRelationship(BaseModel):
links: dict[str, AnyHttpUrl]
data: dict[str, Any] | list[dict[str, Any]] | None = None
class BaseEntity(BaseModel):
type: str
id: str
# More code below...

state字段只是尖叫着"选择";所以我带了一个enum,只是作为一个想法。我还选择使用python命名约定以及实际的数据键名作为别名。

现在您可以定义RawEntity模型来捕获原始API输出:

...
class RawAttributes(BaseAttributes):
fields: dict[str, Any]
class RawEntity(BaseEntity):
links: dict[str, AnyHttpUrl]
attributes: RawAttributes
relationships: dict[str, RawRelationship] = {}
@root_validator
def ensure_consistency(cls, values: dict[str, Any]) -> dict[str, Any]:
if values["id"] != values["attributes"].id:
raise ValueError("id inconsistent")
return values
# More code below...

这里有一个关于根验证如何有意义的演示。

最后,我们可以编写目标模型。我们可以给它一个类方法,专门用于将RawEntity解析为FlatEntity,它执行一些平坦化任务。我们可以再次将特定于字段的属性委托给验证器:

...
SELF_KEY = "self"
class FlatRelationship(BaseEntity):
link: AnyHttpUrl
class FlatEntity(BaseAttributes, BaseEntity):
link: AnyHttpUrl
# more fields:
department: str = Field(alias="Department")
division: str = Field(alias="Division")
project: str = Field(alias="Project")
# relationships:
created_by: FlatRelationship = Field(alias="createdBy")
edited_by: FlatRelationship = Field(alias="editedBy")
ancestors: list[FlatRelationship]
owner: FlatRelationship
pdf: AnyHttpUrl
@classmethod
def from_raw_entity(cls, entity: RawEntity) -> FlatEntity:
data = entity.dict(exclude={"links", "attributes", "relationships"})
data["link"] = entity.links[SELF_KEY]
data |= entity.attributes.dict(exclude={"fields"})
data |= entity.attributes.fields
data |= entity.relationships
return cls.parse_obj(data)
@validator(
"name",
"description",
"department",
"division",
"project",
pre=True,
)
def get_field_value(cls, value: object) -> object:
if isinstance(value, dict):
return value["value"]
return value
@validator("*", pre=True)
def flatten_relationship(cls, value: object, field: ModelField) -> object:
if field.type_ is not FlatRelationship:
return value
if not isinstance(value, RawRelationship):
return value
if isinstance(value.data, dict):
return FlatRelationship(**value.data, link=value.links[SELF_KEY])
if isinstance(value.data, list) and field.shape == SHAPE_LIST:
return [
FlatRelationship(**data, link=value.links[SELF_KEY])
for data in value.data
]
return value
@validator("pdf", pre=True)
def get_pdf_link(cls, value: object) -> object:
if isinstance(value, RawRelationship):
return value.links[SELF_KEY]
return value
# More code below...

正如您所看到的,对于分组在"fields"下的所有字段,都有一个验证器。在源数据中也是如此。还有另一种方法用于扁平化"关系",它基本上将它们转换为BaseEntity的实例,但带有link字段。还有一个单独的pdf字段。

请注意,这些验证器中的每一个都被配置为pre=True,因为传入的数据不会是声明的字段类型,所以我们的自定义验证器需要在默认Pydantic字段验证器启动之前完成它们的工作

有了这样的设置,如果我们将示例数据放入名为EXAMPLE的字典中,我们可以像这样测试我们的解析器:
...
if __name__ == "__main__":
instance = RawEntity.parse_obj(EXAMPLE)
parsed = FlatEntity.from_raw_entity(instance)
print(parsed.json(indent=4))

输出如下:

{
"type": "entity",
"id": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"eid": "efebcc3e-445c-4d85-9689-bb85f46160cb",
"created_at": "2021-07-14T05:58:47.239000+00:00",
"edited_at": "2022-09-22T11:28:53.327000+00:00",
"state": "open",
"name": "E03075-042",
"description": "",
"link": "https://example.com/api/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb",
"department": "Foo",
"division": "Bar",
"project": "My Project",
"created_by": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"edited_by": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"ancestors": [
{
"type": "entity",
"id": "7h60bcb9-b1c0-4a12-8b6b-12e3eab54e6f",
"link": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/ancestors"
}
],
"owner": {
"type": "user",
"id": "101",
"link": "https://example.com/api/rest/v1.0/users/101"
},
"pdf": "https://example.com/api/rest/v1.0/entities/efebcc3e-445c-4d85-9689-bb85f46160cb/pdf"
}

我相信你可以进一步调整/优化你的需要,但一般的方法应该是清楚的。默认情况下,Pydantic模型在解析数据时忽略未知的键/字段,所以这应该不是问题。

至于你的第二个问题,我认为这需要在这个网站上单独发布,但一般来说,我不会在验证期间执行web请求。

最新更新