我如何用serde_json建立一个有状态的流解析器?



我试图做一些有状态的JSON解析与serdeserde_json。我首先检查了如何将可以在Deserialize:: Deserialize()中访问的选项传递给Rust的服务器?,虽然我几乎得到了我需要的东西,但我似乎缺少了一些至关重要的东西。

我想做的是双重的:

  1. 我的JSON是非常大的-太大了,只是读取输入到内存-所以我需要流。(fww,它也有许多嵌套层,所以我需要使用disable_recursion_limit)
  2. 我需要一些有状态的处理,在那里我可以传递一些数据到序列化器,这将影响从输入JSON中保留的数据以及在序列化期间如何转换数据。

例如,我的输入可能是:

{ "documents": [
{ "foo": 1 },
{ "baz": true },
{ "bar": null }
],
"journal": { "timestamp": "2023-04-04T08:28:00" }
}

这里,'documents'数组中的每个对象都非常大,我只需要其中的一个子集。不幸的是,我需要首先找到键值对"documents",然后我需要访问该数组中的每个元素。现在,我不关心其他键值对(如"journal"),但这可能会改变。

我目前的方法如下:

use serde::de::DeserializeSeed;
use serde_json::Value;
/// A simplified state passed to and returned from the serialization.
#[derive(Debug, Default)]
struct Stats {
records_skipped: usize,
}
/// Models the input data; `Documents` is just a vector of JSON values,
/// but it is its own type to allow custom deserialization
#[derive(Debug)]
struct MyData {
documents: Vec<Value>,
journal: Value,
}
struct MyDataDeserializer<'a> {
state: &'a mut Stats,
}
/// Top-level seeded deserializer only so I can plumb the state through
impl<'de> DeserializeSeed<'de> for MyDataDeserializer<'_> {
type Value = MyData;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = MyDataVisitor(&mut self.state);
let docs = deserializer.deserialize_map(visitor)?;
Ok(docs)
}
}
struct MyDataVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for MyDataVisitor<'_> {
type Value = MyData;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a map")
}
fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
where
A: serde::de::MapAccess<'de>,
{
let mut documents = Vec::new();
let mut journal = Value::Null;
while let Some(key) = map.next_key::<String>()? {
println!("Got key = {key}");
match &key[..] {
"documents" => {
// Not sure how to handle the next value in a streaming manner
documents = map.next_value()?;
}
"journal" => journal = map.next_value()?,
_ => panic!("Unexpected key '{key}'"),
}
}
Ok(MyData { documents, journal })
}
}
struct DocumentDeserializer<'a> {
state: &'a mut Stats,
}
impl<'de> DeserializeSeed<'de> for DocumentDeserializer<'_> {
type Value = Vec<Value>;
fn deserialize<D>(mut self, deserializer: D) -> Result<Self::Value, D::Error>
where
D: serde::Deserializer<'de>,
{
let visitor = DocumentVisitor(&mut self.state);
let documents = deserializer.deserialize_seq(visitor)?;
Ok(documents)
}
}
struct DocumentVisitor<'a>(&'a mut Stats);
impl<'de> serde::de::Visitor<'de> for DocumentVisitor<'_> {
type Value = Vec<Value>;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a list")
}
fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
where
A: serde::de::SeqAccess<'de>,
{
let mut agg_map = serde_json::Map::new();
while let Some(item) = seq.next_element()? {
// If `item` isn't a JSON object, we'll skip it:
let Value::Object(map) = item else { continue };
// Get the first element, assuming we have some
let (k, v) = match map.into_iter().next() {
Some(kv) => kv,
None => continue,
};
// Ignore any null values; aggregate everything into a single map
if v == Value::Null {
self.0.records_skipped += 1;
continue;
} else {
println!("Keeping {k}={v}");
agg_map.insert(k, v);
}
}
let values = Value::Object(agg_map);
println!("Final value is {values}");
Ok(vec![values])
}
}
fn main() {
let fh = std::fs::File::open("input.json").unwrap();
let buf = std::io::BufReader::new(fh);
let read = serde_json::de::IoRead::new(buf);
let mut state = Stats::default();
let mut deserializer = serde_json::Deserializer::new(read);
let mydata = MyDataDeserializer { state: &mut state }
.deserialize(&mut deserializer)
.unwrap();
println!("{mydata:?}");
}

这段代码运行成功,并正确地反序列化了我的输入数据。问题是,我不能弄清楚如何流"文档"数组一个元素的时间。我不知道如何将documents = map.next_value()?;变成将状态传递给DocumentDeserializer的东西。它应该也许使用类似于:

let d = DocumentDeserializer { state: self.0 }
.deserialize(&mut map)
.unwrap();

.deserialize期望serde::Deserializer<'de>,但mapserde::de::MapAccess<'de>

无论如何,这整个事情似乎过于冗长,所以如果这不是普遍接受的或习惯的,我愿意采用另一种方法。正如链接问题中的OP所指出的那样,所有这些样板文件都令人讨厌。

你的问题研究得很透彻,所以我有点怀疑解决方案会这么简单,但你不只是想

"documents" => {
documents = map.next_value_seed(DocumentDeserializer { self.0 })?;
}

游乐场

(就我个人而言,我不会将这些东西命名为…Deserializer…Seed也许?)

相关内容

最新更新