我在使用Json.net和创建一个大型Bson文件时遇到问题。我有以下测试代码:
Imports System.IO
Imports Newtonsoft.Json
Public Class Region
Public Property Id As Integer
Public Property Name As String
Public Property FDS_Id As String
End Class
Public Class Regions
Inherits List(Of Region)
Public Sub New(capacity As Integer)
MyBase.New(capacity)
End Sub
End Class
Module Module1
Sub Main()
Dim writeElapsed2 = CreateFileBson_Stream(GetRegionList(5000000))
GC.Collect(0)
End Sub
Public Function GetRegionList(count As Integer) As List(Of Region)
Dim regions As New Regions(count - 1)
For lp = 0 To count - 1
regions.Add(New Region With {.Id = lp, .Name = lp.ToString, .FDS_Id = lp.ToString})
Next
Return regions
End Function
Public Function CreateFileBson_Stream(regions As Regions) As Long
Dim sw As New Stopwatch
sw.Start()
Dim lp = 0
Using stream = New StreamWriter("c:atlasregionsStream.bson")
Using writer = New Bson.BsonWriter(stream.BaseStream)
writer.WriteStartArray()
For Each item In regions
writer.WriteStartObject()
writer.WritePropertyName("Id")
writer.WriteValue(item.Id)
writer.WritePropertyName("Name")
writer.WriteValue(item.Name)
writer.WritePropertyName("FDS_Id")
writer.WriteValue(item.FDS_Id)
writer.WriteEndObject()
lp += 1
If lp Mod 1000000 = 0 Then
writer.Flush()
stream.Flush()
stream.BaseStream.Flush()
End If
Next
writer.WriteEndArray()
End Using
End Using
sw.Stop()
Return sw.ElapsedMilliseconds
End Function
End Module
在第一个using语句中,我使用了FileStream而不是StreamWriter,这没有什么区别。
CreateBsonFile_Stream在超过300万条记录时失败,出现OutOfMemory异常。在visualstudio中使用内存分析器可以显示内存在继续攀升,即使我正在刷新我能刷新的所有内容。
500万个区域的列表在内存中约为468Mb。
有趣的是,如果我使用以下代码来生成Json,它可以工作,并且内存稳定在500Mb:
Public Function CreateFileJson_Stream(regions As Regions) As Long
Dim sw As New Stopwatch
sw.Start()
Using stream = New StreamWriter("c:atlasregionsStream.json")
Using writer = New JsonTextWriter(stream)
writer.WriteStartArray()
For Each item In regions
writer.WriteStartObject()
writer.WritePropertyName("Id")
writer.WriteValue(item.Id)
writer.WritePropertyName("Name")
writer.WriteValue(item.Name)
writer.WritePropertyName("FDS_Id")
writer.WriteValue(item.FDS_Id)
writer.WriteEndObject()
Next
writer.WriteEndArray()
End Using
End Using
sw.Stop()
Return sw.ElapsedMilliseconds
End Function
我很确定这是BsonWriter的问题,但我看不出我还能做什么。有什么想法吗?
内存不足的原因如下。根据BSON规范,每个对象或数组(在标准中称为文档)必须在开头包含组成文档的字节总数的计数:
document ::= int32 e_list "x00" BSON Document. int32 is the total number of bytes comprising the document.
e_list ::= element e_list
| ""
element ::= "x01" e_name double 64-bit binary floating point
| "x02" e_name string UTF-8 string
| "x03" e_name document Embedded document
| "x04" e_name document Array
| ...
因此,在写入根对象或数组时,必须预先计算要写入文件的总字节数。
Newtonsoft的BsonDataWriter
和底层BsonBinaryWriter
通过缓存要写入树中的所有令牌来实现这一点,然后当根令牌的内容最终确定时,在写出树之前递归计算大小。(替代方案是让应用程序(即您的代码)以某种方式预先计算这些信息——实际上是不可能的——或者在输出流中来回寻找来写入这些信息,可能只针对Stream.CanSeek == true
的流。)您得到OutOfMemory异常,因为您的系统没有足够的资源来容纳令牌树。
相比之下,JSON标准不要求在文件中的任何位置写入字节计数或大小。因此,JsonTextWriter
可以立即流式传输序列化的数组内容,而无需缓存任何内容。
作为一种解决方法,基于BSON规范和BsonBinaryWriter
,我创建了一个助手方法,该方法将枚举对象增量序列化为Stream.CanSeek == true
所在的流。它不需要将整个BSON文档缓存在内存中,而是寻求流的开头来写入最后的字节计数:
public static partial class BsonExtensions
{
const int BufferSize = 256;
public static void SerializeEnumerable<TItem>(IEnumerable<TItem> enumerable, Stream stream, JsonSerializerSettings settings = null)
{
// Created based on https://github.com/JamesNK/Newtonsoft.Json/blob/master/Src/Newtonsoft.Json/Bson/BsonBinaryWriter.cs
// And http://bsonspec.org/spec.html
if (enumerable == null || stream == null)
throw new ArgumentNullException();
if (!stream.CanSeek || !stream.CanWrite)
throw new ArgumentException("!stream.CanSeek || !stream.CanWrite");
var serializer = JsonSerializer.CreateDefault(settings);
var contract = serializer.ContractResolver.ResolveContract(typeof(TItem));
BsonType rootType;
if (contract is JsonObjectContract || contract is JsonDictionaryContract)
rootType = BsonType.Object;
else if (contract is JsonArrayContract)
rootType = BsonType.Array;
else
// Arrays of primitives are not implemented yet.
throw new JsonSerializationException(string.Format("Item type "{0}" not implemented.", typeof(TItem)));
stream.Flush(); // Just in case.
var initialPosition = stream.Position;
var buffer = new byte[BufferSize];
WriteInt(stream, (int)0, buffer); // CALCULATED SIZE TO BE CALCULATED LATER.
ulong index = 0;
foreach (var item in enumerable)
{
if (item == null)
{
stream.WriteByte(unchecked((byte)BsonType.Null));
WriteString(stream, index.ToString(NumberFormatInfo.InvariantInfo), buffer);
}
else
{
stream.WriteByte(unchecked((byte)rootType));
WriteString(stream, index.ToString(NumberFormatInfo.InvariantInfo), buffer);
using (var bsonWriter = new BsonDataWriter(stream) { CloseOutput = false })
{
serializer.Serialize(bsonWriter, item);
}
}
index++;
}
stream.WriteByte((byte)0);
stream.Flush();
var finalPosition = stream.Position;
stream.Position = initialPosition;
var size = checked((int)(finalPosition - initialPosition));
WriteInt(stream, size, buffer); // CALCULATED SIZE.
stream.Position = finalPosition;
}
private static readonly Encoding Encoding = new UTF8Encoding(false);
private static void WriteString(Stream stream, string s, byte[] buffer)
{
if (s != null)
{
if (s.Length < buffer.Length / Encoding.GetMaxByteCount(1))
{
var byteCount = Encoding.GetBytes(s, 0, s.Length, buffer, 0);
stream.Write(buffer, 0, byteCount);
}
else
{
byte[] bytes = Encoding.GetBytes(s);
stream.Write(bytes, 0, bytes.Length);
}
}
stream.WriteByte((byte)0);
}
private static void WriteInt(Stream stream, int value, byte[] buffer)
{
unchecked
{
buffer[0] = (byte)value;
buffer[1] = (byte)(value >> 8);
buffer[2] = (byte)(value >> 16);
buffer[3] = (byte)(value >> 24);
}
stream.Write(buffer, 0, 4);
}
enum BsonType : sbyte
{
// Taken from https://github.com/JamesNK/Newtonsoft.Json/blob/master/Src/Newtonsoft.Json/Bson/BsonType.cs
// And also http://bsonspec.org/spec.html
Number = 1,
String = 2,
Object = 3,
Array = 4,
Binary = 5,
Undefined = 6,
Oid = 7,
Boolean = 8,
Date = 9,
Null = 10,
Regex = 11,
Reference = 12,
Code = 13,
Symbol = 14,
CodeWScope = 15,
Integer = 16,
TimeStamp = 17,
Long = 18,
MinKey = -1,
MaxKey = 127
}
}
然后将其称为:
BsonExtensions.SerializeEnumerable(regions, stream)
注:
您可以使用上面的方法序列化为本地
FileStream
或MemoryStream
,但不能序列化为无法重新定位的DeflateStream
。未实现序列化基元的可枚举对象,但可以实现。
在版本10.0.1中,Newtonsoft将BSON处理转移到一个单独的nugetNewtonsoft.Json.BSON中,并用
BsonDataWriter
替换了BsonWriter
。如果您使用的是早期版本的Newtonsoft
,则上述答案同样适用于旧的BsonWriter
。由于Json.NET是用c#编写的,而我的主要语言是c#,因此解决方法也是用c#编写。如果你需要将此转换为VB.NET,请告诉我,我可以尝试。
在这里演示一些简单的单元测试。
找到了-BsonWriter正在努力变得"智能"。。。因为我把json作为一个区域数组来生成,所以不管你做什么刷新,它似乎都会把整个数组保存在内存中
为了证明这一点,我取出了Start和End Array写入并运行了例程——内存使用率保持在500Mb,并且过程运行正常。
我的猜测是,这是JsonWriter中修复的一个错误,但在使用较少的BsonWriter 中没有修复。