我有一个包含JSON哈希的大文件(>50Mb)。像这样:
{
"obj1": {
"key1": "val1",
"key2": "val2"
},
"obj2": {
"key1": "val1",
"key2": "val2"
}
...
}
与其解析整个文件并假设前十个元素,我想解析哈希中的每个项目。我实际上并不关心密钥,即 obj1
.
如果我将上述内容转换为以下内容:
{
"key1": "val1",
"key2": "val2"
}
"obj2": {
"key1": "val1",
"key2": "val2"
}
我可以使用 Yajl 流轻松实现我想要的:
io = File.open(path_to_file)
count = 10
Yajl::Parser.parse(io) do |obj|
puts "Parsed: #{obj}"
count -= 1
break if count == 0
end
io.close
有没有办法做到这一点而不必更改文件?也许是 Yajl 中的某种回调?
我最终使用 JSON::Stream 解决了这个问题,它有 start_document
、start_object
等的回调。
我给了我的"解析器"一个to_enum
方法,该方法在解析时发出所有"资源"对象。请注意,除非您完全解析 JSON 流,否则ResourcesCollectionNode
永远不会真正使用,并且 ResourceNode
是仅用于命名目的的 ObjectNode
子类,尽管我可能会删除它:
class Parser
METHODS = %w[start_document end_document start_object end_object start_array end_array key value]
attr_reader :result
def initialize(io, chunk_size = 1024)
@io = io
@chunk_size = chunk_size
@parser = JSON::Stream::Parser.new
# register callback methods
METHODS.each do |name|
@parser.send(name, &method(name))
end
end
def to_enum
Enumerator.new do |yielder|
@yielder = yielder
begin
while !@io.eof?
# puts "READING CHUNK"
chunk = @io.read(@chunk_size)
@parser << chunk
end
ensure
@yielder = nil
end
end
end
def start_document
@stack = []
@result = nil
end
def end_document
# @result = @stack.pop.obj
end
def start_object
if @stack.size == 0
@stack.push(ResourceCollectionNode.new)
elsif @stack.size == 1
@stack.push(ResourceNode.new)
else
@stack.push(ObjectNode.new)
end
end
def end_object
if @stack.size == 2
node = @stack.pop
#puts "Stack depth: #{@stack.size}. Node: #{node.class}"
@stack[-1] << node.obj
# puts "Parsed complete resource: #{node.obj}"
@yielder << node.obj
elsif @stack.size == 1
# puts "Parsed all resources"
@result = @stack.pop.obj
else
node = @stack.pop
# puts "Stack depth: #{@stack.size}. Node: #{node.class}"
@stack[-1] << node.obj
end
end
def end_array
node = @stack.pop
@stack[-1] << node.obj
end
def start_array
@stack.push(ArrayNode.new)
end
def key(key)
# puts "Stack depth: #{@stack.size} KEY: #{key}"
@stack[-1] << key
end
def value(value)
node = @stack[-1]
node << value
end
class ObjectNode
attr_reader :obj
def initialize
@obj, @key = {}, nil
end
def <<(node)
if @key
@obj[@key] = node
@key = nil
else
@key = node
end
self
end
end
class ResourceNode < ObjectNode
end
# Node that contains all the resources - a Hash keyed by url
class ResourceCollectionNode < ObjectNode
def <<(node)
if @key
@obj[@key] = node
# puts "Completed Resource: #{@key} => #{node}"
@key = nil
else
@key = node
end
self
end
end
class ArrayNode
attr_reader :obj
def initialize
@obj = []
end
def <<(node)
@obj << node
self
end
end
end
以及正在使用的示例:
def json
<<-EOJ
{
"1": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"2": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
},
"3": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"4": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
},
"5": {
"url": "url_1",
"title": "title_1",
"http_req": {
"status": 200,
"time": 10
}
},
"6": {
"url": "url_2",
"title": "title_2",
"http_req": {
"status": 404,
"time": -1
}
}
}
EOJ
end
io = StringIO.new(json)
resource_parser = ResourceParser.new(io, 100)
count = 0
resource_parser.to_enum.each do |resource|
count += 1
puts "READ: #{count}"
pp resource
break
end
io.close
输出:
READ: 1
{"url"=>"url_1", "title"=>"title_1", "http_req"=>{"status"=>200, "time"=>10}}
我遇到了同样的问题,并创建了 gem json-streamer,这将使您无需创建自己的回调。
在您的情况下,用法为 (v 0.4.0):
io = File.open(path_to_file)
streamer = Json::Streamer::JsonStreamer.new(io)
streamer.get(nesting_level:1).each do |object|
p oject
end
io.close
将它应用于您的示例,它将生成没有"obj"键的对象:
{
"key1": "val1",
"key2": "val2"
}