如何将数据保存到多维Ruby散列,然后将该散列转换为单个JSON文件



我正在做一个网页抓取器,它可以从网站上抓取以下数据。

  • <
  • 搜索属性/gh>

我使用以下代码将数据保存到三个单独的(单维)JSON文件:

require 'mechanize'
@raw_groups_array = []
@raw_categories_array = []
@search_attributes = []
@groups_clean = []
@categories_clean = []
@categories_combined = []
@categories_hash = {}
# Initialize Mechanize object
a = Mechanize.new
# Begin magic
a.get('http://www.marktplaats.nl/') do |page|
  groups = page.search('//*[(@id = "navigation-categories")]//a')
  groups.each do |group|
    @raw_groups_array.push(group)
    @groups_clean.push(group.text)
    a.get(group[:href]) do |page_2|
      categories = page_2.search('//*[(@id = "category-browser")]//a')
      categories.each do |category|
        @raw_categories_array.push(category)
        @categories_clean.push(category.text)
        @categories_combined.push("#{group.text} | #{category.text}")
        a.get(category[:href]) do |page_3|
          search_attributes = page_3.search('//*[contains(concat( " ", @class, " " ), concat( " ", "heading", " " ))]')
          search_attributes.each do |attribute|
            @search_attributes.push("#{group.text} | #{category.text} | #{attribute.text}") unless attribute.text == 'Outlet '
            # Uncomment the line below if you want to see what's going on.
            # (it has minimal effect on performance)
            puts "#{group.text} | #{category.text} | #{attribute.text}" unless attribute.text == 'Outlet '
          end
        end
      end
    end
  end
end
# Write json files
File.open('json/prestige/prestige_groups.json', 'w') do |f|
  puts '# Writing groups'
  f.write(@groups_clean.to_json)
  puts '|-----------> Done.'
end
File.open('json/prestige/prestige_categories.json', 'w') do |f|
  puts '# Writing categories'
  f.write(@categories_clean.to_json)
  puts '|-----------> Done.'
end
File.open('json/prestige/prestige_combined.json', 'w') do |f|
  puts '# Writing combined'
  f.write(@categories_combined.to_json)
  puts '|-----------> Done.'
end
File.open('json/prestige/prestige_search_attributes.json', 'w') do |f|
  puts '# Writing search attributes'
  f.write(@search_attributes.to_json)
  puts '|-----------> Done.'
end
puts '# Finished.'

代码可以工作。但是我很难重构它以创建以下格式的ruby哈希:

{
  "category"=>{
    "name"=>"#{category}",
    "group"=>"#{group}",
    "search_attributes"=>{
      "1"=>"#{search_attributes[0]}",
      "2"=>"#{search_attributes[1]}",
      "."=>"#{search_attributes[.]}",
      "i"=>"#{search_attributes[i]}", # depending on search_attributes.length
    }
  }
}

我试过这样做:

...
search_attributes.each do |attribute|
  @categories_hash.store([:category][:name], category.text)
  @categories_hash.store([:category][:group], group.text)
  @categories_hash.store([:category][:search_attributes][:1], attribute.text)
end
...

但是总是出现语法错误。

如有任何帮助,不胜感激。

Max建议我尝试Hash#[],但这会返回具有单个类别(最后一个)的哈希值。

search_attributes.each_with_index do |attribute, index|
  @categories_hash[:category][:name] = category.text
  @categories_hash[:category][:group] = group.text
  @categories_hash[:category][:search_attributes][:"#{index}"] = attribute.text unless attribute.text == "Outlet "   
end

我已经粘贴了完整的代码在这里

您使用Hash#store有什么特别的原因吗?用那种方法没有捷径。

我认为使用Hash#[]更好。

@categories_hash[:category] ||= {}
@categories_hash[:category][:search_attributes] ||= {}
@categories_hash[:category][:search_attributes][:1] = attribute.text

||=确保在您尝试在子散列中存储内容之前初始化子散列。

在这里,这里和这里的帮助下,我有了完整的工作代码:

require 'mechanize'
@hashes = []
# Initialize Mechanize object
a = Mechanize.new
# Begin scraping
a.get('http://www.marktplaats.nl/') do |page|
  groups = page.search('//*[(@id = "navigation-categories")]//a')
  groups.each_with_index do |group, index_1|
    a.get(group[:href]) do |page_2|
      categories = page_2.search('//*[(@id = "category-browser")]//a')
      categories.each_with_index do |category, index_2|
        a.get(category[:href]) do |page_3|
          search_attributes = page_3.search('//*[contains(concat( " ", @class, " " ), concat( " ", "heading", " " ))]')
          attributes_hash = {}
          search_attributes.each_with_index do |attribute, index_3|
            attributes_hash[index_3.to_s] = "#{attribute.text unless attribute.text == 'Outlet '}"
          end
          item = {
            id: "#{index_1}.#{index_2}",
            name: category.text,
            group: group.text,
            :search_attributes => attributes_hash
          }
          @hashes << item
          # Uncomment this if you want to see what's being pushed
          puts item
        end
      end
    end
  end
end
# Open file and begin
File.open("json/light/#{Time.now.strftime '%Y%m%d%H%M%S'}_light_categories.json", 'w') do |f|
  puts '# Writing category data to JSON file'
  f.write(@hashes.to_json)
  puts "|-----------> Done. #{@hashes.length} written."
end
puts '# Finished.'

相关内容

  • 没有找到相关文章

最新更新