Ruby/Nokogiri站点抓取-无效的UTF-8字节序列(ArgumentError)



ruby n00b here。我试图从存储在CSV文件中的每个URL中抓取一个p标记,并将抓取的内容及其URL输出到新文件(myResults.csv)。然而,我一直得到一个"无效的字节序列在UTF-8 (ArgumentError)"错误,这表明url是无效的?(他们都是标准的'http://www.exmaple.com/page'和工作在我的浏览器)?

已经尝试了。parse和。encode从类似的线程在这里,但没有运气。谢谢你的阅读。

代码:

require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
  :write_headers => true,
  :headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
  csv_doc = File.foreach('listOfURLs.xls') do |url|
    URI.parse(URI.encode(url.chomp))
    begin
    page = Nokogiri.HTML(open(url))
      page.css('.bio media-content').each do |scrape|
      desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
      csv << [url, desc]
    end
  end
end
end
puts "scraping done!"

错误信息:

/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
    from bbb.rb:13:in `block (2 levels) in <main>'
    from bbb.rb:11:in `foreach'
    from bbb.rb:11:in `block in <main>'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from bbb.rb:10:in `<main>'

两件事:

  1. 您说url存储在CSV文件中,但您在代码中引用excel文件listOfURLs.xls

  2. 问题似乎是文件listOfURLs.xls的编码,ruby假设该文件是UTF-8编码的。如果文件不是UTF-8编码或包含非有效的UTF-8字符,您可能会得到该错误。

    您应该仔细检查文件是否以UTF-8编码,并且不包含任何非法字符。

    如果您必须打开非UTF-8编码的文件,请尝试使用ISO-8859-1:

    f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
        puts row
    end
    

关于UTF-8中无效字节序列的一些有用信息

更新:

一个例子:

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
    csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
        URI.parse(URI.encode(url.chomp))
        begin
        page = Nokogiri.HTML(open(url))
            page.css('.bio media-content').each do |scrape|
            desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
            csv << [url, desc]
        end
    end
end

我在这里有点晚了,但这应该适用于任何人在未来遇到同样的问题:csv_doc = IO.read(file).force_encoding('ISO-8859-1')。Encode ('utf-8', replace: nil)

最新更新