ruby n00b here。我试图从存储在CSV文件中的每个URL中抓取一个p标记,并将抓取的内容及其URL输出到新文件(myResults.csv)。然而,我一直得到一个"无效的字节序列在UTF-8 (ArgumentError)"错误,这表明url是无效的?(他们都是标准的'http://www.exmaple.com/page'和工作在我的浏览器)?
已经尝试了。parse和。encode从类似的线程在这里,但没有运气。谢谢你的阅读。
代码:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
错误信息:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
两件事:
-
您说url存储在CSV文件中,但您在代码中引用excel文件
listOfURLs.xls
-
问题似乎是文件
listOfURLs.xls
的编码,ruby假设该文件是UTF-8编码的。如果文件不是UTF-8编码或包含非有效的UTF-8字符,您可能会得到该错误。您应该仔细检查文件是否以UTF-8编码,并且不包含任何非法字符。
如果您必须打开非UTF-8编码的文件,请尝试使用ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row| puts row end
关于UTF-8中无效字节序列的一些有用信息
更新:
一个例子:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
我在这里有点晚了,但这应该适用于任何人在未来遇到同样的问题:csv_doc = IO.read(file).force_encoding('ISO-8859-1')。Encode ('utf-8', replace: nil)