我遇到这样一种情况,Nokogiri
结果将hex
编码到我的结果中。问题是,结果的实际编码是UTF-8
,但包含十六进制字符:
Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8
Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8
How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8
从技术上讲,URL编码为UTF-8
,但具有十六进制字符。我还没有找到任何将它们视为十六进制的东西来翻译为UTF-8
,所以我不知道如何识别这些字符分组进行翻译。除了编写一个可能有效的复杂方法之外,我想我会看看是否有对原始字符串的强制识别,然后使用force_encode
或类似的东西进行翻译
有人对如何做到这一点有什么建议吗?任何见解都值得赞赏。我宁愿避免将这些字符手工编码到方法中。
更新:CGI::unescapeHTML(<string>]
不工作:
irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true
您没有提供您的"结果的编码是UTF-8,但包含十六进制字符";在最初的问题中。我想我不明白那个问题。
在更新中,您使用了不正确的方法。unescapeHTML
用于解析HTML实体:
irb(main):010:0> CGI.escapeHTML '<'
=> "<"
irb(main):012:0> CGI.unescapeHTML '<'
=> "<"
您需要使用的方法是解码URL序列:
irb(main):017:0> encoded_url = "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):018:0> CGI.unescape encoded_url
=> "https://www.freepik.com/free-photos-vectors/fish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
如果这不能解决你的实际问题,我很乐意修改问题中的源代码。