Ruby编码来自以前的十六进制编码



我遇到这样一种情况,Nokogiri结果将hex编码到我的结果中。问题是,结果的实际编码是UTF-8,但包含十六进制字符:

Best 100+ Fishing Pictures | Download Free Images on Unsplash
https%3A%2F%2Funsplash.com%2Fs%2Fphotos%2Ffishing&rut=d1dd8233a6ad628121fa36d8d5a51be0b6fb0eda75e234d5036bf7b49efcf25b
current encoding: UTF-8
Fish Images | Free Vectors, Stock Photos & PSD
https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68
current encoding: UTF-8
How to Use Fish vs. fishes Correctly
https%3A%2F%2Fgrammarist.com%2Fusage%2Ffish%2Dfishes%2F&rut=e0897e219c9b0b125a1442b59e36c49753417a1b7812ae9d3ab0bc3179ffe6b5
current encoding: UTF-8

从技术上讲,URL编码为UTF-8,但具有十六进制字符。我还没有找到任何将它们视为十六进制的东西来翻译为UTF-8,所以我不知道如何识别这些字符分组进行翻译。除了编写一个可能有效的复杂方法之外,我想我会看看是否有对原始字符串的强制识别,然后使用force_encode或类似的东西进行翻译

有人对如何做到这一点有什么建议吗?任何见解都值得赞赏。我宁愿避免将这些字符手工编码到方法中。

更新CGI::unescapeHTML(<string>]不工作:

irb(main):024:0> a
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):025:0> CGI::unescapeHTML(a)
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):026:0> CGI::unescapeHTML(a) == a
=> true

您没有提供您的"结果的编码是UTF-8,但包含十六进制字符";在最初的问题中。我想我不明白那个问题。

在更新中,您使用了不正确的方法。unescapeHTML用于解析HTML实体:

irb(main):010:0> CGI.escapeHTML '<'
=> "&lt;"
irb(main):012:0> CGI.unescapeHTML '&lt;'
=> "<"

您需要使用的方法是解码URL序列:

irb(main):017:0> encoded_url = "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
=> "https%3A%2F%2Fwww.freepik.com%2Ffree%2Dphotos%2Dvectors%2Ffish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"
irb(main):018:0> CGI.unescape encoded_url
=> "https://www.freepik.com/free-photos-vectors/fish&rut=f68a290a96893c63f8849bc9e89152d97a632d7a95bbf5d0ca2e939b378fff68"

如果这不能解决你的实际问题,我很乐意修改问题中的源代码。

最新更新