我正在尝试比较Snowflake和SQL Server之间的字符串值。我有比较UNICODE字符的问题。SQL Server MD5哈希算法产生的结果与Snowflake不同。
为了比较的目的,解决这个差异的最好方法是什么?
示例代码
SQL Server
/* SQL SERVER
LOWER and CONVERT are used to produce the exact HASH format as Snowflake
*/
SELECT
LOWER(
CONVERT(VARCHAR(1000),
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
, 2)
) AS mismatch;
SELECT
LOWER(
CONVERT(VARCHAR(1000),
HASHBYTES('MD5', CAST('md5_algtest' AS VARCHAR(50)))
, 2)
) AS matches;
雪花/* SNOWFLAKE */
SELECT md5('md5_alg“test”') AS mismatch;
SELECT md5('md5_algtest') AS match;
Microsoft SQL Server使用UTF-16编码来存储unicode字符。雪花以UTF-8字符集存储所有数据。
所以你需要将'md5_alg " test " '转换为UTF-8并计算哈希值。
我找到了一个函数:https://gist.github.com/sevaa/f084a0a5a994c3bc28e518d5c708d5f6
create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
as
begin
declare @i int = 1, @n int = datalength(@s)/2, @r varbinary(max) = 0x, @c int, @c2 int, @d varbinary(4)
while @i <= @n
begin
set @c = unicode(substring(@s, @i, 1))
if (@c & 0xFC00) = 0xD800
begin
set @i += 1
if @i > @n
return cast(cast('Malformed UTF-16 - two nchar sequence cut short' as int) as varbinary)
set @c2 = unicode(substring(@s, @i, 1))
if (@c2 & 0xFC00) <> 0xDC00
return cast(cast('Malformed UTF-16 - continuation missing in a two nchar sequence' as int) as varbinary)
set @c = (((@c & 0x3FF) * 0x400) | (@c2 & 0x3FF)) + 0x10000
end
if @c < 0x80
set @d = cast(@c as binary(1))
if @c >= 0x80 and @c < 0x800
set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 as binary(2))
if @c >= 0x800 and @c < 0x10000
set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 as binary(3))
if @c >= 0x10000
set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 as binary(4))
set @r += @d
set @i += 1
end
return @r
end
创建此函数后,您可以计算MD5,它将生成与Snowflake相同的值:
SELECT
LOWER(
CONVERT(VARCHAR(32),
HASHBYTES('MD5', [dbo].[ToUTF8]('md5_alg“test”') )
, 2)
) AS mismatch,
LOWER(
CONVERT(VARCHAR(32),
HASHBYTES('MD5', [dbo].[ToUTF8]('md5_algtest') )
, 2)
) AS matches;
mismatch | matches | 80381678898496 aba31245b01f40dd25 | cb95937a11e610f6aaf0d06666bde771 |
---|
对于SQL Server 2019 forward,以下解决方案适用于我。
SELECT
LOWER(
CONVERT(VARCHAR(1000),
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
, 2)
) AS mismatch,
LOWER(
CONVERT(VARCHAR(1000),
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8)
, 2)
) AS match
https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928