比较SQL Server和Snowflake的字符串值



我正在尝试比较Snowflake和SQL Server之间的字符串值。我有比较UNICODE字符的问题。SQL Server MD5哈希算法产生的结果与Snowflake不同。

为了比较的目的,解决这个差异的最好方法是什么?

示例代码

SQL Server

/*  SQL SERVER  
LOWER and CONVERT are used to produce the exact HASH format as Snowflake
*/
SELECT 
LOWER(
CONVERT(VARCHAR(1000), 
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
, 2)
) AS mismatch;
SELECT 
LOWER(
CONVERT(VARCHAR(1000), 
HASHBYTES('MD5', CAST('md5_algtest' AS VARCHAR(50)))
, 2)
) AS matches;

雪花
/*  SNOWFLAKE   */
SELECT md5('md5_alg“test”') AS mismatch;
SELECT md5('md5_algtest') AS match;

Microsoft SQL Server使用UTF-16编码来存储unicode字符。雪花以UTF-8字符集存储所有数据。

所以你需要将'md5_alg " test " '转换为UTF-8并计算哈希值。

我找到了一个函数:https://gist.github.com/sevaa/f084a0a5a994c3bc28e518d5c708d5f6

create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
as
begin
declare @i int = 1, @n int = datalength(@s)/2, @r varbinary(max) = 0x, @c int, @c2 int, @d varbinary(4)
while @i <= @n
begin
set @c = unicode(substring(@s, @i, 1))
if (@c & 0xFC00) = 0xD800
begin
set @i += 1
if @i > @n
return cast(cast('Malformed UTF-16 - two nchar sequence cut short' as int) as varbinary)
set @c2 = unicode(substring(@s, @i, 1))
if (@c2 & 0xFC00) <> 0xDC00
return cast(cast('Malformed UTF-16 - continuation missing in a two nchar sequence' as int) as varbinary)
set @c = (((@c & 0x3FF) * 0x400) | (@c2 & 0x3FF)) + 0x10000
end
if @c < 0x80
set @d = cast(@c as binary(1))
if @c >= 0x80 and @c < 0x800 
set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 as binary(2))
if @c >= 0x800 and @c < 0x10000
set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 as binary(3))
if @c >= 0x10000
set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 as binary(4))

set @r += @d
set @i += 1
end
return @r
end

创建此函数后,您可以计算MD5,它将生成与Snowflake相同的值:

SELECT 
LOWER(
CONVERT(VARCHAR(32), 
HASHBYTES('MD5', [dbo].[ToUTF8]('md5_alg“test”')  )
, 2)
) AS mismatch,
LOWER(
CONVERT(VARCHAR(32), 
HASHBYTES('MD5',  [dbo].[ToUTF8]('md5_algtest')  )
, 2)
) AS matches;
tbody> <<tr>
mismatchmatches
80381678898496 aba31245b01f40dd25cb95937a11e610f6aaf0d06666bde771

对于SQL Server 2019 forward,以下解决方案适用于我。

SELECT 
LOWER(
CONVERT(VARCHAR(1000), 
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)))
, 2)
) AS mismatch,
LOWER(
CONVERT(VARCHAR(1000), 
HASHBYTES('MD5', CAST('md5_alg“test”' AS VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8)
, 2)
) AS match

https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928

最新更新