在Redshift中比较两个表的更有效的方法



我有一系列存储过程,其中包含填充表的逻辑。我编辑存储过程逻辑以将新字段填充到表中。目前,为了检查存储过程如何影响表,我在更改之前获取表的完整副本,并将其作为新表存储在红移数据库中,例如"table_test",这样我就可以检查行数是否相同,列是否包含相同的数据。这似乎是一个非常低效的过程,存储整个旧表以用于与新版本的表进行比较。

有没有更好/更有效的方法来比较AWS Redshift中的两个表?

我过去在数据库之间比较数据时所做的是创建一个每列"类似MD5的";签名在你的情况下,你可以在你的";预";表格内容和您的";张贴";表格内容。这只会告诉你哪些列不同,但这可能就是你所需要的。

当出现差异时进行调试可能很困难,但您可以";保存";表的副本到S3以供调试使用。这可能会降低你想要的速度,你可能只想在出现问题或开启测试时以这种方式运行。您也可以通过";日期";这样您就可以得到不匹配的日期和列。

我已经用几种不同的方式制作了这样的签名,因为非Redshift数据库并不总是像Redshift那样快。由于您正在将Redshift与Redshift进行比较,因此比较过程变得更容易、更快。在这种情况下,我要做的是为每一列预生成MD5(columnN::text),然后将base64结果的一部分转换为BIGINT。然后可以对每一列求和()这些值。(SUM()是聚合列信息并使用MD5结果子集的最简单方法。)由于MD5签名是大的,使用结果的子集是好的;"唯一性";整个结果。溢出可能是一个问题,所以在每个值上添加一个负常量可以帮助解决这个问题。生成的查询看起来像:

select 
sum(nvl(strtol(substring({{column.column_name}}, 17, 8), 16) - (1::bigint << 31), 0))
from <<CTE>>;

这是我在这个过程中使用的jinja2模板,它允许我读取表DDL并将非文本列转换为CTE中的文本。希望这个片段对流程的工作方式足够清晰。

=====================更新====================

我感觉人们对jinja2的使用有些困惑。Jinja2是一种模板语言,可用于根据某些输入扩展文本。它不执行任何对表进行签名和比较的SQL工作。这是一种使不同表的重复工作更快的方法。

下面是一个为表创建签名的示例:

select
(
sum(nvl(strtol(substring(USAF, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(WBAN, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(STATION_NAME, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(CTRY, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(STATE, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(ICAO, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(LAT, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(LON, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(ELEV, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(begin_date, 17, 8), 16) - (1::bigint << 31), 0))
+ sum(nvl(strtol(substring(END_date, 17, 8), 16) - (1::bigint << 31), 0))
)  as "hash"
from (
select
md5(USAF::text) as USAF
, md5(WBAN::text) as WBAN
, md5(STATION_NAME::text) as STATION_NAME
, md5(CTRY::text) as CTRY
, md5(STATE::text) as STATE
, md5(ICAO::text) as ICAO
, md5(LAT::text) as LAT
, md5(LON::text) as LON
, md5(ELEV::text) as ELEV
, md5(floor(extract(epoch from begin_date))::text) as begin_date
, md5(floor(extract(epoch from END_date))::text) as END_date
from public.gsod_stations
);

您可以看到,对于每一列,都需要计算一些散列(在本例中为md5),而如何计算取决于数据类型。然后对这些散列求和以创建列级散列,然后对这些列级散列求和来创建表级散列。(偏移量1::bigint<<31用于防止大表溢出。)这并不难理解,但为每个表创建这种SQL是一件痛苦的事。

这就是jinja2的作用所在。通过在模板中包含SQL,并在模板中为表生成DDL,jinja可以生成SQL。

Jinja SQL模板(采用Redshift SQL语法):

{% for table in tables %}
{%- if table.table_name == target or target is not defined %}
{% set vars = {"first_column":true} %}
select
(
{% for column in table.col_names -%}
{%- if not column.skip_compare -%}
{%- if vars.first_column -%}
{%- if vars.update({"first_column": false}) -%} {%- endif -%}
{%- else -%}
{% raw %}+ {% endraw %} 
{%- endif -%}
sum(nvl(strtol(substring({{column.column_name}}, 17, 8), 16) - (1::bigint << 31), 0))
{%- else -%}
-- skipping {{column.column_name}}
{%- endif %}
{% endfor -%}
)  as "hash"
from (
select
{%- set vars = {"first_column":true} %}
{% for column in table.col_names -%}
{%- if not column.skip_compare -%}
{%- if vars.first_column -%}
{%- if vars.update({"first_column": false}) -%} {%- endif -%}
{%- else -%}
{% raw %}, {% endraw %}
{%- endif -%}
{%- if column.RS_col_type in ["date", "timestamp"] -%}
md5(floor(extract(epoch from {{column.column_name}}))::text) as {{column.column_name}}
{%- elif column.RS_col_type in ["boolean", "bool"] -%}
md5(({{column.column_name}}::int)::text) as {{column.column_name}}
{%- else -%}
md5({{column.column_name}}::text) as {{column.column_name}}
{%- endif -%}
{%- else -%}
-- skipping {{column.column_name}}
{%- endif %}
{% endfor -%}
from {{table.RS_schema}}.{{table.table_name}}
)
;
{% endif %}
{% endfor %}

还有一个示例jsonDDL文件(包含2个表的信息):

{"tables":  
[
{"table_name":"gsod_stations", 
"RS_schema":"public",
"col_names": [
{"column_name":"USAF", "RS_col_type":"varchar(10)", "RS_col_params":"ENCODE   zstd"},
{"column_name":"WBAN", "RS_col_type":"integer", "RS_col_params":"ENCODE   zstd"},
{"column_name":"STATION_NAME", "RS_col_type":"varchar(80)", "RS_col_params":"ENCODE   zstd"},
{"column_name":"CTRY", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
{"column_name":"STATE", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
{"column_name":"ICAO", "RS_col_type":"varchar(30)", "RS_col_params":"ENCODE   zstd"},
{"column_name":"LAT", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
{"column_name":"LON", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
{"column_name":"ELEV", "RS_col_type":"float4", "RS_col_params":"ENCODE   zstd"},
{"column_name":"begin_date", "RS_col_type":"date", "RS_col_params":"ENCODE   zstd"},
{"column_name":"END_date", "RS_col_type":"date", "RS_col_params":"ENCODE   zstd"}
],
"RS_sort_stmnt":"SORTKEY (USAF,WBAN)",
"RS_dist_stmnt":"DISTKEY (USAF)"},
{"table_name":"gsod_weather_station_data", 
"RS_schema":"public",
"col_names": [
{"column_name":"station_wban", "RS_col_type":"varchar(12)", "RS_col_params":"ENCODE zstd"},
{"column_name":"station_id", "RS_col_type":"varchar(10)", "RS_col_params":"NOT NULL ENCODE zstd"},
{"column_name":"wban", "RS_col_type":"integer", "RS_col_params":"NOT NULL ENCODE zstd"},
{"column_name":"yearmoda", "RS_col_type":"date", "RS_col_params":"NOT NULL ENCODE raw"},
{"column_name":"temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"NULL ENCODE zstd"},
{"column_name":"tcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"dewpoint", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"dcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"slp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"slpcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"stp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"stpcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"visibility", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"vcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"wind_speed", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"wcount", "RS_col_type":"integer", "RS_col_params":"ENCODE zstd"},
{"column_name":"max_wind_speed", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"max_wind_gust", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"max_temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"maxtflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"min_temp", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"mintflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"precip", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"pflags", "RS_col_type":"char(2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"snow_depth", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"fog", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"rain", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"snow", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"hail", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"thunder", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"tornado", "RS_col_type":"bool", "RS_col_params":"ENCODE zstd"},
{"column_name":"STATION_NAME", "RS_col_type":"varchar(80)", "RS_col_params":"ENCODE zstd"},
{"column_name":"CTRY", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
{"column_name":"STATE", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
{"column_name":"ICAO", "RS_col_type":"char(30)", "RS_col_params":"ENCODE zstd"},
{"column_name":"LAT", "RS_col_type":"decimal(8,3)", "RS_col_params":"ENCODE zstd"},
{"column_name":"LON", "RS_col_type":"decimal(8,3)", "RS_col_params":"ENCODE zstd"},
{"column_name":"ELEV", "RS_col_type":"decimal(8,2)", "RS_col_params":"ENCODE zstd"},
{"column_name":"begin_date", "RS_col_type":"date", "RS_col_params":"ENCODE zstd", "skip_compare":true},
{"column_name":"END_date", "RS_col_type":"date", "RS_col_params":"ENCODE zstd"}
],
"RS_sort_stmnt":"SORTKEY (yearmoda, station_wban)",
"RS_dist_stmnt":"DISTKEY (station_wban)"}
]
}

可以运行生成的SQL,并为每个表创建签名。您希望在表之间以及数据库之间比较这些签名。这正是我开发这个过程的原因。由于您正在比较的表/数据库可能会有所不同,因此进行比较的方式也会有所不同——但基本上是比较这些哈希值,看看表是否包含相同的内容。

需要注意的事项:

  • 虽然Redshift在生成md5值并对其求和方面非常快,但其他数据库对于大型数据集则没有那么快。我不得不";哑下来";对于这样的数据库来说,散列要简单得多
  • 任何哈希都是有损的,两个东西哈希相同的可能性是有限的,或者在这种情况下,两个哈希的总和错误匹配。这种可能性极低,但并非为零
  • Md5散列是非常独特的,散列的每个部分也是唯一的,只是程度不同。通过为求和选择md5的一部分,可以提高操作速度,同时保持非常高的置信度,即没有错误的比较
  • 该过程只能指示";匹配";或";不匹配";。要定位差异,需要对表的子集进行一些哈希运算,并对表值进行直接比较。换言之,如果(何时)需要调试不匹配的原因,那么只为旧版本的表提供哈希签名对您没有帮助。出于这种调试目的,建议存档旧版本的表