我的行约为3400万行,每行都有23列在TPC-DS数据集中的store_sales
表中。
i具有ss_item_sk
和ss_ticket_number
列的composite primary key
。
运行查询SELECT count(DISTINCT <primary key>) ..
后,我可以看到它输出了表中存在的行总数。
现在,我将另一列与primary key
一起添加,即ss_sold_date_sk
。
之后,如果我运行了count
查询,我会比以前打印的行数 。有人可以通过示例向我解释为什么会发生这种情况?
tl; dr
什么时候将列添加到复合主键会停止使其独一无二?
demo
create table mytable (c1 string,c2 string);
insert into mytable values ('A','A'),('B',null),('C',null);
select count(distinct c1) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 3 |
+----------------+
正如预期的-3个不同的值 - 'a','b'和'c'
select count(distinct concat(c1,c2)) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 1 |
+----------------+
正如预期的。为什么? - 请参阅下一个查询
select c1,c2,concat(c1,c2) as concat_c1_c2 from mytable;
+----+------+--------------+
| c1 | c2 | concat_c1_c2 |
+----+------+--------------+
| A | A | AA |
| B | NULL | NULL |
| C | NULL | NULL |
+----+------+--------------+
与无效的串联得出无效
select count(distinct c1,c2) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 1 |
+----------------+
bug !!
这是一个工作的错误:
select count(distinct struct(c1,c2)) as count_distinct from mytable;
+----------------+
| count_distinct |
+----------------+
| 3 |
+----------------+