连接列,同时也是两边的分区键,只修剪左边的分区

  • 本文关键字:分区 修剪 左边 连接 postgresql
  • 更新时间 :
  • 英文 :


(参见https://dbfiddle.uk/00lhcAA6。我在版本13、14和15上得到了相同的结果。

取两个分区在一起的表,它们共享相同的概念分区键(created_at)。

CREATE TABLE part_main (
main_id serial,
create_time timestamptz,
main_val int,
primary key (main_id, create_time)
)
PARTITION BY RANGE (create_time);
CREATE TABLE part_other (
other_id serial,
create_time timestamptz,
main_id int,
other_val text,
primary key (other_id, create_time),
foreign key (main_id, create_time) references part_main(main_id, create_time)
)
PARTITION BY RANGE (create_time);
CREATE TABLE part_main_y2023m02 PARTITION OF part_main
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_main_y2023m03 PARTITION OF part_main
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_main_y2023m04 PARTITION OF part_main
FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');
CREATE TABLE part_other_y2023m02 PARTITION OF part_other
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');
CREATE TABLE part_other_y2023m03 PARTITION OF part_other
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');
CREATE TABLE part_other_y2023m04 PARTITION OF part_other
FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

和一些测试数据

insert into part_main (create_time, main_val) values ('2023-04-02', 10);
insert into part_other (main_id, create_time, other_val) select main_id, create_time, 'foo' from part_main;

当通过(main_id, create_time)连接这些表并在create_time上过滤时,仅从左侧表中修剪分区。在下面的计划中,可以看到只扫描了一个part_main分区,而不是从part_other扫描了所有三个分区。

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where create_time >= '2023-04-01';
Merge Join  (cost=326.83..356.03 rows=50 width=52) (actual time=0.132..0.135 rows=1 loops=1)
  Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
  ->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.083..0.083 rows=1 loops=1)
        Sort Key: part_main.main_id, part_main.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.012..0.013 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)
  ->  Sort  (cost=265.11..273.13 rows=3210 width=48) (actual time=0.044..0.045 rows=1 loops=1)
        Sort Key: part_other.main_id, part_other.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Append  (cost=0.00..78.15 rows=3210 width=48) (actual time=0.019..0.021 rows=1 loops=1)
              ->  Seq Scan on part_other_y2023m02 part_other_1  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.004..0.004 rows=0 loops=1)
              ->  Seq Scan on part_other_y2023m03 part_other_2  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.003..0.003 rows=0 loops=1)
              ->  Seq Scan on part_other_y2023m04 part_other_3  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.011..0.011 rows=1 loops=1)

分别过滤左侧和右侧的created_at列,生成预期的计划。

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where part_main.create_time >= '2023-04-01' and part_other.create_time >= '2023-04-01';
Merge Join  (cost=100.23..107.60 rows=6 width=52) (actual time=0.086..0.088 rows=1 loops=1)
  Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
  ->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.026..0.026 rows=1 loops=1)
        Sort Key: part_main.main_id, part_main.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.014..0.015 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)
  ->  Sort  (cost=38.51..39.40 rows=357 width=48) (actual time=0.056..0.056 rows=1 loops=1)
        Sort Key: part_other.main_id, part_other.create_time
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on part_other_y2023m04 part_other  (cost=0.00..23.38 rows=357 width=48) (actual time=0.011..0.011 rows=1 loops=1)
              Filter: (create_time >= '2023-04-01 00:00:00+01'::timestamp with time zone)

所以它看起来像我有一个变通与第二个查询风格,但我有几个问题。

  1. 我不知道第一种和第二种风格之间是否存在语义差异?也就是说,where create_time >= '2023-04-01'where part_main.create_time >= '2023-04-01' and part_other.create_time >= '2023-04-01'是否有可能产生不同的结果?
  2. 如果没有,我是否可以尝试让规划师以更简洁的风格制作理想的计划?

1)没有区别。连接谓词using (main_id, create_time)

的简写。
on part_main.main_id = part_other.main_id and  
part_main.create_time = part_other.create_time

由于这是来自part_main.create_time = part_other.create_time内部连接,使用逻辑规则意味着

part_main.create_time >= '2023-03-01'等价于part_other.create_time >= '2023-03-01'

所以添加它不会改变结果。

2)但是不需要添加第二个条件。

正如文档所说,enable_partitionwise_join的默认值是false,这可能是您的情况。

设置为SET enable_partitionwise_join = true;后,您将看到执行计划只选择相关分区

explain analyze 
select * from part_main join part_other using (main_id, create_time)
where create_time >= '2023-04-01';
Merge Join  (cost=136.26..149.08 rows=17 width=52) (actual time=0.032..0.033 rows=1 loops=1)
Merge Cond: ((part_main.main_id = part_other.main_id) AND (part_main.create_time = part_other.create_time))
->  Sort  (cost=61.72..63.26 rows=617 width=16) (actual time=0.012..0.013 rows=1 loops=1)
Sort Key: part_main.main_id, part_main.create_time
Sort Method: quicksort  Memory: 25kB
->  Seq Scan on part_main_y2023m04 part_main  (cost=0.00..33.12 rows=617 width=16) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (create_time >= '2023-04-01 00:00:00+02'::timestamp with time zone)
->  Sort  (cost=74.54..77.21 rows=1070 width=48) (actual time=0.015..0.016 rows=1 loops=1)
Sort Key: part_other.main_id, part_other.create_time
Sort Method: quicksort  Memory: 25kB
->  Seq Scan on part_other_y2023m04 part_other  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.007..0.008 rows=1 loops=1)
Planning Time: 0.216 ms
Execution Time: 0.052 ms 

所以不需要扩展谓词,另外,使用分区的连接可以获得更好的性能(因为只有相应的分区是一对一连接的;而不是修剪分区,附加它们并连接它们)。

最后,让我对您的primary key (other_id, create_time)做一些评论——显然只有第一列是PK,第二列是为了启用分区而添加的。这就引出了PK是否有意义的问题。

这两个查询之间没有逻辑上的区别,但是有人可能会说有语义上的区别。第一个查询说通过main_id和create_time连接两个表,并在表的create_time列之一上过滤(因为对create_time的引用是不合格的,PostgreSQL可以选择任意一个表)。PostgreSQL对main_id和create_time匹配的一般情况执行连接,但不利用第二个表的分区中哪个是可行的传递逻辑限制。

第二个查询说通过main_id和create_time连接两个表,但也通过create_time显式过滤。最终的结果是一样的,但是如何到达那里的说明是不同的。

没有任何实际的方法来强迫计划者选择一个特定的计划。一般来说,不应该强迫计划者做出选择。目标应该是设计一个环境,使规划人员能够找到一个有效的计划,尽管不一定是最优的计划。

最新更新