(pg version 11.2(
我有以下表格结构:
CREATE TABLE site_tally
(
id serial,
dt_created timestamp WITHOUT TIME ZONE DEFAULT now() NOT NULL,
dt_updated timestamp WITHOUT TIME ZONE,
geo text NOT NULL,
dt_tally date NOT NULL,
parent_site text NOT NULL,
site_id integer NOT NULL,
tracked boolean NOT NULL,
utm_parameter_id integer NOT NULL,
device text NOT NULL,
layout_id integer NOT NULL,
views integer DEFAULT 0,
PRIMARY KEY (id, geo)
) PARTITION BY LIST (geo);
CREATE UNIQUE INDEX site_tally_uindex
ON site_tally (geo, dt_tally, parent_site, site_id, tracked, utm_parameter_id, device, layout_id);
CREATE TABLE site_tally_uk PARTITION OF site_tally FOR VALUES IN ('UK');
CREATE TABLE site_tally_us PARTITION OF site_tally FOR VALUES IN ('US');
CREATE TABLE site_tally_au PARTITION OF site_tally FOR VALUES IN ('AU');
CREATE TABLE utm_parameters
(
id serial NOT NULL PRIMARY KEY,
dt_created timestamp DEFAULT now() NOT NULL,
source text DEFAULT 'default'::text NOT NULL,
medium text DEFAULT 'default'::text NOT NULL,
campaign text DEFAULT 'default'::text NOT NULL,
term text DEFAULT 'default'::text NOT NULL,
content text DEFAULT 'default'::text NOT NULL
);
CREATE UNIQUE INDEX utm_parameters_source_medium_campaign_term_content_uindex
ON utm_parameters (source, medium, campaign, term, content);
出于性能原因,site_tally
专门分区,因为我们永远不需要查询多个geo
。
我遇到了一个边缘情况,其中一个查询需要很长时间才能运行:
SELECT SUM(views) AS views,
term AS utm
FROM site_tally
INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
WHERE geo = 'UK'
AND dt_tally >= '2019-08-01'
AND dt_tally <= '2019-08-31'
AND parent_site = 'site1'
AND source = 'source1'
AND medium = 'medium1'
AND campaign = 'campaign1'
AND tracked = FALSE
GROUP BY source,
medium,
campaign,
term;
解释分析:
GroupAggregate (cost=1.11..12152.56 rows=1 width=74) (actual time=88.064..163032.380 rows=351 loops=1)
" Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign, utm_parameters.term"
-> Nested Loop (cost=1.11..12152.53 rows=1 width=70) (actual time=59.993..163025.340 rows=15823 loops=1)
-> Index Scan using utm_parameters_source_medium_campaign_term_content_uindex on utm_parameters (cost=0.55..8.57 rows=1 width=70) (actual time=0.024..39.883 rows=5994 loops=1)
Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text) AND (campaign = 'campaign1'::text))
-> Append (cost=0.56..12143.95 rows=1 width=8) (actual time=26.022..27.188 rows=3 loops=5994)
-> Index Scan using site_tally_uk_geo_dt_tally_parent_site_site_id_tracked_utm__idx on site_tally_uk (cost=0.56..12143.95 rows=1 width=8) (actual time=26.020..27.185 rows=3 loops=5994)
Index Cond: ((geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (dt_tally <= '2019-08-31'::date) AND (parent_site = 'site1'::text) AND (tracked = false) AND (utm_parameter_id = utm_parameters.id))
Filter: (NOT tracked)
Planning Time: 0.693 ms
Execution Time: 163032.762 ms
在这种特殊情况下,有许多term
要分组,没有term
的查询的行为非常不同:
SELECT SUM(views) AS views,
campaign AS utm
FROM site_tally
INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
WHERE geo = 'UK'
AND dt_tally >= '2019-08-01'
AND dt_tally <= '2019-08-31'
AND parent_site = 'site1'
AND source = 'source1'
AND medium = 'medium1'
AND tracked = FALSE
GROUP BY source,
medium,
campaign;
解释分析:
GroupAggregate (cost=87129.06..87129.13 rows=3 width=48) (actual time=54.451..54.451 rows=1 loops=1)
" Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign"
-> Sort (cost=87129.06..87129.07 rows=3 width=44) (actual time=50.572..51.398 rows=15823 loops=1)
Sort Key: utm_parameters.campaign
Sort Method: quicksort Memory: 2610kB
-> Hash Join (cost=1583.46..87129.04 rows=3 width=44) (actual time=11.359..46.521 rows=15823 loops=1)
Hash Cond: (site_tally_uk.utm_parameter_id = utm_parameters.id)
-> Append (cost=1322.54..86645.61 rows=84764 width=8) (actual time=4.268..31.765 rows=53612 loops=1)
-> Bitmap Heap Scan on site_tally_uk (cost=1322.54..86221.79 rows=84764 width=8) (actual time=4.267..28.157 rows=53612 loops=1)
Recheck Cond: ((dt_tally <= '2019-08-31'::date) AND (geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (parent_site = 'site1'::text) AND (NOT tracked))
Heap Blocks: exact=5237
-> Bitmap Index Scan on site_tally_uk_geo_dt_tally_parent_site_tracked_idx (cost=0.00..1301.35 rows=84764 width=0) (actual time=3.519..3.519 rows=53612 loops=1)
Index Cond: (dt_tally <= '2019-08-31'::date)
-> Hash (cost=260.09..260.09 rows=66 width=44) (actual time=7.083..7.084 rows=5994 loops=1)
Buckets: 8192 (originally 1024) Batches: 1 (originally 1) Memory Usage: 556kB
-> Bitmap Heap Scan on utm_parameters (cost=5.23..260.09 rows=66 width=44) (actual time=1.346..5.862 rows=5994 loops=1)
Recheck Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text))
Heap Blocks: exact=2655
-> Bitmap Index Scan on utm_parameters_source_medium_campaign_term_content_uindex (cost=0.00..5.21 rows=66 width=0) (actual time=0.991..0.992 rows=5994 loops=1)
Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text))
Planning Time: 0.571 ms
Execution Time: 54.773 ms
注意:site_tally
有更多的基于整数的列(在views
列之后(,这些列也用于选择为SUM值。我决定将它们排除在问题之外,因为它已经是一个很长的问题了!
因此,理想情况下,我想加快此查询的速度,我尝试了另一种索引策略:
CREATE INDEX testing ON site_tally (geo, dt_tally, parent_site, tracked)
WHERE geo='UK' and dt_tally >= '2019-08-01' and parent_site='site1' and tracked=FALSE;
查询规划器不会选择此索引,即使我尝试使用类似dt_tally > '2019-07-31'
的内容来具体我的查询。
此时,我无法更改site_tally
上的唯一索引(其他查询依赖于该特定列顺序(
我想了解此查询中真正发生的事情(我不太精通 EXPLAIN 输出(。
看起来对utm_parameters
的索引扫描的估计值非常偏差。
首先,尝试一个简单的
ANALYZE utm_parameters;
看看这是否可行。
如果这不能改善问题,请尝试收集更详细的统计数据:
ALTER TABLE utm_parameters ALTER source SET STATISTICS 1000, ALTER medium SET STATISTICS 1000, ALTER campaign SET STATISTICS 1000; ANALYZE utm_parameters;
如果这也不能改善估计值,则问题可能是列之间的相关性。尝试创建扩展统计信息:
CREATE STATISTICS utm_parameters_stats (dependencies) ON source, medium, campaign FROM utm_parameters; ANALYZE utm_parameters;
似乎最后一个选项为您解决了问题。那么发生了什么?
PostgreSQL有很好的统计数据来估计形式
column = value
条件的选择性。假设这三个条件中的每一个的选择性都是 0.1,即 90% 的行被过滤掉。不知道更好,PostgreSQL假设条件在统计上是独立的,所以它假设所有三个条件的选择性加在一起是0.1 * 0.1 * 0.1 = 0.001。
现在碰巧条件不是独立的,例如,如果两行的
campaign
相同,那么medium
行很可能也是相同的。因此,PostgreSQL的估计将远低于实际情况。这个低估计值导致 PostgreSQL 选择嵌套循环连接,这是小型外部表的最佳访问路径。但是,如果外部表很大,嵌套循环联接的性能会非常糟糕。因此,修复估计值可以提高性能。