我这里有一个星型模式,我正在查询事实数据表,并希望加入一个非常小的维度表。我真的无法解释以下内容:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id
FROM bi.impressions imp
GROUP BY imp.os_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
Total runtime: 12559.523 ms
(3 rows)
这需要 ~12600ms,但当然没有连接的数据,所以我无法将imp.os_id"解析"为有意义的内容,所以我添加一个连接:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp, bi.os_desc os
WHERE imp.os_id=os.os_id
GROUP BY imp.os_id, os.os_desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
-> Hash Join (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
Hash Cond: (imp.os_id = os.os_id)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
-> Hash (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Seq Scan on os_desc os (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
Total runtime: 25565.199 ms
(8 rows)
这有效地使查询的执行时间加倍。我的问题是,我从图片中遗漏了什么?我认为这么小的查找不会对查询执行时间造成巨大差异。
使用(推荐的)显式 ANSI JOIN 语法重写:
SELECT COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os ON os.os_id = imp.os_id
GROUP BY imp.os_id, os.os_desc;
首先,如果展示次数中每一行在os_desc
中找到的匹配项多于或少于一个,则您的第二个查询可能是错误的。
如果对os_id
具有外键约束(保证引用完整性)以及对bi.impressions.os_id
的NOT NULL
约束,则可以排除这种情况。如果是这样,在第一步中,简化为:
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os USING (os_id)
GROUP BY imp.os_id, os.os_desc;
如果列NOT NULL
,count(*)
比 count(column)
快,在这里等效。并为计数添加列别名。
更快,但:
SELECT os_id, os.os_desc, sub.ct
FROM (
SELECT os_id, COUNT(*) AS ct
FROM bi.impressions
GROUP BY 1
) sub
JOIN bi.os_desc os USING (os_id)
先聚合,后加入。更多在这里:
- 聚合查询中具有多列的单个列
- PostgreSQL - 按数组排序
HashAggregate (cost=868719.08..868719.24 rows=16 width=10)
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22)
嗯,宽度从 10 到 22 是翻倍。 也许您应该在分组后而不是之前加入?
以下查询在不增加查询执行时间的情况下解决了该问题。问题仍然存在,为什么添加非常简单的连接会显着增加执行时间,但这可能是Postgres的特定问题,在该领域具有丰富经验的人最终可能会回答它。
WITH
OSES AS (SELECT os_id,os_desc from bi.os_desc)
SELECT
COUNT(impression_id) as imp_count,
os_desc FROM bi.impressions imp,
OSES os
WHERE
os.os_id=imp.os_id
GROUP BY os_desc
ORDER BY imp_count;