在尝试从几个连接的表中提取大量列(~15-20(时,我整理了2个视图来提取必要的信息。但是,在我的本地数据库中(只有 ~1kposts
行(,连接这些视图工作正常;当我在生产数据库(~30kposts
行(上创建这些相同的视图并尝试联接该视图时,我意识到该解决方案无法扩展到测试数据集之外。
我尝试将这两个视图(类别数据(如categories.title
(和创作者数据(如users.display_name
(迁移到 CTEpost_data
,理论上,该视图将充当这些视图的关键版本,并允许我获取符合条件的帖子的所有帖子数据。
我整理了一个示例 DBFiddle 和一些测试数据来解释表结构。实际数据具有更多的列,但这代表了构建查询所需的联接。
table : posts
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| id | parent_id | created_by | message | attachments |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| 8 | NULL | 8 | laptop for sale | [{"media_id": 1380}] |
| 9 | NULL | 4 | NEW lamp shade up for grabs | [{"media_id": 1442}, {"link_id": 103}] |
| 10 | 1 | 7 | Oooh I could be interested | |
| 11 | 1 | 7 | DMing you now! I've been looking for one | |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
table : users
+----+------------------+---------------------------+
| id | display_name | created_at |
+----+------------------+---------------------------+
| 1 | John Appleseed | 2018-02-20T00:00:00+00:00 |
| 2 | Massimo Jenkins | 2018-05-14T00:00:00+00:00 |
| 3 | Johanna Marionna | 2018-06-05T00:00:00+00:00 |
| 4 | Jackson Creek | 2018-11-15T00:00:00+00:00 |
| 5 | Joe Schmoe | 2019-01-09T00:00:00+00:00 |
| 6 | John Johnson | 2019-02-14T00:00:00+00:00 |
| 7 | Donna Madison | 2019-05-14T00:00:00+00:00 |
| 8 | Jenna Kaplan | 2019-06-23T00:00:00+00:00 |
+----+------------------+---------------------------+
table : categories
+----+------------+------------+-------------------------------------------------------+
| id | created_by | title | description |
+----+------------+------------+-------------------------------------------------------+
| 1 | 2 | Technology | Anything tech; Consumer, business or education tools! |
| 2 | 2 | Home Goods | Anything for the home |
+----+------------+------------+-------------------------------------------------------+
table : categories_posts
+---------+-------------+
| post_id | category_id |
+---------+-------------+
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
+---------+-------------+
table : users_categories
+---------+-------------+
| user_id | category_id |
+---------+-------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+---------+-------------+
table : posts_removed
+---------+----------------------+------------+
| post_id | removed_at | removed_by |
+---------+----------------------+------------+
| 10 | 2019-01-22 09:08:14 | 7 |
+---------+----------------------+------------+
在下面的查询中,符合条件的帖子在基本SELECT
中确定;然后,post_data CTE 连接到结果集(限制为 25 行(,并返回 CTE 中的所有列。
WITH post_data AS (
SELECT posts.id,
posts.parent_id,
posts.created_by,
posts.attachments,
categories_posts.category_id,
categories.title,
categories.created_by AS category_created_by,
creator.display_name AS creator_display_name,
creator.created_at AS creator_created_at
/* ... And a whole bunch of other fields from posts, categories_posts, users */
FROM posts
LEFT OUTER JOIN categories_posts
ON categories_posts.post_id = posts.id
LEFT OUTER JOIN categories
ON categories.id = categories_posts.category_id
LEFT OUTER JOIN users creator
ON creator.id = posts.created_by
/* ... And a whole bunch of other joins to facilitate the selected fields */
)
SELECT post_data.*
FROM posts
/* Set up the criteria for the posts selected before getting their data from the CTE */
LEFT OUTER JOIN posts_removed removed ON removed.post_id = posts.id
LEFT OUTER JOIN users user_me ON user_me.id = "1"
LEFT OUTER JOIN users_followed ON users_followed.user_id = posts.created_by
AND users_followed.followed_by = user_me.id
LEFT OUTER JOIN categories_posts ON categories_posts.post_id = posts.id
LEFT OUTER JOIN users_categories ON users_categories.category_id = categories_posts.category_id
LEFT OUTER JOIN posts_removed pp_removed ON pp_removed.post_id = posts.parent_id
/* Join our post_data on the post's ID */
JOIN post_data ON post_data.id = posts.id
WHERE
(
(
users_categories.user_id = user_me.id AND users_categories.left_at IS NULL
) OR categories_posts.category_id IS NULL
) AND (
posts.created_by = user_me.id
OR users_followed.followed_by = user_me.id
OR categories_posts.category_id IS NOT NULL
) AND removed.removed_at IS NULL
AND pp_removed.removed_at IS NULL
AND (post_data.id = posts.id OR post_data.id = posts.parent_id)
ORDER BY posts.id DESC
LIMIT 25
从理论上讲,我认为这是通过根据基本选择条件选择行,然后根据 Post ID 对 CTE 进行索引扫描来工作的;但是,查询优化器似乎选择对posts
表进行全表扫描。
EXPLAIN SELECT
给了我这个信息:
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | extra |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| 1 | PRIMARY | posts | ALL | PRIMARY,parent_id,created_by | | | | 33870 | 100 | Using temporary; Using filesort |
| 1 | PRIMARY | removed | eq_ref | PRIMARY | PRIMARY | 8 | posts.id | 1 | 19 | Using where |
| 1 | PRIMARY | user_me | const | PRIMARY | PRIMARY | 8 | const | 1 | 100 | Using where; Using index |
| 1 | PRIMARY | categories_posts | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | |
| 1 | PRIMARY | categories | eq_ref | PRIMARY | PRIMARY | 8 | categories_posts.category_id | 1 | 100 | Using index |
| 1 | PRIMARY | users_categories | eq_ref | user_id_2,user_id,category_id | user_id_2 | 16 | user_me.id,api.categories_posts.category_id | 1 | 100 | Using where |
| 1 | PRIMARY | users_followed | eq_ref | user_id,followed_by | user_id | 16 | posts.created_by,api.user_me.id | 1 | 100 | Using where; Using index |
| 1 | PRIMARY | pp_removed | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.parent_id | 1 | 19 | Using where |
| 1 | PRIMARY | <derived2> | ALL | | | | | 493911 | 19 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | posts | ALL | | | | | 33870 | 100 | Using temporary |
| 2 | DERIVED | categories_posts | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | |
| 2 | DERIVED | categories | eq_ref | PRIMARY | PRIMARY | 8 | api.categories_posts.category_id | 1 | 100 | |
| 2 | DERIVED | posts_votes | ref | post_id | post_id | 8 | api.posts.id | 1 | 100 | Using index |
| 2 | DERIVED | pp | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.parent_id | 1 | 100 | |
| 2 | DERIVED | pp_removed | eq_ref | PRIMARY | PRIMARY | 8 | api.pp.id | 1 | 100 | Using index |
| 2 | DERIVED | removed | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | Using index |
| 2 | DERIVED | creator | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.created_by | 1 | 100 | |
| 2 | DERIVED | usernames | ref | user_id | user_id | 8 | api.creator.id | 1 | 100 | |
| 2 | DERIVED | verifications | ALL | | | | | 4 | 100 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | categories_identifiers | ref | category_id | category_id | 8 | api.categories.id | 1 | 100 | |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
除此之外,我尝试重构我的查询以尝试在posts
表中强制使用键,例如在选择中使用FORCE INDEX(PRIMARY)
,并将 CTE 移动为基本查询并添加过滤器WHERE id IN ({the original base query})
,但似乎优化器仍然执行全表扫描。
如果解码查询计划中发生的情况很有帮助:
- 在撰写本文时,有33,387
posts
行,但查询计划显示 - 查询计划显示返回33,870行的完整表扫描
- 查询计划还显示派生表 (
<derived2>
( 具有493,911行
我的核心问题是:
当我说子查询应该只从基本选择查询的每个结果行执行一次时,我是否正确?如果是这样,那么 CTE 也应该在
posts.id
上使用 JOIN,并可能使用表索引?为什么查询计划显示它只选择 33,870 行,而只有33,387行?493,911行从何而来?
在这种情况下,如何防止全表扫描?
试一试... 在JOINing
WITH
之前执行LIMIT 25
:
SELECT * FROM
( SELECT ... FROM posts
JOIN categories_posts ...
ORDER BY posts.id DESC
LIMIT 25 ) AS x
JOIN post_data
ON post_data.id IN (x.id, x.parent_id)
ORDER BY posts.id DESC