我有两个组,我试图找到交集(需要2列匹配),我发现连接2个临时表产生的性能比仅连接一个临时表的原始表慢50倍。这对我来说毫无意义,也许有人能给我点化一下?
我是这样写两个临时表的:
CREATE TEMPORARY TABLE attendees (
event_id SMALLINT(5) UNSIGNED,
person_id INT(10) UNSIGNED NOT NULL,
KEY(event_id),
KEY(person_id)
);
INSERT INTO attendees (event_id, person_id)
SELECT event_id, person_id
FROM attendance WHERE year=2013
GROUP BY event_id, person_id;
CREATE TEMPORARY TABLE invitees (
event_id SMALLINT(5) UNSIGNED,
person_id INT(10) UNSIGNED NOT NULL,
KEY(event_id),
KEY(person_id)
);
INSERT INTO invitees (event_id, person_id)
SELECT event_id, person_id
FROM invitations WHERE year=2013
GROUP BY event_id, person_id;
SELECT i.event_id, COUNT(DISTINCT i.person_id)
FROM attendees AS a
INNER JOIN invitees AS i
ON a.person_id = i.person_id AND a.event_id = i.event_id
GROUP BY i.event_id;
这两个临时表中的每一个都有不到2000行,但是这个最终查询在我的笔记本电脑上花费了2.5秒。我不明白这怎么可能。
另一方面,使用下面的实现,最后的查询只需要0.05秒,即使它访问了完整的邀请表(~100,000行):
CREATE TEMPORARY TABLE attendees (
event_id SMALLINT(5) UNSIGNED,
person_id INT(10) UNSIGNED NOT NULL,
KEY(event_id),
KEY(person_id)
);
INSERT INTO attendees (event_id, person_id)
SELECT event_id, person_id
FROM attendance WHERE year=2013
GROUP BY event_id, person_id;
SELECT i.event_id, COUNT(DISTINCT i.person_id)
FROM attendees AS a
INNER JOIN invitations AS i
ON a.person_id = i.person_id AND a.event_id = i.event_id
WHERE i.year=2013
GROUP BY i.event_id;
值得注意的是,原始表(出席率和邀请)都有event_id, person_id和year的索引。我之所以要写这么复杂的代码,首先是因为有些人参加了没有被邀请的活动,我必须计算这些人的维恩图的每个部分的计数(出席并被邀请,出席未被邀请,被邀请未参加,以及两者都没有)。
我想我的问题是,这里发生了什么让第二个版本这么快?
我的服务器版本是5.5.36 MySQL社区服务器(5.6有几个奇怪的行为破坏了我的网站)。
每个表只能使用一个索引。与其在person_id
和event_id
上分别建立索引,不如在这两个表上建立复合索引:
CREATE TEMPORARY TABLE attendees (
event_id SMALLINT(5) UNSIGNED,
person_id INT(10) UNSIGNED NOT NULL,
KEY(event_id, person_id)
);
CREATE TEMPORARY TABLE invitees (
event_id SMALLINT(5) UNSIGNED,
person_id INT(10) UNSIGNED NOT NULL,
KEY(event_id, person_id)
);
我怀疑原来的attendance
表有一个这样的索引,这使得与该表的连接要快得多。