编辑:
由于你们建议为球员/锦标赛名称创建单独的表格,并用外键替换字符串,我做了以下操作:
SELECT DISTINCT tournament INTO tournaments FROM chess_data2
ALTER TABLE tournaments ADD COLUMN id SERIAL PRIMARY KEY
我对namew和nameb重复了一遍,然后用外键替换字符串。这就是它变得棘手的地方——我无法在"合法"的时间内做到这一点。
我尝试了以下两种方法:
1( 删除现有索引
1( 分别为namew、nameb和锦标赛创建单独索引
1( 运行查询,将我想要的数据插入到新表中:
SELECT date, whiterank, blackrank, t_round, result,
(SELECT p.id FROM players p WHERE c_d2.namew = p.name) AS whitep,
(SELECT p2.id FROM players p2 WHERE c_d2.nameb = p2.name) AS blackp,
(SELECT t.id FROM tournaments t WHERE t_d2.tournament = t.t_name) AS tournament
INTO final_chess from chess_data2 c_d2
不幸的是,它非常慢,所以我回到了用户Boris Shchegolev。在评论中,他建议在现有的表ches_data2中创建一个新列并进行更新。所以我做了:
ALTER TABLE chess_data2 ADD COLUMN name_id INTEGER
UPDATE chess_data2 cd2 SET namew_id = (SELECT id FROM players WHERE name = cd2.namew)"
我半小时前开始了这些查询,第一个是即时的,但第二个需要很长时间。
我现在该怎么办
初始问题:
数据库架构:
日期日期
命名文本
nameb文本whiterank INTEGER
blackrank整数
锦标赛文本
t_round整数
结果真实
id BIGINT
chess_data2_pkey(id(
black_index(名称、锦标赛、日期(
chess_data2_pkey(id(唯一
w_b_t_d_index(名称w、名称b、锦标赛、日期(
white_index(名称、锦标赛、日期(
问题:
以下语句的性能非常好(在一个有300万个条目的数据库中约为60-70秒(:
# Number of points that the white player has so far accrued throughout the tournament
(SELECT coalesce(SUM(result),0) from chess_data2 t2
where (t1.namew = t2.namew) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90)
+ SELECT coalesce(SUM(1-result),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90 ) AS result_in_t_w
from chessdata2 t1
同时,下面的select(与where子句完全相同(需要花费很长时间才能计算。
# Number of games that the white player has so far played in the tournament
(SELECT coalesce(count(*),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(count(*),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1
我尝试了一种不同的方法(使用sum(,但也没有变得更好:
# Number of games that the white player has so far played in the tournament
(SELECT coalesce(sum(1),0) from chess_data t2 where (t1.namew = t2.namew) and
t1.tournament = t2.tournament and t1.date > t2.date and t1.date < t2.date + 90)
+ (SELECT coalesce(sum(1),0) from chess_data2 t2
where (t1.namew = t2.nameb) and t1.tournament = t2.tournament
and t1.date > t2.date and t1.date < t2.date + 90) AS games_t_w from chess_data2 t1
知道这里发生了什么以及如何解决吗我在PyCharm中使用python 3.5和psycopg2来运行这些查询。我很乐意提供任何其他信息,因为这对我来说是一个非常重要的项目。
EXPLAIN ANALYZE(用于最后一个查询(:
Seq Scan on chess_data2 t1 (cost=0.00..49571932.96 rows=2879185 width=86) (actual time=0.061..81756.896 rows=2879185 loops=1)
Planning time: 0.161 ms
Execution time: 81883.716 ms
SubPlan 2
SubPlan 1
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Aggregate (cost=8.58..8.59 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=2879185)
-> Index Only Scan using white_index on chess_data2 t2 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=1 loops=2879185)
-> Index Only Scan using black_index on chess_data2 t2_1 (cost=0.56..8.58 rows=1 width=0) (actual time=0.013..0.013 rows=2 loops=2879185)
Rows Removed by Filter: 1
Rows Removed by Filter: 1
Index Cond: ((namew = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Index Cond: ((nameb = t1.namew) AND (tournament = t1.tournament) AND (date < t1.date))
Heap Fetches: 6009767
Heap Fetches: 5303160
Filter: (t1.date < (date + 90))
Filter: (t1.date < (date + 90))
由于表设计不佳,查询性能不佳。从EXPLAIN中可以明显看出,数据库使用索引,但索引字段都是TEXT
,并且索引很大。
修复它:
- 创建表
names
- 将
namew
和nameb
替换为namew_id
和nameb_id
,均引用names
- 创建表
tournaments
- 将
tournament
替换为引用tournaments
的tournament_id
- 将
black_index
重新索引为(nameb_id, tournament_id, date)
- 将
white_index
重新索引为(namew_id, tournament_id, date)
- 删除
w_b_t_d_index
,除非您在其他查询中使用它 - 从
count(*)
查询中删除无用的coalesce
您的查询应该是这样的:
SELECT
(
SELECT count(*)
FROM chess_data t2
WHERE
t1.namew_id = t2.namew_id AND
t1.tournament_id = t2.tournament_id AND
t1.date > t2.date AND
t1.date < t2.date + 90
)
+
(
SELECT count(*)
FROM chess_data2 t2
WHERE
t1.namew_id = t2.nameb_id AND
t1.tournament_id = t2.tournament_id AND
t1.date > t2.date AND
t1.date < t2.date + 90
) AS games_t_w
FROM chess_data2 t1
整个挤压操作,(复合指数除外(:
DONT CREATE TABLE games (
id serial not null PRIMARY KEY
, zdate DATE
, namew TEXT
, nameb TEXT
, whiterank INTEGER NOT NULL DEFAULT 0
, blackrank INTEGER NOT NULL DEFAULT 0
, tournament TEXT
, t_round INTEGER NOT NULL DEFAULT 0
, result REAL NOT NULL DEFAULT 0.0
);
CREATE TABLE tournaments
( id serial not null PRIMARY KEY
, tournament TEXT UNIQUE
);
CREATE TABLE players
( id serial not null PRIMARY KEY
, name TEXT UNIQUE
);
INSERT INTO tournaments(tournament)
SELECT DISTINCT tournament
FROM games;
VACUUM ANALYZE tournaments;
INSERT INTO players(name)
SELECT DISTINCT namew
FROM games;
VACUUM ANALYZE players;
INSERT INTO players(name)
SELECT DISTINCT nameb
FROM games g
WHERE NOT EXISTS (
SELECT * FROM players nx WHERE nx.name = g.nameb
);
VACUUM ANALYZE players;
ALTER TABLE games
ADD COLUMN id_w INTEGER
, ADD COLUMN id_b INTEGER
, ADD COLUMN id_t INTEGER
;
UPDATE games g
SET id_w = p.id
FROM players p
WHERE p.name = g.namew;
UPDATE games g
SET id_b = p.id
FROM players p
WHERE p.name = g.nameb;
UPDATE games g
SET id_t = t.id
FROM tournaments t
WHERE t.tournament = g.tournament;
ALTER TABLE games
ALTER COLUMN id_w SET NOT NULL
, ALTER COLUMN id_b SET NOT NULL
, ALTER COLUMN id_t SET NOT NULL
;
CREATE INDEX ON games(id_w);
CREATE INDEX ON games(id_b);
CREATE INDEX ON games(id_t);
ALTER TABLE games
ADD CONSTRAINT fk_games_idw FOREIGN KEY(id_w) REFERENCES players(id)
, ADD CONSTRAINT fk_games_idb FOREIGN KEY(id_b) REFERENCES players(id)
, ADD CONSTRAINT fk_games_idt FOREIGN KEY(id_t) REFERENCES tournaments(id)
, DROP COLUMN namew
, DROP COLUMN nameb
, DROP COLUMN tournament
;
VACUUM ANALYZE games;