我有两个表,其中第一个非常大(>50M行):
CREATE CACHED TABLE Alldistances (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
和第二个也可以相当大(>5M行):
CREATE CACHED TABLE tempcach (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
两个表都有索引:
CREATE INDEX mulalldis ON Alldistances (word1, word2, distance);
CREATE INDEX multem ON tempcach (word1, word2, distance);
在我的java程序中,我使用准备好的语句来填充/预组织tempcach表中的数据,然后我将表合并到所有距离:
MERGE INTO Alldistances alld USING (
SELECT word1,
word2,
distance,
distcount FROM tempcach
) AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
然后删除或截断tempchach表,并用新数据填充。在合并期间,我得到了OOM,我猜这是因为整个表在合并期间被加载到内存中。所以我将不得不分批合并,但我可以在SQL或在我的java程序中做。或者是否有一种聪明的方法可以在合并时避免OOM ?
在SQL中可以进行块(批)合并。你需要
- 限制每个块中来自临时表的行数
- 删除相同的行
SELECT语句应该使用ORDER BY和LIMIT
SELECT word1,
word2,
distance,
distcount FROM tempcach
ORDER BY primary key or unique columns
LIMIT 1000
) AS src (
合并后,delete语句将选择要删除的相同行
DELETE FROM tempcach WHERE primary key or unique columns IN
(SELECT primary key or unique columns FROM tempcach
ORDER BY primary key or unique columns LIMIT 1000)
首先,只是因为这种事情惹恼了我,为什么要在子选择中选择临时表的所有字段?为什么不使用更简单的SQL:
MERGE INTO Alldistances alld USING tempcach AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
你需要让数据库避免将整个表加载到内存中,那就是在两个表上都建立索引。
CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);