为什么插入的速度会随着数据库的增长而减慢



我正在做一个生成大量数据的个人项目,我认为将其存储在本地DB中是有意义的。然而,随着DB的增长,我看到了疯狂的放缓,这使得运行变得不可行。

我做了一个简单的测试来展示我在做什么。我制作了一个字典,在其中我进行了一系列本地处理(大约100万个条目(,然后将其批量插入SQLite DB,然后循环并再次执行。这是代码:

from collections import defaultdict
import sqlite3
import datetime
import random
def log(s):
now = datetime.datetime.now()
print(str(now) + ": " + str(s))
def create_table():
conn = create_connection()
with conn:
cursor = conn.cursor()
sql = """
CREATE TABLE IF NOT EXISTS testing (
test text PRIMARY KEY,
number integer
);"""
cursor.execute(sql)
conn.close()
def insert_many(local_db):
sql = """INSERT INTO testing(test, number) VALUES(?, ?) ON CONFLICT(test) DO UPDATE SET number=number+?;"""
inserts = []
for key, value in local_db.items():
inserts.append((key, value, value))
conn = create_connection()
with conn:
cursor = conn.cursor()
cursor.executemany(sql, inserts)
conn.close()
def main():
i = 0
log("Starting to process records")
for i in range(1, 21):
local_db = defaultdict(int)
for j in range(0, 1000000):
s = "Testing insertion " + str(random.randrange(100000000))
local_db[s] += 1
log("Created local DB for " + str(1000000 * i) + " records")
insert_many(local_db)
log("Finished inserting " + str(1000000 * i) + " records")
def create_connection():
conn = None
try:
conn = sqlite3.connect('/home/testing.db')
except Error as e:
print(e)
return conn
if __name__ == '__main__':
create_table()
main()

这跑了一秒钟很好,然后像疯了一样慢下来。这是我刚刚得到的输出:

2019-10-23 15:28:59.211036: Starting to process records
2019-10-23 15:29:01.308668: Created local DB for 1000000 records
2019-10-23 15:29:10.147762: Finished inserting 1000000 records
2019-10-23 15:29:12.258012: Created local DB for 2000000 records
2019-10-23 15:29:28.752352: Finished inserting 2000000 records
2019-10-23 15:29:30.853128: Created local DB for 3000000 records
2019-10-23 15:39:12.826357: Finished inserting 3000000 records
2019-10-23 15:39:14.932100: Created local DB for 4000000 records
2019-10-23 17:21:37.257651: Finished inserting 4000000 records
...

正如你所看到的,前一百万次插入需要9秒,然后下一百万次需要16秒,然后膨胀到10分钟,然后是1小时40分钟(!(。是我在做什么奇怪的事情导致了这种疯狂的放缓,还是这是sqlite的限制?

(与其说是答案,不如说是扩展注释(

SQLite只支持BTree索引。对于可能具有不同长度的字符串,树存储行ID。树的读取复杂性是O(log(n((,其中n是表的长度,但是,它将乘以从表中读取和比较字符串值的复杂性。因此,除非有充分的理由,否则最好使用一个整数字段作为主键。

在这种情况下,更糟糕的是,您插入的字符串具有相当长的共享前缀("测试插入"(,因此搜索第一个不匹配需要更长的时间。

加速建议,按预期效果大小排序:

  • 真实数据库(MariaDB、Postgres(支持哈希索引,这将解决这个问题
  • 禁用自动提交(跳过不必要的磁盘写入;非常昂贵(
  • 反转文本字符串(固定文本之前的数字(,甚至只保留数字部分
  • 使用大量插入(一条语句中有多条记录(

@peak的答案通过不使用索引避免了整个问题。如果根本不需要索引,这绝对是一条路。

使用您的程序进行一次(小的?(修改,我得到了非常合理的时间安排,如下所示。修改为使用sqlite3.connect而不是pysqlite.connect

使用sqlite3.connect计时

#注释是近似值。

2019-10-23 13:00:37.843759: Starting to process records
2019-10-23 13:00:40.253049: Created local DB for 1000000 records
2019-10-23 13:00:50.052383: Finished inserting 1000000 records          # 12s
2019-10-23 13:00:52.065007: Created local DB for 2000000 records
2019-10-23 13:01:08.069532: Finished inserting 2000000 records          # 18s
2019-10-23 13:01:10.073701: Created local DB for 3000000 records
2019-10-23 13:01:28.233935: Finished inserting 3000000 records          # 20s
2019-10-23 13:01:30.237968: Created local DB for 4000000 records
2019-10-23 13:01:51.052647: Finished inserting 4000000 records          # 23s
2019-10-23 13:01:53.079311: Created local DB for 5000000 records
2019-10-23 13:02:15.087708: Finished inserting 5000000 records          # 24s
2019-10-23 13:02:17.075652: Created local DB for 6000000 records
2019-10-23 13:02:41.710617: Finished inserting 6000000 records          # 26s
2019-10-23 13:02:43.712996: Created local DB for 7000000 records
2019-10-23 13:03:18.420790: Finished inserting 7000000 records          # 37s
2019-10-23 13:03:20.420485: Created local DB for 8000000 records
2019-10-23 13:04:03.287034: Finished inserting 8000000 records          # 45s
2019-10-23 13:04:05.593073: Created local DB for 9000000 records
2019-10-23 13:04:57.871396: Finished inserting 9000000 records          # 54s
2019-10-23 13:04:59.860289: Created local DB for 10000000 records       
2019-10-23 13:05:54.527094: Finished inserting 10000000 records # 57s
...

TEXT PRIMARY KEY的成本

我认为速度减慢的主要原因是test被定义为TEXT PRIMARY KEY。这需要巨大的索引成本,正如这段代码片段所暗示的那样,在一次删除了"PRIMARY KEY"one_answers"ON CONFLICT"声明的运行中:

2019-10-23 13:26:22.627898: Created local DB for 10000000 records
2019-10-23 13:26:24.010171: Finished inserting 10000000 records
...
2019-10-23 13:26:58.350150: Created local DB for 20000000 records
2019-10-23 13:26:59.832137: Finished inserting 20000000 records

在1000万张唱片的大关上,这还不到1.4秒,在2000万张唱片大关上也不多。

最新更新