为什么使用sqlite的SQLAlchemy插入速度是直接使用sqlite3的25倍



为什么这个简单的测试用例使用SQLAlchemy插入100000行的速度是直接使用sqlite3驱动程序的25倍?我在现实世界的应用程序中也看到过类似的减速。我做错什么了吗?

#!/usr/bin/env python
# Why is SQLAlchemy with SQLite so slow?
# Output from this program:
# SqlAlchemy: Total time for 100000 records 10.74 secs
# sqlite3:    Total time for 100000 records  0.40 secs

import time
import sqlite3
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine 
from sqlalchemy.orm import scoped_session, sessionmaker
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))
def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    engine  = create_engine(dbname, echo=False)
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
def test_sqlalchemy(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
    DBSession.commit()
    print "SqlAlchemy: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn
def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"
if __name__ == '__main__':
    test_sqlalchemy(100000)
    test_sqlite3(100000)

我尝试过许多变体(参见http://pastebin.com/zCmzDraU)

SQLAlchemy ORM在同步数据库更改时使用工作单元模式。这种模式远远超出了简单的数据"插入"。它包括使用属性检测系统接收分配给对象的属性,该系统跟踪对象的更改,包括在标识映射中跟踪插入的所有行,该标识映射的效果是,对于每一行,SQLAlchemy必须检索其"上次插入的id"(如果尚未给定),并且还涉及根据需要对要插入的行进行扫描和排序以寻找相关性。为了保持所有这些操作的运行,对象还需要进行一定程度的记账,这对于大量的行来说可能会在处理大型数据结构时花费过多的时间,因此最好对这些数据进行分组。

基本上,工作单元是一个很大程度的自动化,以便在没有显式持久化代码的情况下将复杂的对象图持久化到关系数据库中的任务自动化,而这种自动化是有代价的。

因此ORM基本上不适用于高性能的大容量插件。这就是SQLAlchemy拥有两个独立库的全部原因,如果您查看http://docs.sqlalchemy.org/en/latest/index.html您将看到索引页面的两个不同的部分——一个用于ORM,另一个用于Core。如果不了解两者,就无法有效地使用SQLAlchemy。

对于快速批量插入的用例,SQLAlchemy提供了核心,即ORM构建在其之上的SQL生成和执行系统。有效地使用这个系统,我们可以生成一个与原始SQLite版本有竞争力的INSERT。下面的脚本说明了这一点,以及一个ORM版本,该版本预先分配主键标识符,以便ORM可以使用execute()插入行。这两个ORM版本每次都会对1000条记录进行刷新,这对性能产生了重大影响。

此处观察到的运行时间为:

SqlAlchemy ORM: Total time for 100000 records 16.4133379459 secs
SqlAlchemy ORM pk given: Total time for 100000 records 9.77570986748 secs
SqlAlchemy Core: Total time for 100000 records 0.568737983704 secs
sqlite3: Total time for 100000 records 0.595796823502 sec

脚本:

import time
import sqlite3
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))
def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name":'NAME ' + str(i)} for i in range(n)]
    )
    print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn
def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"
if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

另请参阅:http://docs.sqlalchemy.org/en/latest/faq/performance.html

@zzzeek的精彩回答。对于那些想知道查询的相同统计数据的人,我稍微修改了@zzzeek代码,在插入这些记录后立即查询这些记录,然后将这些记录转换为dict列表。

以下是的结果

SqlAlchemy ORM: Total time for 100000 records 11.9210000038 secs
SqlAlchemy ORM query: Total time for 100000 records 2.94099998474 secs
SqlAlchemy ORM pk given: Total time for 100000 records 7.51800012589 secs
SqlAlchemy ORM pk given query: Total time for 100000 records 3.07699990273 secs
SqlAlchemy Core: Total time for 100000 records 0.431999921799 secs
SqlAlchemy Core query: Total time for 100000 records 0.389000177383 secs
sqlite3: Total time for 100000 records 0.459000110626 sec
sqlite3 query: Total time for 100000 records 0.103999853134 secs

有趣的是,使用裸sqlite3进行查询仍然比使用SQLAlchemyCore快3倍。我想这就是您为返回ResultProxy而不是空的sqlite3行所付出的代价。

SQLAlchemyCore大约比使用ORM快8倍。因此,无论怎样,使用ORM进行查询都要慢得多。

这是我使用的代码:

import time
import sqlite3
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String,  create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from sqlalchemy.sql import select
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(String(255))
def init_sqlalchemy(dbname = 'sqlite:///sqlalchemy.db'):
    global engine
    engine = create_engine(dbname, echo=False)
    DBSession.remove()
    DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
def test_sqlalchemy_orm(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer()
        customer.name = 'NAME ' + str(i)
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    t0 = time.time()
    q = DBSession.query(Customer)
    dict = [{'id':r.id, 'name':r.name} for r in q]
    print "SqlAlchemy ORM query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"

def test_sqlalchemy_orm_pk_given(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    for i in range(n):
        customer = Customer(id=i+1, name="NAME " + str(i))
        DBSession.add(customer)
        if i % 1000 == 0:
            DBSession.flush()
    DBSession.commit()
    print "SqlAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    t0 = time.time()
    q = DBSession.query(Customer)
    dict = [{'id':r.id, 'name':r.name} for r in q]
    print "SqlAlchemy ORM pk given query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"
def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
        [{"name":'NAME ' + str(i)} for i in range(n)]
    )
    print "SqlAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs"
    conn = engine.connect()
    t0 = time.time()
    sql = select([Customer.__table__])
    q = conn.execute(sql)
    dict = [{'id':r[0], 'name':r[0]} for r in q]
    print "SqlAlchemy Core query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"
def init_sqlite3(dbname):
    conn = sqlite3.connect(dbname)
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS customer")
    c.execute("CREATE TABLE customer (id INTEGER NOT NULL, name VARCHAR(255), PRIMARY KEY(id))")
    conn.commit()
    return conn
def test_sqlite3(n=100000, dbname = 'sqlite3.db'):
    conn = init_sqlite3(dbname)
    c = conn.cursor()
    t0 = time.time()
    for i in range(n):
        row = ('NAME ' + str(i),)
        c.execute("INSERT INTO customer (name) VALUES (?)", row)
    conn.commit()
    print "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec"
    t0 = time.time()
    q = conn.execute("SELECT * FROM customer").fetchall()
    dict = [{'id':r[0], 'name':r[0]} for r in q]
    print "sqlite3 query: Total time for " + str(len(dict)) + " records " + str(time.time() - t0) + " secs"

if __name__ == '__main__':
    test_sqlalchemy_orm(100000)
    test_sqlalchemy_orm_pk_given(100000)
    test_sqlalchemy_core(100000)
    test_sqlite3(100000)

我还测试了没有将查询结果转换为dicts的情况,统计数据类似:

SqlAlchemy ORM: Total time for 100000 records 11.9189999104 secs
SqlAlchemy ORM query: Total time for 100000 records 2.78500008583 secs
SqlAlchemy ORM pk given: Total time for 100000 records 7.67199993134 secs
SqlAlchemy ORM pk given query: Total time for 100000 records 2.94000005722 secs
SqlAlchemy Core: Total time for 100000 records 0.43700003624 secs
SqlAlchemy Core query: Total time for 100000 records 0.131000041962 secs
sqlite3: Total time for 100000 records 0.500999927521 sec
sqlite3 query: Total time for 100000 records 0.0859999656677 secs

与ORM相比,使用SQLAlchemyCore进行查询的速度大约快20倍。

需要注意的是,这些测试非常肤浅,不应过于认真。我可能错过了一些明显的技巧,可以完全改变统计数据

衡量性能改进的最佳方法是直接在您自己的应用程序中。不要认为我的数据是理所当然的

我会尝试插入表达式测试,然后进行基准测试。

由于OR映射程序的开销,它可能仍然会更慢,但我希望不会慢那么多。

你介意试着发布结果吗。这是非常有趣的东西。

最新更新