如何在使用 python 或 postgres 检查约束和返回值时加快插入速度

我正在尝试加快插入模式的速度。我已经看到了一些关于这个问题的问题，但是它们不处理我的问题，这些是最相关的：

psycopg2：使用一个查询插入多行
如何在PostgreSQL中加快插入性能

我的架构中的表都包含自动生成的键Serial，所以我在插入的末尾使用RETURNING id。我不使用copy的原因是我需要检查约束。特别是某些字段具有唯一的索引，因此我正在检查特定行是否已存在，以便我可以选择其id(为了在自动生成的键上进行映射(

我这样做：

sql = "INSERT INTO %s.%s(%s) VALUES (%s) ON CONFLICT DO NOTHING RETURNING %s;"
sql = sql % (schema_name,table_name,','.join(head[1:]),'%s',head[0])

其中 head 包含表中字段的名称。然后我这样做：

try:
# get the auto generated id back
return_id = pdb.cur.fetchone()[0]     
except TypeError:
sql_exist_id = "SELECT %s FROM %s.%s WHERE %s = '%s'" % (head[0],schema_name,table_name,unique_field,row[1])

我的问题是：有没有办法在检查约束和返回键的同时进行批量插入(就像这个 psycopg2：用一个查询插入多行(？

你可以非常接近...

RETURNING子句将适用于多值插入，如下所示：

insert into some_table (col1, col2) values 
(val1, val2), (val3, val4), 
(val5, val6), (val7, val8) 
returning id

假设some_table有一个serial，你会得到4个结果陈述。它还将与insert into ... select一起使用：

insert into some_table (col1, col2) 
select col1, col2 
from some_other_source returning id

但是在这两种情况下，如果您指定on conflict do nothing，则在发生冲突时，整个语句将失败，因此一个不良记录会杀死事务。

但请注意，returning也适用于删除：

delete from some_table where col1 = 'bob' returning id

将其与@JacobH的建议放在一起：批量加载到临时表中，筛选出冲突的行，然后插入剩余的非冲突行行。你提到了copy，所以我假设已经有.csv大致与目的地匹配。像这样：

with conn.cursor() as stmt:
#create a constraint-free temp table based on your destination table
#the `where 0 = 1` ensures it's empty
stmt.execute('select * into temp tmp_some_table from some_table where 0 = 1') 
#bulk load into the temp table:
with open('some.csv') as infile:
stmt.copy_from(infile, 'tmp_some_table', sep=',', columns=['col1', 'col2'])
#strip out your conflicts. note that we are returning col1 (assumed to be a unique key)
#you can do as many passes as needed...
stmt.execute('delete from tmp_some_table tst using some_table st where tst.col1 = st.col1 returning col1')
dupe_keys = stmt.fetchall() #all your deleted, conflicting keys
#bulk insert non-conflicting values
stmt.execute('insert into some_table (col1, col2) select (col1, col2) from tmp_some_table returning id') 
inserted_ids = stmt.fetchall() #all the ids created from the insert

一个好处是，与迭代和调用execute相比，这应该非常快 - 从本质上讲，您正在 4 个陈述中做任何事情;大部分网络 I/O 在初始 CSV 导入与 N 个记录插入上的 N 次往返上。其他一切都发生在数据库中。临时表可能会保留在内存，如果它相当小。

意识到我在这个答案上迟到了，但我希望它有所帮助。

相关内容

最新更新

热门标签：