今天,我们的生产数据库(Aurora PostgreSQL 9.6.3(遇到了死锁情况,其中多个进程试图对一行执行相同的UPDATE
查询。我们认为死锁只会在无序更新多行时发生,所以这是一个惊喜;然而,它确实发生在我们一天中最繁忙的时候。
这是我们 Python 代码中包含 UPDATE
语句的事务(这是一个穷人的UPSERT
(:
with self.connection.cursor() as cursor:
cursor.execute("""UPDATE students SET name = %s WHERE uuid = %s AND activity_id = %s""", (name, uuid, activityId))
if cursor.rowcount <= 0:
cursor.execute("""INSERT INTO students (name, uuid, activity_id) VALUES (%s, %s, %s)""", (name, uuid, activityId))
if cursor.rowcount <= 0:
self.connection.rollback()
raise BaseDao.NotUpserted("No student name was updated or inserted for activity_id %d and uuid %s" % (activityId, uuid))
else:
self.connection.commit()
下面是日志中的一些相关行,包括仅更新一行的简单查询:
...
2018-01-19 16:21:27 UTC:[38161]:ERROR: deadlock detected
2018-01-19 16:21:27 UTC:[38161]:DETAIL: Process 38161 waits for ShareLock on transaction 90490253; blocked by process 25147.
Process 25147 waits for ShareLock on transaction 90490267; blocked by process 38161.
Process 38161: UPDATE students SET name = 'foobar' WHERE uuid = 'ca1b2d153cbdc9574cce' AND activity_id = 35473237
Process 25147: UPDATE students SET name = 'foobar' WHERE uuid = 'ca1b2d153cbdc9574cce' AND activity_id = 35473237
...
以下是两个相关表:
db=> d students
Table "public.students"
Column | Type | Modifiers
-------------+------------------------+-------------------------------------------------------------------
id | integer | not null default nextval('students_id_seq'::regclass)
name | character varying(128) | not null
uuid | character varying(40) | not null
activity_id | integer | not null
Indexes:
"students_pkey" PRIMARY KEY, btree (id)
"students_activity_id" btree (activity_id)
Foreign-key constraints:
"activity_id_refs_id_76c08098" FOREIGN KEY (activity_id) REFERENCES activities(id) DEFERRABLE INITIALLY DEFERRED
db=> d activities
Table "public.activities"
Column | Type | Modifiers
-------------------+--------------------------+----------------------------------------------------------------------
id | integer | not null default nextval('activities_id_seq'::regclass)
start_time | timestamp with time zone | not null
end_time | timestamp with time zone |
activity_type | character varying(2) | not null
activity_id | integer | not null
started_by_id | integer | not null
activity_state | integer | not null
legacy_id | integer |
hide_report | boolean | not null
report_status | integer |
students_finished | text | not null
room_name | text |
last_updated | timestamp with time zone |
state | integer |
Indexes:
"activities_pkey" PRIMARY KEY, btree (id)
"activities_end_time" btree (end_time)
"activities_room_name_c1f9997a_like" btree (room_name text_pattern_ops)
"activities_room_name_c1f9997a_uniq" btree (room_name)
"activities_started_by_id" btree (started_by_id)
Foreign-key constraints:
"started_by_id_refs_id_5ea35c7a" FOREIGN KEY (started_by_id) REFERENCES users(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "students" CONSTRAINT "activity_id_refs_id_76c08098" FOREIGN KEY (activity_id) REFERENCES activities(id) DEFERRABLE INITIALLY DEFERRED
当只更新一行时,我们怎么会陷入这样的僵局?
我可以想到造成这种僵局的两个原因:
-
执行更新的事务包含多个语句,其他语句也创建锁。
-
涉及创建其他锁的触发器。
不是错误,除非它们发生得太频繁;处理死锁失败是一个错误。只需重试失败的事务即可。