这是上下文,它今天刚刚发生:
- Ubuntu 16.04, PostgreSQL 9.5
- 查询任何使用 PostgreSQL 的页面时无限加载(无消息)
- 使用
psql
运行任何查询时无限加载(无消息) - 没有什么特别的
/var/log/postgresql/postgresql-9.5-main.log
- 没有什么特别的
/var/log/syslog
- 服务器负载正常(每个处理器~1%,内存几乎未使用,磁盘空间正常)
然后,运行sudo service postgresql restart
后一切正常。
在这种情况下,我应该在哪里搜索?这是一个"常见"问题吗?没有日志,我不知道该怎么办。我怎么能相对"确定"它不会再随机发生?
旁注:这是我重新启动后postgreSQL日志的详细信息:
2017-03-16 12:20:52 CET [857-2] LOG: received fast shutdown request
2017-03-16 12:20:52 CET [857-3] LOG: aborting any active transactions
2017-03-16 12:20:52 CET [6829-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6829-2] XXXXX@YYYY STATEMENT: select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6644-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6644-2] XXXXX@YYYY STATEMENT: select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6649-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6649-2] XXXXX@YYYY STATEMENT: select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6580-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6580-2] XXXXX@YYYY STATEMENT: select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6768-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6768-2] XXXXX@YYYY STATEMENT: select * from "users" where "email" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [6253-1] XXXXX@YYYY FATAL: terminating connection due to administrator command
2017-03-16 12:20:52 CET [6253-2] XXXXX@YYYY STATEMENT: alter table "users" add column "email_confirmed" boolean not null default '1'
2017-03-16 12:20:52 CET [8465-1] XXXXX@YYYY FATAL: terminating connection due to administrator command
2017-03-16 12:20:52 CET [913-2] LOG: autovacuum launcher shutting down
2017-03-16 12:20:52 CET [6586-1] XXXXX@YYYY FATAL: terminating connection due to administrator command at character 15
2017-03-16 12:20:52 CET [6586-2] XXXXX@YYYY STATEMENT: select * from "users" where "users"."id" = $1 and "users"."deleted_at" is null limit 1
2017-03-16 12:20:52 CET [31969-1] XXXXX@YYYY FATAL: terminating connection due to administrator command
2017-03-16 12:20:52 CET [6300-1] XXXXX@YYYY FATAL: terminating connection due to administrator command
2017-03-16 12:20:52 CET [6974-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [906-1] LOG: shutting down
2017-03-16 12:20:52 CET [6980-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6978-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6975-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6977-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6976-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6983-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6979-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6982-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6981-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6985-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [6984-1] XXXXX@YYYY FATAL: the database system is shutting down
2017-03-16 12:20:52 CET [906-2] LOG: database system is shut down
2017-03-16 12:20:53 CET [7002-1] LOG: database system was shut down at 2017-03-16 12:20:52 CET
这是导致您的问题的语句:
alter table "users" add column "email_confirmed" boolean not null default '1'
这样的ALTER TABLE
需要在表上安装ACCESS EXCLUSIVE
锁,即使与读取访问权限也存在冲突。
现在,表users
似乎是一个繁忙的表,这通常没有问题(除非两个事务尝试写入同一行),因为它们需要的表锁不会相互冲突。
现在ALTER TABLE
了,必须等到表上所有早期事务(在表上至少有一个ACCESS SHARE
锁)完成。在ALTER TABLE
之后开始的对表的所有查询都必须在其后面排队,因为它们需要的锁与ALTER TABLE
需要的锁冲突。
一旦ALTER TABLE
获得ACCESS EXCLUSIVE
锁定,它应该很快完成,事情就会恢复正常。
现在还剩下一个谜语:日志中的所有其他中断查询看起来都非常短,因此不清楚为什么您必须等待那么长时间 - 它们应该在几微秒内完成。
为了解释这一点,请记住,当查询完成时,锁不会释放,而是一直保持到事务结束。因此,如果在事务中运行短查询,然后使事务保持打开状态,则此空闲数据库会话也将阻止ALTER TABLE
。在正常情况下,您可能不会注意到,特别是如果所有事务都是只读的,但在这种情况下,这将证明是有害的。
顺便说一句,保持打开的事务还有其他负面的副作用,特别是VACUUM
无法完成其工作,并且您的表和索引可能会变得臃肿。
您可以通过从pg_stat_activity
系统视图中检查state
来检查应用程序是否存在该问题。如果您可以定期看到idle in transaction
会话并保持这种状态一段时间,那么您正在查看问题的根本原因。这应该被修复!
您还可以中断ALTER TABLE
语句,而不是重新启动服务器。
如果再次遇到此类问题,请查询pg_locks
系统视图。您将看到运行ALTER TABLE
的会话有一个具有granted = FALSE
的锁,因为它正在等待表锁。您还可以查看哪些其他会话在桌子上锁定。