MariaDb 故障转移在 CPU 上运行得很高,即使没有流量也是如此



我们有一个用scala 2.12.x和Play框架2.5.x编写的API。API 使用 MariaDb 连接器/J 2.5.4 连接到 AWS aurora 集群,jdbc:mysql:aurora://some-aurora-cluster

从功能上讲,一切正常,除了我们注意到即使没有流量,CPU 使用率也很高。一些研究表明:

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ top -H
…
6373 root      20   0 4644452 990888  21540 S 14.6 12.6   1:15.68 MariaDb-failove
6374 root      20   0 4644452 990888  21540 S 13.6 12.6   1:16.11 MariaDb-failove
6305 root      20   0 4644452 990888  21540 S 13.3 12.6   1:14.31 MariaDb-failove
6375 root      20   0 4644452 990888  21540 S 12.3 12.6   1:14.59 MariaDb-failove
6372 root      20   0 4644452 990888  21540 S 11.3 12.6   1:15.78 MariaDb-failove
…

上面的cmd显示了许多MariaDb故障转移。我不确定它的作用以及为什么有多个它忙于高 CPU 使用率。

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ netstat -a | less
…
tcp6       0      0 ip-xxx-xxx-xxx-31.:37446 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37108 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37648 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:36934 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:36870 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37254 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
tcp6       0      0 ip-xxx-xxx-xxx-31.:37902 ip-xxx-xxx-xxx-129:mysql TIME_WAIT
…

有很多TIME_WAIT。这也很奇怪,因为在我执行此cmd时没有流量。

[ec2-user@ip-xxx-xxx-xxx-xxx ~]$ netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n
1 established)
1 Foreign
1 SYN_SENT
6 LISTEN
37 ESTABLISHED
851 TIME_WAIT

有数百个TIME_WAIT;每次我执行cmd时,数字都在变化。

有没有人对这是否正常或我需要担心的事情有任何见解?

如果您有其他问题,请告诉我。

========== 更多信息

ps -aux | grep java

获得PID:2655

jstack 2655 > threaddump.log

以下是内容(修剪(:

2020-06-16 16:44:39
Full thread dump OpenJDK 64-Bit Server VM (25.252-b09 mixed mode):
"MariaDb-failover-5" #276 daemon prio=5 os_prio=0 tid=0x00007f4b0400b000 nid=0x18e7 waiting on condition [0x00007f4b0c6d6000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"MariaDb-failover-4" #275 daemon prio=5 os_prio=0 tid=0x00007f4b0400a000 nid=0x18e6 waiting on condition [0x00007f4afbefd000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"MariaDb-failover-3" #274 daemon prio=5 os_prio=0 tid=0x00007f4b04009000 nid=0x18e5 waiting on condition [0x00007f4afbbfa000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"MariaDb-failover-2" #273 daemon prio=5 os_prio=0 tid=0x00007f4b04008000 nid=0x18e4 waiting on condition [0x00007f4b0c5d5000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"MariaDb-failover-1" #265 daemon prio=5 os_prio=0 tid=0x00007f4b2407e800 nid=0x18a1 waiting on condition [0x00007f4b0c2d4000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:269)
at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:203)
at org.mariadb.jdbc.internal.failover.thread.FailoverLoop.doRun(FailoverLoop.java:84)
at org.mariadb.jdbc.internal.failover.thread.TerminableRunnable.run(TerminableRunnable.java:80)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"VM Thread" os_prio=0 tid=0x00007f4b540db800 nid=0xf37 runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f4b54069000 nid=0xf35 runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f4b5406a800 nid=0xf36 runnable 
"VM Periodic Task Thread" os_prio=0 tid=0x00007f4b54132800 nid=0xf3e waiting on condition 
JNI global references: 2520

十六进制LWP的PID:

6373 18e5
6374 18e6
6305 18a1
6375 18e7
6372 18e4

========== 更多信息

我们的 API 有 5 种不同的数据库配置 - 5 种不同的数据库。每个都有一个连接字符串,如下所示jdbc:mysql:aurora://some-aurora-cluster

请注意,aurora模式用于更好的故障转移体验,以便解释 5 个轻量级进程。但它们很健谈,导致许多TIME_WAIT,并可能导致 CPU 使用率升高。

以前有没有人遇到过这种情况,你是如何缓解它的?我仍然想使用aurora模式(或等效模式(,这样我们就不必在数据库故障转移时重新启动应用程序。

经过几天的搜索和研究,我终于深入研究了MariaDb驱动程序代码,特别是在线程转储显示的区域,并继续关注代码堆栈。

https://github.com/mariadb-corporation/mariadb-connector-j/blob/master/src/main/java/org/mariadb/jdbc/internal/protocol/AuroraProtocol.java

我发现了对默认值为 120 的设置retriesAllDown的引用。进一步阅读 MariaDb 驱动程序知识库页面,我还发现了另一个设置failoverLoopRetries,它具有相同的默认值 120。

在这里,您可以阅读有关MariaDb驱动程序设置的更多信息:https://github.com/mariadb-corporation/mariadb-connector-j/blob/3bc66153b51aca188afc50ff35a0123f16c099ed/src/main/java/org/mariadb/jdbc/util/DefaultOptions.java

对于我们的团队和 API,我们对值 12(默认值的 10%(感到满意,并决定在这两个设置中都使用该值,因此下面是修改后的连接字符串:

jdbc:mysql:aurora://some-aurora-cluster?retriesAllDown=12&failoverLoopRetries=12

这大大降低了 CPU 使用率,并且仍然保持了我们需要的故障转移功能。

希望这个答案对某人有所帮助。在它至少帮助其他 10 人之前,我不会将其标记为我原始问题的答案。

最新更新