JVM崩溃和磁盘/内核读取错误-是否存在相关性



我们发现,在JRE的小版本更新后,JVM会莫名其妙地崩溃。这就是嫌疑人。但在将崩溃时间与系统日志消息进行关联后,我发现每次发生崩溃时,都会记录内核中的内存错误。有足够的RAM;但我想交换仍然被linux使用。假设磁盘错误导致JVM崩溃。这是一个合理的假设吗?

JVM崩溃堆栈

be-7.2.1/lib/jdbc/mysql/mysql-connector-java-5.1.46.jar org.sonar.ce.app.CeServer /home/cicd/sonarqube-7.2.1/temp/sq-process3072857830430806886properties
Picked up _JAVA_OPTIONS: -Xmx60g
2020.02.06 11:51:16 INFO  app[][o.s.a.SchedulerImpl] Process[ce] is up
2020.02.06 11:51:16 INFO  app[][o.s.a.SchedulerImpl] SonarQube is up
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f7df7918531, pid=13479, tid=0x00007f6e59cf9700
#
# JRE version: OpenJDK Runtime Environment (8.0_242-b08) (build 1.8.0_242-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.242-b08 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0xa42531]  Symbol::increment_refcount()+0x1
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
# An error report file with more information is saved as:
# /home/xxx/sonarqube-7.2.1/elasticsearch/hs_err_pid13479.log
[error occurred during error reporting , id 0x7]
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
2020.02.18 07:35:17 WARN  app[][o.s.a.p.AbstractProcessMonitor] Process exited with exit value [es]: 134
2020.02.18 07:35:17 INFO  app[][o.s.a.SchedulerImpl] Process [es] is stopped
2020.02.18 07:35:21 INFO  app[][o.s.a.SchedulerImpl] Process [ce] is stopped
2020.02.18 07:35:24 INFO  app[][o.s.a.SchedulerImpl] Process [web] is stopped
2020.02.18 07:35:24 INFO  app[][o.s.a.SchedulerImpl] SonarQube is stopped

系统日志错误

"kernel: blk_update_request: critical medium error, dev sda, sector"
[libjvm.so+0xa42531]  Symbol::increment_refcount()+0x1
Feb 18 07:35:14 xxx kernel: Read-error on swap-device (253:1:2227784)
Feb 18 07:35:14 xxx kernel: Read-error on swap-device (253:1:2227792)
Feb 18 07:35:14 xxx kernel: Read-error on swap-device (253:1:2227

该磁盘是一个较旧的IBM硬件,在之前,其他服务器已经出现故障

=== START OF INFORMATION SECTION ===
Vendor:               IBM
Product:              ServeRAID M5110
Revision:             3.45
Compliance:           SPC-3
User Capacity:        5,996,996,984,832 bytes [5.99 TB]
Logical block size:   512 bytes
Logical Unit id:      0x600605b0072bb48022f34180127fc92d
Serial number:        002dc97f128041f32280b42b07b00506
Device type:          disk
Local Time is:        Wed Feb 19 06:44:22 2020 EET
SMART support is:     Unavailable - device lacks SMART capability

正如@user207421所提到的,JVM崩溃是由于读取错误。(日志中的libjvm.so将其固定(

在这台机器上又发生了几次读取错误。但那些时候没有JVM崩溃,syslogs中也没有libjvm.so的日志

Mar  4 03:49:33 xxx kernel: blk_update_request: critical medium error, dev sda, sector 15250816
Mar  4 03:49:33 xxx kernel: XFS (dm-2): metadata I/O error: block 0x425d80 ("xfs_trans_read_buf_map") error 61 numblks 32
Mar  4 03:49:33 xxx kernel: XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -61.
Mar  4 03:49:37 xx kernel: megaraid_sas 0000:1b:00.0: 17102 (636601577s/0x0002/FATAL) - Unrecoverable 

最新更新