记一次GC导致CPU飙升的排查过程

背景

早上收到报警,一台测试机的cpu使用率已经达到90%,看了监控,从凌晨开始,使用率一直在上升,知道7点左右到了96%左右。

排查步骤

  1. 输入top命令,查看到底是哪个进程占用了CPU,结果如下图所示 ,12961的这个java进程有问题。

    GC_1

  2. 输入ps -ef|grep 12961,查看到底是哪个java进程,结果发现是skywalking(一个APM工具,想了解的可以到http://skywalking.apache.org/了解)。

  3. 输入top -H -p 12961,查看进程中的哪个线程占用了CPU,结果如下图 ,可以看到1296,,12965,12966,12967这几个有问题。输入printf "%x\n" 12965,将线程号12965转换成16进制,结果为32a5GC_2

  4. 导出线程堆栈文件,输入jstack 12961 > ~/12961.jstack,查看线程为32a5的是什么线程,结果如下 ,可以看出是GC的线程。GC_3

  5. 再次验证一下,输入jstat -gcutil 12961 2000 10,打印GC的情况,可以看出老年代达到了100%,而且一直在进行Full GC,所以到这里基本可以判断是GC导致的问题。 GC_4

  6. dump堆存储文件,输入jmap -dump:format=b,file=/home/gateway/12961.bin 12961,利用MAT查看,结果可以发现MpsArrayQueue可能会有问题,在结合skywalking的错误日志可以发现操作ES的时候会有问题,后来沟通在凌晨时,ES(因为Skywalking依赖ES作为存储层)团队把测试环境的ES集群关掉了,所以导致我们这边有问题。GC_5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2019-05-28 10:43:26,068 - org.apache.skywalking.oap.server.core.analysis.worker.IndicatorPersistentWorker - 106 [pool-16-thread-1] ERROR [] - Request cannot be executed; I/O reactor status: STOPPED
java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED
at org.apache.http.util.Asserts.check(Asserts.java:46) ~[httpcore-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.ensureRunning(CloseableHttpAsyncClientBase.java:90) ~[httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalHttpAsyncClient.execute(InternalHttpAsyncClient.java:123) ~[httpasyncclient-4.1.2.jar:4.1.2]
at org.elasticsearch.client.RestClient.performRequestAsync(RestClient.java:366) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestClient.performRequestAsyncNoCatch(RestClient.java:351) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:198) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:522) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.get(RestHighLevelClient.java:293) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at org.apache.skywalking.oap.server.library.client.elasticsearch.ElasticSearchClient.get(ElasticSearchClient.java:187) ~[library-client-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.IndicatorEsDAO.get(IndicatorEsDAO.java:48) ~[storage-elasticsearch-plugin-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.analysis.worker.IndicatorPersistentWorker.lambda$prepareBatch$0(IndicatorPersistentWorker.java:104) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.analysis.worker.IndicatorPersistentWorker$$Lambda$200/90606637.accept(Unknown Source) [server-core-6.1.0.jar:6.1.0]
at java.util.HashMap$Values.forEach(HashMap.java:972) [?:1.8.0_40]
at org.apache.skywalking.oap.server.core.analysis.worker.IndicatorPersistentWorker.prepareBatch(IndicatorPersistentWorker.java:101) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.analysis.worker.IndicatorPersistentWorker.prepareBatch(IndicatorPersistentWorker.java:37) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.analysis.worker.PersistenceWorker.buildBatchCollection(PersistenceWorker.java:90) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$extractDataAndSave$2(PersistenceTimer.java:95) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$199/882433893.accept(Unknown Source) [server-core-6.1.0.jar:6.1.0]
at java.util.ArrayList.forEach(ArrayList.java:1249) [?:1.8.0_40]
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.extractDataAndSave(PersistenceTimer.java:89) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer.lambda$start$0(PersistenceTimer.java:65) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.oap.server.core.storage.PersistenceTimer$$Lambda$193/124351398.run(Unknown Source) [server-core-6.1.0.jar:6.1.0]
at org.apache.skywalking.apm.util.RunnableWithExceptionProtection.run(RunnableWithExceptionProtection.java:36) [apm-util-6.1.0.jar:6.1.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_40]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_40]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_40]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_40]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_40]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_40]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_40]

后记

在这个案例中没有给出具体的故障原因,因为还没有时间去看源码,有兴趣的同学也可以自己来研究。记录这次过程只是想为以后排查类似的问题提供便利,不用再去查各种命令。建议以后出现此类问题的时候一定按上述步骤保留现场,然后重启恢复业务,不然重启之后所有信息都丢失了,无法排查。后续我也会写skywalking相关的源码分析文章,到时候会给出具体的答案。