Linux,  MongoDB

记录一则MongoDB异常宕机(未找到原因)的案例

一 故障现象

同事反馈说,访问MongoDB报错:

Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=172.16.6.164:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]; nested exception is com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=172.16.6.164:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused (Connection refused)}}]

这是一套replication的MongoDB数据库,primary在172.16.6.164机器,secondary在172.16.19.110机器上。

二 排查过程

1 查看mongod进程

登录MongoDB数据库主服务器,查看mongod进程,果然不存在。

2 查看MongoDB日志

2022-06-07T11:23:41.307+0800 I NETWORK  [conn571] received client metadata from 172.16.17.17:14386 conn571: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "3.10.0-862.el7.x86_64" }, platform: "Java/Oracle Corporation/1.8.0_171-8u171-b11-1~deb9u1-b11" }
2022-06-07T11:23:59.551+0800 I -        [conn559] end connection 172.16.17.17:35155 (85 connections now open)
2022-06-07T11:23:59.551+0800 I -        [conn558] end connection 172.16.17.17:34240 (85 connections now open)
2022-06-07T11:23:59.552+0800 I -        [conn563] end connection 172.16.17.17:25101 (84 connections now open)
2022-06-07T11:24:08.600+0800 I NETWORK  [thread1] connection accepted from 172.16.17.17:37428 #573 (83 connections now open)
2022-06-07T11:24:08.608+0800 I NETWORK  [conn573] received client metadata from 172.16.17.17:37428 conn573: { driver: { name: "mongo-java-driver|sync|spring-boot", version: "4.2.3" }, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "3.10.0-862.el7.x86_64" }, platform: "Java/Oracle Corporation/1.8.0_171-8u171-b11-1~deb9u1-b11" }
2022-06-07T11:24:08.715+0800 I ACCESS   [conn573] Successfully authenticated as principal biz_manage_dev on biz_manage_dev

看到日志,最后一条记录时间是2022-06-07T11:24:08.715+0800,且是正常的。再之后就没有信息的信息了。

3 查看操作系统日志

[root@k8-minion-ident-test2-164 ~]# dmesg -T|grep -E "oom|kill|out of"
[四 6月  2 10:50:24 2022] iptables invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=-999
[四 6月  2 10:50:24 2022]  [<ffffffff81184d0e>] oom_kill_process+0x24e/0x3c0
[四 6月  2 10:50:24 2022]  [<ffffffff811847ad>] ? oom_unkillable_task+0xcd/0x120
[四 6月  2 10:50:24 2022] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[四 6月  2 12:16:45 2022] conn1206 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[四 6月  2 12:16:45 2022]  [<ffffffff81184d0e>] oom_kill_process+0x24e/0x3c0
[四 6月  2 12:16:45 2022]  [<ffffffff811847ad>] ? oom_unkillable_task+0xcd/0x120
[四 6月  2 12:16:45 2022] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[四 6月  2 14:46:43 2022] irqbalance invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[四 6月  2 14:46:43 2022]  [<ffffffff81184d0e>] oom_kill_process+0x24e/0x3c0
[四 6月  2 14:46:43 2022]  [<ffffffff811847ad>] ? oom_unkillable_task+0xcd/0x120
[四 6月  2 14:46:43 2022] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[root@k8-minion-ident-test2-164 ~]# 

4 查看备库日志

2022-06-07T06:59:28.005+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to k8-minion-ident-test2-164:27017, took 4ms (3 connections now open to k8-minion-ident-test2-164:27017)
2022-06-07T07:00:28.003+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending idle connection to host k8-minion-ident-test2-164:27017 because the pool meets constraints; 2 connections to that host remain open
2022-06-07T11:24:34.678+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host k8-minion-ident-test2-164:27017 due to bad connection status; 1 connections to that host remain open
2022-06-07T11:24:34.679+0800 I REPL     [replication-1189] Restarting oplog query due to error: HostUnreachable: End of file. Last fetched optime (with hash): { ts: Timestamp 1654572268000|1, t: 1 }[-2529022957656624853]. Restarts remaining: 3
2022-06-07T11:24:34.679+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host k8-minion-ident-test2-164:27017 due to bad connection status; 0 connections to that host remain open
2022-06-07T11:24:34.682+0800 I -        [conn4] end connection 172.16.6.164:56140 (1 connection now open)
2022-06-07T11:24:34.683+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
2022-06-07T11:24:34.684+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host k8-minion-ident-test2-164:27017 due to bad connection status; 1 connections to that host remain open
2022-06-07T11:24:34.684+0800 I REPL     [replication-1189] Scheduled new oplog query Fetcher source: k8-minion-ident-test2-164:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1654572268000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 917653 -- target:k8-minion-ident-test2-164:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1654572268000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2022-06-07T11:24:34.685+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Failed to close stream: Transport endpoint is not connected
2022-06-07T11:24:34.685+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to k8-minion-ident-test2-164:27017
2022-06-07T11:24:34.737+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Failed to connect to k8-minion-ident-test2-164:27017 - HostUnreachable: Connection refused
2022-06-07T11:24:34.755+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to k8-minion-ident-test2-164:27017; HostUnreachable: Connection reset by peer
2022-06-07T11:24:34.755+0800 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to k8-minion-ident-test2-164:27017: HostUnreachable: Connection reset by peer
2022-06-07T11:24:34.756+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Dropping all pooled connections to k8-minion-ident-test2-164:27017 due to failed operation on a connection
2022-06-07T11:24:34.756+0800 I REPL     [replication-1189] Restarting oplog query due to error: HostUnreachable: Connection refused. Last fetched optime (with hash): { ts: Timestamp 1654572268000|1, t: 1 }[-2529022957656624853]. Restarts remaining: 2
2022-06-07T11:24:34.756+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to k8-minion-ident-test2-164:27017
2022-06-07T11:24:34.756+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to k8-minion-ident-test2-164:27017
2022-06-07T11:24:34.756+0800 I REPL     [replication-1189] Scheduled new oplog query Fetcher source: k8-minion-ident-test2-164:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1654572268000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 917659 -- target:k8-minion-ident-test2-164:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1654572268000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2022-06-07T11:24:34.756+0800 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to k8-minion-ident-test2-164:27017

三解决

主库上,手工启动MongoDB。

此时,发现原来的secondary 172.16.19.110上的备库变成了primary。于是,手工在该库上,进入mongoshell,执行:

 rs.stepDown()        #给当前primary降级,使之成为备库

最后,发现原来的172.16.6.164机器上的MongoDB重新变为primary,对外提供服务。

四 参考

rs.stepDown()指令的使用:

Instructs the primary of the replica set to become a secondary. After the primary steps down, eligible secondaries will hold an election for primary.

The method does not immediately step down the primary. If no electable secondaries are up to date with the primary, the primary waits up to secondaryCatchUpPeriodSecs (by default 10 seconds) for a secondary to catch up. Once an electable secondary is available, the method steps down the primary.

Once stepped down, the original primary becomes a secondary and is ineligible from becoming primary again for the remainder of time specified by stepDownSecs.

For a detailed explanation of the method ‘s execution, see Behavior.

https://www.mongodb.com/docs/manual/reference/method/rs.stepDown/

五 小结

暂时还么有找到这个MongoDB为什么突然宕机,并且日志里没有看到相关的错误信息。没有OOM相关的错误。先记下来,以后有类似情况作参考。

留言