What happens when Docker Swarm manager node dies
When a Docker Swarm manager node dies, the impact on the Swarm depends on several factors, including the number of manager nodes in the Swarm and whether you have set up high availability by having multiple manager nodes. Here’s what happens and how Docker Swarm handles it:
High Availability in Docker Swarm
To ensure high availability, it is recommended to have an odd number of manager nodes (typically 3 or 5) in your Docker Swarm. This setup allows the Swarm to tolerate failures and continue operating correctly.
Manager Node Responsibilities
Manager nodes in Docker Swarm are responsible for:
- Orchestrating and Scheduling Tasks: Deciding where and when to run tasks (containers) across the Swarm.
- Maintaining Cluster State: Managing the state of the Swarm, including services, tasks, and nodes.
- Service Discovery and Networking: Managing the internal DNS and overlay networks for service discovery.
- Health Monitoring: Monitoring the health of nodes and services.
Single Manager Node Failure
If you have a single manager node and it dies, the entire Swarm cluster is affected as there’s no fallback manager to take over its responsibilities. This can cause:
- Task Scheduling Failure: No new tasks can be scheduled or existing tasks rescheduled if needed.
- State Management Issues: The state of the Swarm (services, nodes, tasks) cannot be modified.
- Service Discovery Impact: Potential disruption in service discovery and internal networking.
Multiple Manager Nodes (High Availability)
If you have set up multiple manager nodes, Docker Swarm can handle the failure of a single manager node gracefully.
Scenario: Multiple Manager Nodes
-
Initial Setup: Suppose you have a Swarm with 3 manager nodes (
Manager A
,Manager B
,Manager C
) and several worker nodes. -
Manager Node Failure: If
Manager A
dies:- Leader Election: Docker Swarm uses the Raft consensus algorithm to maintain consistency and elect a new leader if the current leader dies. The remaining manager nodes (
Manager B
andManager C
) will automatically elect a new leader. - Task Continuity: The new leader continues managing and scheduling tasks, maintaining the cluster’s operational state.
- Fault Tolerance: The Swarm remains functional and can tolerate the failure of one manager node without service disruption.
- Leader Election: Docker Swarm uses the Raft consensus algorithm to maintain consistency and elect a new leader if the current leader dies. The remaining manager nodes (
Adding and Removing Manager Nodes
Adding a New Manager Node
To restore redundancy after a manager node failure, you can add a new manager node to the Swarm:
docker swarm join --token <MANAGER-TOKEN> <MANAGER-IP>:2377
docker node promote <NEW-NODE-ID>
Removing a Failed Manager Node
You can remove the failed manager node from the Swarm:
docker node rm <FAILED-NODE-ID>
Practical Example
Here’s a step-by-step example illustrating the process:
-
Create Swarm and Add Managers:
docker swarm init --advertise-addr <MANAGER-A-IP> docker swarm join --token <MANAGER-TOKEN> <MANAGER-A-IP>:2377 docker node promote <NODE-B-ID> docker node promote <NODE-C-ID>
-
Deploy a Service:
docker service create --name my-service --replicas 3 my-service-image
-
Manager Node Failure:
If
Manager A
fails, the Swarm cluster will elect a new leader (Manager B
orManager C
). -
Monitor and Manage Nodes:
docker node ls
Use this command to see the status of all nodes and identify the failed manager node.
-
Recover and Add a New Manager:
docker swarm join --token <MANAGER-TOKEN> <MANAGER-B-IP>:2377 docker node promote <NEW-NODE-ID>
Replace
<NEW-NODE-ID>
with the ID of the new node you are promoting to a manager.