Node Drain Agent Guide

Overview

Node Drain Agent is a feature that saves time and effort for operations for node drain, such as cluster upgrades. With this feature, you can check the status of nodes and pods to be drained in advance via the CLI (cnatix command). If there are no problems, Node Drain Agent marks target nodes unschedulable and evicts pods safely.

This agent supports not only the same functionality as the kubectl drain command, but also interactive commands for draining, pre-checking of pods with strict PDBs that block eviction, reporting of pre-checks, etc.

Enabling Node Drain Agent

The Node Drain Agent feature is not enabled by default. To enable the feature, you need to take the following steps:

Upgrade CCOperator to the latest version. Specify Node Drain Agent related parameters in values.yaml.
Add cluster-rightscaler-agent to the list of Cluster Controller components that are installed in the cluster.

To upgrade CCOperator, you first need to pull the latest Helm chart (v0.566.0 or later).

helm repo add cloudnatix https://charts.cloudnatix.com
helm repo update
helm pull cloudnatix/ccoperator

Then set the CCOperator version and Node Drain Agent parameters for in values.yaml:

clusterRightscalerAgent:
  nodeDrainAgent:
    enable: true

Then the next step is to add cluster-rightscaler-agent to the list of Cluster Controller components. Here is an example command:

cnatix clusters components create \
  --cluster <cluster-name> \
  --kind cluster-rightscaler-agent \
  --version <agent-version> (--upgradable)

The latest stable version of cluster-rightscaler-agent can be found here. When --upgradable is set, the component is auto-upgraded.

Drain nodes with CLI

You can drain nodes using cnatix k8s drain run command.

cnatix k8s drain run --cluster <cluster-name> [<node>...]

This command accepts multiple nodes to be drained, separated by spaces. You can also specify target nodes with a label selector if you set the --node-label-selector flag instead of the node argument.

Note: Node arguments and label selector cannot be used together.

The following is an example output of draining worker-node-1 in my-cluster. (Lines beginning with '#' are added for explanation. They are not output in the actual command.)

$ cnatix k8s drain run --cluster my-cluster worker-node-1
# (1) Ask for final confirmation before sending the drain request.
? Request node drain to worker-node-1? Yes
- Requesting node drain... done (uuid: e607dc7a-826d-4d7d-be9a-7df73f6c07b7)
- Preparing node drain (phase: PENDING)... done
- Waiting for node drain to be ready (phase: PENDING)... done
- Checking node drain can be run (phase: READY)... done

# (2) Report pods to be evicted that did not pass the pre-check.
X 13 eviction is blocked for the following reasons
Pods matching a strict PDB: [3]
- shopping/cart-5967dc5db7-7ll2k in node/worker-node-1: pod with strict PDB (DisruptionsAllowed=0)
- shopping/shop-57cc4b4bd7-2vx7z in node/worker-node-1: pod with strict PDB (DisruptionsAllowed=0)
- shopping/shop-57cc4b4bd7-5plbg in node/worker-node-1: pod with strict PDB (DisruptionsAllowed=0)
Pods are managed by DaemonSet: [4]
- shopping/ds-1-dp644 in node/worker-node-1:
- shopping/ds-2-cfb82 in node/worker-node-1:
- kube-system/fluentd-elasticsearch-sxpxv in node/worker-node-1:
- kube-system/kube-proxy-7kjh6 in node/worker-node-1:
Pods with a local storage: [2]
- shopping/local-storage-7d4d88bdc4-wfhlw in node/worker-node-1:
- shopping/prometheus-0 in node/worker-node-1:

# (3) Ask how to deal with pods that did not pass the pre-check.
? Cancel this node drain request? No, continue and choose how to resolve the above pods.
? How to handle "Pods matching a strict PDB"? skip
? How to handle "Pods are managed by DaemonSet"? skip
? How to handle "Pods with a local storage"? force_evict
- Requesting how to handle pod with filter errors... done
- Waiting for filter errors to be resolved (phase: IN_PROGRESS)... done

# (4) Display the eviction progress.
Pod Eviction Status:
✓ shopping/backend-68db66cb46-fx676 in node/worker-node-1 is evicted
✓ shopping/ui-67df8bfbc7-dhcdj in node/worker-node-1 is evicted
✓ shopping/backend-68db66cb46-qcgzj in node/worker-node-1 is evicted
✓ shopping/local-storage-7d4d88bdc4-wfhlw in node/worker-node-1 is evicted
✓ shopping/prometheus-0 in node/worker-node-1 is evicted

Request is SUCCEEDED

Running `k8s drain run` after a node draining request failed

When a node draining request fails, Node Drain Agent does not process a following drain request automatically. To process the request, you need to explicitly confirm the failure.

Here is an example failure of node drain requests as a specified node is not found in the cluster:

Request is FAILED: get nodes: nodes "ip-aaa-bbb-ccc-ddd.region.compute.internal" not found

Node Drain Agent stops processing new node draining requests (either queued or newly created) until the user explicitly tells it to proceed.

When you run the k8s drain run command in such a situation, you'll see a warning message as follows:

! Request is pending due to that the last completed request has failed
? Continue node drain request ignoring the last drain failure? (Y/n)

By answering yes to the prompt, you can force it to run the request.

Canceling a node draining request manually

A node draining request may be detached from the CLI due to an error. In such a case, please cancel the detached request manually.

To cancel a certain node draining request, run the k8s drain cancel command with the UUID of the node draining request specified.

To list node draining requests for your tenant including such a detached request, please use the k8s drain list command.

Marking nodes schedulable

The cnatix k8s schedulable create command can be used when you want to mark nodes as schedulable again. For example, you can run the command after node maintenance has been completed.

cnatix k8s schedulable create --cluster <cluster-name> [<node>...]