Optimizing Kubernetes with Spot Rebalancing

Overview

CloudNatix provides an agent that automatically rebalances On-Demand nodes and Spot nodes in a single cluster. The auto-rebalancing moves workloads from On-Demand to Spot nodes to maximize the utilization of Spot instances and reduce the cluster cost.

Let's look at the following example scenario to understand how the auto-rebalancing works. Suppose that a K8s cluster has a mix of On-Demand nodes and Spot nodes. Some of the workloads must run on On-Demand nodes for availability reason and node selector/affinity are used to pin the workloads to On-Demand nodes (e.g., nodeSelector with eks.amazonaws.com/capacityType: ON_DEMAND). All other workloads can run on either On-Demand nodes or Spot nodes (we call these workloads "Spot-eligible").

One common technique used when the cluster has a mix of On-Demand nodes and Spot nodes is Cluster Autoscaler with Priority based expander. This setting allows Cluster Autoscaler to prefer Spot instances, but fall back to On-Demand instances when AWS doesn't have available Spot capacity.

Cluster Autoscaler, however, does not guarantee the optimal allocation of On-Demand instances and Spot instances. There are often cases where a cluster has On-Demand instances occupied by Spot-eligible workloads (and cluster cost is unnecessarily high due to such On-Demand nodes). Here is an example case that leads to such situation.

A Spot-eligible pod is pending.
Cluster Autoscaler attempts to scale-up the cluster, but AWS is at the peak time and it doesn't have sufficient Spot capacity. Cluster Autoscaler falls back to the On-Demand node creation. The Spot-eligible pod is scheduled there.
AWS Spot becomes off-peak time and Spot capacity becomes available. But, the Spot-eligible pod continues to run on the On-Demand node. Cluster Autoscaler doesn't down-scale the node either.

CloudNatix Spot rebalancing technology monitors the cluster and converts such an On-Demand node to a Spot node. More specifically, it takes the following steps:

Examine all On-Demand nodes in a cluster and find nodes that are occupied by Spot-eligible pods.
If such an On-Demand node is found, check if pods running on the node can be moved some other nodes in the cluster.
If so, drain the node and reschedule the pods. Spot-eligible pods are likely to be rescheduled to a Spot node.

To achieve Step 3, CloudNatix keeps a Spot node that is occupied by a low-priority pod (called spot-capacity-checker). When the node is drained, Spot-eligible pods can preempt the spot-capacity-checker pod. This pod is also used to check if AWS has available Spot capacity or not.

Enabling On-Demand-to-Spot Optimization

The Spot optimization feature is not enabled by default. To enable the feature, you need to take the following steps:

Upgrade CCOperator to the latest version. Specify Spot Optimization related parameters in values.yaml.
Add cluster-rightscaler-agent to the list of Cluster Controller components that are installed in a cluster.

To upgrade CCOperator, you first need to pull the latest Helm chart (v0.417.0 or later).

helm repo add cloudnatix https://charts.cloudnatix.com
helm repo update
helm pull cloudnatix/ccoperator

Then set the CCOperator version and Spot optimization parameters for in values.yaml:

clusterRightscalerAgent:
  spotOptimization:
    enable: true
    dryRun: true
    spotInstanceType: <instance type (e.g., m5.4x.large)>
    podDefaultSpotEligibility: true # (or false)

spotInstanceType is an instance type for Spot instances that a target cluster uses. It is used to retrieve the CPU/memory capacity of Spot nodes. When a cluster has multiple different instance types for Spot nodes, these instance types need to have the same CPU/memory capacity. As far as the condition is satisfied, spotInstanceType can be set to any one of the instance types.
podDefaultSpotEligibility is set this to true if all workloads can run on Spot by default. It is set to false otherwise.
dryRun is set to true when the agent logs drainable nodes in K8s events but does not perform actual node drain. This is useful when the agent is initially deployed to a cluster. The default value is false.

Then the next step is to add cluster-rightscaler-agent to the list of Cluster Controller components. Here is an example command:

cnatix clusters components create \
  --cluster <cluster-name> \
  --kind cluster-rightscaler-agent \
  --version <agent-version> (--upgradable)

The latest stable version of cluster-rightscaler-agent can be found here. --upgradable is set when the component is auto-upgraded.

Enabling Spot-to-Spot Rebalancing

In addition to On-Demand to Spot rebalancing, we support rebalancing among Spot nodes. To enable the feature, please add the following configuration to the CCOperator configmap:

clusterRightscalerAgent:
  spotOptimization:
    enable: true
    enableSpotToSpotRebalancing: true
    candidateSpotInstanceTypes:
    - <instance type (e.g., m5.4x.large)>
    - ...
    dryRun: true

candidateSpotInstanceTypes is a list of instance types that are allowed to be created in the cluster. We constantly scan the current state of the cluster and replace an expensive Spot instance to a cheaper Spot instance.
dryRun is set to true when the agent logs drainable nodes in K8s events but does not perform actual node drain. This is useful when the agent is initially deployed to a cluster. The default value is false.

Then the next step is to add cluster-rightscaler-agent to the list of Cluster Controller components. Here is an example command:

cnatix clusters components create \
  --cluster <cluster-name> \
  --kind cluster-rightscaler-agent \
  --version <agent-version> (--upgradable)

Limitations

Currently the feature is available only on AWS.

A related feature CloudNatix offers is a Saving Plan Aware Rebalancer (SPAR) that incorporates AWS savings plan information to intelligently utilize Spot instances in accordance to savings plan parameters.

Savings Plan Aware Rebalancer

Overview

Enabling On-Demand-to-Spot Optimization

Enabling Spot-to-Spot Rebalancing

Limitations

Related Pages