CloudNatix - Enabling AI Ops

Overview

You can enable AI Ops with CloudNatix. The provided features includes:

LLM hosting with the API compatible with OpenAI API
Development environment with Jupyter Notebooks
GPU federation across multiple GPU clusters

For example, you can host LLM models in your Kubernetes clusters and run chat completion or fine-tunings with the models. You can also submit training jobs to a global K8s cluster hosted by CloudNatix, which schedules jobs over your GPU clusters.

Enabling the AI Ops Feature

AI Ops features are enabled by installing LLMariner.

Prerequisites

The cnatix CLI version v0.841.0 or later
The llma CLI version v1.25.0 or later (installation procedure)
S3-compatible object store

LLMariner stores models (including fine-tuned models) in a S3-compatible object store. If you're using an EKS cluster, we need to an S3 bucket and corresponding an IAM role for allowing the LLMariner service account to access the S3 bucket.

Please see LLMariner page for more information.

Step 1. Obtain a cluster registration key

Run the following command. It will output a cluster registration key.

# Set the API base URL to https://api.llm.cloudnatix.com/v1
llma auth login
llma admin clusters register <cluster-name>

We later use the cluster registration key when installing LLMariner.

If you prefer GUI, you can also visit https://app.llm.cloudnatix.com and register a new cluster from there.

Step 2. Create a secret for Hugging Face (optional)

If you would like to download models from Hugging Face, create a K8s secret in the cloudnatix namespace.

kubectl create namespace cloudnatix

kubectl create secret generic \
  huggingface-key \
  -n cloudnatix \
  --from-literal=apiKey=${HUGGING_FACE_HUB_TOKEN}

In the next step, we will configure LLMariner to use the secret.

Step 3. Create a `values.yaml` file for LLMariner

Create a values.yaml file used by the LLMariner Helm chart.

You can access ArtifactHub to understand the schema of the values.yaml (by clicking "DEFAULT VALUES" or "VALUES SCHEMA").

Here is an example:

global:
  objectStore:
    s3:
      bucket: cloudnatix-aiops
      endpointUrl: ""
      region: us-west-2

  # This is required only when an S3 bucket is accessed with the secret key.
  awsSecret:
    name: aws
    accessKeyIdKey: accessKeyId
    secretAccessKeyKey: secretAccessKey

inference-manager-engine:
  replicaCount: 2
  runtime:
    runtimeImages:
      ollama: mirror.gcr.io/ollama/ollama:0.3.6
      vllm: public.ecr.aws/cloudnatix/llm-operator/vllm-openai:20250115
      triton: nvcr.io/nvidia/tritonserver:24.09-trtllm-python-py3
  model:
    default:
      runtimeName: vllm
    overrides:
      NikolayKozloff/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M-GGUF:
        preloaded: true
        resources:
          limits:
            nvidia.com/gpu: 1
        vllmExtraFlags:
        - --tokenizer
        - deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
      lmstudio-community/phi-4-GGUF/phi-4-Q4_K_M.gguf:
        preloaded: true
        resources:
          limits:
            nvidia.com/gpu: 1
        vllmExtraFlags:
        - --tokenizer
        - microsoft/phi-4

model-manager-loader:
  baseModels:
  - lmstudio-community/phi-4-GGUF/phi-4-Q4_K_M.gguf
  - NikolayKozloff/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M-GGUF
  downloader:
    kind: huggingFace
    huggingFace:
      cacheDir: /tmp/.cache/huggingface/hub
  huggingFaceSecret:
    name: huggingface-key
    apiKeyKey: apiKey

Step 4. Install

Type cnatix clusters configure. You will be asked whether you would like to install LLMariner, your cluster registration key, and the location of its values.yaml file.

Once the command completes, you can follow the regular CloudNatix installation procedure.

Step 5. Test

Once the installation complete, check the health status of the registered cluster.

llma admin clusters list

You can also see a list of hosted models by typing:

llma models list

Once a model is loaded, you can ask a question to the model:

llma chat completions create \
  --model <model-name> \
  --role user \
  --completion "What is k8s?"

Note on Fine-tuning Jobs

LLMariner provides the file upload API, but the API is not supported when the LLMariner controler plane is hosted by CloudNatix. This is because CloudNatix Global Cluster Controller doesn't want to store a customer's training data in its storage.

You can still use llma storage files create-link to create file objects. This command creates file objects without actually uploading files.

Please visit the LLMariner page for more information.