Mithril Compute Clusters
Hosted Kubernetes for your Mithril GPU instances. We run the control plane. You bring the nodes and your kubectl.
TL;DR. Create a cluster in the console, attach a reservation or spot bid in the same region, run ml k8s update-kubeconfig locally, and kubectl apply your workload. Under five minutes of hands-on time end-to-end.
What you get
A long-lived Kubernetes control plane that Mithril runs and maintains. You spin up your the worker nodes (they're just your reserved or spot GPU instances) and they join the cluster automatically when they boot. Your local kubectl talks to the control plane the same way it would talk to any other Kubernetes cluster.
The architecture is three pieces:
Control plane (API server, scheduler, etcd)
Mithril. Long-lived. You don't need to touch it.
Nodes (GPU instances)
You. Created via reservation or spot bid; join the cluster on boot.
kubectl
You. Standard Kubernetes from your laptop.
Hosted control plane instances come pre-loaded with Cilium CNI and the NVIDIA GPU Operator, so GPU workloads are schedulable out of the box. You're free to modify the cluster configuration; this isn't a fully managed offering, and Mithril doesn't push updates beyond node lifecycle management.
Quickstart
Prerequisites
A Mithril project with billing configured
The Mithril CLI installed locally (installation guide)
kubectlinstalled locallyAn SSH key registered to your Mithril account before you create the cluster (see SSH keys and the CLI below)
Step 1: Create the cluster
In the Mithril console:
Navigate to Clusters > Create cluster
Pick a region. Your worker nodes must be in the same region as the control plane or they will not join.
Name the cluster. Click Create.
Provisioning can take up to 10 minutes. When the cluster appears in the list with status Available, note the Cluster host IP. That's the public IP your kubectl will talk to.
Step 2: Attach a worker node
You have two paths that both end up looking the same from inside the Kubernetes cluster:
When creating a reservation, select your cluster in the Kubernetes cluster dropdown on the order form. The instance comes up already joined.
Use this for production workloads where you need guaranteed access.
When creating a spot bid, select your cluster in the Kubernetes cluster dropdown. The node joins when the bid wins, leaves when it's preempted.
Use this for experiments, batch jobs, or development.
The cluster's region must match the instance's region, so double-check before you submit the order.
Step 3: Point kubectl at the cluster
kubectl at the clusterFrom your laptop:
ml setup authenticates the CLI to your Mithril account. ml k8s update-kubeconfig fetches the cluster's kubeconfig and merges it into your local ~/.kube/config under a new context. Your other kubectl contexts keep working.
Verify:
Step 4: Run a GPU pod
Create hello-gpu.yaml:
Apply it:
You should see the nvidia-smi table. That's the full loop of adding a worker node and running a GPU workload on it.
Persistent storage
Mithril file shares and block volumes are attachable to nodes in your cluster. Select them on the order form (reservation or spot bid) and the volumes are mounted on every node in that order.
File shares (
mithril-file-share): multi-instance, read/write across nodes used for shared datasets and checkpoints.Block volumes: single-instance, persistent disk used for per-node scratch space.
For the full decision tree (object store vs. file share vs. block, when to use which), see Persistent storage overview.
At high pod-creation rates, the storage CSI can hit a too many concurrent requests error. It's transient, so stagger pod creation if you're scheduling dozens of pods with persistent volumes at once.
To make persistent storage volumes accessible within the k8s cluster use hostPath for pod mounting, e.g:
SSH keys and the CLI
This is the most common source of friction. ml k8s update-kubeconfig is not a pure API call: under the hood it SSHes into the control plane host as the ubuntu user to fetch the kubeconfig. If your SSH key isn't authorized on that host, the command fails with Permission denied (publickey).
Three things to verify if you hit this:
The key is registered to your Mithril account. Console > SSH keys > confirm the public key matches the one you have locally.
The key was registered before the cluster was created. Adding a key to your account after the fact doesn't retroactively authorize it on existing clusters. If you registered the key after creating the cluster, recreate the cluster or contact support.
Your local SSH agent is offering the key. Run
ssh-add -l. If your key isn't listed, runssh-add ~/.ssh/your-keyto load it. On macOS, the agent can drop keys between sessions; consider adding to~/.ssh/config:
If ml k8s update-kubeconfig succeeds with the SSH step but then fails with Failed to parse local kubeconfig, your local ~/.kube/config is empty, missing, or malformed. Create a minimal stub:
Then re-run ml k8s update-kubeconfig.
Node lifecycle
Mithril manages the join/drain/uncordon lifecycle automatically. You should not run kubeadm on a node — it will break the cluster.
Node joins (boot)
Automatically joins the cluster, registers with the control plane
Spot preemption
Sends kubectl drain with a 5-minute grace period, then powers off the instance
Reallocation (spot bid wins again)
Sends kubectl uncordon so the scheduler resumes placing pods
Bid termination (instance permanently deleted)
Removes the node from the cluster
FAQ
Can I access the cluster from outside Mithril?
Yes. The control plane has a public IP with SSH (port 22) and the Kubernetes API (port 6443) open by default.
Does MCC support autoscaling?
You can configure autoscaling for deployments inside Kubernetes (HPA, etc.) and the standard Kubernetes mechanisms work. There's no officially supported mechanism for autoscaling the underlying cluster size today. Add nodes by placing additional reservations or spot bids with the cluster selected.
Is this a fully managed Kubernetes service?
No. Mithril manages the control plane lifecycle and node join/drain/remove operations. It does not push updates to cluster configuration, manage RBAC, or handle workload-level concerns. Treat MCC as "managed control plane, self-served everything else."
Can I use ml launch against my cluster?
Yes. In your task YAML, set infra: kubernetes and your task runs against the cluster. See Task YAML > Infra for the full reference.
My node never joined the cluster. What's going on?
Most likely cause: the cluster's region doesn't match the instance's region. Less common: a custom startup script that interferes with the join process, or a network policy on your side blocking outbound to the control plane.
Check the instance logs and contact support if neither applies.
Last updated