> For the complete documentation index, see [llms.txt](https://docs.mithril.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.mithril.ai/compute-and-storage/mithril-compute-clusters.md).

# Mithril Compute Clusters

Hosted Kubernetes for your Mithril GPU instances. We run the control plane. You bring the nodes and your `kubectl`.

{% hint style="info" %}
**TL;DR.** Create a cluster in the console, attach a reservation or spot bid in the same region, run `ml k8s update-kubeconfig` locally, and `kubectl apply` your workload. Under five minutes of hands-on time end-to-end.
{% endhint %}

## What you get

A long-lived Kubernetes control plane that Mithril runs and maintains. You spin up your the worker nodes (they're just your reserved or spot GPU instances) and they join the cluster automatically when they boot. Your local `kubectl` talks to the control plane the same way it would talk to any other Kubernetes cluster.

The architecture is three pieces:

| Component                                       | Who manages it                                                      |
| ----------------------------------------------- | ------------------------------------------------------------------- |
| **Control plane** (API server, scheduler, etcd) | Mithril. Long-lived. You don't need to  touch it.                   |
| **Nodes** (GPU instances)                       | You. Created via reservation or spot bid; join the cluster on boot. |
| `kubectl`                                       | You. Standard Kubernetes from your laptop.                          |

Hosted control plane instances come pre-loaded with **Cilium CNI** and the **NVIDIA GPU Operator**, so GPU workloads are schedulable out of the box. You're free to modify the cluster configuration; this isn't a fully managed offering, and Mithril doesn't push updates beyond node lifecycle management.

## Quickstart

### Prerequisites

* A Mithril project with billing configured
* The Mithril CLI installed locally ([installation guide](/mithril-cli/installation.md))
* `kubectl` installed locally
* An SSH key registered to your Mithril account **before** you create the cluster (see [SSH keys and the CLI](#ssh-keys-and-the-cli) below)

### Step 1: Create the cluster

In the Mithril console:

1. Navigate to **Clusters > Create cluster**
2. Pick a region. Your worker nodes must be in the same region as the control plane or they will not join.
3. Name the cluster. Click **Create**.

Provisioning can take up to 10 minutes. When the cluster appears in the list with status `Available`, note the **Cluster host** IP. That's the public IP your `kubectl` will talk to.

### Step 2: Attach a worker node

You have two paths that both end up looking the same from inside the Kubernetes cluster:

{% tabs %}
{% tab title="Reservation" %}
When creating a reservation, select your cluster in the **Kubernetes cluster** dropdown on the order form. The instance comes up already joined.

Use this for production workloads where you need guaranteed access.
{% endtab %}

{% tab title="Spot" %}
When creating a spot bid, select your cluster in the **Kubernetes cluster** dropdown. The node joins when the bid wins, leaves when it's preempted.

Use this for experiments, batch jobs, or development.
{% endtab %}
{% endtabs %}

{% hint style="warning" %}
The cluster's region must match the instance's region, so double-check before you submit the order.
{% endhint %}

### Step 3: Point `kubectl` at the cluster

From your laptop:

```bash
ml setup
ml k8s update-kubeconfig
```

`ml setup` authenticates the CLI to your Mithril account. `ml k8s update-kubeconfig` fetches the cluster's kubeconfig and merges it into your local `~/.kube/config` under a new context. Your other `kubectl` contexts keep working.

Verify:

```bash
kubectl config current-context  # should show your Mithril cluster
kubectl get nodes               # should list your attached node(s) as Ready
```

### Step 4: Run a GPU pod

Create `hello-gpu.yaml`:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: hello-gpu
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.4.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: Never
```

Apply it:

```bash
kubectl apply -f hello-gpu.yaml
kubectl logs hello-gpu
```

You should see the `nvidia-smi` table. That's the full loop of adding a worker node and running a GPU workload on it.

## Persistent storage

Mithril file shares and block volumes are attachable to nodes in your cluster. Select them on the order form (reservation or spot bid) and the volumes are mounted on every node in that order.

* **File shares** (`mithril-file-share`): multi-instance, read/write across nodes used for shared datasets and checkpoints.
* **Block volumes**: single-instance, persistent disk used for per-node scratch space.

For the full decision tree (object store vs. file share vs. block, when to use which), see [Persistent storage overview](/compute-and-storage/persistent-storage.md).

{% hint style="info" %}
At high pod-creation rates, the storage CSI can hit a `too many concurrent requests` error. It's transient, so stagger pod creation if you're scheduling dozens of pods with persistent volumes at once.
{% endhint %}

To make persistent storage volumes accessible within the k8s cluster use `hostPath` for pod mounting, e.g:

```yaml
volumes:
  - name: data-volume
    hostPath:
      path: /mnt/your-fileshare-name
      type: Directory
```

## SSH keys and the CLI

This is the most common source of friction. `ml k8s update-kubeconfig` is not a pure API call: under the hood it SSHes into the control plane host as the `ubuntu` user to fetch the kubeconfig. If your SSH key isn't authorized on that host, the command fails with `Permission denied (publickey)`.

Three things to verify if you hit this:

1. **The key is registered to your Mithril account.** Console > SSH keys > confirm the public key matches the one you have locally.
2. **The key was registered&#x20;*****before*****&#x20;the cluster was created.** Adding a key to your account after the fact doesn't retroactively authorize it on existing clusters. If you registered the key after creating the cluster, recreate the cluster or contact support.
3. **Your local SSH agent is offering the key.** Run `ssh-add -l`. If your key isn't listed, run `ssh-add ~/.ssh/your-key` to load it. On macOS, the agent can drop keys between sessions; consider adding to `~/.ssh/config`:

```
Host *
   AddKeysToAgent yes
   UseKeychain yes
```

If `ml k8s update-kubeconfig` succeeds with the SSH step but then fails with `Failed to parse local kubeconfig`, your local `~/.kube/config` is empty, missing, or malformed. Create a minimal stub:

```bash
mkdir -p ~/.kube
cat > ~/.kube/config <<'EOF'
apiVersion: v1
kind: Config
clusters: []
contexts: []
users: []
EOF
```

Then re-run `ml k8s update-kubeconfig`.

## Node lifecycle

Mithril manages the join/drain/uncordon lifecycle automatically. You should not run `kubeadm` on a node — it will break the cluster.

| Event                                              | What Mithril does                                                                |
| -------------------------------------------------- | -------------------------------------------------------------------------------- |
| **Node joins** (boot)                              | Automatically joins the cluster, registers with the control plane                |
| **Spot preemption**                                | Sends `kubectl drain` with a 5-minute grace period, then powers off the instance |
| **Reallocation** (spot bid wins again)             | Sends `kubectl uncordon` so the scheduler resumes placing pods                   |
| **Bid termination** (instance permanently deleted) | Removes the node from the cluster                                                |

## FAQ

<details>

<summary>Can I access the cluster from outside Mithril?</summary>

Yes. The control plane has a public IP with SSH (port 22) and the Kubernetes API (port 6443) open by default.

</details>

<details>

<summary>Does MCC support autoscaling?</summary>

You can configure autoscaling for deployments inside Kubernetes (HPA, etc.) and the standard Kubernetes mechanisms work. There's no officially supported mechanism for autoscaling the underlying cluster size today. Add nodes by placing additional reservations or spot bids with the cluster selected.

</details>

<details>

<summary>Is this a fully managed Kubernetes service?</summary>

No. Mithril manages the control plane lifecycle and node join/drain/remove operations. It does not push updates to cluster configuration, manage RBAC, or handle workload-level concerns. Treat MCC as "managed control plane, self-served everything else."

</details>

<details>

<summary>Can I use <code>ml launch</code> against my cluster?</summary>

Yes. In your task YAML, set `infra: kubernetes` and your task runs against the cluster. See [Task YAML > Infra](/mithril-cli/task-yaml/infra.md) for the full reference.

</details>

<details>

<summary>My node never joined the cluster. What's going on?</summary>

Most likely cause: the cluster's region doesn't match the instance's region. Less common: a custom startup script that interferes with the join process, or a network policy on your side blocking outbound to the control plane.

Check the instance logs and contact support if neither applies.

</details>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.mithril.ai/compute-and-storage/mithril-compute-clusters.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
