> For the complete documentation index, see [llms.txt](https://docs.mithril.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.mithril.ai/mithril-cli/task-yaml/data-and-storage.md).

# Data & storage

Three ways to get data onto your cluster and persist results.

| Mechanism          | YAML field    | What it does                       | Lifecycle                     |
| ------------------ | ------------- | ---------------------------------- | ----------------------------- |
| Cloud buckets      | `file_mounts` | Sync data to a path on the cluster | Data lives in your bucket     |
| Persistent volumes | `volumes`     | Mount Mithril network storage      | Survives instance termination |
| Ephemeral volumes  | `volumes`     | Fast local scratch space           | Deleted with cluster          |

Your local code is handled separately by `workdir` — see Syncing your code.

### Cloud buckets

Use `file_mounts` to make data available at a path on the cluster.

```yaml
file_mounts:
  /data: s3://my-bucket/training-data
  /models: gs://my-bucket/pretrained
```

#### Local files

You can also mount local files and directories. They are uploaded to a temporary cloud bucket behind the scenes and synced to the cluster:

```yaml
file_mounts:
  /remote/path: /local/path/to/data
  /remote/config.yaml: ./config.yaml
```

#### Supported sources

| Source                     | Example                               |
| -------------------------- | ------------------------------------- |
| AWS S3                     | `s3://my-bucket/path`                 |
| Google Cloud Storage (GCS) | `gs://my-bucket/path`                 |
| Cloudflare R2              | `r2://my-bucket`                      |
| CoreWeave Object Storage   | `cw://my-bucket`                      |
| OCI Object Storage         | `oci://my-bucket@region`              |
| Local directory            | `/absolute/path` or `./relative/path` |
| Local file                 | `./config.yaml`                       |

For latest list of supported providers → [Cloud Buckets (SkyPilot docs)](https://docs.skypilot.co/en/latest/reference/storage.html)&#x20;

#### Storage modes

Cloud buckets support three access modes:

| Mode              | Reads                     | Writes                                                       | Best for                           |
| ----------------- | ------------------------- | ------------------------------------------------------------ | ---------------------------------- |
| `MOUNT` (default) | Streamed from bucket      | Replicated to bucket and visible to other VMs                | Shared datasets, multi-node access |
| `COPY`            | Pre-fetched to local disk | Local only, not synced back                                  | Fast I/O on data that fits on disk |
| `MOUNT_CACHED`    | Cached locally on access  | Cached locally, uploaded in background before task completes | Checkpoints and large writes       |

```yaml
file_mounts:
  /data:
    source: s3://my-bucket/dataset
    mode: COPY # MOUNT, COPY, or MOUNT_CACHED
```

→ [Cloud Buckets (SkyPilot docs)](https://docs.skypilot.co/en/latest/reference/storage.html) — advanced storage options, bucket creation, CLI management, and YAML reference

### Persistent volumes

Use `volumes` to mount Mithril network storage that survives instance termination, preemption, and restarts. Ideal for training checkpoints and datasets you reuse across runs.

#### Create a volume

```bash
ml sky volumes apply \
  --name my-data \
  --infra mithril/us-central5-a \
  --type mithril-file-share \
  --size 100GB
```

#### Use in task YAML

```yaml
resources:
  infra: mithril/us-central5-a  # must match volume region

volumes:
  /data: my-data
  /checkpoints: my-checkpoints
```

#### Volume interfaces

| Interface  | `--type`             | Use case                                |
| ---------- | -------------------- | --------------------------------------- |
| File (NFS) | `mithril-file-share` | Shared access across multiple instances |
| Block      | `mithril-block`      | Single instance, high throughput        |

> Not all regions support both interfaces. Check the Mithril console for availability.

#### Manage volumes

```bash
ml sky volumes ls               # list volumes
ml sky volumes delete my-volume # delete a volume
```

#### Region matching

Volume and cluster must be in the same region:

```yaml
resources:
  infra: mithril/us-central5-a # ← must match

volumes:
  /data: my-volume # ← created in us-central5-a
```

### Ephemeral storage

Every Mithril instance comes with NVMe SSD ephemeral storage at no extra cost, automatically mounted at `/mnt/local`. No YAML configuration needed — it's available on every instance by default.

| Event            | Ephemeral storage                  |
| ---------------- | ---------------------------------- |
| VM restart       | Retained                           |
| Preemption       | Wiped (re-mounted on reallocation) |
| Termination      | Wiped                              |
| Host maintenance | Wiped                              |

Use `/mnt/local` for scratch work, caches, and shuffle buffers. Don't store anything you need to keep — use persistent volumes or object storage (via cloud buckets) for that.

→ [Ephemeral Storage](https://docs.mithril.ai/compute-and-storage/ephemeral-storage) — instance storage specs and detailed behavior

### Syncing your code

`workdir` syncs a local directory to `~/sky_workdir/` on the cluster:

```yaml
workdir: .
```

Your `run` commands execute from `~/sky_workdir/`, so relative paths work as expected. The workdir is re-synced on every `ml launch` and `ml exec`.

### Choosing the right mechanism

| Scenario                                                             | Use                                      |
| -------------------------------------------------------------------- | ---------------------------------------- |
| Training data                                                        | `file_mounts` with bucket URL            |
| Checkpoints you need across runs                                     | `volumes` (persistent)                   |
| Scratch space for shuffling/caching                                  | `/mnt/local` Node-local NVMe (ephemeral) |
| Your code and configs                                                | `workdir`                                |
| Workload output (final weights, LoRA adapters, logs, eval artifacts) | Object storage                           |

### Complete example

```yaml
name: training-run

resources:
  infra: mithril/us-central5-a
  accelerators: B200:8

workdir: .

file_mounts:
  /datasets: s3://my-bucket/imagenet

volumes:
  /checkpoints: my-checkpoints-volume

run: |
  python train.py \
    --data /datasets \
    --scratch /scratch \
    --output /checkpoints
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.mithril.ai/mithril-cli/task-yaml/data-and-storage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
