Axolotl

Setup

This example is based on https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/mistralarrow-up-right.

git clone [email protected]:axolotl-ai-cloud/axolotl.git
cd axolotl/examples

Define workload

resources:
  accelerators: B200:8
  infra: mithril

workdir: mistral

setup: |
  docker pull winglian/axolotl:main-py3.10-cu118-2.0.1

run: |
  docker run --gpus all \
    -v ~/sky_workdir:/sky_workdir \
    -v /root/.cache:/root/.cache \
    winglian/axolotl:main-py3.10-cu118-2.0.1 \
    huggingface-cli login --token ${HF_TOKEN} 

  docker run --gpus all \
    -v ~/sky_workdir:/sky_workdir \
    -v /root/.cache:/root/.cache \
    winglian/axolotl:main-py3.10-cu118-2.0.1 \
    accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml

envs:
  HF_TOKEN: null # Pass with `--secret HF_TOKEN` in CLI
  

Run workload

  • --cluster axolotl — Name the cluster axolotl. If a cluster with this name already exists, it will be reused; otherwise a new one is created.

  • --secret HF_TOKEN Pass HF_TOKEN as a secret environment variable to the remote cluster. Since no =value is provided, it reads the value from your local environment ($HF_TOKEN). Secrets behave like --env but are redacted in logs and YAML outputs for security.

  • -i30 or --idle-minutes-to-autostop 30 — Automatically stop the cluster after 30 minutes of idleness (no running or pending jobs in the cluster's job queue).

  • --down Autodown — Instead of just stopping the cluster when the autostop timer fires, tear it down entirely (delete the cloud resources). Combined with -i30, this means: after all jobs finish and the cluster has been idle for 30 minutes, destroy the cluster completely.

Last updated