Overview
Defining a basic workload
# task.yaml
resources:
infra: mithril
accelerators: B200:8
num_nodes: 2
setup: |
pip install -r requirements.txt
run: |
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--node_rank=$SKYPILOT_NODE_RANK \
train.py --distributedRunning the workload
Provisioning and scheduling
Features
Built on Skypilot
Last updated