Multi-node training
from flow.sdk.decorators import FlowApp
app = FlowApp()
# train.py
@app.function(
gpu="h100:8",
num_instances=4,
distributed_mode="auto"
)
def distributed_training():
# Flow automatically sets up cluster discovery and exports environment variables:
# NUM_NODES, GPU_COUNT, HEAD_NODE_IP, NODE_RANK for torchrun
# Plus PyTorch distributed variables: RANK, LOCAL_RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT
dist.init_process_group(backend="nccl")
# Your training code here...
if __name__ == "__main__":
task = flow.run(distributed_training)
print(f"Task submitted: {task.id}")Automatic Coordination
From Single-Node to Multi-Node
How Flow's Rendezvous Works
Environment Variables Set by Flow
Shared Storage
File Storage for Multi-Node Access
Training Script with Coordinated Checkpointing
S3 Dataset Integration
Volume Configuration
Debugging Multi-Node Training
Node-Specific SSH Access
Debugging Workflows
Log Aggregation
Common Debugging Scenarios
Production Configuration
Cost Protection
Performance Optimization
Last updated