Multi-node training

Traditional multi-node distributed training requires manual coordination—discovering node IPs, assigning ranks, configuring network communication, and setting up shared storage. Flow eliminates this complexity through automatic coordination.

Instead of managing infrastructure, you focus on your training code:

from flow.sdk.decorators import FlowApp

app = FlowApp()

# train.py
@app.function(
    gpu="h100:8", 
    num_instances=4,
    distributed_mode="auto"
)
def distributed_training():
    # Flow automatically sets up cluster discovery and exports environment variables:
    # NUM_NODES, GPU_COUNT, HEAD_NODE_IP, NODE_RANK for torchrun
    # Plus PyTorch distributed variables: RANK, LOCAL_RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT
    dist.init_process_group(backend="nccl")
    # Your training code here...
    
if __name__ == "__main__":
    task = flow.run(distributed_training)
    print(f"Task submitted: {task.id}")

Simply run it with: python train.py

What happens:

Locally: Python executes train.py, which calls flow.run(torchrun_training)

  1. Flow submits the decorated function to run on remote infrastructure

  2. Remotely: Mithril provisions 4 nodes and executes torchrun_training() on each

Flow handles node discovery, rank assignment, environment setup, shared storage, and debugging tools. Here's how each piece works.


Automatic Coordination

Flow's rendezvous system eliminates manual node coordination. When you specify num_instances > 1, Flow automatically handles node discovery, rank assignment, and environment setup.

From Single-Node to Multi-Node

Convert single-node training to multi-node by adding two parameters:

How Flow's Rendezvous Works

When you submit the task, Flow's coordination system:

  1. Discovers all instances via the provider API

  2. Assigns deterministic ranks based on instance IDs

  3. Elects a leader (rank 0) and resolves its private IP

  4. Sets environment variables for PyTorch distributed training

  5. Synchronizes startup so all nodes begin together

Your training code uses standard PyTorch distributed patterns—dist.init_process_group(), DDP, dist.barrier()—without any Flow-specific modifications.

Environment Variables Set by Flow

Flow automatically configures the distributed training environment:

No manual configuration required—Flow handles the coordination process automatically. You just need to set the PyTorch variables from Flow's variables in your training script.


Shared Storage

Multi-node training requires storage that all nodes can access simultaneously—shared datasets, model checkpoints, and training logs. Flow lets you specify volumes that automatically mount across all nodes in your deployment.

File Storage for Multi-Node Access

When you specify interface="file" in your volume configuration, Flow ensures the storage is accessible across all nodes:

Training Script with Coordinated Checkpointing

S3 Dataset Integration

For read-only datasets, you can mount S3 buckets that appear on all nodes:

Volume Configuration

Flow supports different storage configurations:

  • File storage (interface="file"): Shared access across all nodes. Use for datasets, checkpoints, and logs that need multi-node read/write access.

  • Block storage (interface="block"): Exclusive access to high-performance block storage.

  • S3 mounts: Read-only access to S3 buckets, mounted on all nodes. Useful for large datasets without persistent storage costs.

When you specify volumes in a multi-node function, Flow ensures they appear simultaneously on every node, eliminating manual synchronization between nodes.


Debugging Multi-Node Training

When distributed training fails, you need to inspect individual nodes—check GPU utilization, NCCL communication, memory usage, and process status. Flow provides node-specific SSH access and log aggregation to simplify multi-node debugging.

Node-Specific SSH Access

Flow lets you SSH directly to any node in your multi-node deployment:

Debugging Workflows

Check GPU utilization across all nodes to identify performance issues:

Log Aggregation

Flow aggregates logs from all nodes with automatic node identification:

Common Debugging Scenarios

NCCL Communication Issues: Use node-specific SSH to check network connectivity and NCCL environment variables on individual nodes.

Uneven GPU Utilization: SSH to underutilized nodes to check process status, memory usage, and data loading bottlenecks.

Hanging Training: Check if specific nodes are stuck in barriers or collective operations using per-node process inspection.

Flow's node-specific access eliminates the need to manage separate SSH connections or correlate logs manually across multiple instances.


Production Configuration

Production multi-node training requires cost controls and performance optimization. Flow provides built-in safeguards and configuration options for production workloads.

Cost Protection

Flow includes automatic cost protection mechanisms for large-scale training:

Performance Optimization

Flow automatically provisions high-speed networking for multi-node workloads. For explicit control, specify interconnect requirements:

Last updated