Overview

Python → Petaflops in 15 seconds. Flow procures GPUs through Mithril, spins InfiniBand-connected instances, and runs your workloads—zero friction, no hassle.

Background

There's a paradox in GPU infrastructure today: Massive GPU capacity sits idle, even as AI teams wait in queues—starved for compute. Mithril, the AI-compute omnicloud, dynamically allocates GPU resources from a global pool (spanning Mithril's first-party resources and 3rd-party partner cloud capacity) using efficient two-sided auctions, maximizing surplus and reducing costs. Mithril seamlessly supports both reserved-in-advance and just-in-time workloads—maximizing utilization, ensuring availability, and significantly reducing costs.

flow run "python train.py" # -i 8xh100
 Bidding for best‑price GPU node (8×H100) with $12.29/h100-hr limit_price…
 Launching on NVIDIA H100-80GB for $1/h100-hr

Why choose Flow

Status quo GPU provisioning involves quotas, complex setups, and queue delays, even as GPUs sit idle elsewhere or in recovery processes. Flow addresses this:

Dynamic Market Allocation – Efficient two-sided auctions ensure you pay the lowest market-driven prices rather than inflated rates.

Simplified Batch Execution – An intuitive interface designed for cost-effective, high-performance batch workloads without complex infrastructure management.

Provision from 1 to thousands of GPUs for long-term reservations, short-term "micro-reservations" (minutes to weeks), or spot/on-demand needs—all interconnected via InfiniBand. High-performance persistent storage and built-in Docker support further streamline workloads, ensuring rapid data access and reproducibility.


Why Flow + Mithril?

Pillar
Outcome
How

Iteration Velocity and Ease

Fresh containers in seconds; from idea to training or serving instantly.

flow dev for DevBox or flow run to programmatically launch tasks

Best price-performance via market-based pricing

Preemptible secure jobs for $1/h100-hr

Blind two-sided second-price auction; client-side bid capping

Availability and Elasticity

GPUs always available, self-serve; no haggling, no calls.

Uncapped spot + overflow capacity from partner clouds

Abstraction and Simplification

InfiniBand VMs, CUDA drivers, auto-managed healing buffer—all pre-arranged.

Mithril virtualization and base images preconfigured + Mithril capacity management.

"The tremendous demand for AI compute and the large fraction of idle time makes sharing a perfect solution, and Mithril's innovative market is the right approach."Paul Milgrom, Nobel Laureate (Auction Theory and Mechanism Design)


Pricing & Auctions

How Flow leverages Mithril's Second-Price Auction:

You express your limit price (or leverage flow defaults); GPUs provision instantly at the fair market clearing rate.

Your Bid's Limit Price
Current Spot Price
You Pay

$3.00

$1.00

$1.00

$3.00

$3.50 (spike)

  • Your billing price = highest losing bid.

  • Limit price protects from surprises.

  • Resell unused reservations into the auction to recoup costs.

Full Auction Mechanics →


Key Concepts to Get Started

Auctions & Limit Prices

Flow uses Mithril spot instances via second-price auctions. See auction mechanics.

Core Workflows

  • flow dev → interactive loops in seconds.

  • flow run → reproducible batch jobs.

  • flow grab → instant GPU cluster (e.g., flow grab 256)

  • Python API → easy pipelines and orchestration.

Examples

# Grab a micro-cluster instantly  
flow grab 256  # optionally name it: -n micro-cluster

# Launch a batch job on discounted H100s
flow run "python train.py" -i 8xh100

# Frictionlessly leverage an existing SLURM script
flow run job.slurm

# Serverless‑style decorator
@flow.function(gpu="a100")

Ideal Use Cases

  • Rapid Experimentation – Quick iterations for research sprints.

  • Instant Elasticity – Scale rapidly from one to thousands of GPUs.

  • Collaborative Research – Shared dev environments with per-task cost controls.

Flow is not yet ideal for: always‑on ≤100 ms inference, strictly on‑prem regulated data, or models that fit on laptop or consumer-grade GPUs.


Architecture (30‑s view)

Your intent ⟶ Flow Execution Layer ⟶ Global GPU Fabric

Flow SDK abstracts complex GPU auctions, InfiniBand clusters, and multi-cloud management into a single seamless and unified developer interface.


Under the Hood (Advanced)

  • Bid Caps – Protect budgets automatically.

  • Self-Healing – Spot nodes dynamically migrate tasks.

  • Docker/Conda – Pre-built images or dynamic install.

  • Multi-cloud Ready – Mithril (with Oracle, Nebius integrations internal to Mithril), and more coming

  • SLURM Compatible – Run #SBATCH scripts directly.


Key Features Summary

  • Distributed Training – Multi-node InfiniBand clusters auto-configured

  • Code Upload – Automatic with .flowignore (or .gitignore fallback)

  • Container Environments – Custom Docker images with caching (set image="...")

  • Live Debugging – SSH into running instances (flow ssh)

  • Cost Protection – Built-in max_price_per_hour safeguards

  • Google Colab Integration – Connect notebooks to GPU instances

  • Private Registries – ECR/GCR with auto-authentication

Repository: https://github.com/mithrilcompute/flow

Further Reading

Last updated