Overview
Python → Petaflops in 15 seconds. Flow procures GPUs through Mithril, spins InfiniBand-connected instances, and runs your workloads—zero friction, no hassle.
Background
There's a paradox in GPU infrastructure today: Massive GPU capacity sits idle, even as AI teams wait in queues—starved for compute. Mithril, the AI-compute omnicloud, dynamically allocates GPU resources from a global pool (spanning Mithril's first-party resources and 3rd-party partner cloud capacity) using efficient two-sided auctions, maximizing surplus and reducing costs. Mithril seamlessly supports both reserved-in-advance and just-in-time workloads—maximizing utilization, ensuring availability, and significantly reducing costs.
flow run "python train.py" # -i 8xh100
⠋ Bidding for best‑price GPU node (8×H100) with $12.29/h100-hr limit_price…
✓ Launching on NVIDIA H100-80GB for $1/h100-hr
Why choose Flow
Status quo GPU provisioning involves quotas, complex setups, and queue delays, even as GPUs sit idle elsewhere or in recovery processes. Flow addresses this:
Dynamic Market Allocation – Efficient two-sided auctions ensure you pay the lowest market-driven prices rather than inflated rates.
Simplified Batch Execution – An intuitive interface designed for cost-effective, high-performance batch workloads without complex infrastructure management.
Provision from 1 to thousands of GPUs for long-term reservations, short-term "micro-reservations" (minutes to weeks), or spot/on-demand needs—all interconnected via InfiniBand. High-performance persistent storage and built-in Docker support further streamline workloads, ensuring rapid data access and reproducibility.
Why Flow + Mithril?
Iteration Velocity and Ease
Fresh containers in seconds; from idea to training or serving instantly.
flow dev
for DevBox or flow run
to programmatically launch tasks
Best price-performance via market-based pricing
Preemptible secure jobs for $1/h100-hr
Blind two-sided second-price auction; client-side bid capping
Availability and Elasticity
GPUs always available, self-serve; no haggling, no calls.
Uncapped spot + overflow capacity from partner clouds
Abstraction and Simplification
InfiniBand VMs, CUDA drivers, auto-managed healing buffer—all pre-arranged.
Mithril virtualization and base images preconfigured + Mithril capacity management.
"The tremendous demand for AI compute and the large fraction of idle time makes sharing a perfect solution, and Mithril's innovative market is the right approach." — Paul Milgrom, Nobel Laureate (Auction Theory and Mechanism Design)
Pricing & Auctions
How Flow leverages Mithril's Second-Price Auction:
You express your limit price (or leverage flow defaults); GPUs provision instantly at the fair market clearing rate.
$3.00
$1.00
$1.00
Your billing price = highest losing bid.
Limit price protects from surprises.
Resell unused reservations into the auction to recoup costs.
Key Concepts to Get Started
Auctions & Limit Prices
Flow uses Mithril spot instances via second-price auctions. See auction mechanics.
Core Workflows
flow dev
→ interactive loops in seconds.flow run
→ reproducible batch jobs.flow grab
→ instant GPU cluster (e.g.,flow grab 256
)Python API → easy pipelines and orchestration.
Examples
# Grab a micro-cluster instantly
flow grab 256 # optionally name it: -n micro-cluster
# Launch a batch job on discounted H100s
flow run "python train.py" -i 8xh100
# Frictionlessly leverage an existing SLURM script
flow run job.slurm
# Serverless‑style decorator
@flow.function(gpu="a100")
Ideal Use Cases
Rapid Experimentation – Quick iterations for research sprints.
Instant Elasticity – Scale rapidly from one to thousands of GPUs.
Collaborative Research – Shared dev environments with per-task cost controls.
Flow is not yet ideal for: always‑on ≤100 ms inference, strictly on‑prem regulated data, or models that fit on laptop or consumer-grade GPUs.
Architecture (30‑s view)
Your intent ⟶ Flow Execution Layer ⟶ Global GPU Fabric
Flow SDK abstracts complex GPU auctions, InfiniBand clusters, and multi-cloud management into a single seamless and unified developer interface.
Under the Hood (Advanced)
Bid Caps – Protect budgets automatically.
Self-Healing – Spot nodes dynamically migrate tasks.
Docker/Conda – Pre-built images or dynamic install.
Multi-cloud Ready – Mithril (with Oracle, Nebius integrations internal to Mithril), and more coming
SLURM Compatible – Run
#SBATCH
scripts directly.
Key Features Summary
Distributed Training – Multi-node InfiniBand clusters auto-configured
Code Upload – Automatic with
.flowignore
(or.gitignore
fallback)Container Environments – Custom Docker images with caching (set
image="..."
)Live Debugging – SSH into running instances (
flow ssh
)Cost Protection – Built-in
max_price_per_hour
safeguardsGoogle Colab Integration – Connect notebooks to GPU instances
Private Registries – ECR/GCR with auto-authentication
Repository: https://github.com/mithrilcompute/flow
Further Reading
Last updated