Inference overview

Inference is currently in Beta

Mithril inference is an engine to support your batch inference needs with cost-efficiency and flexibility. It is ideal for workloads that do not require immediate responses, such as running evaluations, classifying large datasets, and performing bulk inference tasks.

By using Mithril Inference, you can:

  • Reduce costs with significantly lower pricing compared to synchronous requests

  • Unlock larger rate limits due to the higher throughput for large-scale operations

  • Simplify workflows by submitting requests asynchronously and retrieving results later within a clear completion window

We offer an Async Batch Inference API to run a large number of model requests in a single job. Using OpenAI API compatible bindings, our API allows for an ergonomic and intuitive developer experience. If you're already using the OpenAI API spec, switching to Mithril Inference is as easy as swapping the base URL.

Getting Started

To begin using the batch inference API, follow the instructions in the quickstart. You can have your first job submitted in less than 3 minutes! Currently we support target completion windows of 24 hours.

Supported models

For the most up-to-date list of supported models, check the active-models endpoint. Today, we support popular open-source LLMs such as:

  • Llama 4 Maverick 17Bx128E Instruct FP8

  • Llama 4 Scout 17Bx16E Instruct

  • Llama 3.1 8B Instruct

  • Qwen2.5 72B Instruct

We have plans to add additional models — please reach out to us directly if there are additional models you would like for us to support!

Last updated