Inference API overview

Overview

Mithril Inference API lets you execute a large number of requests asynchronously. It is ideal for tasks that do not require immediate responses, such as:

Running evaluations
Classifying large datasets
Bulk inference tasks

Using the Batch API provides:

Reduced Costs: Significantly lower pricing compared to synchronous requests.
Higher Throughput: Greater rate limits for large-scale operations.
Asynchronous Convenience: Submit requests and retrieve results later, within a clear completion window.

The API largely follows the structure as OpenAI's Batch API. This API supports uploading batch jobs, checking job status, and retrieving results.

Input file format

The input for batch inference uses the .jsonl OpenAI batch file format, which consists of a series of JSON objects, each on a new line. Each line represents a separate request.

Here is an example file.

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}

Implementation notes:

Currently, Mithril only support the /v1/chat/completions endpoint.
Every request in the file should be a complete and valid request to the specified endpoint.
The custom_id field is used to track individual requests within the batch.
All requests in a file must be made to the same model.

Retrieving results

Upon completion, results and any errors are provided as downloadable .jsonl files accessible via the Batch API endpoints.

Important notes:

Output and input files are stored for a maximum of 30 days from the job completion date. After 30 days, files are permanently purged and cannot be retrieved.
Jobs that exceed computational limits or runtime thresholds will be automatically paused, checkpointed, and returned to you partially complete.

Batch lifecycle

Status

Description

validating

Your input file is being validated before the batch begins.

failed

Your batch failed validation or encountered an error.

in_progress

Your batch is actively being processed.

completed

Your batch successfully completed; results are ready.

expired

Your batch did not complete within the 24-hour time window.

cancelling

Your batch cancellation request is currently being processed.

cancelled

Your batch was cancelled successfully.

Limits and constraints

We impose certain system limits by default. If you need to increase the limits, please reach out to our support.

Limit Type

Value

Description

Maximum File Size

150 MB

Input files larger than this limit will be rejected.

Maximum Requests per Job

25,000 per batch

Each batch job can include up to 25,000 individual requests.

Completion Window

24 hours

We maintain a target SLA of 24 hours. This means that we expect most jobs to complete with high confidence within at least 24 hours of their submission time.

Maximum Jobs per Day

300

Total jobs per day

Maximum Requests per Day

200,000

Total requests per day

Maximum Tokens per day

500,000,000

Total input and output tokens per day

Billing

Mithril Inference operates on a pay-as-you-go model, charging usage based on completed batch jobs and the input/output token pricing specified. Learn more about how billing works at Billing.

Model support

The batch API currently supports a variety of popular language models, and we continuously expand this selection based on user needs and feedback.

To check which models are available for batch processing, visit our Active Models page.

If you're interested in a model not currently listed or have specific model-related requests, please reach out at [email protected].

Data retention and privacy

Batch input and output files are securely stored for 30 days, after which they're permanently deleted.
Ensure retrieval of your outputs within this timeframe to avoid data loss.

PreviousInference quickstart NextInference API

Last updated 4 months ago

Good evening