> ## Documentation Index
> Fetch the complete documentation index at: https://flexinference.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Deadline routing

> The start_within field, the flex race, and automatic fallback.

`start_within` is the one field FlexInference adds, and every request must include it. It tells us how long you are willing to wait for a request to start. When you can wait a little, we try a cheaper tier first and cut your bill. When you need the answer now, we send it straight through to your provider. The field takes one of four forms, and each one is below.

## The four forms

<Tabs>
  <Tab title="default">
    ```json theme={null}
    { "start_within": "default" }
    ```

    Routes to the provider's **standard** real-time tier at standard pricing. That means OpenAI's `default`, Gemini's `standard`, or Anthropic's `standard_only`. It works with any model.
  </Tab>

  <Tab title="priority">
    ```json theme={null}
    { "start_within": "priority" }
    ```

    Routes to OpenAI's **priority** tier for the fastest possible start. Best for interactive, latency-critical paths. Priority is the most expensive tier. Works with any model. On Anthropic this is **best-effort**. Claude has no real priority tier, so we map it to Anthropic's `auto`.
  </Tab>

  <Tab title="auto">
    ```json theme={null}
    { "start_within": "auto" }
    ```

    Lets the provider choose the tier. On OpenAI this is `service_tier: auto`; on Anthropic it maps to Anthropic's `auto`. (This is the provider's own auto tier, not a FlexInference auto-routing feature.) This works on **OpenAI and Anthropic**. Gemini has no auto tier, so `auto` on a Gemini model returns `400 auto_unsupported_for_gemini`.
  </Tab>

  <Tab title="a duration">
    ```json theme={null}
    { "start_within": "00h-00m-30s" }
    ```

    Runs the **flex race** (below) on a flex-capable [model](/models). The flex race means we try the cheaper flex tier first and only move up to standard when flex cannot start in time. The duration is the longest you will wait for the request to start, written as `HHh-MMm-SSs` with two digits per field. It must be between **5 seconds and 10 minutes** (`00h-00m-05s` to `00h-10m-00s`).
  </Tab>
</Tabs>

`default`, `priority`, and `auto` proxy **any** model straight to its provider (OpenAI, Gemini, or Anthropic). Only the duration form runs the flex race, and it needs a flex-capable [model](/models). A duration on any other model returns `400 model_not_flex_capable`.

<Note>
  The flex race is **not available on Claude** because Anthropic has no flex tier. A duration `start_within` on a `claude-*` model returns `400 flex_unsupported_for_anthropic`. Use `default`, `priority`, or `auto` instead.
</Note>

## The flex race

When you give a duration, we try to serve your request from the provider's **flex** tier (OpenAI or Gemini), which is billed at half the standard rate but isn't guaranteed capacity. We hold your response and watch the upstream:

<Steps>
  <Step title="Fire flex">
    We send your request to the flex tier and wait for the provider to confirm it is being fulfilled, all inside the time you set.
  </Step>

  <Step title="Commit to flex">
    If the provider starts the request in time, we commit. Your response streams back as normal, billed at flex rates. The response shows `"service_tier": "flex"`.
  </Step>

  <Step title="Fall back to standard">
    If flex can't start in time, we cancel it and send the request to your standard tier so you still get an answer. This runs the opposite way from the fallback you are used to. Most tools fall back down to a cheaper model, but here flex is already the cheaper path, so we fall back up to your standard tier. That standard tier is your safety net, not a downgrade. The response shows `"service_tier": "default"` (OpenAI) or `"standard"` (Gemini).
  </Step>
</Steps>

Fallback to standard happens when, inside the time you set, flex returns a `429` (no capacity), any `5xx`, a pre-start failure, or simply doesn't start. The flex-to-standard switch is the **only** tier change FlexInference makes.

<Info>
  Pick a duration as long as your UX can tolerate. A longer wait gives flex more room to win, which means more requests served at the cheaper rate.
</Info>

## When the flex pool is unavailable

The flex tier is shared, best-effort capacity, so it can miss your window in a few ways. It can return a `429` when there is no capacity, return any `5xx`, fail before it starts, or simply not start in time. When that happens we send the request to standard for you, and you still get an answer.

In rare cases the flex pool fails *after* the request has been accepted and started streaming. We show that failure to you instead of hiding it. On a streaming request you get a terminal `response.failed` event. On a non-streaming request you get a `502`. We never retry it in silence, because we never report a request as done when it was not. If you hit this, **retry**, or use `default`, `priority`, or `auto` to skip the flex pool entirely.

## Billing

You bring your own key, so your provider bills your account at the rate of whichever tier served the request. You pay flex rates when flex commits and standard rates when it falls back. Sometimes a flex attempt burns a few tokens before it fails. Your provider bills that usage the same way it would if you called flex yourself, and we show it in your usage so the cost is never hidden. FlexInference adds no markup on your tokens. We charge 20% of the money we save you, and we charge nothing when we save you nothing.

## What passes through unchanged

Every other status code and response body from your provider passes through untouched. A `400` for a bad parameter, a `401` for a bad key, or a rate limit on the standard attempt all reach you unchanged. You see exactly what your provider returned. FlexInference only changes the tier, and it never changes the contract.

<Note>
  `start_within` is **required**. If you leave it out, the request returns `400 missing_start_within`, and this happens even with a plain OpenAI SDK pointed at our base URL. If you send a value we cannot read, you get `invalid_start_within`. A valid value looks like `"default"`, `"priority"`, `"auto"`, or a duration such as `"00h-00m-30s"`. The bare word `standard` used to be allowed and no longer is, so send `default` instead. See [Errors](/errors) for the full list.
</Note>
