Deadline routing - FlexInference

start_within is the one field FlexInference adds, and every request must include it. It tells us how long you are willing to wait for a request to start. When you can wait a little, we try a cheaper tier first and cut your bill. When you need the answer now, we send it straight through to your provider. The field takes one of four forms, and each one is below.

The four forms

default
priority
auto
a duration

{ "start_within": "default" }

Routes to the provider’s standard real-time tier at standard pricing. That means OpenAI’s default, Gemini’s standard, or Anthropic’s standard_only. It works with any model.

{ "start_within": "priority" }

Routes to OpenAI’s priority tier for the fastest possible start. Best for interactive, latency-critical paths. Priority is the most expensive tier. Works with any model. On Anthropic this is best-effort. Claude has no real priority tier, so we map it to Anthropic’s auto.

{ "start_within": "auto" }

Lets the provider choose the tier. On OpenAI this is service_tier: auto; on Anthropic it maps to Anthropic’s auto. (This is the provider’s own auto tier, not a FlexInference auto-routing feature.) This works on OpenAI and Anthropic. Gemini has no auto tier, so auto on a Gemini model returns 400 auto_unsupported_for_gemini.

{ "start_within": "00h-00m-30s" }

Runs the flex race (below) on a flex-capable model. The flex race means we try the cheaper flex tier first and only move up to standard when flex cannot start in time. The duration is the longest you will wait for the request to start, written as HHh-MMm-SSs with two digits per field. It must be between 5 seconds and 10 minutes (00h-00m-05s to 00h-10m-00s).

default, priority, and auto proxy any model straight to its provider (OpenAI, Gemini, or Anthropic). Only the duration form runs the flex race, and it needs a flex-capable model. A duration on any other model returns 400 model_not_flex_capable.

The flex race is not available on Claude because Anthropic has no flex tier. A duration start_within on a claude-* model returns 400 flex_unsupported_for_anthropic. Use default, priority, or auto instead.

The flex race

When you give a duration, we try to serve your request from the provider’s flex tier (OpenAI or Gemini), which is billed at half the standard rate but isn’t guaranteed capacity. We hold your response and watch the upstream:

Fire flex

We send your request to the flex tier and wait for the provider to confirm it is being fulfilled, all inside the time you set.

Commit to flex

If the provider starts the request in time, we commit. Your response streams back as normal, billed at flex rates. The response shows "service_tier": "flex".

Fall back to standard

If flex can’t start in time, we cancel it and send the request to your standard tier so you still get an answer. This runs the opposite way from the fallback you are used to. Most tools fall back down to a cheaper model, but here flex is already the cheaper path, so we fall back up to your standard tier. That standard tier is your safety net, not a downgrade. The response shows "service_tier": "default" (OpenAI) or "standard" (Gemini).

Fallback to standard happens when, inside the time you set, flex returns a 429 (no capacity), any 5xx, a pre-start failure, or simply doesn’t start. The flex-to-standard switch is the only tier change FlexInference makes.

Pick a duration as long as your UX can tolerate. A longer wait gives flex more room to win, which means more requests served at the cheaper rate.

When the flex pool is unavailable

The flex tier is shared, best-effort capacity, so it can miss your window in a few ways. It can return a 429 when there is no capacity, return any 5xx, fail before it starts, or simply not start in time. When that happens we send the request to standard for you, and you still get an answer. In rare cases the flex pool fails after the request has been accepted and started streaming. We show that failure to you instead of hiding it. On a streaming request you get a terminal response.failed event. On a non-streaming request you get a 502. We never retry it in silence, because we never report a request as done when it was not. If you hit this, retry, or use default, priority, or auto to skip the flex pool entirely.

Billing

You bring your own key, so your provider bills your account at the rate of whichever tier served the request. You pay flex rates when flex commits and standard rates when it falls back. Sometimes a flex attempt burns a few tokens before it fails. Your provider bills that usage the same way it would if you called flex yourself, and we show it in your usage so the cost is never hidden. FlexInference adds no markup on your tokens. We charge 20% of the money we save you, and we charge nothing when we save you nothing.

What passes through unchanged

Every other status code and response body from your provider passes through untouched. A 400 for a bad parameter, a 401 for a bad key, or a rate limit on the standard attempt all reach you unchanged. You see exactly what your provider returned. FlexInference only changes the tier, and it never changes the contract.

start_within is required. If you leave it out, the request returns 400 missing_start_within, and this happens even with a plain OpenAI SDK pointed at our base URL. If you send a value we cannot read, you get invalid_start_within. A valid value looks like "default", "priority", "auto", or a duration such as "00h-00m-30s". The bare word standard used to be allowed and no longer is, so send default instead. See Errors for the full list.

​The four forms

​The flex race

​When the flex pool is unavailable

​Billing

​What passes through unchanged

The four forms

The flex race

When the flex pool is unavailable

Billing

What passes through unchanged