start_within, to tell us how long you can wait. A flex request costs half the standard rate for the same model and the same answer. FlexInference tries the cheaper flex tier first. If flex cannot start in the time you gave, FlexInference moves the request up to your normal standard tier so it still completes.
FlexInference does not sell inference. You bring your own provider key (BYOK) - OpenAI, Gemini, or Anthropic - and the provider bills your account directly at whatever tier served the request. We add cost routing on top that respects how long you can wait.
Why it exists
OpenAI sells the same models at different prices depending on how fast you need them. The flex (batch) tier runs at half the standard rate, but capacity is not guaranteed. Most apps do not have the time to manage that trade-off per request. We measured this across more than 10,000 real requests. Cost came down about 47 percent, token weighted, while p50 latency stayed about the same, up around 4 percent. You save real money and your users do not feel a slower app.start_within is how long you are willing to wait before the request starts running. You are not setting a hard project deadline. You are telling us your latency budget. Give it a short duration when a user is waiting on the answer, and a longer one when nothing is waiting and you would rather pay less.
With FlexInference you just say how long you can wait, and it does the rest:
Need it now
start_within: "priority" routes to OpenAI’s priority tier.Can wait a bit
start_within: "00h-00m-30s" tries the cheaper flex tier first. If flex cannot start within 30 seconds, FlexInference moves the request up to your standard tier so it still runs.Default
start_within: "default" uses OpenAI’s default tier and pricing. ("auto" lets OpenAI pick.)Drop-in compatible
Point the base URL at FlexInference and add the requiredstart_within field. Everything else works the same, including streaming, tool calling, structured outputs, vision, and reasoning. You get the same behavior you would get calling OpenAI, Gemini, or Anthropic directly.
Quickstart
Go from zero to your first request.
Deadline routing
How
start_within works, and when fallback kicks in.Authentication
FlexInference API keys, plus your own OpenAI, Gemini, or Anthropic key.
Supported models
The models that support flex routing.