Prerequisites
- An OpenAI API key with available credit. FlexInference uses your key (BYOK). OpenAI bills you directly.
- A terminal with
curl, or Python 3.9+, or Node.js 18+.
Get started
Create a FlexInference account
Go to the dashboard and sign in. Your account is its own organization. It owns your API keys and tracks your usage, so your keys and your bill stay separate from everyone else’s.
Create an API key
In the dashboard, create a key. It looks like
flex_live_... and you only see it once, so copy it somewhere safe right away. This is the key you send to FlexInference, not your OpenAI key.Add your OpenAI key (BYOK)
Paste your OpenAI key into the dashboard. It is encrypted at rest and used only to make requests on your behalf. See Authentication for how this works.
Make your first request
Point your client at
https://api.flexinference.com/v1 and add start_within. Here 00h-00m-30s means you are willing to wait up to 30 seconds for the request to start. FlexInference tries OpenAI’s flex tier first, because it costs less but can sit in a queue. If the flex tier will not start inside your 30 seconds, FlexInference runs your normal standard tier instead, so the request still completes. Standard is the full-price tier that always starts right away, so you never lose the answer.Here are two words to know. Flex is the cheaper tier, and it can wait in a queue before it starts. Standard is your normal full-price tier, and it starts right away. Most tools fall back to a cheaper model when something breaks. FlexInference does the opposite. It starts on the cheap flex tier and moves up to your standard tier only when flex would miss your time limit, so a slow flex tier never costs you the answer.start_within is the longest you are willing to wait for the request to start. You write it as a duration like 00h-00m-30s for 30 seconds. A bigger value gives the cheap flex tier more room to start, which saves you money. A smaller value moves you to the standard tier sooner. The Claude example passes default instead of a duration, because Anthropic has no flex tier to race. That sends the request straight to the standard tier. The value bounds when the work starts, not how long the model takes to answer. See the routing page for the full list of values.A
200 response with "service_tier": "flex" means the cheaper flex tier started in time, so you saved money on this call. "service_tier": "default" means the flex tier was too slow, so FlexInference ran your normal standard tier to finish inside your time limit. Either way you got your answer on time.If your first request fails, read the error body. Every FlexInference error tells you what went wrong, why it happened, how to fix it, and shows a working example. If you forgot to add your OpenAI key in the dashboard, the error says exactly that and points you to the page to fix it. Errors that come straight from the provider pass through with their original status and body, so nothing is hidden from you. See the errors page for the full list.
What to try next
Set your time limit
Learn the
start_within values and when FlexInference moves you from the flex tier up to your standard tier.Stream, tools, vision
Streaming, tools, and vision work unchanged across OpenAI, Gemini, and Anthropic.