SDKs and API usage - FlexInference

FlexInference speaks the OpenAI API. There is no new SDK to learn. Point the official OpenAI SDK (or curl) at our base URL, send your flex_live_ key, and add the required start_within field. Every request needs it. Omit it and you get 400 missing_start_within.

Base URL   https://api.flexinference.com/v1
Auth       Authorization: Bearer flex_live_...

Four endpoints are supported, and translated faithfully in both directions:

POST /v1/responses: the Responses API
POST /v1/chat/completions: the Chat Completions API
POST /v1/interactions: the Interactions API (works with OpenAI, Gemini, and Anthropic models)
POST /v1/messages: the Anthropic Messages API (works with any model)

Configure the client

from openai import OpenAI

client = OpenAI(
    base_url="https://api.flexinference.com/v1",
    api_key="flex_live_...",
)

Passing `start_within`

The start_within field is a normal request field, and every request needs it. It sets how long you are willing to wait for the request to start returning. You write it as a duration, like 00h-01m-00s for one minute. FlexInference uses that time budget to try a cheaper flex tier first. The OpenAI SDKs do not have a typed field for it, so you pass it through their escape hatch. That is extra_body in Python and a plain field on the request object in Node. Here is what flex does on every request. FlexInference tries a cheaper flex tier first, inside the time budget you set with start_within. If the flex tier cannot finish in time, FlexInference runs your standard model instead, so the request still completes on the model you trust. You never get a worse answer than your standard model would give.

resp = client.responses.create(
    model="gpt-5.5",
    input="Summarize this contract.",
    extra_body={"start_within": "00h-01m-00s"},
)

See Deadline routing for the full set of values.

Streaming

Set stream: true and read events as they arrive, the same way you do with OpenAI. Streaming does not change how flex works. FlexInference picks a tier first, and you start getting tokens once that tier begins to answer.

stream = client.responses.create(
    model="gpt-5-nano",
    input="Count to ten.",
    stream=True,
    extra_body={"start_within": "00h-00m-20s"},
)
for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="")

Chat Completions

The Chat Completions endpoint works the same way. start_within applies identically.

resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"start_within": "default"},
)
print(resp.choices[0].message.content)

Interactions

The Interactions endpoint is a third caller format that works with both OpenAI and Gemini models. start_within applies identically. Send input as a string, a content-array of parts, or a multi-turn step list; system_instruction, tools, and response_format map to the same upstream features.

resp = client.interactions.create(
    model="gemini-3.5-flash",
    input="Summarize this contract.",
    extra_body={"start_within": "00h-01m-00s"},
)
print(resp.steps[0].content[0].text)

The response is an interaction object. Your output lives in steps. The usage object splits the token counts across input, output, thought, cached, tool-use, and total.

{
  "id": "interaction_...",
  "object": "interaction",
  "status": "completed",
  "model": "gemini-3.5-flash",
  "created": 1750000000,
  "updated": 1750000000,
  "service_tier": "flex",
  "usage": {
    "total_input_tokens": 1200,
    "total_output_tokens": 180,
    "total_thought_tokens": 64,
    "total_cached_tokens": 0,
    "total_tokens": 1444,
    "total_tool_use_tokens": 0
  },
  "steps": [
    { "type": "model_output", "content": [{ "type": "text", "text": "..." }] }
  ]
}

status is requires_action when a step is a function_call, incomplete when output is truncated, failed on error, and completed otherwise. generation_config.thinking_level maps to reasoning.effort. seed, stop_sequences, and a caller-supplied service_tier are refused; see Errors. Every FlexInference error tells you what is wrong, why, and how to fix it, and it shows an example. Send a service_tier yourself and you get service_tier_not_allowed, because FlexInference owns tier selection. The fix is to drop the field and let flex choose. See Errors for the full list with example bodies.

Messages

The Messages endpoint is a fourth caller format. It speaks the Anthropic Messages shape for both the request and the response, and it works with any model, because FlexInference translates it to and from the canonical form. start_within applies identically and is required. Anthropic requires max_tokens, so it is required here too; omit it and you get 400 missing_max_tokens. Send messages as {role, content} turns, with content a string or an array of content blocks; system, tools, tool_choice, and thinking map to the same upstream features.

resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this contract."}],
    extra_body={"start_within": "default"},
)
print(resp.content[0].text)

The response is an Anthropic message object. Your output lives in content as text, thinking, and tool_use blocks. The usage.service_tier field tells you which tier actually served the request, flex or standard, and that field is the one to trust.

{
  "id": "msg_...",
  "type": "message",
  "role": "assistant",
  "model": "claude-opus-4-8",
  "content": [{ "type": "text", "text": "..." }],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 1200,
    "cache_read_input_tokens": 0,
    "cache_creation_input_tokens": 0,
    "output_tokens": 180,
    "output_tokens_details": { "thinking_tokens": 64 },
    "service_tier": "standard"
  }
}

stop_reason is end_turn on a normal finish, max_tokens when truncated, tool_use when the model calls a tool, and refusal when it declines. thinking maps to reasoning.effort per model. top_k, stop_sequences, cache_control blocks, and document/file blocks with citations are refused with unsupported_parameter, and a caller-supplied service_tier with service_tier_not_allowed; see Errors.

Everything else works unchanged

Tool calling, structured outputs (response_format), vision, and reasoning all pass straight through to the provider you picked. Use them exactly as you would when you call that provider directly.

A few Chat Completions parameters have no match on the Responses models FlexInference routes, so FlexInference rejects them up front with unsupported_parameter: presence_penalty, frequency_penalty, logit_bias, logprobs, top_logprobs, seed, stop, prediction, audio, non-text modalities, web_search_options, and n > 1. FlexInference only rejects them when you set them to a real value. Leaving them at their defaults is fine. For web search, use a Responses web_search tool.

​Configure the client

​Passing start_within

​Streaming

​Chat Completions

​Interactions

​Messages

​Everything else works unchanged

Configure the client

Passing `start_within`

Streaming

Chat Completions

Interactions

Messages

Everything else works unchanged