> ## Documentation Index
> Fetch the complete documentation index at: https://flexinference.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# SDKs and API usage

> Use the OpenAI SDK you already have. Change the base URL, add start_within.

FlexInference speaks the OpenAI API. There is no new SDK to learn. Point the official OpenAI SDK (or `curl`) at our base URL, send your `flex_live_` key, and add the required `start_within` field. Every request needs it. Omit it and you get `400 missing_start_within`.

```
Base URL   https://api.flexinference.com/v1
Auth       Authorization: Bearer flex_live_...
```

Four endpoints are supported, and translated faithfully in both directions:

* `POST /v1/responses`: the Responses API
* `POST /v1/chat/completions`: the Chat Completions API
* `POST /v1/interactions`: the Interactions API (works with OpenAI, Gemini, and Anthropic models)
* `POST /v1/messages`: the Anthropic Messages API (works with **any** model)

## Configure the client

<CodeGroup>
  ```python Python theme={null}
  from openai import OpenAI

  client = OpenAI(
      base_url="https://api.flexinference.com/v1",
      api_key="flex_live_...",
  )
  ```

  ```typescript Node theme={null}
  import OpenAI from "openai";

  const client = new OpenAI({
    baseURL: "https://api.flexinference.com/v1",
    apiKey: "flex_live_...",
  });
  ```

  ```bash curl theme={null}
  export FLEX_API_KEY="flex_live_..."
  export FLEX_BASE_URL="https://api.flexinference.com/v1"
  ```
</CodeGroup>

## Passing `start_within`

The `start_within` field is a normal request field, and every request needs it. It sets how long you are willing to wait for the request to start returning. You write it as a duration, like `00h-01m-00s` for one minute. FlexInference uses that time budget to try a cheaper flex tier first. The OpenAI SDKs do not have a typed field for it, so you pass it through their escape hatch. That is `extra_body` in Python and a plain field on the request object in Node.

Here is what flex does on every request. FlexInference tries a cheaper flex tier first, inside the time budget you set with `start_within`. If the flex tier cannot finish in time, FlexInference runs your standard model instead, so the request still completes on the model you trust. You never get a worse answer than your standard model would give.

<CodeGroup>
  ```python Python theme={null}
  resp = client.responses.create(
      model="gpt-5.5",
      input="Summarize this contract.",
      extra_body={"start_within": "00h-01m-00s"},
  )
  ```

  ```typescript Node theme={null}
  const resp = await client.responses.create({
    model: "gpt-5.5",
    input: "Summarize this contract.",
    start_within: "00h-01m-00s",
  } as any);
  ```

  ```bash curl theme={null}
  curl "$FLEX_BASE_URL/responses" \
    -H "Authorization: Bearer $FLEX_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gpt-5.5",
      "input": "Summarize this contract.",
      "start_within": "00h-01m-00s"
    }'
  ```
</CodeGroup>

See [Deadline routing](/deadline-routing) for the full set of values.

## Streaming

Set `stream: true` and read events as they arrive, the same way you do with OpenAI. Streaming does not change how flex works. FlexInference picks a tier first, and you start getting tokens once that tier begins to answer.

<CodeGroup>
  ```python Python theme={null}
  stream = client.responses.create(
      model="gpt-5-nano",
      input="Count to ten.",
      stream=True,
      extra_body={"start_within": "00h-00m-20s"},
  )
  for event in stream:
      if event.type == "response.output_text.delta":
          print(event.delta, end="")
  ```

  ```typescript Node theme={null}
  const stream = await client.responses.create({
    model: "gpt-5-nano",
    input: "Count to ten.",
    stream: true,
    start_within: "00h-00m-20s",
  } as any);

  for await (const event of stream) {
    if (event.type === "response.output_text.delta") process.stdout.write(event.delta);
  }
  ```

  ```bash curl theme={null}
  curl "$FLEX_BASE_URL/responses" \
    -H "Authorization: Bearer $FLEX_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gpt-5-nano",
      "input": "Count to ten.",
      "stream": true,
      "start_within": "00h-00m-20s"
    }'
  ```
</CodeGroup>

## Timeouts

`start_within` is a promise about when your request starts, not when it finishes. Once the cheaper flex tier begins to answer, it runs to the end at its own pace, and FlexInference never upgrades a committed request to standard for being slow. FlexInference also holds the HTTP response until a tier commits, so your client is waiting for that whole window. Set your client timeout to at least your `start_within` plus the time the model needs to answer.

The OpenAI SDK defaults to a 10 minute timeout. That covers any `start_within` (the longest allowed is 10 minutes) plus a normal generation, so the default is safe. If you set a shorter timeout, a long `start_within` can trip it, and you get a client-side cancel instead of a result. See [`client_closed_request`](/errors#client_closed_request).

<CodeGroup>
  ```python Python theme={null}
  client = OpenAI(
      base_url="https://api.flexinference.com/v1",
      api_key="flex_live_...",
      timeout=600,  # seconds; keep it at least as long as your start_within
  )
  ```

  ```typescript Node theme={null}
  const client = new OpenAI({
    baseURL: "https://api.flexinference.com/v1",
    apiKey: "flex_live_...",
    timeout: 600_000, // ms; keep it at least as long as your start_within
  });
  ```
</CodeGroup>

The FlexInference native SDKs ([Python](https://pypi.org/project/flexinference/), [TypeScript](https://www.npmjs.com/package/flexinference)) handle this for you. They size the wait for the first response from your `start_within` automatically, cap a stalled stream with an idle timeout, and keep no total cap on a stream so a long answer is never cut off. A non-streaming call keeps a total budget. Each timeout is configurable on the client and per request.

## Chat Completions

The Chat Completions endpoint works the same way. `start_within` applies identically.

<CodeGroup>
  ```python Python theme={null}
  resp = client.chat.completions.create(
      model="gpt-5.5",
      messages=[{"role": "user", "content": "Hello!"}],
      extra_body={"start_within": "default"},
  )
  print(resp.choices[0].message.content)
  ```

  ```typescript Node theme={null}
  const resp = await client.chat.completions.create({
    model: "gpt-5.5",
    messages: [{ role: "user", content: "Hello!" }],
    start_within: "default",
  } as any);

  console.log(resp.choices[0].message.content);
  ```
</CodeGroup>

## Interactions

The Interactions endpoint is a third caller format that works with **both** OpenAI and Gemini models. `start_within` applies identically. Send `input` as a string, a content-array of parts, or a multi-turn step list; `system_instruction`, `tools`, and `response_format` map to the same upstream features.

<CodeGroup>
  ```python Python theme={null}
  resp = client.interactions.create(
      model="gemini-3.5-flash",
      input="Summarize this contract.",
      extra_body={"start_within": "00h-01m-00s"},
  )
  print(resp.steps[0].content[0].text)
  ```

  ```typescript Node theme={null}
  const resp = await client.interactions.create({
    model: "gemini-3.5-flash",
    input: "Summarize this contract.",
    start_within: "00h-01m-00s",
  } as any);

  console.log(resp.steps[0].content[0].text);
  ```

  ```bash curl theme={null}
  curl "$FLEX_BASE_URL/interactions" \
    -H "Authorization: Bearer $FLEX_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "gemini-3.5-flash",
      "input": "Summarize this contract.",
      "start_within": "00h-01m-00s"
    }'
  ```
</CodeGroup>

The response is an `interaction` object. Your output lives in `steps`. The `usage` object splits the token counts across input, output, thought, cached, tool-use, and total.

```json theme={null}
{
  "id": "interaction_...",
  "object": "interaction",
  "status": "completed",
  "model": "gemini-3.5-flash",
  "created": 1750000000,
  "updated": 1750000000,
  "service_tier": "flex",
  "usage": {
    "total_input_tokens": 1200,
    "total_output_tokens": 180,
    "total_thought_tokens": 64,
    "total_cached_tokens": 0,
    "total_tokens": 1444,
    "total_tool_use_tokens": 0
  },
  "steps": [
    { "type": "model_output", "content": [{ "type": "text", "text": "..." }] }
  ]
}
```

`status` is `requires_action` when a step is a `function_call`, `incomplete` when output is truncated, `failed` on error, and `completed` otherwise. `generation_config.thinking_level` maps to `reasoning.effort`. `seed`, `stop_sequences`, and a caller-supplied `service_tier` are refused; see [Errors](/errors).

Every FlexInference error tells you what is wrong, why, and how to fix it, and it shows an example. Send a `service_tier` yourself and you get `service_tier_not_allowed`, because FlexInference owns tier selection. The fix is to drop the field and let flex choose. See [Errors](/errors) for the full list with example bodies.

## Messages

The Messages endpoint is a fourth caller format. It speaks the **Anthropic Messages** shape for both the request and the response, and it works with **any** model, because FlexInference translates it to and from the canonical form. `start_within` applies identically and is required. Anthropic requires `max_tokens`, so it is required here too; omit it and you get `400 missing_max_tokens`. Send `messages` as `{role, content}` turns, with `content` a string or an array of content blocks; `system`, `tools`, `tool_choice`, and `thinking` map to the same upstream features.

<CodeGroup>
  ```python Python theme={null}
  resp = client.messages.create(
      model="claude-opus-4-8",
      max_tokens=1024,
      messages=[{"role": "user", "content": "Summarize this contract."}],
      extra_body={"start_within": "default"},
  )
  print(resp.content[0].text)
  ```

  ```typescript Node theme={null}
  const resp = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Summarize this contract." }],
    start_within: "default",
  } as any);

  console.log(resp.content[0].text);
  ```

  ```bash curl theme={null}
  curl "$FLEX_BASE_URL/messages" \
    -H "Authorization: Bearer $FLEX_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "claude-opus-4-8",
      "max_tokens": 1024,
      "messages": [{ "role": "user", "content": "Summarize this contract." }],
      "start_within": "default"
    }'
  ```
</CodeGroup>

The response is an Anthropic `message` object. Your output lives in `content` as `text`, `thinking`, and `tool_use` blocks. The `usage.service_tier` field tells you which tier actually served the request, flex or standard, and that field is the one to trust.

```json theme={null}
{
  "id": "msg_...",
  "type": "message",
  "role": "assistant",
  "model": "claude-opus-4-8",
  "content": [{ "type": "text", "text": "..." }],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 1200,
    "cache_read_input_tokens": 0,
    "cache_creation_input_tokens": 0,
    "output_tokens": 180,
    "output_tokens_details": { "thinking_tokens": 64 },
    "service_tier": "standard"
  }
}
```

`stop_reason` is `end_turn` on a normal finish, `max_tokens` when truncated, `tool_use` when the model calls a tool, and `refusal` when it declines. `thinking` maps to `reasoning.effort` per model. `top_k`, `stop_sequences`, `cache_control` blocks, and `document`/`file` blocks with `citations` are refused with `unsupported_parameter`, and a caller-supplied `service_tier` with `service_tier_not_allowed`; see [Errors](/errors).

## Everything else works unchanged

Tool calling, structured outputs (`response_format`), vision, and reasoning all pass straight through to the provider you picked. Use them exactly as you would when you call that provider directly.

<Warning>
  A few Chat Completions parameters have no match on the Responses models FlexInference routes, so FlexInference rejects them up front with `unsupported_parameter`: `presence_penalty`, `frequency_penalty`, `logit_bias`, `logprobs`, `top_logprobs`, `seed`, `stop`, `prediction`, `audio`, non-text `modalities`, `web_search_options`, and `n > 1`. FlexInference only rejects them when you set them to a real value. Leaving them at their defaults is fine. For web search, use a Responses `web_search` tool.
</Warning>
