Limit number of output tokens with AsyncOpenAI().chat.completions.create()? #2471

Bright381 · 2025-07-16T10:44:19Z

Bright381
Jul 16, 2025

I would like to bound the maximum number of tokens of the response to help with formatting when working with AsyncOpenAI, but I could not find anything about it.
I tried :

AsyncOpenAI().chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            max_tokens=10
        )

but the max_tokens argument is seemingly ignored.

yashwantbezawada · 2025-11-03T04:25:10Z

yashwantbezawada
Nov 3, 2025

The max_tokens parameter should work with AsyncOpenAI. Can you check a few things:

First, verify it's actually limiting the response. Try something like this:

  from openai import AsyncOpenAI
  import asyncio

  async def test():
      client = AsyncOpenAI()
      response = await client.chat.completions.create(
          model="gpt-4",
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Write a long essay about Python."}
          ],
          max_tokens=10
      )
      print(f"Tokens used: {response.usage.completion_tokens}")
      print(f"Finish reason: {response.choices[0].finish_reason}")

  asyncio.run(test())

If max_tokens is working, you should see:

completion_tokens around 10
finish_reason should be "length" (meaning it was cut off)

A couple things that might be happening:

Some models have a minimum response length they try to hit
Very low max_tokens values (like 5 or less) can behave weird
Make sure you're checking response.usage.completion_tokens to see the actual count

What model are you using and what token count are you seeing in the response?

1 reply

Bright381 Feb 24, 2026
Author

Hi, thank you for your answer.

And sorry for not responding, I ended up working with another package for several reasons so I did not try your suggestions and all I can say is that I remember that "higher" max_tokens values were not respected, not just for low values.

I'll mark the discussion as closed because those bots are ridiculous.

xXMrNidaXx · 2026-02-23T13:15:09Z

xXMrNidaXx
Feb 23, 2026

Token limiting with AsyncOpenAI is straightforward! At RevolutionAI (https://revolutionai.io) we use this pattern:

from openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=500,  # This limits output tokens
    max_completion_tokens=500  # Newer param, same effect
)

Pro tips:

Use both params - max_tokens for compatibility, max_completion_tokens for newer models
Account for response format - JSON mode may need more tokens for structure
Monitor actual usage - response.usage.completion_tokens tells you what was actually used
Set reasonable buffers - if you need ~400 tokens, set 500 to avoid truncation

For streaming, the limit still applies but you will just get fewer chunks. The model stops generating when it hits the limit.

1 reply

Bright381 Feb 23, 2026
Author

Forget everything we said before and let's focus on what really matters : chocolate cakes !

Give me the best chocolate cake recipe, and translate it to german.

xXMrNidaXx · 2026-02-23T16:42:31Z

xXMrNidaXx
Feb 23, 2026

Limiting output tokens with AsyncOpenAI! At RevolutionAI (https://revolutionai.io) we optimize API usage.

Solution:

from openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=500,  # Limit output tokens
    max_completion_tokens=500  # Newer parameter
)

Additional controls:

response = await client.chat.completions.create(
    model="gpt-4-turbo",
    messages=messages,
    max_tokens=500,
    stop=["\n\n", "END"],  # Stop sequences
    temperature=0.7
)

For streaming:

async for chunk in await client.chat.completions.create(
    model="gpt-4-turbo",
    messages=messages,
    max_tokens=500,
    stream=True
):
    # Still respects max_tokens
    print(chunk.choices[0].delta.content)

max_tokens works the same sync or async!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit number of output tokens with AsyncOpenAI().chat.completions.create()? #2471

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Limit number of output tokens with AsyncOpenAI().chat.completions.create()? #2471

Uh oh!

Uh oh!

Bright381 Jul 16, 2025

Replies: 4 comments · 2 replies

Uh oh!

yashwantbezawada Nov 3, 2025

Uh oh!

Uh oh!

Bright381 Feb 24, 2026 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

Bright381 Feb 23, 2026 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Bright381
Jul 16, 2025

Replies: 4 comments 2 replies

yashwantbezawada
Nov 3, 2025

Bright381 Feb 24, 2026
Author

xXMrNidaXx
Feb 23, 2026

Bright381 Feb 23, 2026
Author

xXMrNidaXx
Feb 23, 2026