AI, Software Development

OpenAI API Costs Explained

By James KillickFebruary 16, 2026

TL;DR: OpenAI charges per token, not per request. GPT-4o costs more than GPT-3.5 Turbo, and your bill depends on which model you pick, how long your prompts are, and how many calls you make. Getting a handle on those three things will stop most budget blowouts.

OpenAI charges per token, not per request. A token is roughly four characters of text. Send a long prompt, get a long reply, and you spend more. Pick the wrong model and you can spend ten times what you need to. That is the whole picture before the detail.

Most teams hit surprise bills because they chose a model based on capability without checking price, or they let prompt length creep without noticing. This post walks through how pricing actually works so you can build without blowouts.

How does OpenAI token pricing work?

OpenAI prices API access by the number of tokens processed. Input tokens (what you send) and output tokens (what the model returns) are priced separately. Output tokens generally cost more.

As of the current OpenAI API pricing page, GPT-4o input tokens cost significantly more than GPT-3.5 Turbo input tokens. For many tasks, the cheaper model is good enough.

The maths is straightforward:

Total cost = (input tokens x input rate) + (output tokens x output rate)
A typical chat message is 50-150 tokens
A detailed system prompt can run 300-600 tokens before the user says a word
Long documents fed into context can push a single call into thousands of tokens

Track both sides. Teams that only watch output tokens miss half their spend.

Which model should you use?

Pick the cheapest model that does the job well enough. That sounds obvious, but most teams default to GPT-4o for everything because it is the flagship.

Here is a practical guide:

GPT-3.5 Turbo works for classification, summarisation, simple extraction, and chatbots where the conversation is short. It is cheap.
GPT-4o mini sits in the middle. Good reasoning at a lower price than full GPT-4o.
GPT-4o is worth paying for when you need strong reasoning, complex instruction following, or structured output on messy data.
o3 and o4-mini are built for deep reasoning tasks. They are slower and cost more. Use them when standard models fail, not as a default.

For most production apps built at Devwiz, a Sydney team that has shipped 200+ apps since 2015, the right answer is to test GPT-4o mini first, then move to GPT-4o only where quality actually drops.

What drives your bill up?

Four things account for most surprise spend:

Long system prompts. A system prompt runs on every single call. If it is 800 tokens, you pay for 800 tokens every time a user sends one message. Trim it.

Conversation history in context. Many apps send the full chat history to maintain context. After 20 messages, that history is thousands of tokens per call. Use summarisation or a sliding window instead.

High call volume. Ten thousand users each sending five messages a day is 50,000 API calls. Small per-call costs compound fast.

Oversized RAG chunks. When you pull documents from a vector store and inject them into the prompt, chunk size matters a lot. Oversized chunks bloat every call.

When we help CTOs think through AI integration strategy, this is usually the first conversation we have. The architecture decisions made early set the cost ceiling for the whole product.

How do you estimate costs before building?

Before writing code, do a back-of-envelope calculation:

Estimate average input tokens per call (system prompt + history + user message)
Estimate average output tokens per call
Multiply by your expected call volume per month
Apply the model's per-1M-token rate

Example: 800 input tokens + 300 output tokens = 1,100 tokens per call. At 100,000 calls per month that is 110 million tokens. Run that through the pricing page for your chosen model and you have a ballpark.

Build that estimate before you pick a model. It often changes the decision.

How do you control costs in production?

Once you are live, a few patterns keep bills predictable:

Set hard limits. OpenAI lets you set monthly spend limits in the dashboard. Use them. A runaway loop or a bug that triggers thousands of calls should not drain your account overnight.

Cache common responses. If the same prompt gets asked repeatedly, cache the result. You only pay for the first call.

Log and monitor token usage. Every API response includes a usage object with token counts. Log it. Build a simple dashboard or pipe it to your observability tool. You cannot manage what you do not measure.

Rate limit per user. In consumer-facing apps, a small number of users often generate a disproportionate share of calls. Rate limiting protects both cost and availability.

For a deeper look at how this fits into a full integration, the guide on adding AI to an existing app covers the architecture decisions that shape long-term API spend.

What about enterprise pricing and Azure OpenAI?

If your usage is high enough, OpenAI offers enterprise agreements. These can include committed spend discounts, data privacy guarantees, and dedicated capacity.

Azure OpenAI Service runs the same models through Microsoft's infrastructure. Pricing differs slightly, and it comes with Azure's compliance and security posture. Useful for clients with strict data residency requirements.

For most early-stage products, the standard OpenAI API is the right starting point. Move to enterprise or Azure when volume or compliance requirements push you there.

Fine-tuned models and the batch API

Two more pricing options worth knowing:

Fine-tuning lets you train a smaller model on your data. The fine-tuned model can match GPT-4o quality on a specific task at GPT-3.5 Turbo prices. There is an upfront training cost, but if you have a high-volume, narrow use case, fine-tuning often pays for itself quickly.

Batch API processes requests asynchronously at roughly half the standard price. If your use case does not need real-time responses, like overnight document processing or bulk data extraction, batch is worth using.

OpenAI API cost is manageable once you know the levers. Pick the right model, control your prompt length, monitor token usage, and set spend limits. Most budget blowouts come from skipping one of those four steps.

If you are building an AI product and want a team that has done this across 200+ apps for clients including NSW Government, Briometrix, Vivid, and Huskee, talk to the team at Devwiz. We help you ship AI that works without the surprise bills.

James Killick writes about AI product strategy at jameskillick.co.

Frequently asked questions

How much does the OpenAI API cost per month?

It depends on usage. There is no monthly subscription for the standard API. You pay per token. A small internal tool making a few thousand calls per month might cost $10-50. A consumer app with tens of thousands of daily users can run into hundreds or thousands of dollars. Run a token estimate based on your expected call volume before you build.

What is a token in the OpenAI API?

A token is roughly four characters of English text. The word 'building' is about two tokens. The word 'AI' is one token. Every call to the API costs input tokens (what you send) and output tokens (what the model returns). Both are counted and billed separately.

Is GPT-4o worth the extra cost over GPT-3.5 Turbo?

For some tasks, yes. For many tasks, no. GPT-4o is significantly better at complex reasoning, structured output on messy data, and nuanced instruction following. For simple classification, summarisation, or short conversations, GPT-3.5 Turbo or GPT-4o mini will do the job at a fraction of the price. Test the cheaper model first.

How do I stop unexpected OpenAI API bills?

Set a monthly spend limit in your OpenAI account settings. Log token usage on every API call so you can see what is driving cost. Rate limit users in consumer apps. Cache repeated prompts. And trim your system prompt as short as it can go without losing quality.

What is the cheapest way to use the OpenAI API?

Use the cheapest model that meets your quality bar. Keep system prompts short. Use a sliding context window instead of full conversation history. For non-real-time workloads, use the Batch API at roughly half the standard price. Fine-tune a smaller model if you have a high-volume, narrow use case.

About James Killick

James is a co-founder of Devwiz and an AI product specialist. Since 2015 he has helped ship 200+ apps for founders, businesses and government, including work for NSW Government, Briometrix and Huskee. He builds AI-first platforms and writes about turning a proven program into software. He also hosts the Up in the AI podcast.

jameskillick.co · LinkedIn · AI Orchestrators

Tags: AI Integration