Skip to content
ashis.dev
Go back

Tokens, tokens everywhere

Picture this: You’ve just sent a seemingly innocent message to your AI assistant — “Hey, can you help me debug this code?” — and suddenly your wallet starts crying. You’re not imagining things. You’ve just been hit by the token tax.

Sound familiar? Welcome to the club. We’re all paying the token toll now.

Let’s talk about tokens — the fundamental currency of the AI world that determines how much you pay, how fast you get responses, and whether your AI assistant will finish its sentence or leave you hanging like a cliffhanger episode.


What exactly is a token?

Imagine you’re eating a really long string. Not a literal string (that would be weird), but a text string. Now imagine you have to break it into bite-sized pieces before your brain can process it. Those bite-sized pieces? That’s basically what a token is.

More technically, a token is the smallest unit of text that an LLM (Large Language Model) processes. It could be:

Here’s the fun part: tokens are not standardized across models. What counts as one token in GPT-4 might be two in Claude. It’s like how different countries have different currency denominations.

┌─────────────────────────────────────────────────────────────┐
│  Input: "The quick brown fox jumps over the lazy dog"     │
└─────────────────────────────────────────────────────────────┘


┌──────────┬──────────┬─────────┬──────┬───────┬─────┬──────┐
│   The    │  quick   │  brown  │  fox │ jumps│over │ the  │
│  token   │  token   │  token  │token │token │token│token │
└──────────┴──────────┴─────────┴──────┴───────┴─────┴──────┘


┌──────────┬──────────┬─────────┬──────┬───────┬─────┬──────┐
│   lazy   │   dog    │         │      │       │     │      │
│  token   │  token   │   EOS   │      │       │     │      │
└──────────┴──────────┴─────────┴──────┴───────┴─────┴──────┘


                    🤖 AI Brain Processes This 🤖

Pro tip: As a rough rule of thumb, 1 token ≈ 4 characters ≈ 0.75 words in English. So that 100-page novel? Somewhere around 30,000-50,000 tokens. Your API bill says “thank you” in advance.


How tokens are consumed in AI applications

Every time you interact with an AI, tokens flow in both directions:

  1. Input tokens — What you send (your prompt, context, history)
  2. Output tokens — What the AI generates (its response)

The total = input + output. Both cost money. Both have limits.

Here’s a simple API call example:

import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Write a function to fibonacci" },
  ],
  max_tokens: 500, // This limits output tokens!
});

console.log(`Usage:
  Prompt tokens: ${response.usage.prompt_tokens}
  Completion tokens: ${response.usage.completion_tokens}
  Total tokens: ${response.usage.total_tokens}
  Cost: $${((response.usage.total_tokens / 1000) * 0.03).toFixed(4)}`);

Output:

Usage:
  Prompt tokens: 25
  Completion tokens: 156
  Total tokens: 181
  Cost: $0.0054

That’s not much. But scale that to a chatbot with conversation history, and suddenly you’re burning tokens faster than you burn calories at the gym.

The conversation context problem

Here’s where things get tricky. Chatbots remember conversation history. That history? Also tokens.

// This is what your conversation looks like to the AI
const conversationHistory = [
  { role: "system", content: "You are a helpful assistant." }, // ~10 tokens
  { role: "user", content: "Hello!" }, // ~3 tokens
  { role: "assistant", content: "Hello! How can I help you?" }, // ~8 tokens
  { role: "user", content: "What is React?" }, // ~6 tokens
  { role: "assistant", content: "React is a JavaScript library..." }, // ~50 tokens
  { role: "user", content: "How do I use useState?" }, // ~7 tokens
  // ... and this keeps growing
];

After 50 messages back and forth? You’ve easily got 2,000+ tokens just in history. That’s a whole Wikipedia article’s worth of “remember when you asked me that thing.”


How different AI models interpret tokens

Not all tokens are created equal. Different models have different “vocabularies” (called tokenizers) and different ways of breaking down text.

GPT-4 / GPT-4o

OpenAI uses Byte Pair Encoding (BPE) with a vocabulary of about 100,000 tokens. It’s pretty good at English, decent at code, and… let’s just say non-English languages get the short end of the stick.

// GPT-4 tokenizer approximation
// "Artificial Intelligence" → ["Artificial", " Intell", "igence"]
// But "AI" → ["AI"] (it's a special token!)

const gpt4Tokenize = text => {
  // Simplified - actual implementation is much more complex
  return text.match(/[a-zA-Z]+|[^a-zA-Z]/g) || [];
};

Claude (Anthropic)

Claude uses a different tokenizer called BPE but trained differently. The fun part? Claude has a much larger context window (up to 200K tokens for Claude 3.5!) but processes them differently.

// Claude's tokenization tends to be more efficient for certain text types
// "Hello" → ["Hello"]
// "hello" → ["hello"]
// (case-sensitive in interesting ways)

const claudeContext = {
  claude3Opus:     { maxTokens: 200000, pricePer1K: $0.015 },
  claude3Sonnet:   { maxTokens: 200000, pricePer1K: $0.003 },
  claude3Haiku:    { maxTokens: 200000, pricePer1K: $0.00025 }
};

Gemini (Google)

Google’s Gemini uses its own tokenizer and has massive context windows. It’s got some interesting quirks in how it handles code vs. natural language.

const geminiPricing = {
  'gemini-1.5-pro': { inputPer1M: $1.25, outputPer1M: $5.00 },
  'gemini-1.5-flash': { inputPer1M: $0.075, outputPer1M: $0.30 }
};

The token efficiency comparison

┌────────────────┬──────────────┬─────────────────────┐
│ Model         │ Tokens/Word  │ Efficiency Rating   │
├────────────────┼──────────────┼─────────────────────┤
│ GPT-4         │ ~1.3         │ 🤔 Decent           │
│ Claude 3      │ ~1.1         │ 😎 Slightly better  │
│ Gemini 1.5    │ ~1.25        │ 🤔 Decent           │
│ Llama 3       │ ~1.2         │ 🤔 Decent           │
└────────────────┴──────────────┴─────────────────────┘

The difference seems small until you're processing 100K tokens.
Then it's the difference between $3 and $4. Still matters! 💸

Tools to reduce token usage

Alright, now for the money-saving section. Here’s what the leading AI tools offer:

OpenAI

// 1. Use max_tokens wisely
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Summarize this" }],
  max_tokens: 100, // Don't let it ramble!
  temperature: 0.3, // Lower = more focused = fewer tokens
});

// 2. Use the built-in token counting
import { Tiktoken } from "@dqbd/tiktoken";

const countTokens = (text, model = "gpt-4") => {
  const encoder = Tiktoken.forModel(model);
  return encoder.encode(text).length;
};

Claude (Anthropic)

// Claude has a neat feature - you can ask it to be concise!
const messages = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: "Explain tokens. Be extremely concise.",
      },
    ],
  },
];

// Also - use Claude's built-in token counting
const { countTokens } = require("@anthropic-ai/tokenizer");

GitHub Copilot

// Copilot uses context from your open files
// Keep files focused and relevant to reduce context confusion

// BAD: 5 huge files open, asking about CSS
// GOOD: Just the component file open, asking about that specific issue

Cursor / Windsurf (AI IDEs)

These tools have agent mode that can get token-happy. Here’s the pro tip:

// In cursor-rules or .cursorrules, specify:
{
  "rules": [
    "Keep responses concise and under 200 words",
    "Don't repeat code I've already shown you",
    "Use bullet points instead of paragraphs"
  ]
}

General best practices

// 1. System prompts that limit verbosity
const systemPrompt = `You are a concise coding assistant. 
- Keep responses under 100 words
- Provide code directly, minimize explanation
- If code is over 50 lines, summarize instead of showing all`;

// 2. Use few-shot prompting to reduce explanation needs
const fewShotPrompt = `Q: Fix this bug in useEffect
A: \`useEffect(() => {}, [])\`
Q: Fix this state issue
A: \`useState(initialValue)\``;

// 3. Truncate old conversation history
const MAX_HISTORY_TOKENS = 4000;

function pruneHistory(messages, maxTokens) {
  let tokenCount = 0;
  const pruned = [];

  // Start from most recent
  for (let i = messages.length - 1; i >= 0; i--) {
    tokenCount += estimateTokens(messages[i].content);
    if (tokenCount > maxTokens) break;
    pruned.unshift(messages[i]);
  }

  return pruned;
}

What developers must be aware of

1. Context Window Limits

Every model has a maximum context window. Exceed it, and you get errors.

const MODEL_LIMITS = {
  "gpt-4o": 128000,
  "gpt-4-turbo": 128000,
  "gpt-3.5-turbo": 16385,
  "claude-3-5-sonnet": 200000,
  "gemini-1.5-pro": 2000000, // 2 million!
};

// Don't do this:
const hugeDoc = readFileSync("massive-book.txt");
await openai.chat.completions.create({
  messages: [{ role: "user", content: hugeDoc }], // 💥 Error!
});

2. Token Counting Errors

Never trust user input to stay under limits. Always validate:

const MAX_TOKENS = 4000;

async function safeChat(prompt) {
  const promptTokens = countTokens(prompt);

  if (promptTokens > MAX_TOKENS) {
    // Truncate or summarize the prompt
    return { error: "Prompt too long", tokens: promptTokens };
  }

  return await callAI(prompt);
}

3. Hidden Token Sources

Watch out for these token drains:

// 1. System prompts add up!
const verboseSystem = `You are a helpful AI assistant that provides 
detailed explanations with code examples. You always consider edge cases
and provide comprehensive documentation...`; // This is 50+ tokens!

// 2. Function calling adds overhead
const functions = [
  {
    name: "getWeather",
    description: "Get weather for a location", // This counts!
    parameters: {
      /* ... */
    }, // This counts too!
  },
];

// 3. Metadata in responses
// The AI might add "Here's the code you asked for:" which adds tokens

4. Cost Monitoring

const COST_PER_1K = {
  gpt4o: { input: 0.005, output: 0.015 },
  gpt4Turbo: { input: 0.01, output: 0.03 },
  claude35: { input: 0.003, output: 0.015 },
};

function estimateCost(tokens, model = "gpt4o") {
  const rates = COST_PER_1K[model];
  return ((tokens / 1000) * (rates.input + rates.output)).toFixed(4);
}

// Add to your app to track spending!

Tips to reduce token usage

1. Be concise in prompts

// ❌ Wasteful
"Can you please help me with my code? I've been working on this for a while
and I'm having trouble figuring out what's going wrong. Here's the code:"

// ✅ Concise
"Fix this bug:"

2. Use code snippets instead of files

// ❌ Send entire file
const code = readFileSync("App.js");

// ✅ Send relevant portion
const code = `function useAuth() {
  const [user, setUser] = useState(null);
  // how to persist across refresh?
}`;

3. Implement smart context management

class TokenManager {
  constructor(maxTokens = 8000) {
    this.maxTokens = maxTokens;
    this.history = [];
  }

  addMessage(role, content) {
    this.history.push({ role, content });
    this.prune();
  }

  prune() {
    let total = this.history.reduce(
      (sum, m) => sum + countTokens(m.content),
      0
    );

    while (total > this.maxTokens && this.history.length > 2) {
      const removed = this.history.shift(); // Remove oldest
      total -= countTokens(removed.content);
    }
  }

  getContext() {
    return this.history;
  }
}

4. Use lower temperature for focused tasks

// Creative writing = higher temperature
// Code/debugging = lower temperature

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages,
  temperature: 0.2, // More deterministic, fewer "wandering" tokens
  max_tokens: 300, // Hard limit
});

5. Cache common contexts

// Cache system prompts
const cachedSystemPrompt = `You are a React expert. 
Answer in under 50 words. Use code blocks.`;

// Reuse across sessions when possible

The token economy: A summary

┌──────────────────────────────────────────────────────────────┐
│                    TOKEN ECONOMY 101                        │
├──────────────────────────────────────────────────────────────┤
│  Input Tokens  ←──── You send →──── Costs money             │
│  Output Tokens ←── AI responds ←── Costs money               │
│                                                              │
│  Context Window ←── Max tokens model can see               │
│  Max Tokens ←── Max AI can generate                        │
│                                                              │
│  Rule of thumb: 1 token ≈ 0.75 words                       │
│  Rule of wallet: Every token counts! 💰                    │
└──────────────────────────────────────────────────────────────┘

Wrapping up

Tokens are the lifeblood of AI interactions — they’re how the models understand, process, and respond to our requests. They’re also how OpenAI, Anthropic, and Google make their money.

Understanding tokens isn’t just about saving money (though that’s a nice bonus). It’s about:

  1. Better AI interactions — Knowing how to structure prompts
  2. Optimization — Building efficient AI applications
  3. Cost management — Not getting surprised by the bill
  4. Performance — Staying within context limits

So next time you send a prompt, take a moment to think: “Do I really need all those words?” Your wallet (and your AI response time) will thank you.

Now go forth and token responsibly! 🎯


Did this help? Got questions about tokens? Drop a comment below. Or don’t. I don’t know how comments work on this blog yet. 😅


Share this post on:

Next Post
Edge vs Node Runtimes - The Showdown Your Frontend Deserves