LLM Request Tokenization

👉 Learn LLM Payload Tokenization, Core Algorithms and Cost Optimization.

👉 Mastering BPE, WordPiece, and SentencePiece to manage context window limits and control inference costs.


What is Tokenization in LLMs ?

Tokenization is the process of translating human language into a format that a computer can calculate.

Since Large Language Models (LLMs) are essentially massive mathematical functions, they cannot “read” text like humans do; instead, they operate on sequences of numbers.

It acts as the “translator” that breaks the raw text into smaller, manageable chunks called tokens. These tokens are then mapped to unique numerical IDs from the model’s predefined vocabulary.

How it works ?

  1. Normalization:
    • The text is cleaned (e.g., converting to lowercase or fixing accents).
  2. Splitting:
    • The tokenizer breaks the text into units. Modern models like GPT-4 or Llama use subword tokenization, meaning a word like “unbelievable” might be split into un, believ, and able.
  3. Mapping:
    • Each token is assigned a specific integer (e.g., “the” might be 464).
  4. Encoding:
    • These integers are passed into the model as an array of numbers.

Common Tokenization methods

The “granularity” of a token depends on the algorithm used:

MethodUnitProsCons
Word-levelFull wordsSimple, easy to understand.Huge vocabulary; fails on typos or new words (OOV).
Character-levelIndividual lettersSmall vocabulary; never “misses” a word.Long sequences; loses the “meaning” of whole words.
Subword (BPE/WordPiece)Parts of wordsThe Standard. Balances efficiency and flexibility.Can be counter-intuitive (e.g., spaces are often included).

Why it matters ?

  • Cost & Limits:
    • Most AI services (like OpenAI or Anthropic) charge by the token, not the word.
    • Similarly, “Context Windows” (how much the AI can remember) are measured in tokens.
  • Model “Intelligence”:
    • If a tokenizer splits a word poorly, the model might struggle to understand its meaning.
    • This is why AI sometimes fails at simple tasks like counting letters in a word or solving anagrams, it sees the tokens, not the individual letters.

Tokenization process

A hands-on look at tokenization, the process of converting text (prompt messages) into numerical tokens that language models can understand.

We can use the tiktoken library, which is the fast, Byte Pair Encoding (BPE) tokenizer used by OpenAI models. To understand the tokenization process in action.

App Setup

The core tool for this poc / demonstration is tiktoken.

  • It’s OpenAI’s fast BPE tokeniser, used with their models.
  • The tokenizer corresponding to a specific OpenAI API model can be obtained through it.
  • Check OpenAI tiktoken GitHub Repository to learn more about the library.

Import the Library

import tiktoken

Tokenizing/Encoding the text

Let’s initialize the encoder for a specific model and convert a sample text string into a list of token IDs.

encoding = tiktoken.encoding_for_model("gpt-4.1-mini")

input_text = "Hello, how are you doing today?"
tokens = encoding.encode(input_text)

print(tokens)

Output

[13225, 11, 1495, 553, 481, 5306, 4044, 30]

Inspecting Tokens and Text

We can decode the token IDs back into their original text segments to observe how the BPE algorithm splits the input string. Notice how common words like ‘ how’, ‘ are’, and ‘ you’ are treated as single tokens, often including the preceding space.

Printing Token IDs and Text

for token_id in tokens:
    token_text = enconding.decode([token_id])
    print(f"Token ID: {token_id}, Token Text: '{token_text}'")

Output

Token ID: 13225, Token Text: 'Hello'
Token ID: 11, Token Text: ','
Token ID: 1495, Token Text: ' how'
Token ID: 553, Token Text: ' are'
Token ID: 481, Token Text: ' you'
Token ID: 5306, Token Text: ' doing'
Token ID: 4044, Token Text: ' today'
Token ID: 30, Token Text: '?'

Tokenization Request Flow

Below diagram illustrates the sequence of steps for tokenizing and inspecting an input string using tiktoken.

sequenceDiagram
    participant User
    participant PythonScript as Python Script
    participant TiktokenLib as tiktoken Library
    participant OpenAIModel as GPT Model Encoding

    User->>PythonScript: Defines input_text
    PythonScript->>TiktokenLib: Load encoding for "gpt-4.1-mini"
    TiktokenLib->>OpenAIModel: Retrieves BPE vocabulary/merge rules
    TiktokenLib-->>PythonScript: Returns encoding object
    PythonScript->>TiktokenLib: encoding.encode(input_text)
    TiktokenLib->>PythonScript: Returns list of Token IDs (e.g., [13225, 11, ...])
    loop For each Token ID
        PythonScript->>TiktokenLib: encoding.decode([token_id])
        TiktokenLib->>PythonScript: Returns Token Text (e.g., 'Hello', ',')
        PythonScript->>User: Prints Token ID and Text
    end