LLM Request Tokenization
👉 Learn LLM Payload Tokenization, Core Algorithms and Cost Optimization.
👉 Mastering BPE, WordPiece, and SentencePiece to manage context window limits and control inference costs.
What is Tokenization in LLMs ?
Tokenization is the process of translating human language into a format that a computer can calculate.
Since Large Language Models (LLMs) are essentially massive mathematical functions, they cannot “read” text like humans do; instead, they operate on sequences of numbers.
It acts as the “translator” that breaks the raw text into smaller, manageable chunks called tokens. These tokens are then mapped to unique numerical IDs from the model’s predefined vocabulary.
How it works ?
- Normalization:
- The text is cleaned (e.g., converting to lowercase or fixing accents).
- Splitting:
- The tokenizer breaks the text into units. Modern models like GPT-4 or Llama use subword tokenization, meaning a word like “unbelievable” might be split into
un,believ, andable.
- The tokenizer breaks the text into units. Modern models like GPT-4 or Llama use subword tokenization, meaning a word like “unbelievable” might be split into
- Mapping:
- Each token is assigned a specific integer (e.g., “the” might be
464).
- Each token is assigned a specific integer (e.g., “the” might be
- Encoding:
- These integers are passed into the model as an array of numbers.
Common Tokenization methods
The “granularity” of a token depends on the algorithm used:
| Method | Unit | Pros | Cons |
|---|---|---|---|
| Word-level | Full words | Simple, easy to understand. | Huge vocabulary; fails on typos or new words (OOV). |
| Character-level | Individual letters | Small vocabulary; never “misses” a word. | Long sequences; loses the “meaning” of whole words. |
| Subword (BPE/WordPiece) | Parts of words | The Standard. Balances efficiency and flexibility. | Can be counter-intuitive (e.g., spaces are often included). |
Why it matters ?
- Cost & Limits:
- Most AI services (like OpenAI or Anthropic) charge by the token, not the word.
- Similarly, “Context Windows” (how much the AI can remember) are measured in tokens.
- Model “Intelligence”:
- If a tokenizer splits a word poorly, the model might struggle to understand its meaning.
- This is why AI sometimes fails at simple tasks like counting letters in a word or solving anagrams, it sees the tokens, not the individual letters.
Tokenization process
A hands-on look at tokenization, the process of converting text (prompt messages) into numerical tokens that language models can understand.
We can use the tiktoken library, which is the fast, Byte Pair Encoding (BPE) tokenizer used by OpenAI models. To understand the tokenization process in action.
App Setup
The core tool for this poc / demonstration is tiktoken.
- It’s OpenAI’s fast BPE tokeniser, used with their models.
- The tokenizer corresponding to a specific OpenAI API model can be obtained through it.
- Check OpenAI tiktoken GitHub Repository to learn more about the library.
Import the Library
import tiktoken
Tokenizing/Encoding the text
Let’s initialize the encoder for a specific model and convert a sample text string into a list of token IDs.
encoding = tiktoken.encoding_for_model("gpt-4.1-mini")
input_text = "Hello, how are you doing today?"
tokens = encoding.encode(input_text)
print(tokens)
Output
[13225, 11, 1495, 553, 481, 5306, 4044, 30]
Inspecting Tokens and Text
We can decode the token IDs back into their original text segments to observe how the BPE algorithm splits the input string. Notice how common words like ‘ how’, ‘ are’, and ‘ you’ are treated as single tokens, often including the preceding space.
Printing Token IDs and Text
for token_id in tokens:
token_text = enconding.decode([token_id])
print(f"Token ID: {token_id}, Token Text: '{token_text}'")
Output
Token ID: 13225, Token Text: 'Hello'
Token ID: 11, Token Text: ','
Token ID: 1495, Token Text: ' how'
Token ID: 553, Token Text: ' are'
Token ID: 481, Token Text: ' you'
Token ID: 5306, Token Text: ' doing'
Token ID: 4044, Token Text: ' today'
Token ID: 30, Token Text: '?'
Tokenization Request Flow
Below diagram illustrates the sequence of steps for tokenizing and inspecting an input string using tiktoken.
sequenceDiagram
participant User
participant PythonScript as Python Script
participant TiktokenLib as tiktoken Library
participant OpenAIModel as GPT Model Encoding
User->>PythonScript: Defines input_text
PythonScript->>TiktokenLib: Load encoding for "gpt-4.1-mini"
TiktokenLib->>OpenAIModel: Retrieves BPE vocabulary/merge rules
TiktokenLib-->>PythonScript: Returns encoding object
PythonScript->>TiktokenLib: encoding.encode(input_text)
TiktokenLib->>PythonScript: Returns list of Token IDs (e.g., [13225, 11, ...])
loop For each Token ID
PythonScript->>TiktokenLib: encoding.decode([token_id])
TiktokenLib->>PythonScript: Returns Token Text (e.g., 'Hello', ',')
PythonScript->>User: Prints Token ID and Text
end