LLM Memory Illusion
👉 Understand the stateless design of LLMs.
👉 How LLMs maintain conversation context, illuminating stateful memory.
👉 Deep dive in context-passing, conversation management, and cost-aware memory strategies.
Table of Contents
- Table of Contents
- LLM “Memory”: The Stateless Illusion
- Connecting to a Local Ollama Model
- Demonstrating Statelessness in Conversation
- The Conversation Test and The Fix (Python)
- Recap: The Context Passing Trick
- Costing Implication for Conversation History
LLM “Memory”: The Stateless Illusion
It is a common misconception that Large Language Models (LLMs) inherently “remember” past interactions. This apparent recall of a conversation is, in fact, an illusion. LLMs are fundamentally stateless systems. This means that every single API call made to the model is treated as a completely new, independent request, with no memory of any preceding turn.
For a conversational application to maintain context, the developer must actively manage and provide the entire conversation history with every new prompt. This technique is known as Context Passing.
Connecting to a Local Ollama Model
To practically demonstrate this concept, we will set up an environment to interact with a local LLM instance running via Ollama. This method uses the standard OpenAI API structure, making the code easily transferable to other LLM providers (like OpenAI or Anthropic) by simply changing the base_url and API key configuration.
Environment Setup (Python)
The following setup uses the openai library which is compatible with Ollama’s local API endpoint.
# Available in project home directory
# /pyproject.toml
[project]
name = "llm-engineering"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
# .. other dependencies
"openai>=1.109.1",
# .. other settings
]
# Import necessary libraries
import os
from dotenv import load_dotenv
from openai import OpenAI
# Load environment variables from '.env'
load_dotenv(override=True)
# Define configurations
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL')
MODEL = 'gemma3:1b'
# Get API key and set up client
openai = OpenAI(base_url=OLLAMA_BASE_URL)
# Verify configuration (Note: Ollama often uses a dummy key format)
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
print('No API key found. Please set the OPENAI_API_KEY environment variable.')
else:
print('API key loaded successfully.')
# Example output: API Key: sk-proj-dummy-k...
print(f'API Key: {api_key[:15]}...')
Output:
Output:
API key loaded successfully.
API Key: sk-proj-dummy-k...
Initial Stateless Test
The goal of this initial test is to confirm the LLM’s stateless nature by asking it a question it cannot possibly know, as no preceding conversation context has been provided.
# Create client
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL')
MODEL = 'gemma3:1b'
openai = OpenAI(base_url=OLLAMA_BASE_URL)
# First call: The model has no context.
payload = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Who am I?'}]
response = openai.chat.completions.create(model=MODEL, messages=payload)
print(response.choices[0].message.content)
Output:
That’s a fun question! 😊
As a large language model, I don’t have a single, definitive answer. I was created by Google.
You are a user who has interacted with me.
Do you want to play a little game? We could try a question-based round!
Output observation
The model confirms its lack of awareness, responding generically:
“As a large language model, I don’t have a single, definitive answer… You are a user who has interacted with me.” This immediately demonstrates that, in isolation, the LLM has no inherent memory of the user.
Demonstrating Statelessness in Conversation
To further prove the stateless concept, we conduct a two-step conversation, ensuring the second call is made independently of the first.
Step 1: Feed the information
We will first provide the LLM with a piece of personal information - the user’s name.
# Conversation Turn 1: Providing the name
payload_turn_1 = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello, I\'m Vivek?'}]
response_1 = openai.chat.completions.create(model=MODEL, messages=payload_turn_1)
print(response_1.choices[0].message.content)
Output:
Hello Vivek! It’s nice to meet you. How can I help you today? 😊
Step 2: Asking LLM to recall the infomation
Crucially, the next API call is separate (Independent Call) and does not include payload_turn_1.
# Conversation Turn 2: Asking for the name without context
payload_turn_2 = [{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'what is my name?'}]
response_2 = openai.chat.completions.create(model=MODEL, messages=payload_turn_2)
print(response_2.choices[0].message.content)
Output:
As a large language model, I do not have access to your personally identifiable information. I cannot know your name.
It’s a fun question though! 😊
Output observation
The model has failed to recall the name, saying - “As a large language model, I do not have access to your personally identifiable information. I cannot know your name.”
The Developer's Responsibility:
All calls to LLMs are stateless. It is the programmer’s responsibility to capture and send the details from previous conversations along with the current prompt to provide the necessary context.
The Conversation Test and The Fix (Python)
To “fix” the memory problem, the application must construct a new payload that includes the full conversation history before the final query.
# Passing Full Context to a stronger model (llama3.2:3b)
MODEL_3B_PARAM = 'llama3.2:3b'
payload_fixed = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello, I\'m Vivek?'}, # <- Previous turn (User)
{'role': 'assistant', 'content': 'Hello Vivek! How can I assist you today?'}, # <- Previous turn (Assistant)
{'role': 'user', 'content': 'what is my name?'} # <- Current query
]
response_fixed = openai.chat.completions.create(model=MODEL_3B_PARAM, messages=payload_fixed)
print(response_fixed.choices[0].message.content)
Output:
Vivek is the answer to that! You told me your name earlier, actually. Is everything okay? Was there something specific you wanted to chat about or ask for help with ?
Output Observation
The LLM now successfully recalls the name: “Vivek is the answer to that! You told me your name earlier, actually.”
This success confirms that LLM “memory” is simply the LLM processing a longer input sequence that contains the necessary context.
Recap: The Context Passing Trick
The LLM’s apparent memory is entirely a by-product of the application providing the entire conversation history in the input prompt for every new query.
| Key Concept | Description | Detail |
|---|---|---|
Stateless Nature | Every individual call to an LLM is stateless. | This applies to all models, including local (Ollama) and third-party hosted APIs. |
Mechanism | The entire conversation history is passed as part of the input prompt in the messages array. | This happens for every new query to provide context. |
The Illusion | The LLM appears to have “memory” of past turns. | It is merely processing the complete, long input sequence provided in the current API call. |
Prediction | The LLM predicts the most likely next tokens based on the current full input. | Example: The input sequence including "My name is Vivek." and "What's my name?" directs the LLM to predict "Vivek". |
Key Insight: Popular services like ChatGPT and its API products employ this exact technique. Every message sent by the user includes the full preceding conversation history to maintain context.
Context Passing Architecture
The following sequence diagram illustrates the architectural responsibility of the application code in managing and passing the conversation state.
sequenceDiagram
participant U as User
participant App as Application Code
participant LLM as LLM API (e.g., Ollama/OpenAI)
U->>App: Hello, I'm Vivek? (Msg 1)
App->>LLM: API Call with: [System, Msg 1]
LLM-->>App: Hello Vivek! (Response 1)
App-->>U: Response 1
U->>App: What is my name? (Msg 2)
App->>LLM: API Call with: [System, Msg 1, Response 1, Msg 2]
LLM-->>App: Vivek! (Response 2)
App-->>U: Response 2
Note over LLM: LLM processes the ENTIRE history<br>for EACH new request. The application must store and reconstruct the history.
Costing Implication for Conversation History
The context passing technique, while necessary for conversational continuity, has a significant cost implication.
| Aspect | Detail | Implication |
|---|---|---|
| Cost Driver | Token Count Increase | You are charged for all tokens sent in the prompt, which includes the conversation history. |
| Reason for Cost | Computation Cost | The LLM must re-process the entire previous conversation (the increased input token count) for every turn. |
| Desired Outcome | Contextually Relevant Answers | The extra cost is necessary to ensure the LLM predicts the most relevant next tokens, leading to a better user experience. |
| Result | Increased Total Cost | The overall cost per turn increases linearly as the conversation length and total token count grows. |
This necessity for context passing highlights the need for advanced memory management techniques in LLM applications, such as summarization, windowing, and retrieval-augmented generation (RAG) to keep the token count low and thus the cost manageable.