LLM Memory Illusion

👉 Understand the stateless design of LLMs.

👉 How LLMs maintain conversation context, illuminating stateful memory.

👉 Deep dive in context-passing, conversation management, and cost-aware memory strategies.


Table of Contents


LLM “Memory”: The Stateless Illusion

It is a common misconception that Large Language Models (LLMs) inherently “remember” past interactions. This apparent recall of a conversation is, in fact, an illusion. LLMs are fundamentally stateless systems. This means that every single API call made to the model is treated as a completely new, independent request, with no memory of any preceding turn.

For a conversational application to maintain context, the developer must actively manage and provide the entire conversation history with every new prompt. This technique is known as Context Passing.


Connecting to a Local Ollama Model

To practically demonstrate this concept, we will set up an environment to interact with a local LLM instance running via Ollama. This method uses the standard OpenAI API structure, making the code easily transferable to other LLM providers (like OpenAI or Anthropic) by simply changing the base_url and API key configuration.

Environment Setup (Python)

The following setup uses the openai library which is compatible with Ollama’s local API endpoint.

# Available in project home directory
# /pyproject.toml
[project]
name = "llm-engineering"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
 # .. other dependencies
  "openai>=1.109.1",
 # .. other settings
]

# Import necessary libraries
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables from '.env'
load_dotenv(override=True)

# Define configurations
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL')
MODEL = 'gemma3:1b'

# Get API key and set up client
openai = OpenAI(base_url=OLLAMA_BASE_URL)

# Verify configuration (Note: Ollama often uses a dummy key format)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print('No API key found. Please set the OPENAI_API_KEY environment variable.')
else:
    print('API key loaded successfully.')
    # Example output: API Key: sk-proj-dummy-k...
    print(f'API Key: {api_key[:15]}...')

Output:

Output:
API key loaded successfully.
API Key: sk-proj-dummy-k...

Initial Stateless Test

The goal of this initial test is to confirm the LLM’s stateless nature by asking it a question it cannot possibly know, as no preceding conversation context has been provided.

# Create client
OLLAMA_BASE_URL = os.getenv('OLLAMA_BASE_URL')
MODEL = 'gemma3:1b'

openai = OpenAI(base_url=OLLAMA_BASE_URL)

# First call: The model has no context.
payload = [{'role': 'system', 'content': 'You are a helpful assistant.'},
           {'role': 'user', 'content': 'Who am I?'}]

response = openai.chat.completions.create(model=MODEL, messages=payload)
print(response.choices[0].message.content)

Output:

That’s a fun question! 😊 
As a large language model, I don’t have a single, definitive answer. I was created by Google. 

You are a user who has interacted with me.
Do you want to play a little game? We could try a question-based round!

Output observation

The model confirms its lack of awareness, responding generically:

“As a large language model, I don’t have a single, definitive answer… You are a user who has interacted with me.” This immediately demonstrates that, in isolation, the LLM has no inherent memory of the user.


Demonstrating Statelessness in Conversation

To further prove the stateless concept, we conduct a two-step conversation, ensuring the second call is made independently of the first.

Step 1: Feed the information

We will first provide the LLM with a piece of personal information - the user’s name.

# Conversation Turn 1: Providing the name
payload_turn_1 = [{'role': 'system', 'content': 'You are a helpful assistant.'},
           {'role': 'user', 'content': 'Hello, I\'m Vivek?'}]

response_1 = openai.chat.completions.create(model=MODEL, messages=payload_turn_1)
print(response_1.choices[0].message.content)

Output:

Hello Vivek! It’s nice to meet you. How can I help you today? 😊

Step 2: Asking LLM to recall the infomation

Crucially, the next API call is separate (Independent Call) and does not include payload_turn_1.

# Conversation Turn 2: Asking for the name without context
payload_turn_2 = [{'role': 'system', 'content': 'You are a helpful assistant.'},
           {'role': 'user', 'content': 'what is my name?'}]

response_2 = openai.chat.completions.create(model=MODEL, messages=payload_turn_2)
print(response_2.choices[0].message.content)

Output:

As a large language model, I do not have access to your personally identifiable information. I cannot know your name. 

It’s a fun question though! 😊 

Output observation

The model has failed to recall the name, saying - As a large language model, I do not have access to your personally identifiable information. I cannot know your name.

The Developer's Responsibility:
All calls to LLMs are stateless. It is the programmer’s responsibility to capture and send the details from previous conversations along with the current prompt to provide the necessary context.


The Conversation Test and The Fix (Python)

To “fix” the memory problem, the application must construct a new payload that includes the full conversation history before the final query.

# Passing Full Context to a stronger model (llama3.2:3b)
MODEL_3B_PARAM = 'llama3.2:3b'
payload_fixed = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Hello, I\'m Vivek?'},                            # <- Previous turn (User)
    {'role': 'assistant', 'content': 'Hello Vivek! How can I assist you today?'}, # <- Previous turn (Assistant)
    {'role': 'user', 'content': 'what is my name?'}                               # <- Current query
]

response_fixed = openai.chat.completions.create(model=MODEL_3B_PARAM, messages=payload_fixed)
print(response_fixed.choices[0].message.content)

Output:

Vivek is the answer to that! You told me your name earlier, actually. Is everything okay? Was there something specific you wanted to chat about or ask for help with ?

Output Observation

The LLM now successfully recalls the name: “Vivek is the answer to that! You told me your name earlier, actually.”

This success confirms that LLM “memory” is simply the LLM processing a longer input sequence that contains the necessary context.


Recap: The Context Passing Trick

The LLM’s apparent memory is entirely a by-product of the application providing the entire conversation history in the input prompt for every new query.

Key ConceptDescriptionDetail
Stateless NatureEvery individual call to an LLM is stateless.This applies to all models, including local (Ollama) and third-party hosted APIs.
MechanismThe entire conversation history is passed as part of the input prompt in the messages array.This happens for every new query to provide context.
The IllusionThe LLM appears to have “memory” of past turns.It is merely processing the complete, long input sequence provided in the current API call.
PredictionThe LLM predicts the most likely next tokens based on the current full input.Example: The input sequence including "My name is Vivek." and "What's my name?" directs the LLM to predict "Vivek".

Key Insight: Popular services like ChatGPT and its API products employ this exact technique. Every message sent by the user includes the full preceding conversation history to maintain context.

Context Passing Architecture

The following sequence diagram illustrates the architectural responsibility of the application code in managing and passing the conversation state.

sequenceDiagram
    participant U as User
    participant App as Application Code
    participant LLM as LLM API (e.g., Ollama/OpenAI)

    U->>App: Hello, I'm Vivek? (Msg 1)
    App->>LLM: API Call with: [System, Msg 1]
    LLM-->>App: Hello Vivek! (Response 1)
    App-->>U: Response 1

    U->>App: What is my name? (Msg 2)
    App->>LLM: API Call with: [System, Msg 1, Response 1, Msg 2]
    LLM-->>App: Vivek! (Response 2)
    App-->>U: Response 2

    Note over LLM: LLM processes the ENTIRE history<br>for EACH new request. The application must store and reconstruct the history.

Costing Implication for Conversation History

The context passing technique, while necessary for conversational continuity, has a significant cost implication.

AspectDetailImplication
Cost DriverToken Count IncreaseYou are charged for all tokens sent in the prompt, which includes the conversation history.
Reason for CostComputation CostThe LLM must re-process the entire previous conversation (the increased input token count) for every turn.
Desired OutcomeContextually Relevant AnswersThe extra cost is necessary to ensure the LLM predicts the most relevant next tokens, leading to a better user experience.
ResultIncreased Total CostThe overall cost per turn increases linearly as the conversation length and total token count grows.

This necessity for context passing highlights the need for advanced memory management techniques in LLM applications, such as summarization, windowing, and retrieval-augmented generation (RAG) to keep the token count low and thus the cost manageable.