Gradio + LLM Integration

👉 Step-by-step: Connecting Gradio ChatInterface to local and cloud LLMs.

👉 Engineering guide to prompt versioning and model interaction via Python.


Quickly build UI for LLM Apps

Integrating a Large Language Model (LLM) into a user interface (UI) is crucial for testing, gathering feedback, and internal demos. This allows us to move from a raw API call to a deployable, shareable application quickly.

Below application details, how to use Gradio to build a chat interface on top of an LLM client and demonstrates a clear path for integrating an LLM, defining its persona via a system prompt, and exposing it for user interaction and quality assurance (QA).


Features

Using Gradio for LLM demos is about speed and simplicity, but it comes with limitations compared to full-stack frameworks.

FeatureGradioInsight
DevelopmentExtremely fastIdeal for prototyping and internal tools.
CustomizationLow to MediumLimited by pre-built components; less flexible than React/Vue.
State ManagementSimple (Session-based)Easy to maintain chat history within the session, as shown.
DeploymentSimpleBuilt-in methods for sharing links (.launch(share=True)).
StreamingSupports itRequires gr.Interface(live=True) or gr.ChatInterface for advanced chat flows.
ScalabilityPoorNot designed for high-traffic, production-level public APIs.

Use Cases

Wrapping your LLM with a Gradio UI provides immediate value across the engineering lifecycle.

  • Rapid Prototyping: Instantly validate prompt engineering changes and model outputs without building a frontend.
  • Internal Demos & QA: Provide non-technical stakeholders (Product Managers, QA) an accessible way to test the model.
  • Feedback Loops: Collect user input and model responses easily for dataset refinement and debugging.
  • Model Agnosticism: The approach works with any model that exposes an OpenAI-compatible endpoint OpenAI, Azure OpenAI, Ollama, vLLM.

Application setup

We will use environment variables for secrets and configuration, which is standard engineering practice.

The openai library is used as the client, as its interface is a common standard across many local and remote LLM providers.

# .env file
# Dummy API Keys
TEST_API_KEY = "DUMMY-API-KEY-1234567890"

OLLAMA_BASE_URL = "http://localhost:11434/v1"
LLAMA3_3B = "llama3.2:3b"
import os
from dotenv import load_dotenv
from openai import OpenAI

import gradio as gr

# Load .env variables securely
load_dotenv()

# Configuration variables from environment
TEST_API_KEY = os.getenv("TEST_API_KEY")
OLLAMA_API_BASE_URL = os.getenv("OLLAMA_BASE_URL") # e.g., http://localhost:11434/v1
LLM_MODEL = os.getenv("LLAMA3_3B") # The specific model to use

# Basic check to ensure required configuration is present
if (TEST_API_KEY or OLLAMA_API_BASE_URL or LLM_MODEL) is None:
    raise ValueError("One or more required environment variables is not set.")
else:
    print(f"Config loaded. \nBASE_URL: {OLLAMA_API_BASE_URL}, \nMODEL: {LLM_MODEL} \nAPI Key: {TEST_API_KEY[:10]}")

# Initialize the OpenAI client. base_url allows connection to local/non-OpenAI providers.
client = OpenAI(base_url=OLLAMA_API_BASE_URL, api_key=TEST_API_KEY)

Defining the LLM Persona (Prompt)

The System Prompt defines the model’s behavior and constraints and controls the output quality and safety.

We have initialized the PAYLOAD list with System Prompt to ensure it’s included all API call to LLM.

SYSTEM_PROMPT = """You are a helpful assistant that helps people find information.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
"""

# Initialize a list to hold the conversation history (context)
PAYLOAD = []

# The system message is the first, setting the context/persona
PAYLOAD.append({"role": "system", "content": SYSTEM_PROMPT})

The Chat API Function

This function is the central nervous system of our application, which handles all communication with the LLM.

The critical part here is context management that is ensuring natural conversation flow.

To achieve this, when a new request (input) is received it’s immediately appended to the global PAYLOAD (containing all previous conversations) list with the role user.

And the entire PAYLOAD (System Prompt + all previous user and assistant messages) is sent to the LLM, to ensures the model has full context.

Later The LLM’s generated msg is also appended back to the PAYLOAD list with the role as assistant. This prepares the history for the next request from user, ensuring the conversation flows naturally.

# The function Gradio will call
def chat(input: str) -> str:
    print(f"info: chat functions called. Input: {input}")
    
    # 1. Append user's message to conversation history
    PAYLOAD.append({"role": "user", "content": input})
    
    # 2. Call the LLM API
    response = client.chat.completions.create(
        model=LLM_MODEL,
        messages=PAYLOAD, 
        stream=False # We are not streaming the response in this simple example
    )
    
    # 3. Extract the response message
    msg = response.choices[0].message.content
    
    # 4. Append model's message to conversation history for next turn
    PAYLOAD.append({"role": "assistant", "content": msg})

    print(f"Current Conversation History: \n{PAYLOAD}")
    return msg

Building the Gradio Interface

We will define the UI components (eg. Textbox) and then initialize the gr.Interface by link the chat function to it.

Gradio’s internally handles the web server, input/output serialization, and UI rendering automatically.

# Define the input and output components
input_box = gr.Textbox(label="Ask me anything", placeholder="Type your question here...", 
                       type="text", lines=3, max_lines=10, max_length=100, autofocus=True)
output_box = gr.Textbox(label="Response", placeholder="You haven't asked anything yet...", 
                        type="text", lines=10, autoscroll=False, show_copy_button=True)

# Create the Gradio Interface, mapping the function to inputs and outputs
llm_chatting = gr.Interface(fn=chat, 
                            inputs=input_box, 
                            outputs=output_box, 
                            title="LLM Chat Interface", 
                            description="A Gradio interface for Ollama/OpenAI LLM APIs", 
                            flagging_mode='never') # Disable built-in data logging

# Launch the Gradio interface
# To share externally for 72 hours, use: llm_chatting.launch(share=True)
llm_chatting.launch() 

Request Flow

This application follows a simple three component architecture and maintains the state (the conversation history) within Python process’s memory.

  1. The User submits a query through the Gradio UI in their browser.
  2. The Gradio Server receives the HTTP request and calls the associated Python function, chat(input).
  3. The chat function appends the user message to the global PAYLOAD list, which acts as the conversation history.
  4. The openaiClient sends the entire PAYLOAD (System + History) to the LLM API Server.
  5. The LLM API Server processes the request and returns the completion.
  6. The chat function appends the LLM’s response to PAYLOAD and returns the message string.
  7. The Gradio Server takes the returned string and renders it in the output text box for the User.
graph TD
    A[User Browser] -->|1. HTTP Request| B(Gradio Server);
    B -->|2. Calls Python Function 'chat'| C{Python Process};
    C -->|3. Appends User Message to History| C;
    C -->|4. Calls LLM API -OpenAI Client| D[LLM API Server -Ollama/OpenAI];
    D -->|5. LLM Response| C;
    C -->|6. Appends LLM response to History| C;
    C -->|7. Returns Response String| B;
    B -->|8. HTTP Response| A;

Best Practices

If this application is meant for more than one user or needs to be scaled, then this needs to transitioned to a robust system.

Here are the initial transition steps to take.

PracticeRationaleImplementation Advice
Manage Session StateThe current global PAYLOAD list makes all users share one conversation, which is a severe flaw for multi-user apps.Crucial: Migrate from the global list to a session-specific state manager like Gradio’s gr.State or a separate cache (e.g., Redis) for persistence.
Use gr.ChatInterfaceManually using input/output boxes requires you to manage UI formatting and history logic. gr.ChatInterface streamlines this.Use the dedicated gr.ChatInterface component. It automatically handles the chat bubble styling, message order, and internal history management.
Implement Robust Error HandlingExternal API calls can fail due to network issues, rate limits, or invalid parameters. A failure should not crash the user’s session.Wrap the core API call (client.chat.completions.create) in a try...except block. Catch exceptions like openai.APIError and return a clean, user-facing error message.
Enable Response StreamingLLM latency is often high. Users hate waiting for a full, long response to appear all at once.Set stream=True in the API call and use Gradio’s generator functions (functions that yield partial results). This gives the user immediate visual feedback as tokens arrive, masking the underlying latency.

Takeaways

You can rapidly deploy a powerful LLM interface using Python and Gradio. This approach prioritizes speed and iteration over heavy infrastructure.

  • Gradio is a Prototyping Accelerator: Use it to validate model output and prompt structure immediately.
  • System Prompt is the Persona: Define the model’s behavior explicitly in the SYSTEM_PROMPT to control quality.