Brochure Generator
👉 Step-by-step: Building an Intelligent Brochure Generator with LLMs.
👉 Engineering guide to web scraping, JSON mode, and structured generation.
Table of Contents
- Table of Contents
- The Business use-case
- System Architecture
- Application Setup
- Intelligent Link Analysis
- Content Aggregation and Brochure Generation
- Key Takeaways
The Business use-case
In the era of Generative AI, converting unstructured web data into structured marketing collateral is a high-value use case. The Brochure Generator application addresses a common enterprise challenge: synthesizing dispersed company information into a cohesive narrative.
The Core Requirement is to generate a polished brochure given only a company’s name and URL.
Unlike simple summarization tools, this application must exercise judgment:
Distinguishbetween high-value pages (About, Careers, Products) and low-value pages (Privacy Policy, Terms).Normalizenavigation links (converting relative paths to absolute URLs).Synthesizediverse data points into a specific tone (humorous, professional, etc.).
System Architecture
The application follows a sequential “Scrape-Filter-Generate” pipeline. This separates the noise (irrelevant links) from the signal (content pages) before the final generation step, optimizing token usage and output quality.
| Stage | Task |
|---|---|
Input: | User provides a Company Name and URL. |
Extraction: | Scraper fetches all links from the landing page. |
Filtering (LLM 1): | The model analyzes links for relevance and formats them as JSON. |
Aggregation: | Scraper fetches text content from the filtered list. |
Generation (LLM 2): | A creative model compiles the aggregated text into a Markdown brochure. |
The Request Flow:
sequenceDiagram
participant User
participant App
participant Scraper
participant LLM_Link as LLM (Link Analyst)
participant LLM_Brochure as LLM (Creative Writer)
User->>App: Input (Company URL)
App->>Scraper: fetch_website_links(URL)
Scraper->>App: Raw List of Links
App->>LLM_Link: Prompt + Raw Links (Request JSON)
LLM_Link-->>App: Structured "Relevant Links" JSON
loop For each relevant link
App->>Scraper: fetch_text_contents(Link)
end
Scraper-->>App: Aggregated Text Content
App->>LLM_Brochure: Prompt + Aggregated Content
LLM_Brochure-->>App: Final Brochure (Markdown)
App->>User: Display Brochure
Application Setup
To ensure scalability and model agnosticism, this solution uses the openai Python client, which serves as a standard interface for various backends, including OpenAI’s GPT models, Google’s Gemini, or self-hosted Ollama instances.
Required Libraries
We will utilize dotenv for security and IPython for rich display capabilities within Jupyter notebooks.
import os
from dotenv import load_dotenv
import json
from IPython.display import Markdown, display, update_display
from openai import OpenAI
# Local module imports (Assumes a scraper.py file exists in the directory)
import scraper
Configuration and Client Initialization
Securely managing API keys is critical. We load these from a .env file. The code below demonstrates a basic validation step to ensure keys are present and of plausible length before initializing the client.
load_dotenv()
# Check API keys
api_key = os.getenv("OPENAI_API_KEY")
gemini_key = os.getenv("GEMINI_API_KEY")
if(api_key and gemini_key and len(api_key) > 10 and len(gemini_key) > 10):
print("API key look good.")
print(f"Gemini Key: {gemini_key[:10]}..., {'\n'}OpenAI Key: {api_key[:10]}...")
else:
print("No API key found. Please set OPENAI_API_KEY or GEMINI_API_KEY in your .env file.")
# Construct client (Configured for Ollama or compatible endpoints)
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL")
ollama_client = OpenAI(api_key=api_key, base_url=OLLAMA_BASE_URL)
Output:
API key look good.
Gemini Key: AIzaSyCdDv...,
OpenAI Key: sk-proj-qr...
Intelligent Link Analysis
The first engineering hurdle is noise reduction. A raw scrape of a modern website often yields dozens of links—navigational elements, social media footers, and legal disclaimers. Feeding all of these into a context window is inefficient and costly.
The Challenge of Relevance
We need an intelligent filter, for which let’s use an LLM to inspect the scraped URLs and determine which are semantically relevant to a brochure.
Prompt Engineering: One-Shot and JSON Mode
We employ One-shot Prompting. By providing a single, concrete example of the desired JSON output within the system prompt, we significantly increase the model’s adherence to the schema.
System Prompt:
Note how the prompt explicitly asks for a JSON structure with specific keys (name, url, reason). This Chain of Thought (asking for the “reason”) forces the model to justify its selection, often improving accuracy.
LINK_ANALYSIS_SYSTEM_PROMPT = """
You are a helpful assistant that helps to analyse website links and convert relative links to absolute links based on the main domain.
You are able to decide which of the links would be most relevant to include in a brochure about the company, such as links to an About
page, or a Company page, or Career/Jobs page.
You should respond in JSON format with the following structure:
{
"relevant_links": [
{"name": "about page", "url": "https://companydomain.com/about", "reason": "This page tells about the company mission and values."},
{"name": "careers page", "url": "https://companydomain.com/careers", "reason": "This page provides information about job opportunities."}
]
}
"""
User Prompt Construction:
The user prompt dynamically inserts the scraped links and the domain URL. It also explicitly forbids specific categories (Privacy Policy, Terms) to save tokens.
# Build user prompt by including all the links extracted from the website
def get_links_user_prompt(url):
user_prompt = f"""
Here is a list of links extracted from the website {url}:
Please analyse and decide which of these links are relevant web links for a brochure about the company.
Replace any relative links with absolute links based on the main domain {url}.
Do not include Terms of Service, Privacy Policy, Cookie Policy, or email links.
# Links (some might be relative links):
"""
links = scraper.fetch_website_links(url)
user_prompt += "\n".join(links)
return user_prompt
# Example usage to see the raw input
print(get_links_user_prompt("https://edwarddonner.com"))
Output (Raw Links):
Here is a list of links extracted from the website https://edwarddonner.com:
Please analyse and decide which of these links are relevant web links for a brochure about the company.
Replace any relative links with absolute links based on the main domain https://edwarddonner.com.
Do not include Terms of Service, Privacy Policy, Cookie Policy, or email links.
# Links (some might be relative links):
/
/about-me-and-about-nebula/
/connect-four/
/2025/09/15/ai-in-production-gen-ai-and-agentic-ai-on-aws-at-scale/
...
/privacy-policy
/terms-of-service
Implementation and Testing
The function select_relevant_links integrates the LLM. Crucially, we set response_format={"type": "json_object"}.
This is a best practice when working with newer models (like GPT-4o or Gemini Flash) to ensure the output is machine-parseable.
# Integrate with LLM model to select relevant links
def select_relevant_links(base_url, model_name, api_key, website_url):
client = OpenAI(api_key=api_key, base_url=base_url)
print(f"Using model: {model_name}")
payload = [
{"role": "system", "content": LINK_ANALYSIS_SYSTEM_PROMPT},
{"role": "user", "content": get_links_user_prompt(website_url)}
]
# Instruct model to respond in JSON format
json_resp_format = {"type": "json_object"}
response = client.chat.completions.create(model=model_name, messages=payload, response_format=json_resp_format)
result = response.choices[0].message.content
links = json.loads(result)
return links
Testing with Google Gemini:
Let’s test the our application logic so far against a website eg. huggingface.co.
base_url = os.getenv("GEMINI_BASE_URL")
gemini_model = os.getenv("GEMINI_FM") # e.g., gemini-1.5-flash
gemini_api_key = os.getenv("GEMINI_API_KEY")
website_url = "https://huggingface.co"
select_relevant_links(base_url=base_url, model_name=gemini_model, api_key=gemini_api_key, website_url=website_url)
Model Output (JSON):
Note that, how the model correctly identified the "Enterprise", "Pricing", and "Careers" pages while ignoring irrelevant footer links.
Using model: gemini-2.5-flash
{
"relevant_links": [
{
"name": "Home page",
"url": "https://huggingface.co/",
"reason": "Serves as the main landing page and general overview of the company."
},
{
"name": "Enterprise solutions page",
"url": "https://huggingface.co/enterprise",
"reason": "Provides information on solutions tailored for businesses and organizations."
},
{
"name": "Pricing page",
"url": "https://huggingface.co/pricing",
"reason": "Details the costs and plans for various services and products offered by the company."
},
...
{
"name": "Spaces page",
"url": "https://huggingface.co/spaces",
"reason": "Demonstrates the company's platform for building and sharing ML applications."
}
]
}
Content Aggregation and Brochure Generation
With the relevant URLs identified, we move to the content generation phase. This involves scraping the actual text from the filtered links and feeding it into a “Writer” LLM.
Handling Context Windows
A common issue in RAG (Retrieval-Augmented Generation) and content summarization is the context window limit.
If we scrape too many pages, the prompt might exceed the model's input token limit.
In the implementation below, we truncate the final aggregated user prompt to 5,000 characters. While, in a production environment, we should follow the business requirement and might need implement more sophisticated chunking or summarization strategies here.
# Function to create brochure using selected links
def fetch_page_and_all_relevant_links_fmodel(api_key, base_url, model_name, website_url):
# 1. Scrape the main landing page
content = scraper.fetch_text_contents(website_url)
# 2. Get the list of relevant sub-pages via LLM
relevant_links = select_relevant_links(base_url=base_url, model_name=model_name, api_key=api_key, website_url=website_url)
# 3. Aggregate content
result = f"## Landing Page: \n\n{content}\n---\n## Relevant Links:\n\n"
for link in relevant_links['relevant_links']:
result += f"\n\n### Link: {link['name']}\n"
result += scraper.fetch_text_contents(link['url'])
return result
# Function to get the full user prompt content (with safety truncation)
def get_brochure_user_prompt_frontier_models(company_name, website_url):
user_prompt = f"""
You are looking at a company called: {company_name}
Here's the contents of its landing page and other relevant pages;
use this information to build a short brochure of the company in markdown format without code blocks.\\n\\n
"""
# Fetch content
user_prompt += fetch_page_and_all_relevant_links_fmodel(api_key=api_key, base_url=base_url, model_name=gemini_model, website_url=website_url)
# Truncate if more than 5000 characters to fit context window
user_prompt = user_prompt[:5_000]
return user_prompt
The Creative System Prompt
Let’s define a distinct persona for the brochure generation step. Here, we ask for a "humorous, entertaining, witty" tone. This demonstrates how changing the system prompt can drastically alter the “voice” of the application without changing the underlying code.
BROCHURE_GENERATOR_SYSTEM_PROMPT = """
You are an assistant that analyzes the contents of several relevant pages from a company website
and creates a short, humorous, entertaining, witty brochure about the company for prospective customers, investors and recruits.
Respond in markdown without code blocks.
Include details of company culture, customers and careers/jobs if you have the information.
"""
Final Output Generation
Finally, we combine the persona (System Prompt) and the aggregated data (User Prompt) to generate the brochure.
def create_brochure(company_name, website_url):
payload = [
{"role": "system", "content": BROCHURE_GENERATOR_SYSTEM_PROMPT},
{"role": "user", "content": get_brochure_user_prompt_frontier_models(company_name, website_url)}
]
# We want text output here, not JSON
json_resp_format = {"type": "text"}
client = OpenAI(api_key=api_key, base_url=base_url)
response = client.chat.completions.create(model=gemini_model, messages=payload, response_format=json_resp_format)
brochure = response.choices[0].message.content
display(Markdown(brochure))
# Execute the application
create_brochure("HuggingFace", "https://huggingface.co")
Generated Brochure Output:
The model synthesizes the technical details of Hugging Face into a readable, structured marketing document.
Hugging Face: The Home of Machine Learning - Building the Future of AI, Together
Hugging Face is the leading platform and vibrant community where the world’s machine learning experts…
What We Offer
- Models: Explore and utilize over 1 million pre-trained models…
- Datasets: Access a vast collection of over 250,000 datasets…
- Spaces: Launch and experiment with over 400,000 interactive AI applications…
For Our Customers & Partners
- Team & Enterprise Solutions: Accelerate your organization’s AI initiatives…
Join Our Journey While specific career opportunities are not detailed in this brochure, Hugging Face is continuously expanding…
Project Reference
Refer the application source code on Gihub - Brochure Generator App.
Key Takeaways
This solution demonstrates a powerful pattern in LLM Engineering: Chaining. Instead of asking one model to “Browse this site and make a brochure,” which is prone to hallucinations and context limits, we broke the problem down:
- LLM as a Router: We used the first LLM call solely for logic and decision-making (Link Analysis). By enforcing JSON output, we made this step deterministic and reliable.
- LLM as a Creator: We used the second LLM call for synthesis and creativity, providing it with the high-quality context filtered by step 1.
- Hybrid Approach: We combined traditional coding (Python web scraping) with AI reasoning to solve a problem that neither could solve easily on their own.
Next Steps:
- Enhance Scraper: Improve the
scrapermodule to handle JavaScript-heavy sites (e.g., using Selenium or Playwright). - Chunking: Implement better text chunking instead of hard truncation (
[:5000]) to ensure no critical info is lost. - Multi-Modal: Incorporate image analysis to describe the visual style of the website in the brochure.