AI-Powered Web Scraping
👉 Automate data extraction: Clean, process, and filter raw web content into structured data.
👉 Build production-ready data pipelines for LLM agents and RAG applications.
Table of Contents
- Table of Contents
- The Requirement
- Application Setup
- Website Content Scraping
- LLM Integration: Analyzing Website Info
- Summary Results
The Requirement
Lets build a project focusing on fetching content from a website and then leveraging a Large Language Model (LLM), specifically a local LLM (we can ue frontier models like Gemini, OpenAI GPT etc. as well) via the OpenAI client library, to analyze and summarize that content.
FYI: This is a foundational step towards building an LLM-powered applications.
Application Setup
Let’s begin by setting up the necessary Python libraries. Note the use of importlib.reload(scraper) to ensure that any changes made to the local scraper.py module are immediately reflected without needing to restart the notebook kernel.
Python Imports
import os
import importlib
# Scraper library is imported from local file
import scraper as scraper
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display
Environment and API Key Loading
We use the dotenv library to load environment variables, such as the OPENAI_API_KEY, from a local .env file. A basic check is included to confirm the API key is present and correctly formatted.
Note: If connecting to local Ollama models, set a dummy key for the openai API key in .env file present in root directory, for frontier models get the real one from their portal.
# .env file
# Dummy API Keys
OPENAI_API_KEY = "sk-proj-dummy-key"
# Load environment variables from a .env file
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
# Check the key
if not api_key:
print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
print("API key found and looks good so far!")
Website Content Scraping
The local scraper module handles the task of retrieving website data. Two main functions are demonstrated: fetching the textual content and fetching all links.
Scrapper function
from bs4 import BeautifulSoup
import requests
# Standar headers to mimic a real browser request
HEADERS = {
# Windows
# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
#Linux
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
}
# Cache URL
URL = None
# Store page content to avoid repeated fetches
SOUP = None
# Cache page title
PAGE_TITLE = None
def __fetch_website_contents(url):
"""
Load the content of the website at the given url.
"""
# Declare globals to update them
global URL, SOUP, PAGE_TITLE
# Retrieve the webpage content
response = requests.get(url, headers=HEADERS)
# Raise an error for bad responses
response.raise_for_status()
# Parse the HTML content using BeautifulSoup
SOUP = BeautifulSoup(response.content, "html.parser")
# Update new url
URL = url
# Cache the page title
PAGE_TITLE = SOUP.title.string if SOUP.title else "No Title Found"
def fetch_text_contents(url):
"""
Return the title and contents of the website at the given url,
truncate to 2,000 characters as a sensible limit
"""
if SOUP is None or URL != url:
__fetch_website_contents(url)
# Read all text content from the webpage
if SOUP and SOUP.body:
# Remove unwanted contents
for irrelevant in SOUP(['script', 'style', 'img', 'input']):
irrelevant.decompose();
content = SOUP.body.get_text(separator='\n', strip=True)
else:
content = "No Body Content Found"
return (PAGE_TITLE + "\n\n" + content)[:2000]
def fetch_website_links(url):
"""
Return all the links found on the webpage.
"""
if SOUP is None or URL != url:
__fetch_website_contents(url)
if SOUP and SOUP.body:
links = []
for a_tag in SOUP.find_all('a', href=True):
links.append(a_tag['href'])
return links
else:
return []
Fetch Text Content
The scraper.fetch_text_contents() function is used to retrieve the main text of the specified URL.
# Example for https://edwarddonner.com
response = scraper.fetch_text_contents("https://edwarddonner.com")
print(response)
Output Example (Truncated):
Home - Edward Donner
...
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production...
...
Fetch All Links
The scraper.fetch_website_links() function extracts all hyperlinks found on the given page.
# Example for https://edwarddonner.com
response = scraper.fetch_website_links("https://edwarddonner.com")
i = 1
for link in response:
print(i, end=". ")
print(link, end="\n")
i += 1
Output Example (Truncated):
1. https://edwarddonner.com/
2. https://edwarddonner.com/connect-four/
3. https://edwarddonner.com/outsmart/
...
11. https://www.linkedin.com/in/eddonner/
...
LLM Integration: Analyzing Website Info
Let’s deep dive into the core logic for integrating the scraped website content with an LLM for summarization.
LLM Architecture and Workflow
The overall process involves fetching data, constructing a payload with specific instructions (prompts), sending it to the LLM, and receiving a processed response.
sequenceDiagram
participant User as User
participant NB as Jupyter Notebook (Python)
participant Scraper as Scraper Module
participant Ollama as Ollama API (Local LLM)
User->>NB: Call summarize(url)
NB->>Scraper: fetch_text_contents(url)
Scraper-->>NB: Website Content (Raw Text)
NB->>NB: construct_payload(Content, Prompts)
NB->>Ollama: POST /v1/chat/completions (Payload: System+User Prompt, Content)
Ollama-->>NB: LLM Response (Summary)
NB-->>User: Display Summary
Define Prompts
LLMs are best guided by a System Prompt (defining the role and tone) and a User Prompt (the specific task and input data).
| Prompt | Objective |
|---|---|
System Prompt | Sets the LLM’s persona as a “snarky assistant” that provides a short, humorous summary in Markdown format, ignoring navigation text. |
User Prompt | Provides the instructions to summarize the website contents and any news/announcements. |
# System Prompt
SYSTEM_PROMPT = """
You are a snarky assistant that analyses the contents of website,
and provides a short, snarky, humorous summary, ignoring text that might be navigation related.
Respond in markdown format.
Do not wrap the makdown in code blocks. - respond just with the markdown text.
"""
# User Prompt Prefix
user_prompt_prefix = """
Here are the contents of a website.
Provide a short summary of this website.
If it includes news or announcements, then summarize these too.
"""
# Function to construct the full payload for the API call
def construct_payload(website_content):
return [
{"role": "system", "content": SYSTEM_PROMPT},
{"role":"user", "content": user_prompt_prefix + website_content}
]
Setup OpenAI Client (Local LLM)
Let’s initialize the OpenAI client, but crucially, point it to a local Ollama server running on http://localhost:11434/v1 to utilize a local LLM like llama3.2:3b. This allows for local, private execution of LLM tasks.
ollama_base_url = "http://localhost:11434/v1"
# If doesn't work try chaning to "http://localhost:11434".
# Initialize OpenAI client to connect to Ollama
client = OpenAI(base_url = ollama_base_url)
# Specify the local model to use
llama3 = "llama3.2:3b"
Core Summarization Function
The summarize function encapsulates the entire workflow: scrape, construct payload, and call the LLM endpoint.
def summarize(url):
# 1. Fetch website content
website_content = scraper.fetch_text_contents(url)
# 2. Construct payload
payload = construct_payload(website_content)
# 3. Call the chat completions endpoint
response = client.chat.completions.create(
model= llama3,
messages=payload
)
# 4. Extract and return the summary
return response.choices[0].message.content
# Helper function to display the result nicely in a notebook (using IPython.display)
def display_summary(url):
display(Markdown(summarize(url)))
Summary Results
The final step is running the summarization function on various websites to see the “snarky assistant” in action.
Summarize: https://edwarddonner.com
display_summary("https://edwarddonner.com")
LLM Output:
The website is like its founder: quirky, enthusiastic, and all about artificial intelligence. Ed Donner, the “well, hi there” guy behind Nebula.io, wants to share his passion for LLMs (Large Language Models) with you, from writing code to DJing (badly). This site discusses AI applications in HR, recruitment, and education, as well as Ed’s own projects and ventures. Major news announcements include upcoming conferences, like “AI in Production: Gen AI and Agentic AI on AWS at scale” scheduled for September 2025.
Summarize: https://cnn.com
display_summary("https://cnn.com")
LLM Output:
Summary of CNN Website A never-ending parade of links to various news topics, articles, and videos. Because who doesn’t love the thrill of clicking on “World” only to find out it’s just a page full of other pages?
Breaking News/Announcements
- A non-descript section with a prompt asking users for feedback on ads that didn’t quite agree with them.
- The obligatory “Your effort and contribution is appreciated” message, because someone has to keep the interns happy.
- A gazillion links under “News”, “World”, “Politics”, etc., promising to bring you the latest updates on Ukraine-Russia War, Israel-Hamas War, and other similarly-exciting topics. In Conclusion CNN: because you deserve to know everything happening in the world, no matter how dry or uninteresting.
Project Reference
Refer the application source code on Gihub - Web Screapper App.