AI-Powered Web Scraping

👉 Automate data extraction: Clean, process, and filter raw web content into structured data.

👉 Build production-ready data pipelines for LLM agents and RAG applications.


Table of Contents


The Requirement

Lets build a project focusing on fetching content from a website and then leveraging a Large Language Model (LLM), specifically a local LLM (we can ue frontier models like Gemini, OpenAI GPT etc. as well) via the OpenAI client library, to analyze and summarize that content.

FYI: This is a foundational step towards building an LLM-powered applications.


Application Setup

Let’s begin by setting up the necessary Python libraries. Note the use of importlib.reload(scraper) to ensure that any changes made to the local scraper.py module are immediately reflected without needing to restart the notebook kernel.

Python Imports

import os
import importlib

# Scraper library is imported from local file
import scraper as scraper 

from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

Environment and API Key Loading

We use the dotenv library to load environment variables, such as the OPENAI_API_KEY, from a local .env file. A basic check is included to confirm the API key is present and correctly formatted.

Note: If connecting to local Ollama models, set a dummy key for the openai API key in .env file present in root directory, for frontier models get the real one from their portal.

    # .env file

    # Dummy API Keys
    OPENAI_API_KEY = "sk-proj-dummy-key"
# Load environment variables from a .env file
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')


# Check the key
if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

Website Content Scraping

The local scraper module handles the task of retrieving website data. Two main functions are demonstrated: fetching the textual content and fetching all links.

Scrapper function

from bs4 import BeautifulSoup
import requests


# Standar headers to mimic a real browser request
HEADERS = {
    # Windows
    # "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"

    #Linux
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
}

# Cache URL 
URL = None

# Store page content to avoid repeated fetches
SOUP = None

# Cache page title
PAGE_TITLE = None



def __fetch_website_contents(url):
    """
    Load the content of the website at the given url.
    """

    # Declare globals to update them
    global URL, SOUP, PAGE_TITLE


    # Retrieve the webpage content
    response = requests.get(url, headers=HEADERS)

    # Raise an error for bad responses
    response.raise_for_status()

    # Parse the HTML content using BeautifulSoup
    SOUP = BeautifulSoup(response.content, "html.parser")

    # Update new url
    URL = url

    # Cache the page title
    PAGE_TITLE = SOUP.title.string if SOUP.title else "No Title Found"

    

def fetch_text_contents(url):
    """
    Return the title and contents of the website at the given url,
    truncate to 2,000 characters as a sensible limit
    """

    if SOUP is None or URL != url:
        __fetch_website_contents(url)
    
    # Read all text content from the webpage
    if SOUP and SOUP.body:
        # Remove unwanted contents
        for irrelevant in SOUP(['script', 'style', 'img', 'input']):
            irrelevant.decompose();
        content = SOUP.body.get_text(separator='\n', strip=True)
    else:
        content = "No Body Content Found"

    return (PAGE_TITLE + "\n\n" + content)[:2000]



def fetch_website_links(url):
    """
    Return all the links found on the webpage.
    """

    if SOUP is None or URL != url:
        __fetch_website_contents(url)

    if SOUP and SOUP.body:
        links = []
        for a_tag in SOUP.find_all('a', href=True):
            links.append(a_tag['href'])
        
        return links
    else:
        return []

Fetch Text Content

The scraper.fetch_text_contents() function is used to retrieve the main text of the specified URL.

# Example for https://edwarddonner.com
response = scraper.fetch_text_contents("https://edwarddonner.com")
print(response)

Output Example (Truncated):

Home - Edward Donner
...
Well, hi there.
I’m Ed. I like writing code and experimenting with LLMs, and hopefully you’re here because you do too. I also enjoy DJing (but I’m badly out of practice), amateur electronic music production...
...

The scraper.fetch_website_links() function extracts all hyperlinks found on the given page.

# Example for https://edwarddonner.com
response = scraper.fetch_website_links("https://edwarddonner.com")

i = 1
for link in response:
    print(i, end=". ")
    print(link, end="\n")
    i += 1

Output Example (Truncated):

1. https://edwarddonner.com/
2. https://edwarddonner.com/connect-four/
3. https://edwarddonner.com/outsmart/
...
11. https://www.linkedin.com/in/eddonner/
...

LLM Integration: Analyzing Website Info

Let’s deep dive into the core logic for integrating the scraped website content with an LLM for summarization.

LLM Architecture and Workflow

The overall process involves fetching data, constructing a payload with specific instructions (prompts), sending it to the LLM, and receiving a processed response.

sequenceDiagram
    participant User as User
    participant NB as Jupyter Notebook (Python)
    participant Scraper as Scraper Module
    participant Ollama as Ollama API (Local LLM)
    User->>NB: Call summarize(url)
    NB->>Scraper: fetch_text_contents(url)
    Scraper-->>NB: Website Content (Raw Text)
    NB->>NB: construct_payload(Content, Prompts)
    NB->>Ollama: POST /v1/chat/completions (Payload: System+User Prompt, Content)
    Ollama-->>NB: LLM Response (Summary)
    NB-->>User: Display Summary

Define Prompts

LLMs are best guided by a System Prompt (defining the role and tone) and a User Prompt (the specific task and input data).

PromptObjective
System PromptSets the LLM’s persona as a “snarky assistant” that provides a short, humorous summary in Markdown format, ignoring navigation text.
User PromptProvides the instructions to summarize the website contents and any news/announcements.
# System Prompt
SYSTEM_PROMPT = """
    You are a snarky assistant that analyses the contents of website, 
    and provides a short, snarky, humorous summary, ignoring text that might be navigation related.
    Respond in markdown format.
    Do not wrap the makdown in code blocks. - respond just with the markdown text. 
"""

# User Prompt Prefix
user_prompt_prefix = """
    Here are the contents of a website.
    Provide a short summary of this website.
    If it includes news or announcements, then summarize these too.
"""

# Function to construct the full payload for the API call
def construct_payload(website_content):
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role":"user", "content": user_prompt_prefix + website_content}
    ]

Setup OpenAI Client (Local LLM)

Let’s initialize the OpenAI client, but crucially, point it to a local Ollama server running on http://localhost:11434/v1 to utilize a local LLM like llama3.2:3b. This allows for local, private execution of LLM tasks.

ollama_base_url = "http://localhost:11434/v1"
# If doesn't work try chaning to "http://localhost:11434".

# Initialize OpenAI client to connect to Ollama
client = OpenAI(base_url = ollama_base_url)

# Specify the local model to use
llama3 = "llama3.2:3b" 

Core Summarization Function

The summarize function encapsulates the entire workflow: scrape, construct payload, and call the LLM endpoint.

def summarize(url):
    # 1. Fetch website content
    website_content = scraper.fetch_text_contents(url)

    # 2. Construct payload
    payload = construct_payload(website_content)

    # 3. Call the chat completions endpoint
    response = client.chat.completions.create(
        model= llama3,
        messages=payload
    )

    # 4. Extract and return the summary
    return response.choices[0].message.content

# Helper function to display the result nicely in a notebook (using IPython.display)
def display_summary(url):
    display(Markdown(summarize(url)))

Summary Results

The final step is running the summarization function on various websites to see the “snarky assistant” in action.

Summarize: https://edwarddonner.com

display_summary("https://edwarddonner.com")

LLM Output:

The website is like its founder: quirky, enthusiastic, and all about artificial intelligence. Ed Donner, the “well, hi there” guy behind Nebula.io, wants to share his passion for LLMs (Large Language Models) with you, from writing code to DJing (badly). This site discusses AI applications in HR, recruitment, and education, as well as Ed’s own projects and ventures. Major news announcements include upcoming conferences, like “AI in Production: Gen AI and Agentic AI on AWS at scale” scheduled for September 2025.

Summarize: https://cnn.com

display_summary("https://cnn.com")

LLM Output:

Summary of CNN Website A never-ending parade of links to various news topics, articles, and videos. Because who doesn’t love the thrill of clicking on “World” only to find out it’s just a page full of other pages?

Breaking News/Announcements

  • A non-descript section with a prompt asking users for feedback on ads that didn’t quite agree with them.
  • The obligatory “Your effort and contribution is appreciated” message, because someone has to keep the interns happy.
  • A gazillion links under “News”, “World”, “Politics”, etc., promising to bring you the latest updates on Ukraine-Russia War, Israel-Hamas War, and other similarly-exciting topics. In Conclusion CNN: because you deserve to know everything happening in the world, no matter how dry or uninteresting.

Project Reference
Refer the application source code on Gihub - Web Screapper App.