Generate Themes from a Website with OLLM

Build an automated workflow that scrapes a website and generates design themes using OLLM confidential inference. Includes step-by-step code examples for extraction and theme generation.

In this example, we build a simple workflow that analyzes a company’s website and generates high-level business themes using OLLM.

The workflow works as follows:

Read the website’s sitemap.xml
Identify relevant pages (such as /services, /product, or /platform)
Scrape the textual content from those pages
Send the combined content to an OLLM model
Generate structured themes based on the site’s messaging

This type of workflow is commonly used for:

Competitive analysis
Market positioning research
Automated website audits
Internal strategy research

The only AI component in this pipeline is the theme generation step. All other steps are standard web data extraction.

Step 1: Read the Sitemap and Identify Relevant Pages

Most websites expose a sitemap.xml file that lists all indexable pages. Instead of scraping the entire domain blindly, we first read the sitemap and extract only pages relevant to product or service messaging.

import requests
import xml.etree.ElementTree as ET

def get_relevant_urls(sitemap_url):
   response = requests.get(sitemap_url)
   root = ET.fromstring(response.content)

   urls = []
   for url in root.findall(".//{*}loc"):
       link = url.text
       if any(path in link for path in ["/services", "/product", "/platform"]):
           urls.append(link)

   return urls

This ensures we focus only on pages that describe what the company offers, rather than blog posts or legal pages.

Step 2: Scrape Page Content

Once we have the relevant URLs, we extract the visible text from each page.

from bs4 import BeautifulSoup

def scrape_page(url):
   response = requests.get(url)
   soup = BeautifulSoup(response.text, "html.parser")

   text = soup.get_text(separator=" ", strip=True)
   return " ".join(text.split())

In a production environment, you may want to remove navigation menus, footers, or repeated elements. For simplicity, this example extracts full page text.

Step 3: Generate Themes Using OLLM

Now that we have the combined website content, we send it to OLLM for analysis.

OLLM is OpenAI-compatible, so we can use the official OpenAI SDK by setting the base_url to the OLLM endpoint.

from openai import OpenAI

client = OpenAI(
   base_url="https://api.ollm.com/v1",
   api_key="your-api-key"
)

def generate_themes(content):
   response = client.chat.completions.create(
       model="near/GLM-4.6",
       messages=[
           {
               "role": "system",
               "content": "You are an analyst extracting high-level business themes from website content."
           },
           {
               "role": "user",
               "content": f"""
               Analyze the following website content and extract:

               1. Core product or service themes
               2. Target audience segments
               3. Key value propositions
               4. Repeated messaging patterns

               Website content:
               {content}
               """
           }
       ]
   )
return response.choices[0].message.content

This call sends the scraped website data to the selected model (near/GLM-4.6). The model analyzes the text and returns structured thematic insights.

Step 4: Full Workflow Example

Below is a simplified end-to-end example combining all steps.

def run_analysis():
   sitemap_url = "https://example.com/sitemap.xml"

   urls = get_relevant_urls(sitemap_url)

   combined_content = ""

   for url in urls:
       page_text = scrape_page(url)
       combined_content += "\n\n" + page_text

   themes = generate_themes(combined_content)

   print("Generated Themes:\n")
   print(themes)

run_analysis()

When executed, this script will:

Identify relevant product/service pages
Extract their textual content
Send the content to OLLM
Print the generated themes

Expected Output

The model will typically return structured insights such as:

Primary product categories
Core differentiators
Messaging consistency
Target industries or user personas

You can optionally post-process this output into JSON if you require structured downstream usage.

Production Considerations

When applying this workflow in a real system:

Limit or chunk large content to avoid excessive token usage
Validate HTTP status before reading model output
Record token usage (response.usage.total_tokens) for cost tracking
Handle network failures and timeouts gracefully
Avoid scraping websites that prohibit automated access
Deploy within a confidential development environment to maintain end-to-end privacy across the full analysis pipeline

Because OLLM processes all inference inside Trusted Execution Environments (TEEs), the scraped website content is analyzed within a confidential computing boundary.