Generate Themes from a Website
Example workflow for scraping a website and generating themes using OLLM.
In this example, we build a simple workflow that analyzes a company’s website and generates high-level business themes using OLLM.
The workflow works as follows:
- Read the website’s
sitemap.xml - Identify relevant pages (such as
/services,/product, or/platform) - Scrape the textual content from those pages
- Send the combined content to an OLLM model
- Generate structured themes based on the site’s messaging
This type of workflow is commonly used for:
- Competitive analysis
- Market positioning research
- Automated website audits
- Internal strategy research
The only AI component in this pipeline is the theme generation step. All other steps are standard web data extraction.
Step 1: Read the Sitemap and Identify Relevant Pages
Most websites expose a sitemap.xml file that lists all indexable pages. Instead of scraping the entire domain blindly, we first read the sitemap and extract only pages relevant to product or service messaging.
import requests
import xml.etree.ElementTree as ET
def get_relevant_urls(sitemap_url):
response = requests.get(sitemap_url)
root = ET.fromstring(response.content)
urls = []
for url in root.findall(".//{*}loc"):
link = url.text
if any(path in link for path in ["/services", "/product", "/platform"]):
urls.append(link)
return urlsThis ensures we focus only on pages that describe what the company offers, rather than blog posts or legal pages.
Step 2: Scrape Page Content
Once we have the relevant URLs, we extract the visible text from each page.
from bs4 import BeautifulSoup
def scrape_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
text = soup.get_text(separator=" ", strip=True)
return " ".join(text.split())In a production environment, you may want to remove navigation menus, footers, or repeated elements. For simplicity, this example extracts full page text.
Step 3: Generate Themes Using OLLM
Now that we have the combined website content, we send it to OLLM for analysis.
OLLM is OpenAI-compatible, so we can use the official OpenAI SDK by setting the base_url to the OLLM endpoint.
from openai import OpenAI
client = OpenAI(
base_url="https://api.ollm.com/v1",
api_key="your-api-key"
)
def generate_themes(content):
response = client.chat.completions.create(
model="near/GLM-4.6",
messages=[
{
"role": "system",
"content": "You are an analyst extracting high-level business themes from website content."
},
{
"role": "user",
"content": f"""
Analyze the following website content and extract:
1. Core product or service themes
2. Target audience segments
3. Key value propositions
4. Repeated messaging patterns
Website content:
{content}
"""
}
]
)
return response.choices[0].message.contentThis call sends the scraped website data to the selected model (near/GLM-4.6). The model analyzes the text and returns structured thematic insights.
Step 4: Full Workflow Example
Below is a simplified end-to-end example combining all steps.
def run_analysis():
sitemap_url = "https://example.com/sitemap.xml"
urls = get_relevant_urls(sitemap_url)
combined_content = ""
for url in urls:
page_text = scrape_page(url)
combined_content += "\n\n" + page_text
themes = generate_themes(combined_content)
print("Generated Themes:\n")
print(themes)
run_analysis()When executed, this script will:
- Identify relevant product/service pages
- Extract their textual content
- Send the content to OLLM
- Print the generated themes
Expected Output
The model will typically return structured insights such as:
- Primary product categories
- Core differentiators
- Messaging consistency
- Target industries or user personas
You can optionally post-process this output into JSON if you require structured downstream usage.
Production Considerations
When applying this workflow in a real system:
- Limit or chunk large content to avoid excessive token usage
- Validate HTTP status before reading model output
- Record token usage (
response.usage.total_tokens) for cost tracking - Handle network failures and timeouts gracefully
- Avoid scraping websites that prohibit automated access
Because OLLM processes all inference inside Trusted Execution Environments (TEEs), the scraped website content is analyzed within a confidential computing boundary.