Use CasesScraping

Troubleshoot

Common issues and solutions when implementing the website theme generation workflow using OLLM.

This guide covers common issues you may encounter when implementing the website theme generation workflow using OLLM.

The workflow involves:

  1. Fetching a sitemap
  2. Extracting relevant URLs
  3. Scraping page content
  4. Sending the combined content to OLLM
  5. Parsing the model response

If any step fails, the pipeline may break or produce incomplete results. The sections below help you isolate and resolve common issues.

Sitemap Issues

Sitemap Not Found (404)

If your request to sitemap.xml fails:

response = requests.get(sitemap_url)
print(response.status_code)

Possible causes:

  • The website does not expose a public sitemap
  • The sitemap is located at a different path
  • The site blocks automated requests

How to fix

  • Check https://example.com/robots.txt to find the correct sitemap location
  • Verify the URL manually in your browser
  • Add a User-Agent header to your request:
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(sitemap_url, headers=headers)

XML Parsing Errors

If you see errors such as:

xml.etree.ElementTree.ParseError

The sitemap may:

  • Be malformed
  • Contain namespaces not handled properly
  • Be compressed

How to fix

Ensure you handle namespaces correctly:

root.findall(".//{*}loc")

If the sitemap is compressed (.gz), download and decompress it before parsing.

Scraping Issues

Empty or Incomplete Content

If scrape_page() returns very little text:

  • The site may use client-side rendering (JavaScript)
  • Content may load dynamically

How to fix

For JavaScript-heavy sites, use a headless browser tool such as:

  • Playwright
  • Selenium

Basic requests + BeautifulSoup will not execute JavaScript.

Blocked Requests (403)

If you receive:

403 Forbidden

The website may be blocking automated scraping.

How to fix

  • Add a User-Agent header
  • Respect robots.txt
  • Avoid scraping at high request frequency

OLLM API Issues

401 Unauthorized

If you receive a 401 error from OLLM:

  • Verify your API key
  • Ensure base_url="https://api.ollm.com/v1" is set correctly
  • Confirm the key has not been revoked

Example client initialization:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.ollm.com/v1",
    api_key="your-api-key"
)

Model Not Found

If you see an error indicating the model is unavailable:

  • Verify the model ID (near/GLM-4.6)
  • Confirm the model is available in your account
  • Ensure there are no typos in the model string

Empty Model Output

If response.choices[0].message.content is empty:

  • Confirm that the request succeeded (HTTP 200)
  • Check that response.choices exists
  • Print the full response object for debugging

Example guard:

if response and response.choices:
    print(response.choices[0].message.content)

Token and Input Size Issues

Content Too Large

If you receive token limit errors or unusually slow responses, the combined website content may exceed the model’s context window.

How to fix

  • Truncate large pages
  • Chunk content into smaller segments
  • Summarize sections individually before combining results

Example truncation:

MAX_CHARS = 20000
combined_content = combined_content[:MAX_CHARS]

Performance Issues

Slow Execution

The workflow may slow down due to:

  • Large sitemaps
  • Sequential page scraping
  • Large combined prompts

Improvements

  • Add concurrency when scraping
  • Cache scraped pages
  • Filter sitemap URLs more strictly

Response Parsing Errors

If your application crashes while reading:

response.choices[0].message.content

The likely cause is that the response is an error envelope rather than a completion.

Always validate before accessing fields:

if hasattr(response, "choices") and response.choices:
    content = response.choices[0].message.content

Do not attempt to parse output if an error object is present.

Unexpected Output Quality

If the generated themes are:

  • Too generic
  • Not structured
  • Missing important insights

You can improve results by:

  • Adding stronger system instructions
  • Asking for structured output (e.g., JSON format)
  • Reducing noisy scraped content (menus, footers, boilerplate)

Example improved prompt:

{
    "role": "system",
    "content": "Extract structured business themes. Return a clear bullet list grouped by category."
}

Verification & Security Context

All inference requests sent through OLLM are processed inside Trusted Execution Environments (TEEs). If you need to validate execution integrity:

  • Check verification metadata in the OLLM dashboard
  • Confirm request status is “Verified”

This does not affect the scraping workflow, but it may be relevant for audit or compliance requirements.

Debugging Checklist

Before escalating issues, verify:

  • Sitemap URL is correct
  • Relevant pages are being extracted
  • Scraped content is non-empty
  • API key is valid
  • Model ID is correct
  • Response status is 2xx
  • choices[0].message.content exists

This isolates failures quickly and helps determine whether the issue is in scraping logic, request construction, or response handling.

On this page