OLLM Architecture and Confidential Inference Flow

A deep dive into the OLLM architecture, including the router control plane, hardware-backed Trusted Execution Environments, and the attestation layer. Learn how user-controlled model selection, Intel TDX, and NVIDIA GPU attestation deliver verifiable confidential LLM inference end to end.

OLLM’s architecture is designed to provide verifiable, confidential LLM inference while keeping control firmly with the user. It separates request orchestration, secure execution, and verification, ensuring that sensitive data is processed only inside hardware-enforced trust boundaries.

OLLM does not perform automatic model selection or dynamic routing. The model specified in the user's request is executed.

High-level components

Client application

Your application sends requests using an OpenAI-compatible API, explicitly specifying the model to use.

The client:

Selects the model in code or request parameters
Sends prompts and inference parameters
Receives model responses
Receives attestation and verification metadata for the same request

OLLM does not modify, override, or substitute the requested model.

OLLM router (control plane)

The OLLM router acts as a secure orchestration layer, responsible for:

Authenticating requests
Validating model availability and permissions
Enforcing security and execution constraints
Coordinating attestation and verification data

The router does not choose models, does not inspect prompt or response data, and does not perform inference.

Execution environments (data plane)

OLLM routes requests to one of two execution environments depending on the model selected.

TEE models (NEAR and Phala infrastructure) run inside hardware-backed Trusted Execution Environments:

Hardware-enforced memory isolation from host OS, hypervisor, and infrastructure
Encryption of data in use via Intel TDX confidential VMs and NVIDIA H100 GPU attestation
Cryptographic attestation receipt generated per request

ZDR models (Vercel infrastructure) run under zero data retention provider agreements:

No storage or logging of prompts and responses by Vercel or the underlying model provider
Broad access to frontier closed-weight and multimodal models
No hardware isolation or attestation receipt

The model identifier you specify in the request determines which environment is used. OLLM does not select or override the execution path.

Attestation and verification layer

For TEE model requests, the execution environment produces attestation artifacts that prove:

The specified model ran inside a valid TEE
The execution environment matched expected measurements
The response was generated within the trusted boundary

These artifacts are available per request in the OLLM Explorer, enabling independent verification of secure execution.

ZDR model requests do not produce attestation artifacts. Privacy is enforced through Vercel's zero data retention agreements with model providers.

Request lifecycle

Request submission

The client sends a request to OLLM, explicitly specifying the model to use.

Request validation

OLLM authenticates the request and verifies that the specified model is available and supported.

Secure inference execution

The request is forwarded to the selected model’s TEE-backed execution environment.

Attestation generation

Hardware attestation data is generated as part of the execution process.

Response and verification delivery

The model output and corresponding verification metadata are returned to the client.

At no point does OLLM alter the model choice or access plaintext prompt or response data outside the TEE.

Trust boundaries and guarantees

OLLM makes its trust model explicit:

Model choice is always user-controlled OLLM does not perform automatic routing or model substitution.
OLLM does not access inference data Prompts and outputs remain confined to hardware-isolated environments.
Security enforcement depends on model type TEE models: trust anchored in hardware TEE guarantees and cryptographic attestation. ZDR models: trust anchored in Vercel's zero data retention provider agreements.

This architecture allows teams to run sensitive LLM workloads with full control over model selection, while still gaining verifiable privacy and execution integrity.