Vision Models
Image-understanding models on OLLM that accept image input alongside text, available in both TEE and ZDR environments and used through the AI SDK chatModel() method.
Vision models are language models that also accept image input. You send images alongside text in the same request, and the model reasons over both.
When to Use
Use a vision model when your prompt includes images:
- Describing, captioning, or classifying images
- Extracting text or data from screenshots and documents
- Visual question answering
- Comparing or reasoning across multiple images
Vision models still return text, not images. To generate images, see Image & Video.
AI SDK Method
Vision models use the same chatModel() method as language models. Pass image bytes or a URL as a file content part with an image/* media type:
import { readFile } from 'node:fs/promises';
import { createOLLM } from '@orgn/gateway';
import { generateText } from 'ai';
const ollm = createOLLM({ apiKey: process.env.OLLM_API_KEY });
const image = await readFile('photo.jpg');
const { text } = await generateText({
model: ollm.chatModel('vercel_claude_sonnet_4_6'),
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image.' },
{ type: 'file', data: image, mediaType: 'image/jpeg' },
],
}],
});Supported media types include image/jpeg, image/png, image/webp, image/gif, and any other image/* value the underlying model accepts. See the Vercel AI SDK integration for PDF and document input.
Confirm a model accepts image input by checking that 'image' appears in its input_modalities from ollm.listModels({ inputModality: 'image' }).
TEE Catalog
Vision models running in Trusted Execution Environments, on NEAR and Phala infrastructure with Intel TDX + NVIDIA H100 confidential compute.
| Model | Provider | Infrastructure | Context |
|---|---|---|---|
| Qwen3 VL 30B | Alibaba | near | 256K |
| Qwen3 VL 30B | Alibaba | phala | 262K |
| Qwen3 VL 30B A3B Instruct | Alibaba | phala | 128K |
| Qwen2.5 VL 72B | Alibaba | phala | 128K |
ZDR Catalog
Vision-capable models running on Vercel's AI infrastructure with zero data retention provider agreements.
| Model | Provider | Context |
|---|---|---|
| Llama 3.2 11B Vision Instruct | Meta | 128K |
| Llama 3.2 90B Vision Instruct | Meta | 128K |
| Pixtral 12B | Mistral | 128K |
| Pixtral Large | Mistral | 128K |
| Qwen3 VL Instruct | Alibaba | 262K |
| Nemotron Nano 12B v2 VL | NVIDIA | 131K |
Many frontier ZDR language models, including Claude 4.x, the Gemini 2.5 and 3 families, and GPT-4.1+ and GPT-5, also accept image input. Use ollm.listModels({ inputModality: 'image' }) for the authoritative list.
Language Models
Text generation and chat models available through OLLM, covering both TEE and ZDR execution environments and their use with the AI SDK chatModel() method.
Embedding & Reranking Models
Embedding and reranking models on OLLM for semantic search and RAG, available in TEE and ZDR environments and used through the AI SDK embeddingModel() method.