Vision Models

Image-understanding models on OLLM that accept image input alongside text, available in both TEE and ZDR environments and used through the AI SDK chatModel() method.

Vision models are language models that also accept image input. You send images alongside text in the same request, and the model reasons over both.

When to Use

Use a vision model when your prompt includes images:

Describing, captioning, or classifying images
Extracting text or data from screenshots and documents
Visual question answering
Comparing or reasoning across multiple images

Vision models still return text, not images. To generate images, see Image & Video.

AI SDK Method

Vision models use the same chatModel() method as language models. Pass image bytes or a URL as a file content part with an image/* media type:

vision-model.ts

import { readFile } from 'node:fs/promises';
import { createOLLM } from '@orgn/gateway';
import { generateText } from 'ai';

const ollm = createOLLM({ apiKey: process.env.OLLM_API_KEY });
const image = await readFile('photo.jpg');

const { text } = await generateText({
  model: ollm.chatModel('vercel_claude_sonnet_4_6'),
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image.' },
      { type: 'file', data: image, mediaType: 'image/jpeg' },
    ],
  }],
});

Supported media types include image/jpeg, image/png, image/webp, image/gif, and any other image/* value the underlying model accepts. See the Vercel AI SDK integration for PDF and document input.

Confirm a model accepts image input by checking that 'image' appears in its input_modalities from ollm.listModels({ inputModality: 'image' }).

TEE Catalog

Vision models running in Trusted Execution Environments, on NEAR and Phala infrastructure with Intel TDX + NVIDIA H100 confidential compute.

Model	Provider	Infrastructure	Context
Qwen3 VL 30B	Alibaba	near	256K
Qwen3 VL 30B	Alibaba	phala	262K
Qwen3 VL 30B A3B Instruct	Alibaba	phala	128K
Qwen2.5 VL 72B	Alibaba	phala	128K

ZDR Catalog

Vision-capable models running on Vercel's AI infrastructure with zero data retention provider agreements.

Model	Provider	Context
Llama 3.2 11B Vision Instruct	Meta	128K
Llama 3.2 90B Vision Instruct	Meta	128K
Pixtral 12B	Mistral	128K
Pixtral Large	Mistral	128K
Qwen3 VL Instruct	Alibaba	262K
Nemotron Nano 12B v2 VL	NVIDIA	131K

Many frontier ZDR language models, including Claude 4.x, the Gemini 2.5 and 3 families, and GPT-4.1+ and GPT-5, also accept image input. Use ollm.listModels({ inputModality: 'image' }) for the authoritative list.

Vision Models

When to Use

AI SDK Method

TEE Catalog

ZDR Catalog

On this page