OLLMModels

Vision Models

Image-understanding models on OLLM that accept image input alongside text, available in both TEE and ZDR environments and used through the AI SDK chatModel() method.

Vision models are language models that also accept image input. You send images alongside text in the same request, and the model reasons over both.

When to Use

Use a vision model when your prompt includes images:

  • Describing, captioning, or classifying images
  • Extracting text or data from screenshots and documents
  • Visual question answering
  • Comparing or reasoning across multiple images

Vision models still return text, not images. To generate images, see Image & Video.

AI SDK Method

Vision models use the same chatModel() method as language models. Pass image bytes or a URL as a file content part with an image/* media type:

vision-model.ts
import { readFile } from 'node:fs/promises';
import { createOLLM } from '@orgn/gateway';
import { generateText } from 'ai';

const ollm = createOLLM({ apiKey: process.env.OLLM_API_KEY });
const image = await readFile('photo.jpg');

const { text } = await generateText({
  model: ollm.chatModel('vercel_claude_sonnet_4_6'),
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Describe this image.' },
      { type: 'file', data: image, mediaType: 'image/jpeg' },
    ],
  }],
});

Supported media types include image/jpeg, image/png, image/webp, image/gif, and any other image/* value the underlying model accepts. See the Vercel AI SDK integration for PDF and document input.

Confirm a model accepts image input by checking that 'image' appears in its input_modalities from ollm.listModels({ inputModality: 'image' }).

TEE Catalog

Vision models running in Trusted Execution Environments, on NEAR and Phala infrastructure with Intel TDX + NVIDIA H100 confidential compute.

ModelProviderInfrastructureContext
Qwen3 VL 30BAlibabanear256K
Qwen3 VL 30BAlibabaphala262K
Qwen3 VL 30B A3B InstructAlibabaphala128K
Qwen2.5 VL 72BAlibabaphala128K

ZDR Catalog

Vision-capable models running on Vercel's AI infrastructure with zero data retention provider agreements.

ModelProviderContext
Llama 3.2 11B Vision InstructMeta128K
Llama 3.2 90B Vision InstructMeta128K
Pixtral 12BMistral128K
Pixtral LargeMistral128K
Qwen3 VL InstructAlibaba262K
Nemotron Nano 12B v2 VLNVIDIA131K

Many frontier ZDR language models, including Claude 4.x, the Gemini 2.5 and 3 families, and GPT-4.1+ and GPT-5, also accept image input. Use ollm.listModels({ inputModality: 'image' }) for the authoritative list.

On this page