Vision Analysis

Analyze images with AI vision models to describe content, extract information, detect objects, and answer questions about visual data.

The Vision node lets you analyze images using multimodal AI models that understand both text and visual content. Pass an image URL and a text prompt, and the model returns a text description, analysis, or answer about the image. This is useful for processing file uploads (photos, screenshots, receipts, diagrams), verifying visual content, and extracting information from images that are not suitable for pure OCR.

Supported Providers and Models

Vision analysis requires multimodal models that accept image inputs:

Provider	Models	Notes
OpenAI	GPT-4.1, GPT-4.1 Mini, GPT-4o	GPT-4.1 for highest accuracy, Mini for cost savings
Anthropic	Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Haiku 4.5	Sonnet 4.6 recommended for detailed analysis

Not all models support vision. The models listed above are the ones that accept image inputs. Text-only models (like GPT-4.1 Nano or the o-series reasoning models) are not available for the Vision node.

Configuration

Provider

Select either OpenAI or Anthropic. The model dropdown updates accordingly.

Model

Choose the specific vision-capable model:

Model	Provider	Speed	Quality	Best For
GPT-4.1	OpenAI	Moderate	Excellent	Detailed analysis, complex scenes
GPT-4.1 Mini	OpenAI	Fast	Good	Quick descriptions, simple checks
GPT-4o	OpenAI	Fast	Good	General-purpose vision tasks
Claude Sonnet 4.6	Anthropic	Moderate	Excellent	Nuanced descriptions, reasoning
Claude Sonnet 4.5	Anthropic	Moderate	Good	General-purpose analysis
Claude Haiku 4.5	Anthropic	Fast	Good	Simple checks, high-volume processing

Credential

Select a saved API key for the chosen provider. See Credential Management for setup.

Image URL

The URL of the image to analyze. This field supports template variables, which is the most common way to use it: reference the URL from a file upload form field so the vision model analyzes whatever image the user uploaded.

Supported image formats:

JPEG / JPG
PNG
GIF (first frame analyzed)
WebP

The image must be accessible via a public URL or a signed URL. If the image is behind authentication, the API call will fail. Buildorado file uploads generate accessible URLs automatically.

Prompt

The text instruction telling the model what to analyze or describe about the image. Be specific about what you want:

Good prompts:

Describe the damage visible in this photo. Note the location,
severity (minor, moderate, severe), and type of damage.

Is this a valid government-issued ID? Check for:
1. Full name visible
2. Photo present
3. Expiration date visible and not expired
4. No obvious signs of tampering
Return your assessment as JSON.

List all text visible in this image, including any
numbers, dates, and labels.

Vague prompts to avoid:

What is this?

Analyze this image.

The more specific your prompt, the more useful the model's response.

Detail Level

Controls how much visual detail the model processes:

Level	Description	Speed	Cost	Use When
Auto	Model decides based on image content	Varies	Varies	Default for most cases
Low	Processes a lower-resolution version	Fastest	Cheapest	Simple checks (is this a photo of a person?)
High	Processes at full resolution	Slowest	Most expensive	Fine details matter (small text, subtle defects)

Auto is the recommended default. The model intelligently chooses the resolution based on the task.
Low costs fewer tokens and processes faster. Use it when you only need a general understanding of the image.
High is necessary when the analysis depends on fine details like small text, specific colors, or subtle visual features.

Max Tokens

Limits the length of the model's text response. Default is 1024. Maximum is 16384. Set this based on how detailed you need the analysis:

100-256 -- Brief descriptions, yes/no answers, single labels
256-512 -- Moderate descriptions, short lists
512-1024 -- Detailed analysis with multiple observations
1024+ -- Comprehensive reports

Output

The Vision node produces the following output:

Field	Type	Description
`content`	string	The model's text analysis of the image
`inputTokens`	number	Number of input tokens used
`outputTokens`	number	Number of output tokens used
`model`	string	The model that was used
`provider`	string	The provider that was used

The content field is the primary output, available to downstream nodes via template variables. You can feed it into Branch nodes for conditional routing, Email nodes for notifications, or Spreadsheet nodes for data logging.

Use Cases

Insurance Claim Assessment

Process damage photos submitted through a claim form:

Image URL: Reference the file upload field containing the damage photo.
Prompt: "Assess the damage in this photo. Describe the type of damage, estimated severity (minor, moderate, severe), and affected areas. Return JSON with fields: damageType, severity, affectedAreas, description."
Detail level: High (fine details matter for damage assessment)
Route claims by severity: severe to a human adjuster, minor to auto-approval.

ID Verification

Verify uploaded identification documents:

Image URL: Reference the ID photo upload field.
Prompt: "Verify this identification document. Check: 1) Is a full name visible? 2) Is a photo present? 3) Is an expiration date visible? 4) Does the document appear authentic? Return JSON with isValid (boolean) and findings (array of strings)."
Detail level: High
Route valid IDs to the next step and flag invalid ones for manual review.

Product Photo Moderation

Validate product images submitted by sellers:

Image URL: Reference the product photo upload field.
Prompt: "Evaluate this product photo for marketplace listing. Check: appropriate content, clear product visibility, no watermarks, adequate lighting. Rate quality 1-5 and list any issues."
Detail level: Auto
Approve high-quality photos and flag low-quality ones for re-upload.

Receipt Processing

Extract information from uploaded receipts:

Image URL: Reference the receipt upload field.
Prompt: "Extract the following from this receipt: store name, date, total amount, tax amount, payment method, and list of items with prices. Return as JSON."
Detail level: High (receipts have small text)
Map extracted data to expense tracking spreadsheets or accounting integrations.

Real Estate Listing Analysis

Describe property photos for listings:

Image URL: Reference the property photo upload.
Prompt: "Describe this property photo for a real estate listing. Include room type, notable features, condition, natural lighting, and approximate room size category (small, medium, large)."
Detail level: Auto
Use the generated descriptions in automated listing creation workflows.

Vision vs. OCR: When to Use Each

Both Vision and OCR nodes process images, but they serve different purposes:

Feature	Vision	OCR
Primary purpose	Understanding image content	Extracting text from images
Output	Descriptive text or analysis	Raw extracted text
Best for	Photos, scenes, diagrams, verification	Documents, forms, receipts, labels
Understands context	Yes (what is happening in the image)	No (just reads text)
Cost	Higher (complex analysis)	Lower (text extraction only)

Use Vision when you need to understand what is in an image, verify visual content, or answer questions about a scene.

Use OCR when you primarily need to extract readable text from a document or image.

For receipts and documents that need both text extraction and contextual understanding, Vision often handles both tasks in a single call.

Best Practices

Be specific in prompts. Tell the model exactly what to look for and what format to return. "Describe this image" produces generic output. "List all visible text and identify the document type" produces actionable data.
Use Low detail for simple checks. If you just need to confirm an image contains a person or is not blank, Low detail is significantly cheaper and faster.
Use High detail for text in images. When the image contains small text (receipts, documents, labels), High detail improves accuracy substantially.
Combine with structured output. Ask the model to return JSON in its response for easier downstream processing.
Handle missing images gracefully. If the image URL is empty (user did not upload a file), the Vision node will fail. Add conditional logic before the Vision node to check that the file upload field is not empty.
Consider image size. Very large images are resized by the provider. If resolution matters, ensure the uploaded image is within the provider's supported dimensions.

Limitations

The model cannot modify images. For image editing, use the Image Edit node.
Vision models may refuse to analyze images that violate content policies.
The model analyzes a single image per execution. To process multiple images, use multiple Vision nodes or a Loop node.
GIF analysis is limited to the first frame.
The image must be accessible via URL. Local file paths are not supported. File uploads in Buildorado automatically generate accessible URLs.
Execution is subject to a 120-second timeout.

On this page