Vision Analysis
Analyze images with AI vision models to describe content, extract information, detect objects, and answer questions about visual data.
The Vision node lets you analyze images using multimodal AI models that understand both text and visual content. Pass an image URL and a text prompt, and the model returns a text description, analysis, or answer about the image. This is useful for processing file uploads (photos, screenshots, receipts, diagrams), verifying visual content, and extracting information from images that are not suitable for pure OCR.
Supported Providers and Models
Vision analysis requires multimodal models that accept image inputs:
| Provider | Models | Notes |
|---|---|---|
| OpenAI | GPT-4.1, GPT-4.1 Mini, GPT-4o | GPT-4.1 for highest accuracy, Mini for cost savings |
| Anthropic | Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Haiku 4.5 | Sonnet 4.6 recommended for detailed analysis |
Not all models support vision. The models listed above are the ones that accept image inputs. Text-only models (like GPT-4.1 Nano or the o-series reasoning models) are not available for the Vision node.
Configuration
Provider
Select either OpenAI or Anthropic. The model dropdown updates accordingly.
Model
Choose the specific vision-capable model:
| Model | Provider | Speed | Quality | Best For |
|---|---|---|---|---|
| GPT-4.1 | OpenAI | Moderate | Excellent | Detailed analysis, complex scenes |
| GPT-4.1 Mini | OpenAI | Fast | Good | Quick descriptions, simple checks |
| GPT-4o | OpenAI | Fast | Good | General-purpose vision tasks |
| Claude Sonnet 4.6 | Anthropic | Moderate | Excellent | Nuanced descriptions, reasoning |
| Claude Sonnet 4.5 | Anthropic | Moderate | Good | General-purpose analysis |
| Claude Haiku 4.5 | Anthropic | Fast | Good | Simple checks, high-volume processing |
Credential
Select a saved API key for the chosen provider. See Credential Management for setup.
Image URL
The URL of the image to analyze. This field supports template variables, which is the most common way to use it: reference the URL from a file upload form field so the vision model analyzes whatever image the user uploaded.
Supported image formats:
- JPEG / JPG
- PNG
- GIF (first frame analyzed)
- WebP
The image must be accessible via a public URL or a signed URL. If the image is behind authentication, the API call will fail. Buildorado file uploads generate accessible URLs automatically.
Prompt
The text instruction telling the model what to analyze or describe about the image. Be specific about what you want:
Good prompts:
Describe the damage visible in this photo. Note the location,
severity (minor, moderate, severe), and type of damage.Is this a valid government-issued ID? Check for:
1. Full name visible
2. Photo present
3. Expiration date visible and not expired
4. No obvious signs of tampering
Return your assessment as JSON.List all text visible in this image, including any
numbers, dates, and labels.Vague prompts to avoid:
What is this?Analyze this image.The more specific your prompt, the more useful the model's response.
Detail Level
Controls how much visual detail the model processes:
| Level | Description | Speed | Cost | Use When |
|---|---|---|---|---|
| Auto | Model decides based on image content | Varies | Varies | Default for most cases |
| Low | Processes a lower-resolution version | Fastest | Cheapest | Simple checks (is this a photo of a person?) |
| High | Processes at full resolution | Slowest | Most expensive | Fine details matter (small text, subtle defects) |
- Auto is the recommended default. The model intelligently chooses the resolution based on the task.
- Low costs fewer tokens and processes faster. Use it when you only need a general understanding of the image.
- High is necessary when the analysis depends on fine details like small text, specific colors, or subtle visual features.
Max Tokens
Limits the length of the model's text response. Default is 1024. Maximum is 16384. Set this based on how detailed you need the analysis:
- 100-256 -- Brief descriptions, yes/no answers, single labels
- 256-512 -- Moderate descriptions, short lists
- 512-1024 -- Detailed analysis with multiple observations
- 1024+ -- Comprehensive reports
Output
The Vision node produces the following output:
| Field | Type | Description |
|---|---|---|
content | string | The model's text analysis of the image |
inputTokens | number | Number of input tokens used |
outputTokens | number | Number of output tokens used |
model | string | The model that was used |
provider | string | The provider that was used |
The content field is the primary output, available to downstream nodes via template variables. You can feed it into Branch nodes for conditional routing, Email nodes for notifications, or Spreadsheet nodes for data logging.
Use Cases
Insurance Claim Assessment
Process damage photos submitted through a claim form:
- Image URL: Reference the file upload field containing the damage photo.
- Prompt: "Assess the damage in this photo. Describe the type of damage, estimated severity (minor, moderate, severe), and affected areas. Return JSON with fields: damageType, severity, affectedAreas, description."
- Detail level: High (fine details matter for damage assessment)
- Route claims by severity: severe to a human adjuster, minor to auto-approval.
ID Verification
Verify uploaded identification documents:
- Image URL: Reference the ID photo upload field.
- Prompt: "Verify this identification document. Check: 1) Is a full name visible? 2) Is a photo present? 3) Is an expiration date visible? 4) Does the document appear authentic? Return JSON with isValid (boolean) and findings (array of strings)."
- Detail level: High
- Route valid IDs to the next step and flag invalid ones for manual review.
Product Photo Moderation
Validate product images submitted by sellers:
- Image URL: Reference the product photo upload field.
- Prompt: "Evaluate this product photo for marketplace listing. Check: appropriate content, clear product visibility, no watermarks, adequate lighting. Rate quality 1-5 and list any issues."
- Detail level: Auto
- Approve high-quality photos and flag low-quality ones for re-upload.
Receipt Processing
Extract information from uploaded receipts:
- Image URL: Reference the receipt upload field.
- Prompt: "Extract the following from this receipt: store name, date, total amount, tax amount, payment method, and list of items with prices. Return as JSON."
- Detail level: High (receipts have small text)
- Map extracted data to expense tracking spreadsheets or accounting integrations.
Real Estate Listing Analysis
Describe property photos for listings:
- Image URL: Reference the property photo upload.
- Prompt: "Describe this property photo for a real estate listing. Include room type, notable features, condition, natural lighting, and approximate room size category (small, medium, large)."
- Detail level: Auto
- Use the generated descriptions in automated listing creation workflows.
Vision vs. OCR: When to Use Each
Both Vision and OCR nodes process images, but they serve different purposes:
| Feature | Vision | OCR |
|---|---|---|
| Primary purpose | Understanding image content | Extracting text from images |
| Output | Descriptive text or analysis | Raw extracted text |
| Best for | Photos, scenes, diagrams, verification | Documents, forms, receipts, labels |
| Understands context | Yes (what is happening in the image) | No (just reads text) |
| Cost | Higher (complex analysis) | Lower (text extraction only) |
Use Vision when you need to understand what is in an image, verify visual content, or answer questions about a scene.
Use OCR when you primarily need to extract readable text from a document or image.
For receipts and documents that need both text extraction and contextual understanding, Vision often handles both tasks in a single call.
Best Practices
- Be specific in prompts. Tell the model exactly what to look for and what format to return. "Describe this image" produces generic output. "List all visible text and identify the document type" produces actionable data.
- Use Low detail for simple checks. If you just need to confirm an image contains a person or is not blank, Low detail is significantly cheaper and faster.
- Use High detail for text in images. When the image contains small text (receipts, documents, labels), High detail improves accuracy substantially.
- Combine with structured output. Ask the model to return JSON in its response for easier downstream processing.
- Handle missing images gracefully. If the image URL is empty (user did not upload a file), the Vision node will fail. Add conditional logic before the Vision node to check that the file upload field is not empty.
- Consider image size. Very large images are resized by the provider. If resolution matters, ensure the uploaded image is within the provider's supported dimensions.
Limitations
- The model cannot modify images. For image editing, use the Image Edit node.
- Vision models may refuse to analyze images that violate content policies.
- The model analyzes a single image per execution. To process multiple images, use multiple Vision nodes or a Loop node.
- GIF analysis is limited to the first frame.
- The image must be accessible via URL. Local file paths are not supported. File uploads in Buildorado automatically generate accessible URLs.
- Execution is subject to a 120-second timeout.