Sign In

Vision Analysis

Analyze images with AI vision models to describe content, extract information, detect objects, and answer questions about visual data.

The Vision node lets you analyze images using multimodal AI models that understand both text and visual content. Pass an image URL and a text prompt, and the model returns a text description, analysis, or answer about the image. This is useful for processing file uploads (photos, screenshots, receipts, diagrams), verifying visual content, and extracting information from images that are not suitable for pure OCR.

Supported Providers and Models

Vision analysis requires multimodal models that accept image inputs:

ProviderModelsNotes
OpenAIGPT-4.1, GPT-4.1 Mini, GPT-4oGPT-4.1 for highest accuracy, Mini for cost savings
AnthropicClaude Sonnet 4.6, Claude Sonnet 4.5, Claude Haiku 4.5Sonnet 4.6 recommended for detailed analysis

Not all models support vision. The models listed above are the ones that accept image inputs. Text-only models (like GPT-4.1 Nano or the o-series reasoning models) are not available for the Vision node.

Configuration

Provider

Select either OpenAI or Anthropic. The model dropdown updates accordingly.

Model

Choose the specific vision-capable model:

ModelProviderSpeedQualityBest For
GPT-4.1OpenAIModerateExcellentDetailed analysis, complex scenes
GPT-4.1 MiniOpenAIFastGoodQuick descriptions, simple checks
GPT-4oOpenAIFastGoodGeneral-purpose vision tasks
Claude Sonnet 4.6AnthropicModerateExcellentNuanced descriptions, reasoning
Claude Sonnet 4.5AnthropicModerateGoodGeneral-purpose analysis
Claude Haiku 4.5AnthropicFastGoodSimple checks, high-volume processing

Credential

Select a saved API key for the chosen provider. See Credential Management for setup.

Image URL

The URL of the image to analyze. This field supports template variables, which is the most common way to use it: reference the URL from a file upload form field so the vision model analyzes whatever image the user uploaded.

Supported image formats:

  • JPEG / JPG
  • PNG
  • GIF (first frame analyzed)
  • WebP

The image must be accessible via a public URL or a signed URL. If the image is behind authentication, the API call will fail. Buildorado file uploads generate accessible URLs automatically.

Prompt

The text instruction telling the model what to analyze or describe about the image. Be specific about what you want:

Good prompts:

Describe the damage visible in this photo. Note the location,
severity (minor, moderate, severe), and type of damage.
Is this a valid government-issued ID? Check for:
1. Full name visible
2. Photo present
3. Expiration date visible and not expired
4. No obvious signs of tampering
Return your assessment as JSON.
List all text visible in this image, including any
numbers, dates, and labels.

Vague prompts to avoid:

What is this?
Analyze this image.

The more specific your prompt, the more useful the model's response.

Detail Level

Controls how much visual detail the model processes:

LevelDescriptionSpeedCostUse When
AutoModel decides based on image contentVariesVariesDefault for most cases
LowProcesses a lower-resolution versionFastestCheapestSimple checks (is this a photo of a person?)
HighProcesses at full resolutionSlowestMost expensiveFine details matter (small text, subtle defects)
  • Auto is the recommended default. The model intelligently chooses the resolution based on the task.
  • Low costs fewer tokens and processes faster. Use it when you only need a general understanding of the image.
  • High is necessary when the analysis depends on fine details like small text, specific colors, or subtle visual features.

Max Tokens

Limits the length of the model's text response. Default is 1024. Maximum is 16384. Set this based on how detailed you need the analysis:

  • 100-256 -- Brief descriptions, yes/no answers, single labels
  • 256-512 -- Moderate descriptions, short lists
  • 512-1024 -- Detailed analysis with multiple observations
  • 1024+ -- Comprehensive reports

Output

The Vision node produces the following output:

FieldTypeDescription
contentstringThe model's text analysis of the image
inputTokensnumberNumber of input tokens used
outputTokensnumberNumber of output tokens used
modelstringThe model that was used
providerstringThe provider that was used

The content field is the primary output, available to downstream nodes via template variables. You can feed it into Branch nodes for conditional routing, Email nodes for notifications, or Spreadsheet nodes for data logging.

Use Cases

Insurance Claim Assessment

Process damage photos submitted through a claim form:

  • Image URL: Reference the file upload field containing the damage photo.
  • Prompt: "Assess the damage in this photo. Describe the type of damage, estimated severity (minor, moderate, severe), and affected areas. Return JSON with fields: damageType, severity, affectedAreas, description."
  • Detail level: High (fine details matter for damage assessment)
  • Route claims by severity: severe to a human adjuster, minor to auto-approval.

ID Verification

Verify uploaded identification documents:

  • Image URL: Reference the ID photo upload field.
  • Prompt: "Verify this identification document. Check: 1) Is a full name visible? 2) Is a photo present? 3) Is an expiration date visible? 4) Does the document appear authentic? Return JSON with isValid (boolean) and findings (array of strings)."
  • Detail level: High
  • Route valid IDs to the next step and flag invalid ones for manual review.

Product Photo Moderation

Validate product images submitted by sellers:

  • Image URL: Reference the product photo upload field.
  • Prompt: "Evaluate this product photo for marketplace listing. Check: appropriate content, clear product visibility, no watermarks, adequate lighting. Rate quality 1-5 and list any issues."
  • Detail level: Auto
  • Approve high-quality photos and flag low-quality ones for re-upload.

Receipt Processing

Extract information from uploaded receipts:

  • Image URL: Reference the receipt upload field.
  • Prompt: "Extract the following from this receipt: store name, date, total amount, tax amount, payment method, and list of items with prices. Return as JSON."
  • Detail level: High (receipts have small text)
  • Map extracted data to expense tracking spreadsheets or accounting integrations.

Real Estate Listing Analysis

Describe property photos for listings:

  • Image URL: Reference the property photo upload.
  • Prompt: "Describe this property photo for a real estate listing. Include room type, notable features, condition, natural lighting, and approximate room size category (small, medium, large)."
  • Detail level: Auto
  • Use the generated descriptions in automated listing creation workflows.

Vision vs. OCR: When to Use Each

Both Vision and OCR nodes process images, but they serve different purposes:

FeatureVisionOCR
Primary purposeUnderstanding image contentExtracting text from images
OutputDescriptive text or analysisRaw extracted text
Best forPhotos, scenes, diagrams, verificationDocuments, forms, receipts, labels
Understands contextYes (what is happening in the image)No (just reads text)
CostHigher (complex analysis)Lower (text extraction only)

Use Vision when you need to understand what is in an image, verify visual content, or answer questions about a scene.

Use OCR when you primarily need to extract readable text from a document or image.

For receipts and documents that need both text extraction and contextual understanding, Vision often handles both tasks in a single call.

Best Practices

  • Be specific in prompts. Tell the model exactly what to look for and what format to return. "Describe this image" produces generic output. "List all visible text and identify the document type" produces actionable data.
  • Use Low detail for simple checks. If you just need to confirm an image contains a person or is not blank, Low detail is significantly cheaper and faster.
  • Use High detail for text in images. When the image contains small text (receipts, documents, labels), High detail improves accuracy substantially.
  • Combine with structured output. Ask the model to return JSON in its response for easier downstream processing.
  • Handle missing images gracefully. If the image URL is empty (user did not upload a file), the Vision node will fail. Add conditional logic before the Vision node to check that the file upload field is not empty.
  • Consider image size. Very large images are resized by the provider. If resolution matters, ensure the uploaded image is within the provider's supported dimensions.

Limitations

  • The model cannot modify images. For image editing, use the Image Edit node.
  • Vision models may refuse to analyze images that violate content policies.
  • The model analyzes a single image per execution. To process multiple images, use multiple Vision nodes or a Loop node.
  • GIF analysis is limited to the first frame.
  • The image must be accessible via URL. Local file paths are not supported. File uploads in Buildorado automatically generate accessible URLs.
  • Execution is subject to a 120-second timeout.

On this page

Vision Analysis | Buildorado