Artificial IntelligenceAnthropicClaude API

Claude Vision: Analyse Images, PDFs, and Documents

TT
TopicTrick
Claude Vision: Analyse Images, PDFs, and Documents

Text is only one of the ways humans communicate and work. An enormous proportion of business information lives in images — screenshots of systems, scanned documents, charts in presentations, photos of physical infrastructure, and pages from PDFs. Until AI models gained vision capabilities, processing this content required manual extraction into text before any AI could work with it.

Claude's vision capabilities change that. You can send Claude an image, a PDF page, or a screenshot and ask it to read, analyse, reason about, and extract information from what it sees. This post covers everything you need to know about working with visual content in the Claude API — from the mechanics of sending images to practical extraction patterns for real-world business documents.


What Claude Can See and Understand

Claude's vision capabilities handle a broad range of visual content:

  • Photographs: People, objects, scenes, physical environments, and products
  • Screenshots: UI interfaces, error messages, dashboards, web pages
  • Charts and graphs: Bar charts, line graphs, pie charts — Claude can read values and describe trends
  • Documents and forms: Scanned text, handwritten notes, form fields, table data
  • Diagrams: Architecture diagrams, flowcharts, network maps, org charts
  • PDFs: Multi-page documents including text, images, and tables within PDFs
  • Code screenshots: Claude can read and reason about code visible in an image

Claude cannot process video files, animated GIFs (it sees the first frame), or audio embedded in media. For video, you would extract frames and send them as individual images.


Sending an Image to Claude

Images are passed as content blocks within the messages array. Claude supports both base64-encoded images and URL-referenced images.

From a URL

The simplest method — if your image is publicly accessible, pass the URL directly:

python
1import anthropic 2 3client = anthropic.Anthropic() 4 5response = client.messages.create( 6 model="claude-sonnet-4-6", 7 max_tokens=1024, 8 messages=[ 9 { 10 "role": "user", 11 "content": [ 12 { 13 "type": "image", 14 "source": { 15 "type": "url", 16 "url": "https://example.com/architecture-diagram.png" 17 } 18 }, 19 { 20 "type": "text", 21 "text": "Describe the components in this architecture diagram and identify any potential single points of failure." 22 } 23 ] 24 } 25 ] 26) 27 28print(response.content[0].text)

From a Local File (Base64 Encoding)

For private images or files that are not publicly hosted:

python
1import anthropic 2import base64 3from pathlib import Path 4 5client = anthropic.Anthropic() 6 7# Read and encode the image 8image_path = Path("screenshot.png") 9image_data = base64.standard_b64encode(image_path.read_bytes()).decode("utf-8") 10 11response = client.messages.create( 12 model="claude-sonnet-4-6", 13 max_tokens=1024, 14 messages=[ 15 { 16 "role": "user", 17 "content": [ 18 { 19 "type": "image", 20 "source": { 21 "type": "base64", 22 "media_type": "image/png", 23 "data": image_data 24 } 25 }, 26 { 27 "type": "text", 28 "text": "What error is shown in this screenshot? What is the most likely cause?" 29 } 30 ] 31 } 32 ] 33)

Supported Image Formats

Claude supports JPEG, PNG, GIF (first frame), and WebP image formats. For PDFs, you have two options: convert pages to images with a library like pdf2image and send them as base64 images, or use the Files API to upload the full PDF. The Files API approach is covered in the next post and is recommended for multi-page PDF processing.


    Supported Image Sizes and Limits

    • Maximum image file size: 20MB per image
    • Maximum images per request: up to 100 images
    • Very large images are automatically downscaled internally — you do not need to resize them before sending, but small images are handled more efficiently
    • For optimal performance, resize images to 1280px on the longest edge before sending — beyond this size, the model quality gain is negligible but token cost increases

    Practical Use Cases with Code

    Document Data Extraction

    Extract structured data from a scanned invoice, receipt, or form:

    python
    1response = client.messages.create( 2 model="claude-sonnet-4-6", 3 max_tokens=2048, 4 messages=[ 5 { 6 "role": "user", 7 "content": [ 8 { 9 "type": "image", 10 "source": {"type": "base64", "media_type": "image/jpeg", "data": invoice_base64} 11 }, 12 { 13 "type": "text", 14 "text": """Extract the following fields from this invoice and return as JSON: 15{ 16 "vendor_name": "", 17 "invoice_number": "", 18 "invoice_date": "", 19 "due_date": "", 20 "total_amount": "", 21 "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}] 22} 23Return only the JSON. Do not include any explanation.""" 24 } 25 ] 26 } 27 ] 28) 29 30import json 31invoice_data = json.loads(response.content[0].text)

    Chart and Graph Analysis

    python
    1response = client.messages.create( 2 model="claude-sonnet-4-6", 3 max_tokens=2048, 4 messages=[ 5 { 6 "role": "user", 7 "content": [ 8 { 9 "type": "image", 10 "source": {"type": "url", "url": "https://example.com/quarterly-sales.png"} 11 }, 12 { 13 "type": "text", 14 "text": "Analyse this sales chart. What is the overall trend? Which quarter had the highest growth? Are there any anomalies worth investigating?" 15 } 16 ] 17 } 18 ] 19)

    Multi-Image Comparison

    Claude can work with multiple images in a single request — useful for comparing before/after states, different designs, or multiple document pages:

    python
    1response = client.messages.create( 2 model="claude-sonnet-4-6", 3 max_tokens=2048, 4 messages=[ 5 { 6 "role": "user", 7 "content": [ 8 {"type": "text", "text": "Here are two versions of the same login page. Image 1 is the current version:"}, 9 {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": v1_base64}}, 10 {"type": "text", "text": "Image 2 is the proposed redesign:"}, 11 {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": v2_base64}}, 12 {"type": "text", "text": "From a UX perspective, what are the key differences? Which version would likely convert better and why?"} 13 ] 14 } 15 ] 16)

    Label Your Images in Multi-Image Prompts

    When sending multiple images in a single request, always label them explicitly in your text content — 'Image 1:', 'Image 2:', and so on. Claude can refer to images by the labels you provide, which makes the conversation much clearer and prevents confusion when Claude needs to distinguish between images in its response.


      Processing PDFs with Vision

      For PDF documents, you have two main approaches:

      Approach 1 — Convert Pages to Images

      Use a Python library to convert PDF pages to images, then send each page as an image:

      python
      1from pdf2image import convert_from_path 2import base64 3from io import BytesIO 4 5# Convert PDF pages to PIL Images 6pages = convert_from_path("contract.pdf", dpi=200) 7 8# Encode each page 9encoded_pages = [] 10for page in pages: 11 buffer = BytesIO() 12 page.save(buffer, format="JPEG", quality=85) 13 encoded_pages.append(base64.standard_b64encode(buffer.getvalue()).decode("utf-8")) 14 15# Build content blocks for the first 5 pages 16content = [{"type": "text", "text": "Review this contract and identify all obligations, payment terms, and termination clauses."}] 17for i, page_data in enumerate(encoded_pages[:5]): 18 content.append({"type": "text", "text": f"Page {i+1}:"}) 19 content.append({ 20 "type": "image", 21 "source": {"type": "base64", "media_type": "image/jpeg", "data": page_data} 22 }) 23 24response = client.messages.create( 25 model="claude-sonnet-4-6", 26 max_tokens=4096, 27 messages=[{"role": "user", "content": content}] 28)

      Approach 2 — Files API (Recommended for Multi-Page PDFs)

      The Files API (covered in the next post) lets you upload a PDF once and reference it by file ID, which is cleaner and more efficient for large documents.


      Accessibility and Responsible Use

      • Do not use Claude's vision to identify individuals by face — this raises significant privacy concerns and Anthropic's policies restrict this use case
      • Be transparent with users when their submitted images are being processed by AI
      • For healthcare documents, legal contracts, and other sensitive materials, ensure your data handling complies with applicable regulations before sending images to external APIs

      Image Data and Privacy

      Every image you send to Claude via the API leaves your environment. Treat image data with the same care you would apply to any sensitive text data. Do not send images containing personal identification information, medical records, or confidential business data unless you have assessed the data processing implications and your users have given appropriate consent.


        Summary

        Claude's vision capabilities unlock a category of automation that was previously impossible with text-only AI: working directly with the visual information that fills modern business workflows. From invoice extraction to infrastructure diagram analysis, Claude can be the reading layer that processes your visual content and converts it into actionable structured data.

        Core takeaways:

        • Use URL source for publicly accessible images — simpler and no encoding overhead
        • Use base64 encoding for private, local, or dynamically generated images
        • Use multiple images in one request for comparison and multi-page documents
        • Label images explicitly when sending more than one in a request
        • Resize large images to ~1280px longest edge for optimal cost efficiency

        Next up: Building with the Claude Files API: Upload Once, Use Many Times.


        This post is part of the Anthropic AI Tutorial Series. Previous post: Claude Web Search Tool: Real-Time Data in Your AI App.