Multimodal AI Guide: Vision, Audio, and Document Processing with GPT-4o and Gemini

Modern LLMs accept text, images, audio, and documents as input. Multimodal AI unlocks use cases that were impossible with text-only models: invoice extraction, screenshot debugging, voice assistants, and document QA.

Modality support matrix

Model	Text	Images	Audio	Video	Documents
GPT-4o	✅	✅	✅	❌	✅ (via images)
GPT-4o-mini	✅	✅	❌	❌	✅ (via images)
Gemini 1.5 Pro	✅	✅	✅	✅	✅ (native PDF)
Claude 3.5 Sonnet	✅	✅	❌	❌	✅ (via images)
Whisper-1	❌	❌	✅ (STT)	❌	❌

Image analysis with GPT-4o

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

// Option 1: URL-based image
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'image_url', image_url: { url: 'https://example.com/chart.png', detail: 'high' } },
      { type: 'text', text: 'Describe the trend in this chart and identify any anomalies.' },
    ],
  }],
  max_tokens: 500,
});

// Option 2: Base64-encoded local image
const imageBuffer = fs.readFileSync('screenshot.png');
const base64Image = imageBuffer.toString('base64');

const response2 = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      {
        type: 'image_url',
        image_url: {
          url: `data:image/png;base64,${base64Image}`,
          detail: 'high',  // 'low' for fast/cheap, 'high' for detailed analysis
        },
      },
      { type: 'text', text: 'What errors are visible in this UI screenshot?' },
    ],
  }],
});

Image detail levels and token cost

Detail	Processing	Token cost	Use case
low	Fixed 512×512	~85 tokens	Classification, quick checks
high	Full resolution tiles	~1700+ tokens	OCR, detailed analysis
auto	Model decides	Varies	General purpose

Document extraction (invoices, forms, PDFs)

import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import pdf2pic from 'pdf2pic';

// Convert PDF pages to images for GPT-4o
async function extractInvoiceData(pdfPath: string) {
  const converter = pdf2pic.fromPath(pdfPath, { format: 'png', width: 2000, height: 2600 });
  const pageImages = await converter.bulk(-1, { responseType: 'base64' });

  const InvoiceSchema = z.object({
    invoiceNumber: z.string(),
    date:          z.string(),
    vendor:        z.string(),
    totalAmount:   z.number(),
    currency:      z.string(),
    lineItems:     z.array(z.object({
      description: z.string(),
      quantity:    z.number(),
      unitPrice:   z.number(),
      total:       z.number(),
    })),
  });

  const schema = zodToJsonSchema(InvoiceSchema, 'Invoice');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: [
        ...pageImages.map(img => ({
          type: 'image_url' as const,
          image_url: { url: `data:image/png;base64,${img.base64}`, detail: 'high' as const },
        })),
        { type: 'text', text: 'Extract all invoice data from these pages.' },
      ],
    }],
    response_format: {
      type: 'json_schema',
      json_schema: { name: 'Invoice', strict: true, schema: schema.definitions!['Invoice'] },
    },
  });

  return InvoiceSchema.parse(JSON.parse(response.choices[0].message.content!));
}

Audio transcription with Whisper

import fs from 'fs';

// Transcribe audio file
const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream('meeting.mp3'),
  model: 'whisper-1',
  language: 'en',           // optional — auto-detected if omitted
  response_format: 'verbose_json',  // includes timestamps
  timestamp_granularities: ['word', 'segment'],
});

console.log(transcription.text);
// Access word-level timestamps
transcription.words?.forEach(w => {
  console.log(`[${w.start.toFixed(2)}s] ${w.word}`);
});

Audio translation (any language → English)

const translation = await openai.audio.translations.create({
  file: fs.createReadStream('spanish-meeting.mp3'),
  model: 'whisper-1',
  response_format: 'text',
});
console.log(translation);  // English translation

Text-to-speech

import { createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';

const mp3 = await openai.audio.speech.create({
  model: 'tts-1',         // tts-1-hd for higher quality
  voice: 'nova',          // alloy | echo | fable | onyx | nova | shimmer
  input: 'Hello, welcome to AskDB. How can I help you today?',
  speed: 1.0,             // 0.25–4.0
});

await pipeline(mp3.body as NodeJS.ReadableStream, createWriteStream('response.mp3'));

Multi-image comparison

// Compare multiple images in a single request
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Compare these two UI designs and identify UX differences:' },
      { type: 'text', text: 'Design A:' },
      { type: 'image_url', image_url: { url: designAUrl, detail: 'high' } },
      { type: 'text', text: 'Design B:' },
      { type: 'image_url', image_url: { url: designBUrl, detail: 'high' } },
      { type: 'text', text: 'Which has better usability and why?' },
    ],
  }],
  max_tokens: 1000,
});

Gemini 1.5 Pro: native PDF and video

import { GoogleGenerativeAI } from '@google/generative-ai';
import { GoogleAIFileManager } from '@google/generative-ai/server';
import fs from 'fs';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload PDF directly (no image conversion needed)
const uploadResponse = await fileManager.uploadFile('report.pdf', {
  mimeType: 'application/pdf',
  displayName: 'Q2 Report',
});

const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' });

const result = await model.generateContent([
  { fileData: { mimeType: 'application/pdf', fileUri: uploadResponse.file.uri } },
  'Summarize the key financial metrics from this report.',
]);
console.log(result.response.text());

Vision use cases and patterns

Use case	Model	detail	Notes
Product photo classification	gpt-4o-mini	low	Fast, cheap at scale
Invoice / receipt OCR	gpt-4o	high	High accuracy on text
UI screenshot debugging	gpt-4o	high	Identify errors visually
Chart / graph analysis	gpt-4o	high	Extract data points
PDF document QA	Gemini 1.5 Pro	N/A	Native PDF, longer context
Video summarization	Gemini 1.5 Pro	N/A	Up to 1-hour videos

Cost optimization for vision

Use detail: 'low' for classification — 20× cheaper than high.
Resize images to max 2048px on the longest side before sending.
For repeated documents (e.g., same template invoice), cache extracted results.
Use Gemini 1.5 Pro for PDF QA — native PDF avoids expensive image conversion.

Takeaway

GPT-4o handles most image and audio tasks well. For large PDFs and video, Gemini 1.5 Pro's native file support is more cost-effective. Always use detail: 'low' as a default and upgrade to high only for tasks requiring fine-grained text or detail extraction.