Free developer tools and practical guides for SQL, data workflows, and debugging.
AAskDBSQL & Data Toolkit

Multimodal AI Guide: Vision, Audio, and Document Processing with GPT-4o and Gemini

·13 min read

Modern LLMs accept text, images, audio, and documents as input. Multimodal AI unlocks use cases that were impossible with text-only models: invoice extraction, screenshot debugging, voice assistants, and document QA.

Modality support matrix

ModelTextImagesAudioVideoDocuments
GPT-4o✅ (via images)
GPT-4o-mini✅ (via images)
Gemini 1.5 Pro✅ (native PDF)
Claude 3.5 Sonnet✅ (via images)
Whisper-1✅ (STT)

Image analysis with GPT-4o

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI();

// Option 1: URL-based image
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'image_url', image_url: { url: 'https://example.com/chart.png', detail: 'high' } },
      { type: 'text', text: 'Describe the trend in this chart and identify any anomalies.' },
    ],
  }],
  max_tokens: 500,
});

// Option 2: Base64-encoded local image
const imageBuffer = fs.readFileSync('screenshot.png');
const base64Image = imageBuffer.toString('base64');

const response2 = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      {
        type: 'image_url',
        image_url: {
          url: `data:image/png;base64,${base64Image}`,
          detail: 'high',  // 'low' for fast/cheap, 'high' for detailed analysis
        },
      },
      { type: 'text', text: 'What errors are visible in this UI screenshot?' },
    ],
  }],
});

Image detail levels and token cost

DetailProcessingToken costUse case
lowFixed 512×512~85 tokensClassification, quick checks
highFull resolution tiles~1700+ tokensOCR, detailed analysis
autoModel decidesVariesGeneral purpose

Document extraction (invoices, forms, PDFs)

import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import pdf2pic from 'pdf2pic';

// Convert PDF pages to images for GPT-4o
async function extractInvoiceData(pdfPath: string) {
  const converter = pdf2pic.fromPath(pdfPath, { format: 'png', width: 2000, height: 2600 });
  const pageImages = await converter.bulk(-1, { responseType: 'base64' });

  const InvoiceSchema = z.object({
    invoiceNumber: z.string(),
    date:          z.string(),
    vendor:        z.string(),
    totalAmount:   z.number(),
    currency:      z.string(),
    lineItems:     z.array(z.object({
      description: z.string(),
      quantity:    z.number(),
      unitPrice:   z.number(),
      total:       z.number(),
    })),
  });

  const schema = zodToJsonSchema(InvoiceSchema, 'Invoice');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: [
        ...pageImages.map(img => ({
          type: 'image_url' as const,
          image_url: { url: `data:image/png;base64,${img.base64}`, detail: 'high' as const },
        })),
        { type: 'text', text: 'Extract all invoice data from these pages.' },
      ],
    }],
    response_format: {
      type: 'json_schema',
      json_schema: { name: 'Invoice', strict: true, schema: schema.definitions!['Invoice'] },
    },
  });

  return InvoiceSchema.parse(JSON.parse(response.choices[0].message.content!));
}

Audio transcription with Whisper

import fs from 'fs';

// Transcribe audio file
const transcription = await openai.audio.transcriptions.create({
  file: fs.createReadStream('meeting.mp3'),
  model: 'whisper-1',
  language: 'en',           // optional — auto-detected if omitted
  response_format: 'verbose_json',  // includes timestamps
  timestamp_granularities: ['word', 'segment'],
});

console.log(transcription.text);
// Access word-level timestamps
transcription.words?.forEach(w => {
  console.log(`[${w.start.toFixed(2)}s] ${w.word}`);
});

Audio translation (any language → English)

const translation = await openai.audio.translations.create({
  file: fs.createReadStream('spanish-meeting.mp3'),
  model: 'whisper-1',
  response_format: 'text',
});
console.log(translation);  // English translation

Text-to-speech

import { createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';

const mp3 = await openai.audio.speech.create({
  model: 'tts-1',         // tts-1-hd for higher quality
  voice: 'nova',          // alloy | echo | fable | onyx | nova | shimmer
  input: 'Hello, welcome to AskDB. How can I help you today?',
  speed: 1.0,             // 0.25–4.0
});

await pipeline(mp3.body as NodeJS.ReadableStream, createWriteStream('response.mp3'));

Multi-image comparison

// Compare multiple images in a single request
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'Compare these two UI designs and identify UX differences:' },
      { type: 'text', text: 'Design A:' },
      { type: 'image_url', image_url: { url: designAUrl, detail: 'high' } },
      { type: 'text', text: 'Design B:' },
      { type: 'image_url', image_url: { url: designBUrl, detail: 'high' } },
      { type: 'text', text: 'Which has better usability and why?' },
    ],
  }],
  max_tokens: 1000,
});

Gemini 1.5 Pro: native PDF and video

import { GoogleGenerativeAI } from '@google/generative-ai';
import { GoogleAIFileManager } from '@google/generative-ai/server';
import fs from 'fs';

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload PDF directly (no image conversion needed)
const uploadResponse = await fileManager.uploadFile('report.pdf', {
  mimeType: 'application/pdf',
  displayName: 'Q2 Report',
});

const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' });

const result = await model.generateContent([
  { fileData: { mimeType: 'application/pdf', fileUri: uploadResponse.file.uri } },
  'Summarize the key financial metrics from this report.',
]);
console.log(result.response.text());

Vision use cases and patterns

Use caseModeldetailNotes
Product photo classificationgpt-4o-minilowFast, cheap at scale
Invoice / receipt OCRgpt-4ohighHigh accuracy on text
UI screenshot debugginggpt-4ohighIdentify errors visually
Chart / graph analysisgpt-4ohighExtract data points
PDF document QAGemini 1.5 ProN/ANative PDF, longer context
Video summarizationGemini 1.5 ProN/AUp to 1-hour videos

Cost optimization for vision

  • Use detail: 'low' for classification — 20× cheaper than high.
  • Resize images to max 2048px on the longest side before sending.
  • For repeated documents (e.g., same template invoice), cache extracted results.
  • Use Gemini 1.5 Pro for PDF QA — native PDF avoids expensive image conversion.

Takeaway

GPT-4o handles most image and audio tasks well. For large PDFs and video, Gemini 1.5 Pro's native file support is more cost-effective. Always use detail: 'low' as a default and upgrade to high only for tasks requiring fine-grained text or detail extraction.