Multimodal AI Guide: Vision, Audio, and Document Processing with GPT-4o and Gemini
·13 min read
Modern LLMs accept text, images, audio, and documents as input. Multimodal AI unlocks use cases that were impossible with text-only models: invoice extraction, screenshot debugging, voice assistants, and document QA.
Modality support matrix
| Model | Text | Images | Audio | Video | Documents |
|---|---|---|---|---|---|
| GPT-4o | ✅ | ✅ | ✅ | ❌ | ✅ (via images) |
| GPT-4o-mini | ✅ | ✅ | ❌ | ❌ | ✅ (via images) |
| Gemini 1.5 Pro | ✅ | ✅ | ✅ | ✅ | ✅ (native PDF) |
| Claude 3.5 Sonnet | ✅ | ✅ | ❌ | ❌ | ✅ (via images) |
| Whisper-1 | ❌ | ❌ | ✅ (STT) | ❌ | ❌ |
Image analysis with GPT-4o
import OpenAI from 'openai';
import fs from 'fs';
const openai = new OpenAI();
// Option 1: URL-based image
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: 'https://example.com/chart.png', detail: 'high' } },
{ type: 'text', text: 'Describe the trend in this chart and identify any anomalies.' },
],
}],
max_tokens: 500,
});
// Option 2: Base64-encoded local image
const imageBuffer = fs.readFileSync('screenshot.png');
const base64Image = imageBuffer.toString('base64');
const response2 = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{
type: 'image_url',
image_url: {
url: `data:image/png;base64,${base64Image}`,
detail: 'high', // 'low' for fast/cheap, 'high' for detailed analysis
},
},
{ type: 'text', text: 'What errors are visible in this UI screenshot?' },
],
}],
});Image detail levels and token cost
| Detail | Processing | Token cost | Use case |
|---|---|---|---|
| low | Fixed 512×512 | ~85 tokens | Classification, quick checks |
| high | Full resolution tiles | ~1700+ tokens | OCR, detailed analysis |
| auto | Model decides | Varies | General purpose |
Document extraction (invoices, forms, PDFs)
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import pdf2pic from 'pdf2pic';
// Convert PDF pages to images for GPT-4o
async function extractInvoiceData(pdfPath: string) {
const converter = pdf2pic.fromPath(pdfPath, { format: 'png', width: 2000, height: 2600 });
const pageImages = await converter.bulk(-1, { responseType: 'base64' });
const InvoiceSchema = z.object({
invoiceNumber: z.string(),
date: z.string(),
vendor: z.string(),
totalAmount: z.number(),
currency: z.string(),
lineItems: z.array(z.object({
description: z.string(),
quantity: z.number(),
unitPrice: z.number(),
total: z.number(),
})),
});
const schema = zodToJsonSchema(InvoiceSchema, 'Invoice');
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
...pageImages.map(img => ({
type: 'image_url' as const,
image_url: { url: `data:image/png;base64,${img.base64}`, detail: 'high' as const },
})),
{ type: 'text', text: 'Extract all invoice data from these pages.' },
],
}],
response_format: {
type: 'json_schema',
json_schema: { name: 'Invoice', strict: true, schema: schema.definitions!['Invoice'] },
},
});
return InvoiceSchema.parse(JSON.parse(response.choices[0].message.content!));
}Audio transcription with Whisper
import fs from 'fs';
// Transcribe audio file
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream('meeting.mp3'),
model: 'whisper-1',
language: 'en', // optional — auto-detected if omitted
response_format: 'verbose_json', // includes timestamps
timestamp_granularities: ['word', 'segment'],
});
console.log(transcription.text);
// Access word-level timestamps
transcription.words?.forEach(w => {
console.log(`[${w.start.toFixed(2)}s] ${w.word}`);
});Audio translation (any language → English)
const translation = await openai.audio.translations.create({
file: fs.createReadStream('spanish-meeting.mp3'),
model: 'whisper-1',
response_format: 'text',
});
console.log(translation); // English translationText-to-speech
import { createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';
const mp3 = await openai.audio.speech.create({
model: 'tts-1', // tts-1-hd for higher quality
voice: 'nova', // alloy | echo | fable | onyx | nova | shimmer
input: 'Hello, welcome to AskDB. How can I help you today?',
speed: 1.0, // 0.25–4.0
});
await pipeline(mp3.body as NodeJS.ReadableStream, createWriteStream('response.mp3'));Multi-image comparison
// Compare multiple images in a single request
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Compare these two UI designs and identify UX differences:' },
{ type: 'text', text: 'Design A:' },
{ type: 'image_url', image_url: { url: designAUrl, detail: 'high' } },
{ type: 'text', text: 'Design B:' },
{ type: 'image_url', image_url: { url: designBUrl, detail: 'high' } },
{ type: 'text', text: 'Which has better usability and why?' },
],
}],
max_tokens: 1000,
});Gemini 1.5 Pro: native PDF and video
import { GoogleGenerativeAI } from '@google/generative-ai';
import { GoogleAIFileManager } from '@google/generative-ai/server';
import fs from 'fs';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload PDF directly (no image conversion needed)
const uploadResponse = await fileManager.uploadFile('report.pdf', {
mimeType: 'application/pdf',
displayName: 'Q2 Report',
});
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' });
const result = await model.generateContent([
{ fileData: { mimeType: 'application/pdf', fileUri: uploadResponse.file.uri } },
'Summarize the key financial metrics from this report.',
]);
console.log(result.response.text());Vision use cases and patterns
| Use case | Model | detail | Notes |
|---|---|---|---|
| Product photo classification | gpt-4o-mini | low | Fast, cheap at scale |
| Invoice / receipt OCR | gpt-4o | high | High accuracy on text |
| UI screenshot debugging | gpt-4o | high | Identify errors visually |
| Chart / graph analysis | gpt-4o | high | Extract data points |
| PDF document QA | Gemini 1.5 Pro | N/A | Native PDF, longer context |
| Video summarization | Gemini 1.5 Pro | N/A | Up to 1-hour videos |
Cost optimization for vision
- Use
detail: 'low'for classification — 20× cheaper thanhigh. - Resize images to max 2048px on the longest side before sending.
- For repeated documents (e.g., same template invoice), cache extracted results.
- Use Gemini 1.5 Pro for PDF QA — native PDF avoids expensive image conversion.
Takeaway
GPT-4o handles most image and audio tasks well. For large PDFs and video, Gemini 1.5 Pro's native file support is more cost-effective. Always use detail: 'low' as a default and upgrade to high only for tasks requiring fine-grained text or detail extraction.