Context Window Management: Long Documents, Sliding Window, and LLM Memory Strategies
Context window limits are one of the most common sources of production failures in LLM applications. Stuffing too much into the context is expensive and degrades quality. This guide covers the strategies that keep applications fast, accurate, and cost-efficient.
Context window sizes (2026)
| Model | Context window | Practical limit |
|---|---|---|
| GPT-4o | 128k tokens | ~80k (quality degrades past this) |
| GPT-4o-mini | 128k tokens | ~60k |
| Claude 3.5 Sonnet | 200k tokens | ~150k |
| Gemini 1.5 Pro | 1M tokens | ~500k (cost-limited) |
| Llama 3.1 405B | 128k tokens | ~80k |
Rule of thumb: 1 token ≈ 0.75 English words. A 128k context holds roughly 96,000 words or ~300 pages.
Token counting before sending
import { encoding_for_model } from '@dqbd/tiktoken';
function countTokens(messages: { role: string; content: string }[], model = 'gpt-4o'): number {
const enc = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
let total = 3; // every reply starts with <|start|>assistant<|message|>
for (const msg of messages) {
total += 4; // tokens per message overhead
total += enc.encode(msg.content).length;
total += enc.encode(msg.role).length;
}
enc.free();
return total;
}
const MODEL_LIMITS: Record<string, number> = {
'gpt-4o': 128_000,
'gpt-4o-mini': 128_000,
};
function isWithinLimit(messages: { role: string; content: string }[], model: string): boolean {
const tokens = countTokens(messages, model);
const limit = MODEL_LIMITS[model] ?? 128_000;
return tokens < limit * 0.85; // 85% safety margin
}Strategy 1: Sliding window
Keep the system prompt and recent messages; drop the oldest when the limit is approached:
function slidingWindow(
messages: Message[],
systemPrompt: string,
maxTokens = 80_000
): Message[] {
const system: Message = { role: 'system', content: systemPrompt };
let result = [system, ...messages];
while (countTokens(result) > maxTokens && result.length > 2) {
// Remove the oldest non-system message pair
result.splice(1, 2); // remove user + assistant turn
}
return result;
}
// Use in conversation loop
const trimmedMessages = slidingWindow(conversationHistory, systemPrompt);
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: trimmedMessages,
});Strategy 2: Recursive summarization
Compress older turns into a summary, preserving key facts without the full verbatim history:
async function compressHistory(
oldMessages: Message[],
existingSummary = ''
): Promise<string> {
const historyText = oldMessages
.map(m => `${m.role.toUpperCase()}: ${m.content}`)
.join('
');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Update this running conversation summary with the new exchanges below.
Keep all key facts, decisions, and user preferences. Be concise.
Previous summary:
${existingSummary || 'None yet.'}
New exchanges:
${historyText}
Updated summary:`,
}],
max_tokens: 500,
});
return response.choices[0].message.content!;
}
// Compress every N turns
if (messages.length > 0 && messages.length % 10 === 0) {
const toCompress = messages.splice(0, 8); // compress oldest 8 messages
summary = await compressHistory(toCompress, summary);
// Prepend summary as a system message
messages.unshift({ role: 'system', content: `Conversation so far: ${summary}` });
}Strategy 3: Map-reduce for long documents
Process document sections in parallel, then combine the results:
async function mapReduceSummarize(document: string, chunkSize = 3000): Promise<string> {
const chunks = splitIntoChunks(document, chunkSize);
// Map: summarize each chunk in parallel
const chunkSummaries = await Promise.all(
chunks.map(chunk =>
openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Summarize this section concisely, preserving key facts:
${chunk}`,
}],
max_tokens: 300,
}).then(r => r.choices[0].message.content!)
)
);
// Reduce: combine summaries into a final answer
const finalResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: `Based on these section summaries, provide a comprehensive final summary:
${chunkSummaries.map((s, i) => `Section ${i + 1}: ${s}`).join('
')}`,
}],
max_tokens: 800,
});
return finalResponse.choices[0].message.content!;
}Strategy 4: Selective context (RAG)
Instead of loading the entire document, retrieve only the most relevant chunks:
async function answerWithSelectiveContext(
question: string,
documentChunks: string[]
): Promise<string> {
// Embed question and all chunks
const [questionEmbed, ...chunkEmbeds] = await batchEmbed([question, ...documentChunks]);
// Rank chunks by relevance
const scored = chunkEmbeds
.map((embed, i) => ({ chunk: documentChunks[i], score: cosineSimilarity(questionEmbed, embed) }))
.sort((a, b) => b.score - a.score)
.slice(0, 5); // top 5 chunks only
const context = scored.map(s => s.chunk).join('
---
');
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Answer using only the provided context. Say "I don't know" if not found.' },
{ role: 'user', content: `Context:
${context}
Question: ${question}` },
],
});
return response.choices[0].message.content!;
}Strategy 5: Lost-in-the-middle mitigation
Research shows LLMs recall information at the start and end of context better than in the middle. Position critical information at the boundaries:
// Place most important context at the start or end — not the middle
function orderChunksForRecall(chunks: ScoredChunk[]): string[] {
const sorted = [...chunks].sort((a, b) => b.score - a.score);
// Interleave: most relevant first and last
const result: string[] = [];
let left = 0, right = sorted.length - 1;
let turn = 'start';
while (left <= right) {
if (turn === 'start') { result.unshift(sorted[left++].chunk); turn = 'end'; }
else { result.push(sorted[right--].chunk); turn = 'start'; }
}
return result; // critical chunks at boundaries
}Conversation memory architecture
| Strategy | Token cost | Information loss | Latency |
|---|---|---|---|
| Full history | High (grows unbounded) | None | Low |
| Sliding window | Fixed | Old turns lost | Low |
| Summarization | Medium | Minor | Medium (+1 LLM call) |
| RAG memory | Low (selective) | Low (semantic recall) | Medium (+embed+search) |
| Entity extraction | Very low | Low (structured facts) | Low (key-value lookup) |
Takeaway
Start with a sliding window — it is the simplest reliable strategy. Add recursive summarization once conversations regularly exceed 20 turns. Use RAG for document QA rather than stuffing full documents into context. Position critical information at context boundaries to mitigate lost-in-the-middle degradation.