Context Window Management: Long Documents, Sliding Window, and LLM Memory Strategies

Context window limits are one of the most common sources of production failures in LLM applications. Stuffing too much into the context is expensive and degrades quality. This guide covers the strategies that keep applications fast, accurate, and cost-efficient.

Context window sizes (2026)

Model	Context window	Practical limit
GPT-4o	128k tokens	~80k (quality degrades past this)
GPT-4o-mini	128k tokens	~60k
Claude 3.5 Sonnet	200k tokens	~150k
Gemini 1.5 Pro	1M tokens	~500k (cost-limited)
Llama 3.1 405B	128k tokens	~80k

Rule of thumb: 1 token ≈ 0.75 English words. A 128k context holds roughly 96,000 words or ~300 pages.

Token counting before sending

import { encoding_for_model } from '@dqbd/tiktoken';

function countTokens(messages: { role: string; content: string }[], model = 'gpt-4o'): number {
  const enc = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
  let total = 3; // every reply starts with <|start|>assistant<|message|>
  for (const msg of messages) {
    total += 4; // tokens per message overhead
    total += enc.encode(msg.content).length;
    total += enc.encode(msg.role).length;
  }
  enc.free();
  return total;
}

const MODEL_LIMITS: Record<string, number> = {
  'gpt-4o':       128_000,
  'gpt-4o-mini':  128_000,
};

function isWithinLimit(messages: { role: string; content: string }[], model: string): boolean {
  const tokens = countTokens(messages, model);
  const limit  = MODEL_LIMITS[model] ?? 128_000;
  return tokens < limit * 0.85;  // 85% safety margin
}

Strategy 1: Sliding window

Keep the system prompt and recent messages; drop the oldest when the limit is approached:

function slidingWindow(
  messages: Message[],
  systemPrompt: string,
  maxTokens = 80_000
): Message[] {
  const system: Message = { role: 'system', content: systemPrompt };
  let result = [system, ...messages];

  while (countTokens(result) > maxTokens && result.length > 2) {
    // Remove the oldest non-system message pair
    result.splice(1, 2);  // remove user + assistant turn
  }

  return result;
}

// Use in conversation loop
const trimmedMessages = slidingWindow(conversationHistory, systemPrompt);
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: trimmedMessages,
});

Strategy 2: Recursive summarization

Compress older turns into a summary, preserving key facts without the full verbatim history:

async function compressHistory(
  oldMessages: Message[],
  existingSummary = ''
): Promise<string> {
  const historyText = oldMessages
    .map(m => `${m.role.toUpperCase()}: ${m.content}`)
    .join('

');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Update this running conversation summary with the new exchanges below.
Keep all key facts, decisions, and user preferences. Be concise.

Previous summary:
${existingSummary || 'None yet.'}

New exchanges:
${historyText}

Updated summary:`,
    }],
    max_tokens: 500,
  });

  return response.choices[0].message.content!;
}

// Compress every N turns
if (messages.length > 0 && messages.length % 10 === 0) {
  const toCompress = messages.splice(0, 8);  // compress oldest 8 messages
  summary = await compressHistory(toCompress, summary);
  // Prepend summary as a system message
  messages.unshift({ role: 'system', content: `Conversation so far: ${summary}` });
}

Strategy 3: Map-reduce for long documents

Process document sections in parallel, then combine the results:

async function mapReduceSummarize(document: string, chunkSize = 3000): Promise<string> {
  const chunks = splitIntoChunks(document, chunkSize);

  // Map: summarize each chunk in parallel
  const chunkSummaries = await Promise.all(
    chunks.map(chunk =>
      openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{
          role: 'user',
          content: `Summarize this section concisely, preserving key facts:

${chunk}`,
        }],
        max_tokens: 300,
      }).then(r => r.choices[0].message.content!)
    )
  );

  // Reduce: combine summaries into a final answer
  const finalResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: `Based on these section summaries, provide a comprehensive final summary:

${chunkSummaries.map((s, i) => `Section ${i + 1}: ${s}`).join('

')}`,
    }],
    max_tokens: 800,
  });

  return finalResponse.choices[0].message.content!;
}

Strategy 4: Selective context (RAG)

Instead of loading the entire document, retrieve only the most relevant chunks:

async function answerWithSelectiveContext(
  question: string,
  documentChunks: string[]
): Promise<string> {
  // Embed question and all chunks
  const [questionEmbed, ...chunkEmbeds] = await batchEmbed([question, ...documentChunks]);

  // Rank chunks by relevance
  const scored = chunkEmbeds
    .map((embed, i) => ({ chunk: documentChunks[i], score: cosineSimilarity(questionEmbed, embed) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);  // top 5 chunks only

  const context = scored.map(s => s.chunk).join('

---

');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'Answer using only the provided context. Say "I don't know" if not found.' },
      { role: 'user', content: `Context:
${context}

Question: ${question}` },
    ],
  });

  return response.choices[0].message.content!;
}

Strategy 5: Lost-in-the-middle mitigation

Research shows LLMs recall information at the start and end of context better than in the middle. Position critical information at the boundaries:

// Place most important context at the start or end — not the middle
function orderChunksForRecall(chunks: ScoredChunk[]): string[] {
  const sorted = [...chunks].sort((a, b) => b.score - a.score);

  // Interleave: most relevant first and last
  const result: string[] = [];
  let left = 0, right = sorted.length - 1;
  let turn = 'start';

  while (left <= right) {
    if (turn === 'start') { result.unshift(sorted[left++].chunk); turn = 'end'; }
    else                  { result.push(sorted[right--].chunk);   turn = 'start'; }
  }

  return result;  // critical chunks at boundaries
}

Conversation memory architecture

Strategy	Token cost	Information loss	Latency
Full history	High (grows unbounded)	None	Low
Sliding window	Fixed	Old turns lost	Low
Summarization	Medium	Minor	Medium (+1 LLM call)
RAG memory	Low (selective)	Low (semantic recall)	Medium (+embed+search)
Entity extraction	Very low	Low (structured facts)	Low (key-value lookup)

Takeaway

Start with a sliding window — it is the simplest reliable strategy. Add recursive summarization once conversations regularly exceed 20 turns. Use RAG for document QA rather than stuffing full documents into context. Position critical information at context boundaries to mitigate lost-in-the-middle degradation.