Artificial IntelligenceAnthropicClaude AI

Context Management & Reliability: Claude Architect Exam Domain 5

Lost-in-the-middle effects, escalation triggers, error propagation, and confidence calibration — Domain 5 of the Claude Certified Architect exam.

TT
Emily Ross
9 min read
Context Management & Reliability: Claude Architect Exam Domain 5

Lesson 6 of the Claude Certified Architect – Foundations course. Domain 5 is 15% of the exam (~9 questions) but punches above its weight: its patterns appear inside questions formally scored under other domains. This is the domain of systems that stay correct over time — long conversations, failures, human handoffs, and multi-source truth.

Previous: Lesson 5 — Structured Output

Back to Course Overview


Preserving Critical Information in Long Conversations

Three degradation mechanisms to recognize in scenarios:

  1. Progressive summarization loss. Summaries silently blur exact numbers, percentages, dates, and customer-stated expectations. Fix: extract transactional facts (amounts, order numbers, statuses) into a persistent "case facts" block included in every prompt, outside the summarized history. For multi-issue sessions, persist structured issue data as a separate context layer.
  2. Lost in the middle. Models reliably process the beginning and end of long inputs; middle sections get dropped. Fix: put key findings summaries first, organize detail under explicit section headers.
  3. Tool-result bloat. An order lookup returns 40+ fields when 5 matter. Fix: trim tool outputs to relevant fields before they accumulate — and when a downstream agent has a tight context budget, modify upstream agents to emit structured key facts, citations, and relevance scores instead of verbose prose.

Also remember: multi-turn coherence requires passing complete conversation history on each API call — Claude has no server-side memory of prior requests.


Escalation: Triggers That Work and Proxies That Don't

Legitimate escalation triggers:

  • Customer explicitly requests a human → honor it immediately; don't investigate first. (One nuance: if the issue is clearly within capability, acknowledge frustration and offer to resolve — escalate if they reiterate.)
  • Policy gaps or exceptions — the request falls outside what policy addresses (e.g., competitor price-matching when policy only covers own-site adjustments)
  • Inability to make meaningful progress

Unreliable proxies (recurring distractors):

  • Sentiment-based escalation — frustration doesn't correlate with case complexity
  • Self-reported confidence scores — poorly calibrated; agents are confidently wrong on exactly the hard cases
  • Heuristic identity selection — multiple customer matches means ask for another identifier, never pick the closest match

The standard fix for miscalibrated escalation: explicit criteria in the system prompt plus few-shot examples demonstrating escalate-vs-resolve decisions.


Error Propagation Across Agents

The pattern from Lesson 3, now at system level. When a subagent fails, the coordinator can only recover intelligently if it receives structured error context: failure type, attempted query, partial results, and viable alternatives.

The three anti-patterns, all tested:

Anti-patternConsequence
Generic status ("search unavailable") after silent internal retriesCoordinator can't choose among retry / rephrase / proceed-partial
Returning empty results marked successfulFailure disguised as truth; research silently incomplete
Terminating the whole workflow on one failureDiscards recoverable work

Correct division of labor: subagents handle local recovery for transient errors; they propagate only what they can't fix, with partials and attempts. Synthesis output should carry coverage annotations — which findings are well-supported, which areas have gaps from unavailable sources.


Context in Large Codebase Exploration

Signs of context degradation in long sessions: inconsistent answers, and the model citing "typical patterns" instead of the specific classes it discovered an hour ago. Countermeasures:

  • Scratchpad files — agents record key findings to disk and reference them, surviving context boundaries
  • Subagent delegation — verbose exploration happens in an isolated context; the main agent keeps high-level coordination
  • Phase summarization — summarize findings before each new phase and inject the summary into fresh context
  • /compact — reduce context mid-session when discovery output piles up
  • Crash recovery manifests — each agent exports structured state to a known location; on resume, the coordinator loads the manifest and injects it into agent prompts

Human Review and Confidence Calibration

The trap: aggregate accuracy hides segment failures. 97% overall can coexist with 60% on one document type. Before reducing human review:

  • Analyze accuracy by document type and by field, not just overall
  • Calibrate field-level confidence scores against labeled validation sets — raw model confidence is not calibrated
  • Stratified random sampling of high-confidence extractions — measures the error rate automation would ship and catches novel error patterns
  • Route by risk: low confidence or ambiguous/contradictory sources → human review, spending scarce reviewer capacity where it matters

Provenance and Uncertainty in Multi-Source Synthesis

  • Claim–source mappings (URL, document name, excerpt) must be emitted by subagents and preserved through synthesis — attribution dies in summarization otherwise
  • Conflicting statistics from credible sources: annotate the conflict with attribution; never arbitrarily pick one. The analysis agent completes its work with both values flagged; the coordinator decides reconciliation
  • Temporal data: require publication/collection dates in structured outputs, so a 2023 figure vs a 2026 figure reads as a time series, not a contradiction
  • Report structure: separate well-established findings from contested ones, keeping original source characterizations and methodological context
  • Render by content type — financial data as tables, news as prose, technical findings as structured lists

Hands-On Exercise

  1. Run a 30-turn support conversation with progressive summarization; verify a refund amount survives verbatim via a case-facts block.
  2. Simulate a subagent timeout; confirm the coordinator receives failure type + partial results and annotates the final report's coverage gaps.
  3. Give an extractor two source documents with conflicting revenue figures and different publication dates; verify both values, both sources, and both dates survive to the final output.

Worked Exam Question

Your document-extraction service reports 97.2% accuracy overall. Leadership approves removing human review for all "high-confidence" extractions. Two months later, a compliance audit finds that lease agreements — 4% of volume — have a 22% error rate on the renewal_date field, all shipped without review. Which practice would have caught this before launch?

  • A. Segmenting accuracy by document type and field before automating, then keeping stratified random sampling of high-confidence extractions as an ongoing check.
  • B. Raising the confidence threshold for skipping review from 90% to 95%.
  • C. Fine-tuning the extraction model on more lease agreements.
  • D. Requiring two independent model passes to agree before skipping review.

Answer: A. The aggregate hid the segment: 97.2% overall coexisted with 78% on one document type's critical field. Segment-level validation catches it before launch; stratified sampling of the "safe" bucket catches drift after. Option B tweaks a threshold that was never the problem (the model was confidently wrong on leases), C jumps to remediation before diagnosis, and D doubles cost while two passes can share the same blind spot.


Reliability Design Checklist

A production readiness list distilled from every Domain 5 task statement — useful both for the exam and for shipping real systems:

  • Facts: exact amounts, dates, IDs, and statuses live in a persistent case-facts block, never only in summarized history
  • Position: key findings first, detailed results under explicit section headers, verbose tool outputs trimmed before they accumulate
  • Escalation: triggers are explicit criteria with few-shot examples; explicit human requests honored immediately; ambiguity resolved by asking, never by heuristics
  • Errors: subagents recover locally from transient failures; everything propagated upward carries failure type, attempted action, and partial results; empty results are never disguised as errors nor errors as empty results
  • Long sessions: scratchpad files persist findings; exploration is delegated to subagents; state manifests enable crash recovery
  • Review: accuracy validated per document type and field; confidence calibrated on labeled sets; the automated bucket sampled continuously
  • Provenance: claim-source mappings, publication dates, and conflict annotations survive every summarization and synthesis step

When an exam scenario describes a reliability failure, it is almost always one unchecked box from this list — name the box and the correct option names itself.


Key Takeaways for the Exam

  • Case-facts blocks preserve exact values; summaries don't. Key findings go first; trim tool bloat early.
  • Escalate on explicit requests, policy gaps, and no-progress — never on sentiment or self-confidence.
  • Structured error context in; silent suppression and full-workflow termination out.
  • Validate accuracy per segment and calibrate confidence before automating; stratified-sample the "safe" bucket.
  • Provenance, conflicts, and dates survive synthesis by design, not by luck.

Next: Lesson 7 — Exam Guide & Preparation Strategy


Frequently Asked Questions

What is a case-facts block and why use one?

It is a persistent, structured block of exact transactional facts — refund amounts, order numbers, dates, statuses — included in every prompt outside the summarized conversation history. Progressive summarization is a lossy compression that characteristically blurs precise values ("$52.40" becomes "around $50"), and in support or financial contexts that drift becomes a wrong action. The case-facts block gives critical values a compression-proof home while the rest of the history summarizes freely.

What are valid escalation triggers for a support agent?

Three hold up in production: the customer explicitly asks for a human (honor it immediately, without investigating first), the request falls into a policy gap or exception the agent has no authority to resolve, and the agent cannot make meaningful progress. The two seductive-but-wrong proxies are sentiment (frustrated customers often have simple problems) and the model's self-reported confidence (uncalibrated, and most wrong on exactly the hard cases). Fix miscalibrated escalation with explicit criteria plus few-shot examples, not with score thresholds.

What is the lost-in-the-middle effect?

Models process the beginning and end of long inputs more reliably than the middle — findings buried mid-document get skipped or blurred. Mitigate it structurally: lead with a key-findings summary, organize detail under explicit section headers, and require upstream agents to emit structured key facts rather than long prose. The effect is also why "use a bigger context window" is a recurring wrong answer: capacity does not fix attention distribution.