Context Management & Reliability: Claude Architect Exam Domain 5
Lost-in-the-middle effects, escalation triggers, error propagation, and confidence calibration — Domain 5 of the Claude Certified Architect exam.

Lesson 6 of the Claude Certified Architect – Foundations course. Domain 5 is 15% of the exam (~9 questions) but punches above its weight: its patterns appear inside questions formally scored under other domains. This is the domain of systems that stay correct over time — long conversations, failures, human handoffs, and multi-source truth.
Previous: Lesson 5 — Structured Output
Preserving Critical Information in Long Conversations
Three degradation mechanisms to recognize in scenarios:
- Progressive summarization loss. Summaries silently blur exact numbers, percentages, dates, and customer-stated expectations. Fix: extract transactional facts (amounts, order numbers, statuses) into a persistent "case facts" block included in every prompt, outside the summarized history. For multi-issue sessions, persist structured issue data as a separate context layer.
- Lost in the middle. Models reliably process the beginning and end of long inputs; middle sections get dropped. Fix: put key findings summaries first, organize detail under explicit section headers.
- Tool-result bloat. An order lookup returns 40+ fields when 5 matter. Fix: trim tool outputs to relevant fields before they accumulate — and when a downstream agent has a tight context budget, modify upstream agents to emit structured key facts, citations, and relevance scores instead of verbose prose.
Also remember: multi-turn coherence requires passing complete conversation history on each API call — Claude has no server-side memory of prior requests.
Escalation: Triggers That Work and Proxies That Don't
Legitimate escalation triggers:
- Customer explicitly requests a human → honor it immediately; don't investigate first. (One nuance: if the issue is clearly within capability, acknowledge frustration and offer to resolve — escalate if they reiterate.)
- Policy gaps or exceptions — the request falls outside what policy addresses (e.g., competitor price-matching when policy only covers own-site adjustments)
- Inability to make meaningful progress
Unreliable proxies (recurring distractors):
- Sentiment-based escalation — frustration doesn't correlate with case complexity
- Self-reported confidence scores — poorly calibrated; agents are confidently wrong on exactly the hard cases
- Heuristic identity selection — multiple customer matches means ask for another identifier, never pick the closest match
The standard fix for miscalibrated escalation: explicit criteria in the system prompt plus few-shot examples demonstrating escalate-vs-resolve decisions.
Error Propagation Across Agents
The pattern from Lesson 3, now at system level. When a subagent fails, the coordinator can only recover intelligently if it receives structured error context: failure type, attempted query, partial results, and viable alternatives.
The three anti-patterns, all tested:
| Anti-pattern | Consequence |
|---|---|
| Generic status ("search unavailable") after silent internal retries | Coordinator can't choose among retry / rephrase / proceed-partial |
| Returning empty results marked successful | Failure disguised as truth; research silently incomplete |
| Terminating the whole workflow on one failure | Discards recoverable work |
Correct division of labor: subagents handle local recovery for transient errors; they propagate only what they can't fix, with partials and attempts. Synthesis output should carry coverage annotations — which findings are well-supported, which areas have gaps from unavailable sources.
Context in Large Codebase Exploration
Signs of context degradation in long sessions: inconsistent answers, and the model citing "typical patterns" instead of the specific classes it discovered an hour ago. Countermeasures:
- Scratchpad files — agents record key findings to disk and reference them, surviving context boundaries
- Subagent delegation — verbose exploration happens in an isolated context; the main agent keeps high-level coordination
- Phase summarization — summarize findings before each new phase and inject the summary into fresh context
/compact— reduce context mid-session when discovery output piles up- Crash recovery manifests — each agent exports structured state to a known location; on resume, the coordinator loads the manifest and injects it into agent prompts
Human Review and Confidence Calibration
The trap: aggregate accuracy hides segment failures. 97% overall can coexist with 60% on one document type. Before reducing human review:
- Analyze accuracy by document type and by field, not just overall
- Calibrate field-level confidence scores against labeled validation sets — raw model confidence is not calibrated
- Stratified random sampling of high-confidence extractions — measures the error rate automation would ship and catches novel error patterns
- Route by risk: low confidence or ambiguous/contradictory sources → human review, spending scarce reviewer capacity where it matters
Provenance and Uncertainty in Multi-Source Synthesis
- Claim–source mappings (URL, document name, excerpt) must be emitted by subagents and preserved through synthesis — attribution dies in summarization otherwise
- Conflicting statistics from credible sources: annotate the conflict with attribution; never arbitrarily pick one. The analysis agent completes its work with both values flagged; the coordinator decides reconciliation
- Temporal data: require publication/collection dates in structured outputs, so a 2023 figure vs a 2026 figure reads as a time series, not a contradiction
- Report structure: separate well-established findings from contested ones, keeping original source characterizations and methodological context
- Render by content type — financial data as tables, news as prose, technical findings as structured lists
Hands-On Exercise
- Run a 30-turn support conversation with progressive summarization; verify a refund amount survives verbatim via a case-facts block.
- Simulate a subagent timeout; confirm the coordinator receives failure type + partial results and annotates the final report's coverage gaps.
- Give an extractor two source documents with conflicting revenue figures and different publication dates; verify both values, both sources, and both dates survive to the final output.
Worked Exam Question
Your document-extraction service reports 97.2% accuracy overall. Leadership approves removing human review for all "high-confidence" extractions. Two months later, a compliance audit finds that lease agreements — 4% of volume — have a 22% error rate on the renewal_date field, all shipped without review. Which practice would have caught this before launch?
- A. Segmenting accuracy by document type and field before automating, then keeping stratified random sampling of high-confidence extractions as an ongoing check.
- B. Raising the confidence threshold for skipping review from 90% to 95%.
- C. Fine-tuning the extraction model on more lease agreements.
- D. Requiring two independent model passes to agree before skipping review.
Answer: A. The aggregate hid the segment: 97.2% overall coexisted with 78% on one document type's critical field. Segment-level validation catches it before launch; stratified sampling of the "safe" bucket catches drift after. Option B tweaks a threshold that was never the problem (the model was confidently wrong on leases), C jumps to remediation before diagnosis, and D doubles cost while two passes can share the same blind spot.
Reliability Design Checklist
A production readiness list distilled from every Domain 5 task statement — useful both for the exam and for shipping real systems:
- Facts: exact amounts, dates, IDs, and statuses live in a persistent case-facts block, never only in summarized history
- Position: key findings first, detailed results under explicit section headers, verbose tool outputs trimmed before they accumulate
- Escalation: triggers are explicit criteria with few-shot examples; explicit human requests honored immediately; ambiguity resolved by asking, never by heuristics
- Errors: subagents recover locally from transient failures; everything propagated upward carries failure type, attempted action, and partial results; empty results are never disguised as errors nor errors as empty results
- Long sessions: scratchpad files persist findings; exploration is delegated to subagents; state manifests enable crash recovery
- Review: accuracy validated per document type and field; confidence calibrated on labeled sets; the automated bucket sampled continuously
- Provenance: claim-source mappings, publication dates, and conflict annotations survive every summarization and synthesis step
When an exam scenario describes a reliability failure, it is almost always one unchecked box from this list — name the box and the correct option names itself.
Key Takeaways for the Exam
- Case-facts blocks preserve exact values; summaries don't. Key findings go first; trim tool bloat early.
- Escalate on explicit requests, policy gaps, and no-progress — never on sentiment or self-confidence.
- Structured error context in; silent suppression and full-workflow termination out.
- Validate accuracy per segment and calibrate confidence before automating; stratified-sample the "safe" bucket.
- Provenance, conflicts, and dates survive synthesis by design, not by luck.
Next: Lesson 7 — Exam Guide & Preparation Strategy
Frequently Asked Questions
What is a case-facts block and why use one?
It is a persistent, structured block of exact transactional facts — refund amounts, order numbers, dates, statuses — included in every prompt outside the summarized conversation history. Progressive summarization is a lossy compression that characteristically blurs precise values ("$52.40" becomes "around $50"), and in support or financial contexts that drift becomes a wrong action. The case-facts block gives critical values a compression-proof home while the rest of the history summarizes freely.
What are valid escalation triggers for a support agent?
Three hold up in production: the customer explicitly asks for a human (honor it immediately, without investigating first), the request falls into a policy gap or exception the agent has no authority to resolve, and the agent cannot make meaningful progress. The two seductive-but-wrong proxies are sentiment (frustrated customers often have simple problems) and the model's self-reported confidence (uncalibrated, and most wrong on exactly the hard cases). Fix miscalibrated escalation with explicit criteria plus few-shot examples, not with score thresholds.
What is the lost-in-the-middle effect?
Models process the beginning and end of long inputs more reliably than the middle — findings buried mid-document get skipped or blurred. Mitigate it structurally: lead with a key-findings summary, organize detail under explicit section headers, and require upstream agents to emit structured key facts rather than long prose. The effect is also why "use a bigger context window" is a recurring wrong answer: capacity does not fix attention distribution.
