Context Management & Reliability: Claude Architect Exam Domain 5

Q: What is a case-facts block and why use one?

A persistent structured block of exact transactional facts - amounts, dates, order numbers, statuses - included in every prompt outside the summarized history. It protects precise values that progressive summarization would otherwise blur into vague phrases.

Q: What are valid escalation triggers for a support agent?

Explicit customer requests for a human (honor immediately), policy gaps or exceptions the agent cannot resolve, and inability to make meaningful progress. Sentiment scores and self-reported confidence are unreliable proxies and should not drive escalation.

Q: What is the lost-in-the-middle effect?

Models reliably process information at the beginning and end of long inputs but may omit findings from middle sections. Mitigate it by placing key findings summaries first and organizing detail under explicit section headers.

← Back to Claude Certified Architect Course

Lesson 6 of the Claude Certified Architect – Foundations course. Domain 5 is 15% of the exam (~9 questions) but punches above its weight: its patterns appear inside questions formally scored under other domains. This is the domain of systems that stay correct over time — long conversations, failures, human handoffs, and multi-source truth.

Previous: Lesson 5 — Structured Output

Back to Course Overview

Preserving Critical Information in Long Conversations

Three degradation mechanisms to recognize in scenarios:

Progressive summarization loss. Summaries silently blur exact numbers, percentages, dates, and customer-stated expectations. Fix: extract transactional facts (amounts, order numbers, statuses) into a persistent "case facts" block included in every prompt, outside the summarized history. For multi-issue sessions, persist structured issue data as a separate context layer.
Lost in the middle. Models reliably process the beginning and end of long inputs; middle sections get dropped. Fix: put key findings summaries first, organize detail under explicit section headers.
Tool-result bloat. An order lookup returns 40+ fields when 5 matter. Fix: trim tool outputs to relevant fields before they accumulate — and when a downstream agent has a tight context budget, modify upstream agents to emit structured key facts, citations, and relevance scores instead of verbose prose.

Also remember: multi-turn coherence requires passing complete conversation history on each API call — Claude has no server-side memory of prior requests.

Escalation: Triggers That Work and Proxies That Don't

Legitimate escalation triggers:

Customer explicitly requests a human → honor it immediately; don't investigate first. (One nuance: if the issue is clearly within capability, acknowledge frustration and offer to resolve — escalate if they reiterate.)
Policy gaps or exceptions — the request falls outside what policy addresses (e.g., competitor price-matching when policy only covers own-site adjustments)
Inability to make meaningful progress

Unreliable proxies (recurring distractors):

Sentiment-based escalation — frustration doesn't correlate with case complexity
Self-reported confidence scores — poorly calibrated; agents are confidently wrong on exactly the hard cases
Heuristic identity selection — multiple customer matches means ask for another identifier, never pick the closest match

The standard fix for miscalibrated escalation: explicit criteria in the system prompt plus few-shot examples demonstrating escalate-vs-resolve decisions.

Error Propagation Across Agents

The pattern from Lesson 3, now at system level. When a subagent fails, the coordinator can only recover intelligently if it receives structured error context: failure type, attempted query, partial results, and viable alternatives.

The three anti-patterns, all tested:

Anti-pattern	Consequence
Generic status ("search unavailable") after silent internal retries	Coordinator can't choose among retry / rephrase / proceed-partial
Returning empty results marked successful	Failure disguised as truth; research silently incomplete
Terminating the whole workflow on one failure	Discards recoverable work

Correct division of labor: subagents handle local recovery for transient errors; they propagate only what they can't fix, with partials and attempts. Synthesis output should carry coverage annotations — which findings are well-supported, which areas have gaps from unavailable sources.

Context in Large Codebase Exploration

Signs of context degradation in long sessions: inconsistent answers, and the model citing "typical patterns" instead of the specific classes it discovered an hour ago. Countermeasures:

Scratchpad files — agents record key findings to disk and reference them, surviving context boundaries
Subagent delegation — verbose exploration happens in an isolated context; the main agent keeps high-level coordination
Phase summarization — summarize findings before each new phase and inject the summary into fresh context
/compact — reduce context mid-session when discovery output piles up
Crash recovery manifests — each agent exports structured state to a known location; on resume, the coordinator loads the manifest and injects it into agent prompts

Human Review and Confidence Calibration

The trap: aggregate accuracy hides segment failures. 97% overall can coexist with 60% on one document type. Before reducing human review:

Analyze accuracy by document type and by field, not just overall
Calibrate field-level confidence scores against labeled validation sets — raw model confidence is not calibrated
Stratified random sampling of high-confidence extractions — measures the error rate automation would ship and catches novel error patterns
Route by risk: low confidence or ambiguous/contradictory sources → human review, spending scarce reviewer capacity where it matters

Provenance and Uncertainty in Multi-Source Synthesis

Claim–source mappings (URL, document name, excerpt) must be emitted by subagents and preserved through synthesis — attribution dies in summarization otherwise
Conflicting statistics from credible sources: annotate the conflict with attribution; never arbitrarily pick one. The analysis agent completes its work with both values flagged; the coordinator decides reconciliation
Temporal data: require publication/collection dates in structured outputs, so a 2023 figure vs a 2026 figure reads as a time series, not a contradiction
Report structure: separate well-established findings from contested ones, keeping original source characterizations and methodological context
Render by content type — financial data as tables, news as prose, technical findings as structured lists

Hands-On Exercise

Run a 30-turn support conversation with progressive summarization; verify a refund amount survives verbatim via a case-facts block.
Simulate a subagent timeout; confirm the coordinator receives failure type + partial results and annotates the final report's coverage gaps.
Give an extractor two source documents with conflicting revenue figures and different publication dates; verify both values, both sources, and both dates survive to the final output.

Worked Exam Question

Your document-extraction service reports 97.2% accuracy overall. Leadership approves removing human review for all "high-confidence" extractions. Two months later, a compliance audit finds that lease agreements — 4% of volume — have a 22% error rate on the renewal_date field, all shipped without review. Which practice would have caught this before launch?

A. Segmenting accuracy by document type and field before automating, then keeping stratified random sampling of high-confidence extractions as an ongoing check.
B. Raising the confidence threshold for skipping review from 90% to 95%.
C. Fine-tuning the extraction model on more lease agreements.
D. Requiring two independent model passes to agree before skipping review.

Answer: A. The aggregate hid the segment: 97.2% overall coexisted with 78% on one document type's critical field. Segment-level validation catches it before launch; stratified sampling of the "safe" bucket catches drift after. Option B tweaks a threshold that was never the problem (the model was confidently wrong on leases), C jumps to remediation before diagnosis, and D doubles cost while two passes can share the same blind spot.

Reliability Design Checklist

A production readiness list distilled from every Domain 5 task statement — useful both for the exam and for shipping real systems:

Facts: exact amounts, dates, IDs, and statuses live in a persistent case-facts block, never only in summarized history
Position: key findings first, detailed results under explicit section headers, verbose tool outputs trimmed before they accumulate
Escalation: triggers are explicit criteria with few-shot examples; explicit human requests honored immediately; ambiguity resolved by asking, never by heuristics
Errors: subagents recover locally from transient failures; everything propagated upward carries failure type, attempted action, and partial results; empty results are never disguised as errors nor errors as empty results
Long sessions: scratchpad files persist findings; exploration is delegated to subagents; state manifests enable crash recovery
Review: accuracy validated per document type and field; confidence calibrated on labeled sets; the automated bucket sampled continuously
Provenance: claim-source mappings, publication dates, and conflict annotations survive every summarization and synthesis step

When an exam scenario describes a reliability failure, it is almost always one unchecked box from this list — name the box and the correct option names itself.

Key Takeaways for the Exam

Case-facts blocks preserve exact values; summaries don't. Key findings go first; trim tool bloat early.
Escalate on explicit requests, policy gaps, and no-progress — never on sentiment or self-confidence.
Structured error context in; silent suppression and full-workflow termination out.
Validate accuracy per segment and calibrate confidence before automating; stratified-sample the "safe" bucket.
Provenance, conflicts, and dates survive synthesis by design, not by luck.

Next: Lesson 7 — Exam Guide & Preparation Strategy

Frequently Asked Questions

What is a case-facts block and why use one?

It is a persistent, structured block of exact transactional facts — refund amounts, order numbers, dates, statuses — included in every prompt outside the summarized conversation history. Progressive summarization is a lossy compression that characteristically blurs precise values ("$52.40" becomes "around $50"), and in support or financial contexts that drift becomes a wrong action. The case-facts block gives critical values a compression-proof home while the rest of the history summarizes freely.

What are valid escalation triggers for a support agent?

Three hold up in production: the customer explicitly asks for a human (honor it immediately, without investigating first), the request falls into a policy gap or exception the agent has no authority to resolve, and the agent cannot make meaningful progress. The two seductive-but-wrong proxies are sentiment (frustrated customers often have simple problems) and the model's self-reported confidence (uncalibrated, and most wrong on exactly the hard cases). Fix miscalibrated escalation with explicit criteria plus few-shot examples, not with score thresholds.

What is the lost-in-the-middle effect?

Models process the beginning and end of long inputs more reliably than the middle — findings buried mid-document get skipped or blurred. Mitigate it structurally: lead with a key-findings summary, organize detail under explicit section headers, and require upstream agents to emit structured key facts rather than long prose. The effect is also why "use a bigger context window" is a recurring wrong answer: capacity does not fix attention distribution.