Prompt Engineering & Structured Output: Claude Architect Exam Domain 4
Explicit criteria, few-shot prompting, tool_use JSON schemas, retry loops, and the Batches API — Domain 4 of the Claude Certified Architect exam.

Lesson 5 of the Claude Certified Architect – Foundations course. Domain 4 is 20% of the exam (~12 questions), anchored by the Structured Data Extraction and CI scenarios. The through-line: making Claude's output reliable enough to feed a downstream system — precise criteria in, schema-guaranteed data out, with validation loops for everything in between.
Previous: Lesson 4 — Claude Code Workflows
Explicit Criteria Beat Vague Instructions
The precision problem: automated reviews that flag too much noise destroy developer trust — a few bad categories undermine confidence in the accurate ones.
What doesn't work: "be conservative," "only report high-confidence findings." Confidence-based filtering fails because the model's stated confidence isn't calibrated to correctness.
What works:
- Specific categorical criteria — "flag comments only when claimed behavior contradicts actual code behavior," not "check comments are accurate"
- Define what to report vs skip — bugs and security issues in; minor style and local patterns out
- Severity levels with concrete code examples for each level — the only route to consistent classification
- Temporarily disable high false-positive categories while you improve their prompts — restoring trust beats completeness
Few-Shot Prompting, Done Deliberately
Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent output. Exam-grade usage:
- 2–4 targeted examples for ambiguous scenarios, each showing why one action beat the plausible alternative — this is what lets the model generalize judgment to novel cases rather than pattern-match
- Format demonstrations (location, issue, severity, suggested fix) for output consistency
- Positive/negative pairs distinguishing acceptable code patterns from genuine issues, cutting false positives
- Structural variety examples for extraction — inline citations vs bibliographies, narrative vs tabular data — the standard fix for empty/null extractions on required fields and for hallucination on informal formats
Structured Output via tool_use
The most reliable way to get schema-compliant output: define an extraction tool whose input schema is your output schema, and read the data from the tool_use block. This eliminates JSON syntax errors — but, critically for the exam, not semantic errors: line items that don't sum to the stated total, or values landing in the wrong field, still happen.
tool_choice recap in extraction terms:
"auto"— model may reply with text instead; no guarantee"any"— model must call some tool; right when multiple extraction schemas exist and document type is unknown- forced
{"type": "tool", "name": "extract_metadata"}— guarantees a specific extraction runs first; subsequent steps happen in follow-up turns
Schema Design That Prevents Hallucination
- Make fields optional/nullable when documents may lack them. A required field on absent information forces the model to fabricate a value to satisfy the schema. This is the single most-tested schema-design fact.
- Enums with escape hatches: add
"unclear"for ambiguous cases, and"other"plus a free-text detail field for extensible categories. - Normalization rules live in the prompt alongside the strict schema, to handle inconsistent source formatting.
- Self-checking fields: extract
calculated_totalalongsidestated_totaland aconflict_detectedboolean — semantic validation designed into the schema itself.
Validation, Retry, and Feedback Loops
Retry-with-error-feedback: on validation failure, send a follow-up containing the original document, the failed extraction, and the specific validation errors. The model self-corrects format and structural problems well.
Know when retry is futile: if the information simply isn't in the source (it lives in an external document you didn't provide), no number of retries will conjure it. Format mismatch → retry works; absent information → retry wastes money. The exam tests this distinction directly.
Feedback instrumentation: add a detected_pattern field to findings so that when developers dismiss them, you can analyze which constructs trigger false positives systematically.
The Message Batches API
Facts to memorize:
| Property | Value |
|---|---|
| Cost | 50% savings vs synchronous |
| Latency | Up to 24 hours, no SLA |
| Correlation | custom_id per request/response pair |
| Limitation | No multi-turn tool calling within a request |
The decision rule: batch for non-blocking, latency-tolerant work (overnight tech-debt reports, weekly audits, nightly test generation); synchronous for anything a human is waiting on (pre-merge checks). "Batches are often faster than 24h" is never an acceptable basis for a blocking workflow.
Operational patterns: calculate submission frequency against your SLA (a 30-hour SLA with 24-hour processing means submitting at least every ~4 hours); resubmit only failed documents by custom_id, with fixes (e.g., chunking oversized ones); and refine prompts on a sample before burning a 10,000-document batch on a first draft.
Multi-Instance and Multi-Pass Review
Two architecture facts that surface across scenarios:
Self-review is structurally weak. A model reviewing code in the same session that generated it retains its generation reasoning and is unlikely to question its own decisions. An independent instance without that context catches subtle issues that self-review instructions and extended thinking miss.
Multi-pass beats single-pass for large reviews. A 14-file single pass produces attention dilution: uneven depth, missed bugs, contradictory findings on identical code. Restructure into per-file local passes + a cross-file integration pass. Bigger context windows do not fix attention quality — a named distractor.
Hands-On Exercise
- Build an extraction tool for invoices: required fields, nullable fields, an enum with
"other"+ detail. Feed it documents missing fields; verify nulls, not fabrications. - Break the schema deliberately and implement retry-with-error-feedback; log which errors retries fix.
- Add few-shot examples for two document layouts; measure extraction accuracy before/after.
- Submit 20 documents through the Batches API; simulate two failures and resubmit only those by
custom_id.
A Complete Extraction Tool, End to End
Exam questions assume you have seen a real extraction schema. Here is a compact but production-shaped one for invoices, annotated with the design decisions Domain 4 tests:
{
"name": "extract_invoice",
"description": "Extract structured data from a single invoice document.",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"vendor_name": { "type": "string" },
"vendor_tax_id": { "type": ["string", "null"] },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP", "other"] },
"currency_other_detail": { "type": ["string", "null"] },
"line_items": { "type": "array", "items": { "type": "object",
"properties": { "description": {"type": "string"}, "amount": {"type": "number"} },
"required": ["description", "amount"] } },
"stated_total": { "type": "number" },
"calculated_total": { "type": "number" },
"conflict_detected":{ "type": "boolean" }
},
"required": ["invoice_number", "vendor_name", "currency", "line_items",
"stated_total", "calculated_total", "conflict_detected"]
}
}Walk the choices: vendor_tax_id is nullable because some invoices genuinely lack it — making it required would manufacture hallucinations. The currency enum has an "other" escape hatch with a detail field. And stated_total vs calculated_total with conflict_detected builds semantic validation into the schema itself — the model checks its own arithmetic and flags discrepancies for routing, which schema syntax alone can never do.
Call it with tool_choice {"type": "tool", "name": "extract_invoice"} when every document is an invoice, or "any" when invoices, receipts, and purchase orders share a pipeline and the model must pick the schema.
Worked Exam Question
Your extraction pipeline retries every validation failure up to five times with error feedback. Monitoring shows two failure clusters: (1) dates returned as "March 5th" instead of ISO format — these succeed by retry two; (2) missing purchase_order_number — these fail all five retries on documents where the PO number appears only in a separate approval email. What should you change?
- A. Keep retry-with-error-feedback for cluster 1, and stop retrying cluster 2 — the information is absent from the source, so either supply the approval email as input or make the field nullable.
- B. Increase the retry limit to ten for cluster 2, since more attempts eventually succeed.
- C. Add few-shot examples of correct PO extraction to fix cluster 2.
- D. Lower the temperature for both clusters to make extraction more deterministic.
Answer: A. Retries fix format and structural errors (cluster 1 proves it); they cannot conjure information that is not in the provided document. Cluster 2 needs a data fix — provide the source that contains the value, or let the schema express its absence. Options B, C, and D all spend money re-asking a question the document cannot answer.
Key Takeaways for the Exam
- Categorical criteria and per-severity examples, never "be conservative."
- Few-shot for ambiguity, format, and structural variety; show the why, not just the answer.
- tool_use schemas kill syntax errors, not semantic ones; nullable fields prevent fabrication.
- Retry fixes format problems, never missing information.
- Batch = 50% off, ≤24h, no SLA, no multi-turn tools,
custom_idcorrelation — never for blocking checks. - Independent instance for review; per-file + integration passes for large PRs.
Next: Lesson 6 — Context Management & Reliability
Frequently Asked Questions
What is the most reliable way to get schema-compliant JSON from Claude?
Define a tool whose input schema is the output structure you want, and read the data out of the model's tool_use block. This eliminates the entire class of JSON syntax errors — no trailing commas, no markdown fences, no unquoted keys. What it cannot do is guarantee semantic correctness: values can be schema-valid and still wrong, which is why designs like calculated_total alongside stated_total with a conflict flag exist. Prompt-level "return only valid JSON" instructions reduce but never eliminate failures.
Why should extraction schema fields be nullable?
Because a required field is an order the model will follow even when it shouldn't. If vendor_tax_id is required and the invoice has no tax ID, the model must put something there to satisfy the schema — so it fabricates a plausible value. Making fields that may be absent nullable (or optional) lets the model tell the truth. On the exam, any scenario describing fabricated values in specific fields almost always resolves to this schema fix, not to prompt warnings against hallucination.
When is the Message Batches API appropriate?
For workloads where nobody is waiting: overnight technical-debt reports, weekly audits, bulk document extraction with a next-day SLA. You get 50% cost savings in exchange for up to 24-hour processing with no latency guarantee, and batched requests cannot do multi-turn tool calling. That trade is never acceptable for blocking workflows like pre-merge checks — "batches usually finish faster" is explicitly a wrong answer. Use custom_id on every request so failures can be identified and resubmitted individually.
