LLM resilience
Prose-primary authoring puts an LLM call on the critical path. Three reinforcing layers keep the output trustworthy, and a 54-scenario corpus keeps it honest.
The product story rests on “the LLM does the right thing reliably.”
This page covers the layers that make that statement falsifiable —
the prompt, the retry loop, the post-parse semantic check — and the
corpus that exercises every LLM-touching surface in @klera/planner.
Three layers, one critical path
prose .flow.md
│
▼
┌────────────────────────────────────────────────────────┐
│ 1. Refined system prompt │
│ self-check rubric · positive + negative examples │
│ anti-hallucination rule · IR-projected snapshot │
└──────────────────────┬──────────────────────────────────┘
│
▼ candidate JSON
┌────────────────────────────────────────────────────────┐
│ 2. Schema-error-injecting retries │
│ strict-JSON preamble → Zod errors → semantic errors │
│ each retry carries more signal │
└──────────────────────┬──────────────────────────────────┘
│
▼ schema-validated SemanticPlan
┌────────────────────────────────────────────────────────┐
│ 3. Post-parse semantic check │
│ walks IR against snapshot using a light matcher │
│ nearestCandidates fed back into retry on miss │
└──────────────────────┬──────────────────────────────────┘
│
▼
committed cacheEach layer is independent. A failure at layer 1 (malformed JSON) is
caught by layer 2’s parser. A failure at layer 2 (schema violation
the retries can’t resolve) raises a PlannerHallucinationError
before any cache write. A failure at layer 3 (target not in
snapshot) triggers another retry with the nearest snapshot
candidates as feedback.
Layer 1 — the prompt
The system prompt is iteratively refined against the real-LLM probe. It carries:
- A self-check rubric. The model is asked to verify its own
output before emitting — every step keyword maps to a known IR
variant; every target either resolves against the snapshot or is
wrapped in
optional; every text value comes from the prose, not from the snapshot. - Positive and negative examples. Real prose → real IR pairs for the common keywords; bait scenarios show the failure modes the prompt is hardened against.
- An anti-hallucination rule. Targets the model can’t see in
the snapshot must be wrapped in
optional. The runtime matcher cleanly skips anoptionalstep that doesn’t resolve; an unwrapped target that isn’t in the snapshot is a hallucination. - An IR-projected snapshot. The element graph passed to the
model is projected to expose only IR-relevant fields —
testID,accessibilityLabel,text,role. The runtimeidfield is not in front of the LLM. (An earlier prompt revision leaked runtime IDs and the model was derivingtestIDvalues from them — exactly the failure mode the projection now blocks.)
The prompt is versioned alongside the planner. Updating it bumps
PLANNER_VERSION, which in turn invalidates the cache key — so
prompt updates regenerate caches automatically on the next run.
Layer 2 — schema-error-injecting retries
When the model produces JSON that doesn’t validate, klera retries with progressively more diagnostic feedback in the prompt:
| Attempt | Adds |
|---|---|
| 1 | The prompt + the prose + the projected snapshot |
| 2 | A strict-JSON preamble emphasising “emit JSON, no commentary, no fences” |
| 3 | The previous attempt’s response + the Zod errors it triggered, formatted as a feedback list |
| 4+ | Adds the semantic-check errors (layer 3) when those are what failed |
The retry budget is configurable; the default is 3. Each retry’s input is what the model needs to fix what it got wrong — a parser-level failure, then a schema-level failure, then a semantic-level failure. The retry loop stops as soon as one attempt’s output passes both layers.
When the budget is exhausted klera throws PlannerHallucinationError
carrying every attempt’s response. No half-baked cache writes.
The CLI surfaces the error with the offending attempts; the prose
file is left alone; the next run re-tries from scratch.
Layer 3 — post-parse semantic check
Schema validation is necessary but not sufficient. A plan can be
schema-valid and still target an element that does not exist in the
snapshot — e.g. the model invents a testID that looks plausible
but isn’t there.
checkPlanAgainstSnapshot walks the IR against the element-graph
snapshot using a light matcher ladder (the same four rungs as the
runtime matcher; see self-healing matcher). For
every step that carries a target:
- Resolve the target against the snapshot.
- If it resolves, the step is fine.
- If it doesn’t resolve and the step is wrapped in
optional, the step is fine —optionalis the explicit “may not exist at runtime” annotation. - If it doesn’t resolve and the step is not optional, collect the nearest snapshot candidates by fuzzy score and feed them back into the next retry.
Two design choices worth flagging:
optionalis exempted end-to-end. The check used to walk the inner step insideoptionaland reject it; the corpus surfaced this as a real bug, since it made the prompt’s anti-hallucination rule unfollowable. Fixed by exempting the wholeoptionalsubtree.swipeis intentionally exempt.swiperesolves to coord tuples after sugar normalisation, not to element targets, so the semantic check skips it.
nearestCandidates is what makes the retry productive — the model
sees “you asked for testID: 'login-btn' but the closest match is
testID: 'sign-in-btn' (score 0.84)” and corrects on the next
attempt.
The corpus
54 mock-driven scenarios across 6 suites run as part of the standard test gate, plus 44 real-LLM probe scenarios (manual / periodic, see below). Every prompt edit, every schema change, every matcher tweak runs through the mock corpus first.
| Suite | Scenarios | What it proves |
|---|---|---|
| Prose corpus | 17 | Real-shape prose flows survive validation; bait scenarios fail at the right tier |
| Flakiness simulator | 10 | Every retry escalation path works; loud failure carries a full attempt trace |
| Self-healing matcher | 7 | UI-drift recovery: testID rename, label rewrite, fuzzy text, scope, role |
| Golden end-to-end | 4 | Prose → planFlow → semantic-check → matcher resolution against fixtures |
| Recording corpus | 6 | In-app recorder pipeline: timeline + snapshot → prose + IR via replay LLM |
| Manual-mode corpus | 10 | Paste-back robustness: fenced JSON, leading prose, truncation, wrong shape |
| 54 |
The mock-driven suites use canned LLM output. The real-LLM probe
(44 scenarios under scripts/probe-llm-paid.mjs) uses an actual
local-CLI invocation; it’s gated behind KLERA_PROBE_LLM=1 and runs
manually or on a periodic CI job. The probe drove 11 distinct
findings during prompt iteration; every one is fixed in the prompt
or in the validator.
Loud failure on hallucination
The non-negotiable invariant: a hallucination never lands in the
cache. The retry loop either converges on valid IR or throws
PlannerHallucinationError. The error carries every attempt’s
response so the failure is reproducible offline, and the cache file
is not written.
CLI behaviour on PlannerHallucinationError:
✗ flows/checkout.flow.md — planner exhausted retries
attempt 1: schema invalid (3 Zod errors)
attempt 2: schema valid, semantic check failed (1 missing target)
attempt 3: schema valid, semantic check failed (1 missing target)
no cache file written. re-run to retry from scratch, or open
flows/checkout.flow.md and clarify the prose.The flow is left exactly as the author wrote it. The fix is on the
prose side — clarify the screen, or wrap the uncertain target in
optional.
Cache invalidation on prompt updates
The cache key is hash(prose + element_graph + planner_version).
planner_version is PLANNER_VERSION, exported from the planner
package. Updating the prompt bumps the version; existing caches
mismatch on the next compile and regenerate against the new prompt.
This means prompt improvements ship transparently. Adopters don’t
have to remember to invalidate caches by hand. The CI job that
runs klera compile --check catches stale caches before they
become drift in production runs.
What this corpus does NOT cover
Honest limits worth knowing:
- Real-LLM behaviour against API mode (Anthropic SDK). The
real-LLM probe runs through
claude -ponly. The Anthropic API path uses the same prompt and validation but is not exercised against a live API. - Real-LLM behaviour against
codex/gemini. The adapter table supports them; the probe currently only invokesclaude. - Network failures, rate limits, timeouts. The flakiness simulator models output flakiness only; transport-level failures are tested separately.
- Concurrent compile invocations.
klera compile --allruns sequentially; future parallel mode will need its own concurrency tests.
These are tracked as open follow-ups; this page gets updated as each lands.