Skip to Content
DocsDebuggingLLM resilience

LLM resilience

Prose-primary authoring puts an LLM call on the critical path. Three reinforcing layers keep the output trustworthy, and a 54-scenario corpus keeps it honest.

The product story rests on “the LLM does the right thing reliably.” This page covers the layers that make that statement falsifiable — the prompt, the retry loop, the post-parse semantic check — and the corpus that exercises every LLM-touching surface in @klera/planner.

Three layers, one critical path

prose .flow.md ┌────────────────────────────────────────────────────────┐ │ 1. Refined system prompt │ │ self-check rubric · positive + negative examples │ │ anti-hallucination rule · IR-projected snapshot │ └──────────────────────┬──────────────────────────────────┘ ▼ candidate JSON ┌────────────────────────────────────────────────────────┐ │ 2. Schema-error-injecting retries │ │ strict-JSON preamble → Zod errors → semantic errors │ │ each retry carries more signal │ └──────────────────────┬──────────────────────────────────┘ ▼ schema-validated SemanticPlan ┌────────────────────────────────────────────────────────┐ │ 3. Post-parse semantic check │ │ walks IR against snapshot using a light matcher │ │ nearestCandidates fed back into retry on miss │ └──────────────────────┬──────────────────────────────────┘ committed cache

Each layer is independent. A failure at layer 1 (malformed JSON) is caught by layer 2’s parser. A failure at layer 2 (schema violation the retries can’t resolve) raises a PlannerHallucinationError before any cache write. A failure at layer 3 (target not in snapshot) triggers another retry with the nearest snapshot candidates as feedback.

Layer 1 — the prompt

The system prompt is iteratively refined against the real-LLM probe. It carries:

  • A self-check rubric. The model is asked to verify its own output before emitting — every step keyword maps to a known IR variant; every target either resolves against the snapshot or is wrapped in optional; every text value comes from the prose, not from the snapshot.
  • Positive and negative examples. Real prose → real IR pairs for the common keywords; bait scenarios show the failure modes the prompt is hardened against.
  • An anti-hallucination rule. Targets the model can’t see in the snapshot must be wrapped in optional. The runtime matcher cleanly skips an optional step that doesn’t resolve; an unwrapped target that isn’t in the snapshot is a hallucination.
  • An IR-projected snapshot. The element graph passed to the model is projected to expose only IR-relevant fields — testID, accessibilityLabel, text, role. The runtime id field is not in front of the LLM. (An earlier prompt revision leaked runtime IDs and the model was deriving testID values from them — exactly the failure mode the projection now blocks.)

The prompt is versioned alongside the planner. Updating it bumps PLANNER_VERSION, which in turn invalidates the cache key — so prompt updates regenerate caches automatically on the next run.

Layer 2 — schema-error-injecting retries

When the model produces JSON that doesn’t validate, klera retries with progressively more diagnostic feedback in the prompt:

AttemptAdds
1The prompt + the prose + the projected snapshot
2A strict-JSON preamble emphasising “emit JSON, no commentary, no fences”
3The previous attempt’s response + the Zod errors it triggered, formatted as a feedback list
4+Adds the semantic-check errors (layer 3) when those are what failed

The retry budget is configurable; the default is 3. Each retry’s input is what the model needs to fix what it got wrong — a parser-level failure, then a schema-level failure, then a semantic-level failure. The retry loop stops as soon as one attempt’s output passes both layers.

When the budget is exhausted klera throws PlannerHallucinationError carrying every attempt’s response. No half-baked cache writes. The CLI surfaces the error with the offending attempts; the prose file is left alone; the next run re-tries from scratch.

Layer 3 — post-parse semantic check

Schema validation is necessary but not sufficient. A plan can be schema-valid and still target an element that does not exist in the snapshot — e.g. the model invents a testID that looks plausible but isn’t there.

checkPlanAgainstSnapshot walks the IR against the element-graph snapshot using a light matcher ladder (the same four rungs as the runtime matcher; see self-healing matcher). For every step that carries a target:

  • Resolve the target against the snapshot.
  • If it resolves, the step is fine.
  • If it doesn’t resolve and the step is wrapped in optional, the step is fine — optional is the explicit “may not exist at runtime” annotation.
  • If it doesn’t resolve and the step is not optional, collect the nearest snapshot candidates by fuzzy score and feed them back into the next retry.

Two design choices worth flagging:

  • optional is exempted end-to-end. The check used to walk the inner step inside optional and reject it; the corpus surfaced this as a real bug, since it made the prompt’s anti-hallucination rule unfollowable. Fixed by exempting the whole optional subtree.
  • swipe is intentionally exempt. swipe resolves to coord tuples after sugar normalisation, not to element targets, so the semantic check skips it.

nearestCandidates is what makes the retry productive — the model sees “you asked for testID: 'login-btn' but the closest match is testID: 'sign-in-btn' (score 0.84)” and corrects on the next attempt.

The corpus

54 mock-driven scenarios across 6 suites run as part of the standard test gate, plus 44 real-LLM probe scenarios (manual / periodic, see below). Every prompt edit, every schema change, every matcher tweak runs through the mock corpus first.

SuiteScenariosWhat it proves
Prose corpus17Real-shape prose flows survive validation; bait scenarios fail at the right tier
Flakiness simulator10Every retry escalation path works; loud failure carries a full attempt trace
Self-healing matcher7UI-drift recovery: testID rename, label rewrite, fuzzy text, scope, role
Golden end-to-end4Prose → planFlow → semantic-check → matcher resolution against fixtures
Recording corpus6In-app recorder pipeline: timeline + snapshot → prose + IR via replay LLM
Manual-mode corpus10Paste-back robustness: fenced JSON, leading prose, truncation, wrong shape
54

The mock-driven suites use canned LLM output. The real-LLM probe (44 scenarios under scripts/probe-llm-paid.mjs) uses an actual local-CLI invocation; it’s gated behind KLERA_PROBE_LLM=1 and runs manually or on a periodic CI job. The probe drove 11 distinct findings during prompt iteration; every one is fixed in the prompt or in the validator.

Loud failure on hallucination

The non-negotiable invariant: a hallucination never lands in the cache. The retry loop either converges on valid IR or throws PlannerHallucinationError. The error carries every attempt’s response so the failure is reproducible offline, and the cache file is not written.

CLI behaviour on PlannerHallucinationError:

✗ flows/checkout.flow.md — planner exhausted retries attempt 1: schema invalid (3 Zod errors) attempt 2: schema valid, semantic check failed (1 missing target) attempt 3: schema valid, semantic check failed (1 missing target) no cache file written. re-run to retry from scratch, or open flows/checkout.flow.md and clarify the prose.

The flow is left exactly as the author wrote it. The fix is on the prose side — clarify the screen, or wrap the uncertain target in optional.

Cache invalidation on prompt updates

The cache key is hash(prose + element_graph + planner_version). planner_version is PLANNER_VERSION, exported from the planner package. Updating the prompt bumps the version; existing caches mismatch on the next compile and regenerate against the new prompt.

This means prompt improvements ship transparently. Adopters don’t have to remember to invalidate caches by hand. The CI job that runs klera compile --check catches stale caches before they become drift in production runs.

What this corpus does NOT cover

Honest limits worth knowing:

  • Real-LLM behaviour against API mode (Anthropic SDK). The real-LLM probe runs through claude -p only. The Anthropic API path uses the same prompt and validation but is not exercised against a live API.
  • Real-LLM behaviour against codex / gemini. The adapter table supports them; the probe currently only invokes claude.
  • Network failures, rate limits, timeouts. The flakiness simulator models output flakiness only; transport-level failures are tested separately.
  • Concurrent compile invocations. klera compile --all runs sequentially; future parallel mode will need its own concurrency tests.

These are tracked as open follow-ups; this page gets updated as each lands.

Next

Last updated on