Failure evidence
On failure, klera flushes everything triage needs into one directory. On success, it flushes nothing.
The cost of a passing run should be zero artefacts on disk. The cost of a failed run should be every signal a human or an LLM needs to pick up the trail without re-running the flow. klera’s failure- evidence flush is built around that contract.
What lands on disk
When a step fails, klera writes a per-step directory:
__failure-evidence__/<flow-slug>/step-<n>/
├── entry.snapshot.json element graph at step entry
├── postAttempt.snapshot.json element graph after the matcher tried
├── timeout.snapshot.json element graph at the timeout boundary
├── entry.png screenshot at step entry
├── postAttempt.png screenshot after the attempt
└── timeout.png screenshot at timeoutThe flush also includes the N preceding passing steps so reviewers
can see the screen state leading up to the failure. The default
context window is 3; configure it with --snapshot-context <n> on
klera run. Ring-buffer semantics: the executor keeps the rolling
window in memory and discards it on flow pass.
Two more blocks land on the failed step’s StepResult in the JSON
report (no separate sidecar files):
matcherTrace— every ladder rung the matcher probed, with candidate counts and outcome (match/no-match/ambiguous). See self-healing matcher for the ladder.sourceLinks— when the dev build emits React’s__sourceprop, every element involved in the matcher trace is denormalised back to(fileName, lineNumber, columnNumber). This is the input to the triage suspect-file ranking; see auto-triage.
Reading a flushed directory
Open the JSON snapshots in any text editor — they are pretty-printed arrays of element descriptors with parent / child relationships, visibility, role, text, accessibility label, testID. The PNG triplet at each step boundary tells you what the screen looked like at three moments:
entry.png— what the screen showed when the step started. The screen the matcher had to resolve against.postAttempt.png— what the screen showed right after the matcher’s attempt. For atapthat fired, this is the post-tap screen. For a step that resolved cleanly but failed an assertion, this is the same screen asentry.png.timeout.png— only emitted when the step timed out waiting for an assertion or forwaitForIdle. The “still loading” frame.
The matcher trace embedded in the report tells you exactly which rung
the ladder reached. If attempts ends with fuzzy-text → no-match,
the target text is genuinely not on screen at the timeout boundary —
look at timeout.png and timeout.snapshot.json together. If
attempts ends with exact-text → ambiguous, two elements carry
the same text — candidateIds lists their IDs and sourceLinks
points at the file that introduced the duplicate.
Why passing flows produce zero artefacts
The flush is the only on-disk side-effect of the executor. A passing flow:
- Captures the boundary snapshots into the in-memory ring buffer.
- Captures the PNG frames into the same buffer.
- Drops the entire buffer on flow pass.
Nothing is written. __failure-evidence__/ does not exist after a
green run. This is by design — adopters should be able to commit
the directory in .gitignore and trust that it only has content
worth investigating.
Recommended .gitignore
# klera failure evidence — fresh on every failed run, never on green.
__failure-evidence__/The directory is a debugging artefact, not a fixture. CI uploads it as an artefact bundle so PR reviewers can download the full evidence locally; the JSON report and HTML viewer carry inlined copies of the critical bits inline so the bundle is rarely the first stop.
Inlining via the HTML report
klera report <report.json> --html report.html produces a
self-contained HTML file with:
- Every step’s matcher trace rendered as a small ladder visualisation.
- Every PNG frame embedded as a
data:URI. - Every
boundarySnapshotsblock linked via collapsible<details>. - The triage card if the run was a failure.
The PR comment workflow drops the HTML file in as a CI artefact and links to it from the comment body. Reviewers see the failure inline without leaving GitHub. There is one file to share, not a directory tree.
The data-URI embedding doubles the file size but keeps everything self-contained — no broken-image links if the artefact bundle is moved. For very large element graphs the HTML can grow into the megabyte range; in that case fall back to the on-disk directory.
Source links — the dev-only piece
The __source denormalisation requires React Native to emit
__source on every element. That is on by default in development
builds (Babel’s @babel/plugin-transform-react-jsx-source is part of
the Expo preset) and stripped in production builds. The flush
gracefully handles the production case: sourceLinks is just absent
from the report.
If you ship a production build for E2E (a release-style configuration to test a hardened binary), expect the triage suspect-file ranking to be empty. The deterministic verdict still lands; the LLM narrator still runs; the matcher trace still pinpoints which element resolved. The dev-only field carries one extra layer of “and here is the file that owns that element” that does not survive minification.
Programmatic access
The flush logic lives in packages/engine/src/failure-evidence.ts.
The two public entry points:
flushFailureEvidence(buffer, flowSlug, options)— writes the boundary snapshot JSONs.flushStepFrames(capture, flowSlug, stepIndex, options)— writes the PNG triplet for one step.
Both are async, both honour options.root (defaults to
<cwd>/__failure-evidence__), and both run every payload through
the per-run secret redactor before writing to disk. A typed
${secret:KEY} value will not appear in any flushed snapshot, even
if the runtime saw it during the run. See
fixtures and secrets for
how secret resolution composes with the flush.
Tuning the ring-buffer depth
The default context window is 3 preceding passing steps. Useful ranges:
--snapshot-context 0— only the failed step. Smallest artefact bundle; fine for narrow failures where the immediate context is enough.--snapshot-context 5..10— for long flows where the failure is several screens deep and reviewers need a wider trail.
The ring buffer is bounded; setting the context window to a very large number keeps memory-resident snapshots, not unbounded growth.