Failure evidence

On failure, klera flushes everything triage needs into one directory. On success, it flushes nothing.

The cost of a passing run should be zero artefacts on disk. The cost of a failed run should be every signal a human or an LLM needs to pick up the trail without re-running the flow. klera’s failure- evidence flush is built around that contract.

What lands on disk

When a step fails, klera writes a per-step directory:


__failure-evidence__/<flow-slug>/step-<n>/
  ├── entry.snapshot.json           element graph at step entry
  ├── postAttempt.snapshot.json     element graph after the matcher tried
  ├── timeout.snapshot.json         element graph at the timeout boundary
  ├── entry.png                     screenshot at step entry
  ├── postAttempt.png               screenshot after the attempt
  └── timeout.png                   screenshot at timeout

The flush also includes the N preceding passing steps so reviewers can see the screen state leading up to the failure. The default context window is 3; configure it with --snapshot-context <n> on klera run. Ring-buffer semantics: the executor keeps the rolling window in memory and discards it on flow pass.

Two more blocks land on the failed step’s StepResult in the JSON report (no separate sidecar files):

matcherTrace — every ladder rung the matcher probed, with candidate counts and outcome (match / no-match / ambiguous). See self-healing matcher for the ladder.
sourceLinks — when the dev build emits React’s __source prop, every element involved in the matcher trace is denormalised back to (fileName, lineNumber, columnNumber). This is the input to the triage suspect-file ranking; see auto-triage.

Reading a flushed directory

Open the JSON snapshots in any text editor — they are pretty-printed arrays of element descriptors with parent / child relationships, visibility, role, text, accessibility label, testID. The PNG triplet at each step boundary tells you what the screen looked like at three moments:

entry.png — what the screen showed when the step started. The screen the matcher had to resolve against.
postAttempt.png — what the screen showed right after the matcher’s attempt. For a tap that fired, this is the post-tap screen. For a step that resolved cleanly but failed an assertion, this is the same screen as entry.png.
timeout.png — only emitted when the step timed out waiting for an assertion or for waitForIdle. The “still loading” frame.

The matcher trace embedded in the report tells you exactly which rung the ladder reached. If attempts ends with fuzzy-text → no-match, the target text is genuinely not on screen at the timeout boundary — look at timeout.png and timeout.snapshot.json together. If attempts ends with exact-text → ambiguous, two elements carry the same text — candidateIds lists their IDs and sourceLinks points at the file that introduced the duplicate.

Why passing flows produce zero artefacts

The flush is the only on-disk side-effect of the executor. A passing flow:

Captures the boundary snapshots into the in-memory ring buffer.
Captures the PNG frames into the same buffer.
Drops the entire buffer on flow pass.

Nothing is written. __failure-evidence__/ does not exist after a green run. This is by design — adopters should be able to commit the directory in .gitignore and trust that it only has content worth investigating.

Recommended `.gitignore`


# klera failure evidence — fresh on every failed run, never on green.
__failure-evidence__/

The directory is a debugging artefact, not a fixture. CI uploads it as an artefact bundle so PR reviewers can download the full evidence locally; the JSON report and HTML viewer carry inlined copies of the critical bits inline so the bundle is rarely the first stop.

Inlining via the HTML report

klera report <report.json> --html report.html produces a self-contained HTML file with:

Every step’s matcher trace rendered as a small ladder visualisation.
Every PNG frame embedded as a data: URI.
Every boundarySnapshots block linked via collapsible <details>.
The triage card if the run was a failure.

The PR comment workflow drops the HTML file in as a CI artefact and links to it from the comment body. Reviewers see the failure inline without leaving GitHub. There is one file to share, not a directory tree.

The data-URI embedding doubles the file size but keeps everything self-contained — no broken-image links if the artefact bundle is moved. For very large element graphs the HTML can grow into the megabyte range; in that case fall back to the on-disk directory.

Source links — the dev-only piece

The __source denormalisation requires React Native to emit __source on every element. That is on by default in development builds (Babel’s @babel/plugin-transform-react-jsx-source is part of the Expo preset) and stripped in production builds. The flush gracefully handles the production case: sourceLinks is just absent from the report.

If you ship a production build for E2E (a release-style configuration to test a hardened binary), expect the triage suspect-file ranking to be empty. The deterministic verdict still lands; the LLM narrator still runs; the matcher trace still pinpoints which element resolved. The dev-only field carries one extra layer of “and here is the file that owns that element” that does not survive minification.

Programmatic access

The flush logic lives in packages/engine/src/failure-evidence.ts. The two public entry points:

flushFailureEvidence(buffer, flowSlug, options) — writes the boundary snapshot JSONs.
flushStepFrames(capture, flowSlug, stepIndex, options) — writes the PNG triplet for one step.

Both are async, both honour options.root (defaults to <cwd>/__failure-evidence__), and both run every payload through the per-run secret redactor before writing to disk. A typed ${secret:KEY} value will not appear in any flushed snapshot, even if the runtime saw it during the run. See fixtures and secrets for how secret resolution composes with the flush.

Tuning the ring-buffer depth

The default context window is 3 preceding passing steps. Useful ranges:

--snapshot-context 0 — only the failed step. Smallest artefact bundle; fine for narrow failures where the immediate context is enough.
--snapshot-context 5..10 — for long flows where the failure is several screens deep and reviewers need a wider trail.

The ring buffer is bounded; setting the context window to a very large number keeps memory-resident snapshots, not unbounded growth.

Auto-triage Self-healing matcher Reports Reading a report