Visual snapshots

visualSnapshot captures a screenshot at a step boundary and diffs it against a stored baseline PNG. Pixel regression catches the bugs the element-graph matcher can’t — colour drift, font rendering changes, layout regressions inside opaque native views, the icon someone moved two pixels to the left.


- visualSnapshot: home-after-login
- visualSnapshot:
    id: settings-loaded
    tolerance: 1.0
    region: { testID: settings-list }

How it works

The runtime captures a PNG of the current screen, the engine compares it against the stored baseline pixel-for-pixel, and the step passes if the percentage of differing pixels is below tolerance (default 0.5).

Three baseline states:

First run, no baseline — the captured PNG is written as the new baseline; the step passes; CI prints “baseline created”.
Subsequent run, match — captured matches baseline within tolerance; the step passes silently.
Subsequent run, mismatch — the engine writes a triplet (actual.png, baseline.png, diff.png) into the report artefact directory; the step fails with the observed diff percentage.

The comparator is in-process (pngjs-backed) — no external image toolchain, no Docker, no flake from environment differences.

The IR step shape


- visualSnapshot: home-after-login # short form
- visualSnapshot:
    id: home-after-login
    tolerance: 0.5 # max % of pixels allowed to differ; default 0.5
    region: { testID: notification-list }

Field	Type	Default	Notes
`id`	string (required)	—	Unique within the flow. Filename-safe.
`tolerance`	number (0–100)	`0.5`	Maximum percent of pixels allowed to differ.
`region`	target selector	full screen	Narrows the comparison to one element’s frame.

The short form visualSnapshot: <string> is sugar for { id: <string>, tolerance: 0.5 }.

Why `region`?

The surrounding chrome (clock, signal bar, battery indicator) drifts between runs and contaminates a full-screen diff. Setting region to a testID of the content area lets the comparator ignore everything outside that frame.


- visualSnapshot:
    id: notification-row-2
    region: { testID: notification-list }

The matcher resolves region once at step time. If the element is not visible the step fails immediately with region not visible — no PNGs are written.

Baselines

Baselines live under __baselines__/<flow>/<id>.png relative to the flow directory. Commit them.

- login.flow.md
- login.flow.json
- - login-screen.png
  - home-after-login.png

The first run writes a baseline; reviewers see the PNG land in the PR. Subsequent runs diff against it. Treat baselines as code: review them on every PR that adds or changes one.

Don’t commit baselines from your local machine without checking them on CI’s actual screen size and scale. Different host devices produce different PNGs even for “the same” screen — typically your baseline workflow is “run it once on CI, commit the artefact CI produced”.

On mismatch: the triplet

When a comparison fails, the engine writes three files into the report artefact directory:


artefacts/<run-id>/visual-diffs/login/home-after-login/
  actual.png
  baseline.png
  diff.png

actual.png is what the runtime captured on this run. baseline.png is the committed reference. diff.png highlights the changed pixels in red.

The HTML report renders all three side-by-side under the failed step, and the JSON report carries paths to the files in step.visualDiff:


{
  "step": 5,
  "kind": "visualSnapshot",
  "id": "home-after-login",
  "passed": false,
  "visualDiff": {
    "diffPercent": 1.83,
    "tolerance": 0.5,
    "actualPath": "artefacts/.../actual.png",
    "baselinePath": "artefacts/.../baseline.png",
    "diffPath": "artefacts/.../diff.png"
  }
}

klera report --html report.json --out report.html produces a self-contained HTML page with every triplet inlined as base64 — no broken-image links when you email it to a reviewer.

Updating baselines

When a UI change is intentional, update the committed baselines.


klera run flows/login.flow.md --update-baselines

This swaps the comparator into “always pass, always overwrite” mode. Run the flow once with --update-baselines, inspect the diff in git status, and commit the changed PNGs alongside the prose / YAML change that caused the visual update. Reviewers see the design diff in the PR.

You can also regenerate a single baseline by deleting it and rerunning the flow:


rm __baselines__/login/home-after-login.png
klera run flows/login.flow.md
# → step writes a fresh baseline; passes; commit the new PNG

Both approaches are equivalent. --update-baselines is the right choice when many baselines drifted at once (e.g. after a Tailwind upgrade); per-file deletion is more surgical.

CI hygiene

Three rules keep visual regression honest:

Run on consistent hardware. Pin your CI runner to a single simulator + OS version + scale. Different combinations produce different baselines and you’ll chase phantom diffs forever.
Review every baseline change in the PR. Visual diffs are inherently subjective. The HTML report’s triplet view is designed to be skim-readable; treat it like the design-review tool it is.
Set tolerance per-step, not per-flow. Some screens are anti-aliasing-noisy (rich text, gradients); others should be pixel-perfect. Tune per snapshot.


# OK: 0.5% tolerance is plenty for the login screen
- visualSnapshot: { id: login-screen, tolerance: 0.5 }
 
# OK: gradient-heavy hero — bump to 1.5%
- visualSnapshot: { id: hero, tolerance: 1.5 }
 
# OK: pixel-perfect logo — drop to 0.05%
- visualSnapshot: { id: logo-pinned, tolerance: 0.05 }

Choosing what to snapshot

Visual snapshots are cheap to add but expensive to maintain — every one is a PNG that needs reviewing on every UI change. Pick load-bearing moments:

After login — the home screen is what every user sees first.
Around state transitions — empty state, populated state, loading state, error state of the same screen.
Right before a destructive action — confirmation modals, delete-account screens.
Hard-to-test composites — charts, maps, dynamic layouts where the element graph alone can’t express “looks right”.

Skip:

Per-row screenshots in long lists (use one of the row, not all).
Screens whose copy changes frequently — the diff will be noisy.
Loading states that depend on real network timing — mock the endpoint with delayMs if you must capture them.

A worked example


# yaml-language-server: $schema=../.klera/flow.schema.json
name: Login + home + settings visual coverage
steps:
  - assert: { visible: { testID: login-screen } }
  - visualSnapshot: { id: login-screen, tolerance: 0.5 }
 
  - type:
      into: { testID: login-email }
      value: ${env:E2E_LOGIN_EMAIL}
  - type:
      into: { testID: login-password }
      value: ${secret:E2E_LOGIN_PASSWORD}
  - tap: { testID: login-submit }
 
  - waitForIdle: { animations: true, network: true }
  - visualSnapshot: { id: home-after-login, tolerance: 0.5 }
 
  - tap: { testID: settings-tab }
  - waitForIdle: { animations: true }
  - visualSnapshot:
      id: settings-list
      region: { testID: settings-list }
      tolerance: 0.3

Three baselines, three diffs in the PR if anything moved. The region on the settings list ignores the tab bar (which sometimes animates a frame longer than the quiet window).

Next steps

IR reference — the full schema.
Network mocking — pair with delayMs to capture loading states deterministically.
Fixtures and secrets — ${secret:...} values are scrubbed from JSON / log output, but PNGs are the field designer’s responsibility (secureTextEntry for password fields).