Performance — input-latency benchmark

The interesting performance question for a terminal isn’t “how fast does it render” — modern GPUs make that a non-question — it’s how much time sits between the user pressing a key and the corresponding glyph landing on screen. That’s a chain: window-server → app event pump → modal dispatch → chord registry → command → pty write → child echo → vt100 → cell grid mutation → atlas paint → wgpu queue → display refresh.

Tmnl ships an in-process bench that times one slice of that chain — the modal-dispatch-through-command part — so regressions show up under cargo bench instead of “feels slower.” This page covers the --bench input-latency subcommand: what it measures, what it doesn’t, the recorded Apple Silicon baseline, and how to add scenarios.

This is the user-facing distillation. The contributor-facing methodology doc lives at docs/latency-bench.md in the tmnl repo.

Running the bench

cargo build --release          # important — debug builds are 10-20x slower
./target/release/tmnl --bench input-latency

The bench is a binary subcommand, not a cargo bench target. Tmnl is a binary-only crate; running benches under benches/ would require a lib + bin split of main.rs (~6.5k LOC) to expose the private modules the bench reaches into. The --bench subcommand has the same private access as the rest of main.rs without the refactor.

stderr carries logger output; pipe through 2>/dev/null if you only want the result table:

./target/release/tmnl --bench input-latency 2>/dev/null

The scenarios

v0.1 ships five scenarios. Each builds a fresh headless App (App::new_headless — no winit window, no wgpu surface), runs 200 warmup iterations to prime caches, then times 10k measured iterations:

| Scenario | What it exercises | | --- | --- | | synthetic_key/a | Cheapest fall-through path — no overlay match, no chord match, delivered to a focused-but-session-less shell pane. | | synthetic_key/esc | A named non-printable key. Same dispatch shape; verifies the named-key parse path doesn’t allocate disproportionately. | | synthetic_key/cmd+/ | Chord registry hit with no match (cmd+/ is unbound by default). Measures the chord lookup cost. | | synthetic_key/cmd+shift+p | Chord registry hit + match — opens the command palette overlay. The cheapest non-trivial dispatch shape. | | synthetic_type/13B | Per-char dispatch via App::synthetic_type("hello, world!") — 13 bytes through the same dispatcher, useful for measuring per-character cost. |

cmd+t is deliberately skipped: it opens a new tab via openpty, which fails in non-TTY contexts (piped runs, CI) and dominates the timing with the failed-spawn cost rather than the dispatch cost the bench is trying to measure.

Recorded baseline (Apple Silicon, 2026-06-13)

Numbers from an M-series Mac running macOS 25.4. Median is what to track over time:

| Scenario | min | median | p95 | max | | --- | ---: | ---: | ---: | ---: | | synthetic_key/a | 0 ns | 83 ns | 84 ns | 9.1 µs | | synthetic_key/esc | 0 ns | 42 ns | 42 ns | 1.5 µs | | synthetic_key/cmd+/ | 41 ns | 84 ns | 125 ns | 583 ns | | synthetic_key/cmd+shift+p | 41 ns | 125 ns | 166 ns | 38 µs | | synthetic_type/13B | 291 ns | 333 ns | 417 ns | 35 µs |

/a and /esc are the lower bound on any input event — both fall through to a no-session shell pane after the modal chain. cmd+shift+p at 125 ns median is a fair stand-in for “real” cost per keystroke: chord matched, palette overlay state mutated, returned. synthetic_type/13B divides to ~26 ns per char — useful when measuring text-entry-heavy flows like cmd+L paste.

The max column is always dominated by OS scheduler / page-fault noise. Ignore it unless every scenario regresses simultaneously (in which case the host is overloaded, not tmnl).

What’s measured

App::synthetic_key(spec) runs through:

Parse the spec into a winit Key + ModifiersState. Character keys use SmolStr and allocate a small bag — that allocation is part of what the bench is measuring.
Walk the modal-dispatch chain — welcome / settings / palette / tab-search / find / help overlays each get a shot at consuming the key. Each is an is_none() check until one matches.
If no overlay matches, run the chord registry against (key, mods). Match → dispatch command. No match → fall through.
If nothing else consumes it, deliver to the focused pane (or no-op when the pane has no shell session attached, as in the headless bench).

A regression in any modal check, the chord registry, or the command dispatch shows up as a change in the median column.

What’s NOT measured (yet)

GPU render cost. No surface, no atlas paint, no wgpu::Queue flush. A real frame adds another ~80 µs (rough estimate) for the cell pipeline pass at 80×24.
The window-server hop. Cmd-key chords from a real winit event pump arrive via WindowEvent::KeyboardInput after the OS routes them through CGEvent + the macOS run loop. That’s a measurable fixed overhead per event the bench can’t reach from a synthetic call.
Pty round-trip. A focused Shell pane with a real session would write the byte to the pty master fd; the child shell echoes it back; vt100 parses; cell grid mutates. The synthetic pane has no session attached, so this chain is skipped.
Comparison vs Terminal.app / iTerm / Alacritty. That’s the roadmap follow-up — needs either a screen-recording timestamp comparison (frame-accurate; needs the Accessibility API or a video capture overlay) or an AppleScript / Accessibility hook that fires a known event and timestamps the next display refresh paint.

Adding a scenario

Edit src/bench.rs::run_input_latency. The pattern is:

let scenarios = [
    Scenario::new("synthetic_key/<NAME>", warmup, iters),
    // …
];
for s in &scenarios {
    let spec = s.name.strip_prefix("synthetic_key/").unwrap_or(s.name);
    run_scenario(s, |app| app.synthetic_key(spec));
}

synthetic_key/<spec> covers anything App::synthetic_key can parse — plain chars, named non-printables (esc, enter, tab), modifier chords (cmd+/, cmd+shift+p). For per-string typing flows, Scenario::new("synthetic_type/<NAME>", warmup, iters) with app.synthetic_type(string) does each char through the same dispatcher.

Pick a scenario name that’s descriptive — it’s printed verbatim in the output table.

Cross-terminal comparison

The --bench input-latency bench answers “did tmnl’s dispatch chain regress against itself.” It can’t answer “is tmnl visibly snappier than Terminal.app / iTerm2 / Alacritty / Ghostty” — that takes measuring the end-to-end input → frame latency, including the window-server hop and the surface-present timing the in-process bench deliberately excludes.

The --bench cross-terminal subcommand is the entry point. It’s manual today — the runner is a methodology print-out plus the required setup, not an automated comparison. Automating frame-by-frame counting needs a CGDisplayStream callback wired into tmnl + a Vision-framework glyph detector for non-tmnl terminals, neither of which is in tree yet. The pieces that are shipped produce a publishable number in roughly 30 minutes per terminal.

The three pieces

| Piece | What it does | | --- | --- | | tmnl --bench cross-terminal | Prints the methodology + the per-terminal protocol to stdout. Self-contained so you don’t have to context-switch to the docs site mid-measurement. | | examples/latency_glyph.rs | The platform-agnostic test app. Pure stdin / stdout. Prints █ at row 5 / col 10 on every keystroke, alternating on / off so a frame-by-frame analyst can pair “the key landed THIS frame” with “the glyph appeared THIS frame” in one place. Runs inside whichever terminal you’re measuring — no protocol coupling. | | scripts/latency-bench-cross-terminal.sh | Interactive walk-through. Builds the release binary, loops over the comparison set, prompts for each run (start recording, launch terminal, press 50 keys, stop), then prompts for the measured median + p95. Writes TSV to /tmp/tmnl-latency-bench-<unix-secs>/results.tsv. |

The pieces decouple intentionally. The glyph test app is useful on its own — cargo run --release --example latency_glyph inside any terminal exercises the protocol-agnostic measurement target. The script is the “compare 5 terminals tonight” workflow that builds on top.

Running the comparison

./scripts/latency-bench-cross-terminal.sh

The script’s default comparison set is Terminal.app, iTerm2, Alacritty, Ghostty, tmnl — edit the terminals array near the top to add or remove entries. For each terminal it:

Prints the launch command and lets you skip.
Prompts you to start a ≥60 fps screen recording (QuickTime → New Screen Recording, or screencapture -V).
Launches the test app in the target terminal.
Waits while you press 50 keys at ~1 Hz so each frame cycle has time to settle.
Asks you to stop the recording, step through it frame-by-frame (Quicktime: ←/→ arrow keys advance one frame), and enter the measured median + p95.
Writes the row to the TSV.

The frame-by-frame counting is the eyeball step automation hasn’t landed yet for. The procedure: for each keystroke, find the frame where a key-down cue appears (an OS-level keypress overlay like KeyCastr, or a side-channel key-cam) and the frame where the █ appears at row 5 / col 10. The latency is (glyph_frame − key_frame) × (1 / fps) in ms.

What this measures vs the in-process bench

The in-process bench measures one slice of the chain — modal dispatch through to command dispatch — in nanoseconds. The cross-terminal bench measures the whole user-visible chain: window server → app event pump → modal dispatch → command → pty write → child echo → vt100 → cell grid → atlas paint → wgpu queue → display refresh. The two are complementary regression gates:

A regression in the in-process bench median surfaces as the same scenario crossing a 2× threshold against the baseline table.
A regression in cross-terminal latency surfaces as tmnl’s measured end-to-end median crossing a ~20% threshold against its own prior measurement on the same hardware.

The first is automatable today and runs under ./target/release/tmnl --bench input-latency. The second is the “every-major-release” gate; the procedure produces a one-off table that’s the v0.2 baseline until the automation lands.

The contributor-facing doc at docs/latency-bench.md covers the v0.2 procedure end-to-end plus the v0.3 automation roadmap (CGDisplayStream + AppleScript keystroke driver + Vision framework glyph detection). The manual page above is the user-facing distillation; for “I want to add a v0.3 piece” you want the contributor doc.

Watching for regressions

Compare two runs by piping both to the same filename and diff-ing:

./target/release/tmnl --bench input-latency 2>/dev/null > /tmp/before.txt
# … make a change …
cargo build --release
./target/release/tmnl --bench input-latency 2>/dev/null > /tmp/after.txt
diff -u /tmp/before.txt /tmp/after.txt

The numbers are noisy enough that a single-digit-nanosecond shift on any one scenario isn’t signal. Look for ≥ 2× shifts on the median column — those are the changes worth investigating.

Where to go next

docs/latency-bench.md — the contributor-facing methodology doc; covers the deeper rationale, why certain scenarios were excluded, and the cross-terminal-comparison follow-up plan.
src/bench.rs — the bench implementation. ~130 lines, including the Scenario / run_scenario / headless_app helpers and the actual run_input_latency entry point.
SDK — building a backing app — backing apps see input events delivered through the same dispatcher being measured here, so the cost ceiling for “Hover events delivered to my native client” starts from these numbers (plus the SDK’s read_message + poll overhead).
FEATURES.md — the shipped-feature inventory.