Performance — input-latency benchmark
The interesting performance question for a terminal isn’t “how fast does it render” — modern GPUs make that a non-question — it’s how much time sits between the user pressing a key and the corresponding glyph landing on screen. That’s a chain: window-server → app event pump → modal dispatch → chord registry → command → pty write → child echo → vt100 → cell grid mutation → atlas paint → wgpu queue → display refresh.
Tmnl ships an in-process bench that times one slice of that chain — the
modal-dispatch-through-command part — so regressions show up under
cargo bench instead of “feels slower.” This page covers the
--bench input-latency subcommand: what it measures, what it doesn’t, the
recorded Apple Silicon baseline, and how to add scenarios.
This is the user-facing distillation. The contributor-facing methodology
doc lives at
docs/latency-bench.md
in the tmnl repo.
Running the bench
Section titled “Running the bench”cargo build --release # important — debug builds are 10-20x slower./target/release/tmnl --bench input-latencyThe bench is a binary subcommand, not a cargo bench target. Tmnl is a
binary-only crate; running benches under benches/ would require a
lib + bin split of main.rs (~6.5k LOC) to expose the private modules
the bench reaches into. The --bench subcommand has the same private
access as the rest of main.rs without the refactor.
stderr carries logger output; pipe through 2>/dev/null if you only want
the result table:
./target/release/tmnl --bench input-latency 2>/dev/nullThe scenarios
Section titled “The scenarios”v0.1 ships five scenarios. Each builds a fresh headless App
(App::new_headless — no winit window, no wgpu surface), runs 200 warmup
iterations to prime caches, then times 10k measured iterations:
| Scenario | What it exercises |
| --- | --- |
| synthetic_key/a | Cheapest fall-through path — no overlay match, no chord match, delivered to a focused-but-session-less shell pane. |
| synthetic_key/esc | A named non-printable key. Same dispatch shape; verifies the named-key parse path doesn’t allocate disproportionately. |
| synthetic_key/cmd+/ | Chord registry hit with no match (cmd+/ is unbound by default). Measures the chord lookup cost. |
| synthetic_key/cmd+shift+p | Chord registry hit + match — opens the command palette overlay. The cheapest non-trivial dispatch shape. |
| synthetic_type/13B | Per-char dispatch via App::synthetic_type("hello, world!") — 13 bytes through the same dispatcher, useful for measuring per-character cost. |
cmd+t is deliberately skipped: it opens a new tab via openpty, which
fails in non-TTY contexts (piped runs, CI) and dominates the timing with
the failed-spawn cost rather than the dispatch cost the bench is trying
to measure.
Recorded baseline (Apple Silicon, 2026-06-13)
Section titled “Recorded baseline (Apple Silicon, 2026-06-13)”Numbers from an M-series Mac running macOS 25.4. Median is what to track over time:
| Scenario | min | median | p95 | max |
| --- | ---: | ---: | ---: | ---: |
| synthetic_key/a | 0 ns | 83 ns | 84 ns | 9.1 µs |
| synthetic_key/esc | 0 ns | 42 ns | 42 ns | 1.5 µs |
| synthetic_key/cmd+/ | 41 ns | 84 ns | 125 ns | 583 ns |
| synthetic_key/cmd+shift+p | 41 ns | 125 ns | 166 ns | 38 µs |
| synthetic_type/13B | 291 ns | 333 ns | 417 ns | 35 µs |
/a and /esc are the lower bound on any input event — both fall
through to a no-session shell pane after the modal chain. cmd+shift+p
at 125 ns median is a fair stand-in for “real” cost per keystroke: chord
matched, palette overlay state mutated, returned. synthetic_type/13B
divides to ~26 ns per char — useful when measuring text-entry-heavy
flows like cmd+L paste.
The max column is always dominated by OS scheduler / page-fault noise.
Ignore it unless every scenario regresses simultaneously (in which case
the host is overloaded, not tmnl).
What’s measured
Section titled “What’s measured”App::synthetic_key(spec) runs through:
- Parse the spec into a winit
Key+ModifiersState. Character keys useSmolStrand allocate a small bag — that allocation is part of what the bench is measuring. - Walk the modal-dispatch chain — welcome / settings / palette /
tab-search / find / help overlays each get a shot at consuming the
key. Each is an
is_none()check until one matches. - If no overlay matches, run the chord registry against
(key, mods). Match → dispatch command. No match → fall through. - If nothing else consumes it, deliver to the focused pane (or no-op when the pane has no shell session attached, as in the headless bench).
A regression in any modal check, the chord registry, or the command dispatch shows up as a change in the median column.
What’s NOT measured (yet)
Section titled “What’s NOT measured (yet)”- GPU render cost. No surface, no atlas paint, no
wgpu::Queueflush. A real frame adds another ~80 µs (rough estimate) for the cell pipeline pass at 80×24. - The window-server hop. Cmd-key chords from a real winit event pump
arrive via
WindowEvent::KeyboardInputafter the OS routes them through CGEvent + the macOS run loop. That’s a measurable fixed overhead per event the bench can’t reach from a synthetic call. - Pty round-trip. A focused Shell pane with a real session would write the byte to the pty master fd; the child shell echoes it back; vt100 parses; cell grid mutates. The synthetic pane has no session attached, so this chain is skipped.
- Comparison vs Terminal.app / iTerm / Alacritty. That’s the roadmap follow-up — needs either a screen-recording timestamp comparison (frame-accurate; needs the Accessibility API or a video capture overlay) or an AppleScript / Accessibility hook that fires a known event and timestamps the next display refresh paint.
Adding a scenario
Section titled “Adding a scenario”Edit src/bench.rs::run_input_latency. The pattern is:
let scenarios = [ Scenario::new("synthetic_key/<NAME>", warmup, iters), // …];for s in &scenarios { let spec = s.name.strip_prefix("synthetic_key/").unwrap_or(s.name); run_scenario(s, |app| app.synthetic_key(spec));}synthetic_key/<spec> covers anything App::synthetic_key can parse —
plain chars, named non-printables (esc, enter, tab), modifier
chords (cmd+/, cmd+shift+p). For per-string typing flows,
Scenario::new("synthetic_type/<NAME>", warmup, iters) with
app.synthetic_type(string) does each char through the same dispatcher.
Pick a scenario name that’s descriptive — it’s printed verbatim in the output table.
Cross-terminal comparison
Section titled “Cross-terminal comparison”The --bench input-latency bench answers “did tmnl’s dispatch chain
regress against itself.” It can’t answer “is tmnl visibly snappier
than Terminal.app / iTerm2 / Alacritty / Ghostty” — that takes
measuring the end-to-end input → frame latency, including the
window-server hop and the surface-present timing the in-process bench
deliberately excludes.
The --bench cross-terminal subcommand is the entry point. It’s
manual today — the runner is a methodology print-out plus the
required setup, not an automated comparison. Automating
frame-by-frame counting needs a CGDisplayStream callback wired into
tmnl + a Vision-framework glyph detector for non-tmnl terminals,
neither of which is in tree yet. The pieces that are shipped
produce a publishable number in roughly 30 minutes per terminal.
The three pieces
Section titled “The three pieces”| Piece | What it does |
| --- | --- |
| tmnl --bench cross-terminal | Prints the methodology + the per-terminal protocol to stdout. Self-contained so you don’t have to context-switch to the docs site mid-measurement. |
| examples/latency_glyph.rs | The platform-agnostic test app. Pure stdin / stdout. Prints █ at row 5 / col 10 on every keystroke, alternating on / off so a frame-by-frame analyst can pair “the key landed THIS frame” with “the glyph appeared THIS frame” in one place. Runs inside whichever terminal you’re measuring — no protocol coupling. |
| scripts/latency-bench-cross-terminal.sh | Interactive walk-through. Builds the release binary, loops over the comparison set, prompts for each run (start recording, launch terminal, press 50 keys, stop), then prompts for the measured median + p95. Writes TSV to /tmp/tmnl-latency-bench-<unix-secs>/results.tsv. |
The pieces decouple intentionally. The glyph test app is useful on
its own — cargo run --release --example latency_glyph inside any
terminal exercises the protocol-agnostic measurement target. The
script is the “compare 5 terminals tonight” workflow that builds on
top.
Running the comparison
Section titled “Running the comparison”./scripts/latency-bench-cross-terminal.shThe script’s default comparison set is Terminal.app, iTerm2,
Alacritty, Ghostty, tmnl — edit the terminals array near the top
to add or remove entries. For each terminal it:
- Prints the launch command and lets you skip.
- Prompts you to start a ≥60 fps screen recording (QuickTime →
New Screen Recording, or
screencapture -V). - Launches the test app in the target terminal.
- Waits while you press 50 keys at ~1 Hz so each frame cycle has time to settle.
- Asks you to stop the recording, step through it frame-by-frame (Quicktime: ←/→ arrow keys advance one frame), and enter the measured median + p95.
- Writes the row to the TSV.
The frame-by-frame counting is the eyeball step automation hasn’t
landed yet for. The procedure: for each keystroke, find the frame
where a key-down cue appears (an OS-level keypress overlay like
KeyCastr, or a side-channel key-cam) and the frame where the █
appears at row 5 / col 10. The latency is
(glyph_frame − key_frame) × (1 / fps) in ms.
What this measures vs the in-process bench
Section titled “What this measures vs the in-process bench”The in-process bench measures one slice of the chain — modal dispatch through to command dispatch — in nanoseconds. The cross-terminal bench measures the whole user-visible chain: window server → app event pump → modal dispatch → command → pty write → child echo → vt100 → cell grid → atlas paint → wgpu queue → display refresh. The two are complementary regression gates:
- A regression in the in-process bench median surfaces as the same scenario crossing a 2× threshold against the baseline table.
- A regression in cross-terminal latency surfaces as tmnl’s measured end-to-end median crossing a ~20% threshold against its own prior measurement on the same hardware.
The first is automatable today and runs under
./target/release/tmnl --bench input-latency. The second is the
“every-major-release” gate; the procedure produces a one-off table
that’s the v0.2 baseline until the automation lands.
The contributor-facing doc at
docs/latency-bench.md
covers the v0.2 procedure end-to-end plus the v0.3 automation
roadmap (CGDisplayStream + AppleScript keystroke driver + Vision
framework glyph detection). The manual page above is the user-facing
distillation; for “I want to add a v0.3 piece” you want the
contributor doc.
Watching for regressions
Section titled “Watching for regressions”Compare two runs by piping both to the same filename and diff-ing:
./target/release/tmnl --bench input-latency 2>/dev/null > /tmp/before.txt# … make a change …cargo build --release./target/release/tmnl --bench input-latency 2>/dev/null > /tmp/after.txtdiff -u /tmp/before.txt /tmp/after.txtThe numbers are noisy enough that a single-digit-nanosecond shift on any one scenario isn’t signal. Look for ≥ 2× shifts on the median column — those are the changes worth investigating.
Where to go next
Section titled “Where to go next”docs/latency-bench.md— the contributor-facing methodology doc; covers the deeper rationale, why certain scenarios were excluded, and the cross-terminal-comparison follow-up plan.src/bench.rs— the bench implementation. ~130 lines, including theScenario/run_scenario/headless_apphelpers and the actualrun_input_latencyentry point.- SDK — building a backing app — backing apps see input
events delivered through the same dispatcher being measured here, so
the cost ceiling for “Hover events delivered to my native client”
starts from these numbers (plus the SDK’s
read_message+polloverhead). - FEATURES.md — the shipped-feature inventory.