docs(testing): session 13 plan/inventory + rotate session 14 prompt

- runner-implementation-plan.md: session 13 status section
  (lib/ax.ts primitive shipped, no new spec, coverage stays at 74/76
  = 97% since primitive-only sessions don't move the spec count;
  Phase 0 found debugger detached on dev box which blocked Categories
  A/B/C; pivoted to the PRIORITY DOM unification primitive). Updated
  the "Primitive gaps to flag" entry — DOM/AX loading + traversal
  primitive moved from FLAGGED to LANDED with the consumer list and
  the deliberately-deferred shapes (waitForRenderedSurface registry,
  CSS-querySelector primitive).
- README.md: lib/ax.ts entry in the substrate-primitives note;
  session 13 consumer list (claudeai.ts page-objects + T26).
  Spec count unchanged at 74.
- runner-implementation-followup-prompt.md: rotated for session 14.
  Adds new Category D (call-site migration to waitForAxNode for
  flake reduction) as the PRIORITY shape — doesn't need the
  debugger, builds on session 13's primitive. Carries forward
  Categories A / B / C (still need debugger). Phase 0 must check
  port 9229 BEFORE picking a category. Reading order updated:
  session 13 first.

Co-Authored-By: Claude <claude@anthropic.com>
This commit is contained in:
aaddrick
2026-05-03 23:57:00 -04:00
parent 3d47f33ccb
commit 113329f91f
3 changed files with 454 additions and 407 deletions

View File

@@ -1,178 +1,215 @@
# test-harness runner implementation — session 13 prompt
# test-harness runner implementation — session 14 prompt
This file is meant to be **copied verbatim into a fresh Claude Code
session** as the initial user message. Don't paraphrase it; the
orchestration depends on the exact directives below.
You're picking up after a runner-implementation session that landed 1
new spec (T11_runtime) by way of registering five install-flow
suffixes plus invoking BOTH case-doc-anchored read-side getters across
TWO distinct impl objects (CustomPlugins + LocalPlugins). First cross-
impl-object dual invocation. No primitive change. Coverage 73/76 (96%)
→ 74/76 (97%). Two commits on `docs/compat-matrix` expected (SHAs
new primitive (`lib/ax.ts`) and NO new spec. Session 13 was a pivot:
Phase 0 calibration found the debugger detached on the dev box (port
9229 not listening — Claude was running but Developer → Enable Main
Process Debugger had not been clicked), which blocked Categories A
(operon-mode navigation probe) and C (schema-rev for
`listRemotePluginsPage` / `listSkillFiles`) — both need runtime
probing against a debugger-attached running Claude. Category B (Tier
3 read-only reframes) ALSO effectively needed the debugger for the
smoke-test investigation phase. Session 13 pivoted to the
PRIORITY-flagged DOM unification primitive, which was tractable
without the debugger because both consumer signals existed
statically: `claudeai.ts` had a private `snapshotAx`, T26 had a
duplicate inline copy explicitly noted as "premature abstraction at 1
consumer", plus the user reported recurring AX-query flake. Coverage
unchanged at 74/76 (97%) — primitive-only sessions don't move the
spec count. Two commits on `docs/compat-matrix` expected (SHAs
inserted after the test-harness commit lands — the user reviews and
commits at the end of every session):
- TBD — `test(harness): session 12 T11 plugin install runtime`
(Tier 2 reframe; multi-suffix `waitForEipcChannels` over the
install-flow suffixes — `CustomPlugins/installPlugin` (case-doc
:507181) / `uninstallPlugin` / `updatePlugin` /
`listInstalledPlugins` / `LocalPlugins/getPlugins` — plus dual
`invokeEipcChannel` across TWO impl objects:
`CustomPlugins_$_listInstalledPlugins` with `args = [[]]` (empty
`egressAllowedDomains`, T33c pattern) and `LocalPlugins_$_getPlugins`
with `args = []`; passes on KDE-W in 28.8s cold).
- TBD — `test(harness): session 13 lib/ax.ts AX substrate primitive`
(extracts `snapshotAx` from `claudeai.ts` private + T26 inlined
duplicate; adds `waitForAxNode` / `waitForAxNodes` predicate-based
polling helpers; re-exports `RawElement` / `AxNode` /
`axTreeToSnapshot` / `waitForAxTreeStable` from `explore/walker.ts`
so consumers stay inside `lib/`; refactors `claudeai.ts` and T26
to consume the shared substrate).
The plan doc at
[`docs/testing/runner-implementation-plan.md`](runner-implementation-plan.md)
captures the tier classification and execution-time reclassifications.
Its "Status (post-execution)" section is the source of truth for
what's done and what's deferred — read **session 12** first, then
**session 11**, then **session 10**, then **session 9**, then **session
8**, then **session 7**, then **session 6**, then **session 5**, then
**session 4**, then **session 3**, then **session 2**, then **session
1** sub-sections.
what's done and what's deferred — read **session 13** first, then
**session 12**, then **session 11**, then **session 10**, then
**session 9**, then **session 8**, then **session 7**, then **session
6**, then **session 5**, then **session 4**, then **session 3**, then
**session 2**, then **session 1** sub-sections.
This session is a continuation, not a restart. Start by reading the
plan doc's status sections.
### Big new findings from session 12
### Big new findings from session 13
1. **`LocalPlugins` registers 15 methods, `CustomPlugins` 16.**
Smoke-test against the user's debugger-attached running Claude
surfaced the full method list. Cleanly invocable read-sides:
`LocalPlugins.getPlugins()` → array (length 0 on dev box),
`LocalPlugins.getDownloadedRemotePlugins()` → array,
`CustomPlugins.listInstalledPlugins([[]])` → array,
`CustomPlugins.listMarketplaces([[]])` → array (also T33c),
`CustomPlugins.listAvailablePlugins([[]])` → array (also T33c),
`CustomPlugins.getCachedCommands()` → array,
`CustomPlugins.getInstallCounts()` → null,
`CustomPlugins.getAndClearMigrationIssues()` → null,
`CustomPlugins.listLocalOrgPlugins()` → array. Three methods need
pluginId at position 0 but accept any string (not just real plugin
IDs): `getPluginOAuthStatus`, `getPluginCliStatus`,
`getPluginShimOps`. **Two methods need extra args not derivable
from a fresh isolation:** `LocalPlugins.listSkillFiles` (positional
`pluginId` + `skillName``[]` rejects, `[cwd]` rejects too,
needs both); `CustomPlugins.listRemotePluginsPage` (positional
`limit: number` at 0 — every smoke-tested arg shape rejected;
schema-rev would resolve this via grep on the `Argument "limit" at
position 0` literal).
2. **Cross-impl-object dual invocation is the strongest Tier 2
pattern** when the case-doc surface spans two interfaces. T11's
install flow involves both `CustomPlugins.*` (the API/marketplace
side that drives install) and `LocalPlugins.*` (the local-fs side
where plugins land). T11_runtime invokes one read-side from each
rather than picking one. Strictly stronger than single-interface
coverage — proves the install plumbing crosses both impls intact.
Mixed-arg-shape fine (one needs `[[]]`, another `[]`); same as
T21's mixed-shape (one returns array, another returns boolean).
3. **The Tier 2 reframe pool is essentially exhausted.** Every Tier 1
fingerprint with a tractable runtime sibling has been promoted.
The remaining deferred items are Tier 3 (login-required write-side
flows), Tier 4 (out of scope), or schema-rev work to unblock the
still-rejecting read-sides surfaced this session
(`listRemotePluginsPage`, `listSkillFiles`).
1. **Pre-existing T16 / T17 / T07 / S25 / S29-S31 flake confirmed
on KDE-W against the unchanged baseline.** Running the full suite
surfaced 12 failures, including T16 (CodeTab.activate: no AX-tree
button with accessibleName="Code" found) and T17. Verified
pre-existing by stashing the session-13 changes and re-running
T16 — same failure. Session 13's primitive doesn't fix the existing
flake; it lays groundwork. Future sessions can build flake-
reduction patches against `lib/ax.ts`'s `waitForAxNode` (e.g.
promote `activateTab`'s one-shot snapshot to a proper retry, or
give T07's CSS-querySelector poll a more durable wait shape if
that abstraction emerges).
2. **`lib/ax.ts` is the new shared AX-tree substrate.** Surface:
- `snapshotAx(inspector, opts)` — single AX read with the
stability gate. `opts.fast` skips the gate for inside-poll
callers (matches the existing `claudeai.ts`/T26 contract).
- `waitForAxNode(inspector, predicate, opts)` — repeatedly
snapshot the tree and return the first matching `RawElement`,
null on timeout. Gates on stability once at the start
(configurable), then iterates with `fast: true`. Built against
the inline polling loops in `CodeTab.activate`, `openPill`,
`clickMenuItem`, T26 pre/post-click anchor scans — but the
existing call-sites are NOT migrated this session (their per-
spec retry budgets are tuned and changing them speculatively
risks flake). Future call-site migrations are tractable.
- `waitForAxNodes(inspector, predicate, opts)` — same shape,
returns every match. For consumers that want to enumerate.
- Re-exports: `RawElement`, `AxNode`, `axTreeToSnapshot`,
`waitForAxTreeStable` — so consumers stay inside `lib/`
instead of reaching into `explore/walker.ts` directly.
3. **The debugger-attachment precondition is binding.** Sessions 9
through 12 did extensive runtime probing of the per-wc IPC
registry against the user's debugger-attached Claude. Without
that probing, Categories A / B / C in this prompt are blocked at
the smoke-test phase. If the user hasn't clicked Developer →
Enable Main Process Debugger before the session starts, port 9229
is closed and the categories pivot to either documentation work
or the call-site-migration shape that doesn't need runtime
probing. Phase 0 must check `ss -tln | grep ':9229'` (or `curl
--max-time 2 http://127.0.0.1:9229/json`) before fanning out.
4. **The reframe pool remains essentially exhausted.** Same status
as session 12 — every Tier 1 fingerprint with a tractable runtime
sibling has been promoted. The remaining options are now: (a)
call-site migration to `waitForAxNode` for flake reduction, (b)
operon-mode navigation probe (still needs debugger), (c) schema-
rev for `listRemotePluginsPage` / `listSkillFiles` (still needs
debugger), (d) Tier 3 read-only reframes (most need user-account
state). The natural next-session shape is (a) — flake reduction
builds on session 13's primitive and doesn't need the debugger.
### Authoritative reference
Read these in order before fanning out:
- [`docs/testing/runner-implementation-plan.md`](runner-implementation-plan.md)
— tier classification + status section. Read **session 12**, then
**session 11**, **session 10**, **session 9**, **session 8**,
**session 7**, **session 6**, **session 5**, **session 4**, **session
3**, **session 2**, then **session 1** "Status (post-execution)"
sub-sections. The Tier-3 list (search for "## Tier 3") is the
candidate pool for any further reframes.
— tier classification + status section. Read **session 13**, then
**session 12**, then **session 11**, **session 10**, **session 9**,
**session 8**, **session 7**, **session 6**, **session 5**, **session
4**, **session 3**, **session 2**, then **session 1** "Status (post-
execution)" sub-sections. The Tier-3 list (search for "## Tier 3")
is the candidate pool for any further reframes.
- [`tools/test-harness/README.md`](../../tools/test-harness/README.md)
— runner conventions, the now-74-spec inventory, primitives in
`lib/`, isolation defaults, the CDP-gate workaround, the eipc
note (covers registry walk, renderer-wrapper invocation, the
schema-rev pattern from session 9, the foundational-getAll
pattern from session 10, the dual-case-doc-anchored-read-side
pattern from session 11, and the cross-impl-object dual
invocation pattern from session 12).
note, and the new `lib/ax.ts` substrate (session 13 addition;
consumer list is `claudeai.ts` page-objects + T26).
- [`docs/testing/cases/README.md`](cases/README.md) — case-doc
structure and the four anchor scopes.
- [`tools/test-harness/src/lib/`](../../tools/test-harness/src/lib/)
— the existing primitives. No session 12 additions; surface remains
the session 8 shape (`getEipcChannels` / `findEipcChannel` /
`findEipcChannels` / `waitForEipcChannel` / `waitForEipcChannels` /
`invokeEipcChannel` on `lib/eipc.ts`).
— the existing primitives. Session 13 added `lib/ax.ts`; surface
is `snapshotAx` / `waitForAxNode` / `waitForAxNodes` plus re-
exports of `RawElement` / `AxNode` / `axTreeToSnapshot` /
`waitForAxTreeStable`. The session 8 eipc surface
(`getEipcChannels` / `findEipcChannel` / `findEipcChannels` /
`waitForEipcChannel` / `waitForEipcChannels` / `invokeEipcChannel`
on `lib/eipc.ts`) is unchanged.
- [`tools/test-harness/eipc-registry-probe.ts`](../../tools/test-harness/eipc-registry-probe.ts)
— the session 7 read-only registry probe. Re-run against a
debugger-attached Claude (`Developer → Enable Main Process
Debugger` from the menu) to capture the current registry shape.
Session 12 used a small one-off smoke-test in the test-harness
dir (`localplugins-smoke.ts` clones the InspectorClient
connection pattern from eipc-registry-probe.ts, dumps full
method lists for plugin-related interfaces, runs N candidate
read-sides through M arg shapes, reports `[OK]` / `[REJ]` per
probe; deleted after).
Sessions 11 / 12 used small one-off smoke-tests in the test-
harness dir that clone the InspectorClient connection pattern
and run N candidate read-sides through M arg shapes; deleted
after.
- [`tools/test-harness/src/runners/`](../../tools/test-harness/src/runners/)
— every existing spec is a template. Notable session 12 templates:
- `T11_runtime.spec.ts` — multi-suffix `waitForEipcChannels` over
install-flow suffixes + dual `invokeEipcChannel` across TWO impl
objects (CustomPlugins + LocalPlugins). Pattern for any case-doc
test whose surface spans two interfaces — invoke a read-side from
each rather than picking one.
— every existing spec is a template. Notable session 13
candidates for follow-up:
- `T26_routines_page_renders.spec.ts` — first consumer of
`lib/ax.ts`'s exported `snapshotAx` (refactored from inline).
Other AX-using specs (T16, T17, H05) still call through
`claudeai.ts` page-objects which use the shared substrate
transparently.
- [`docs/testing/cases/*.md`](cases/) — the spec each runner
asserts. The **Code anchors:** field tells you exactly where
upstream implements the feature.
### Tests in scope this session
**Realistic ceiling: ~1 new spec OR one investigation + maybe a
narrowly-scoped Tier 2 / schema-rev landing.** Sessions 9-12 each
landed 1-2 specs. With coverage at 74/76, the test budget naturally
shifts toward investigation, schema-rev for still-rejecting read-
sides, or operon-mode probing. Session 13's main bet should aim for
1 spec OR one substantive investigation deliverable.
**Realistic ceiling: ~1 new spec OR one substantive flake-reduction
deliverable OR one investigation.** Sessions 9-12 each landed 1-2
specs; session 13 landed only a primitive (debugger blocked).
Coverage at 74/76 means the test budget naturally shifts toward
either (a) flake reduction against `lib/ax.ts`'s primitive, (b)
investigation that requires the debugger and was deferred from
sessions 12-13, or (c) Tier 3 read-only reframes that the harness
can construct from existing `seedFromHost` state.
#### **PRIORITY: Unify DOM loading + traversal primitives.** Take
this on first if budget allows — the user is reporting a real,
recurring flake: tests fail because they aren't waiting long enough
for the DOM to render, AX-tree queries fire before the relevant
subtree is mounted, and each spec picks its own `retryUntil` budget.
Existing wait primitives are scattered: `electron.ts:waitForReady('userLoaded')`
(post-login URL transition), `claudeai.ts` page-objects (each rolls
its own `retryUntil` for AX lookups), `eipc.ts:waitForEipcChannel`
(handler registration). No unified "wait for surface rendered"
primitive exists. Proposed shape is **`lib/dom-ready.ts`** with
`waitForAxNode` / `waitForAxTreeStable` / `waitForRenderedSurface`
helpers — see plan-doc "Primitive gaps to flag" → "Unified DOM/AX
loading + traversal primitive" for the full proposal. Pre-work:
audit per-spec `retryUntil` budgets and AX-query sites in
`claudeai.ts` + flaky test runners to identify the 3-5 most-flaky
callsites; build the primitive against those specifically (not
speculatively). Threshold-driven extraction, same way `eipc.ts` /
`input.ts` / `electron-mocks.ts` came out of consumer pressure
rather than design-up-front. **If this primitive is what session
13 ships, that's a strictly higher-impact outcome than another
Tier 2 / Tier 3 reframe — flake reduction touches every existing
AX-using spec (T07, T16, T17, T26, H05) and unblocks future
Code-tab AX work.**
**Phase 0 MUST check the debugger BEFORE picking a category.** Run
`ss -tln 2>/dev/null | grep ':9229'` (or
`curl --max-time 2 http://127.0.0.1:9229/json`). If port 9229 is not
listening, Categories A and C are hard-blocked. Pivot to D or B.
**Category A (operon-mode navigation probe)** is the natural next
step. The other 21 wrapper-exposed operon interfaces remain registry-
unconfirmed; if any URL form recovered from the bundle surfaces
additional handlers, that's a tractable Tier 2 reframe. **Category B
(Tier 3 read-only reframes)** picks the lowest-hanging Tier 3 spec
where a non-destructive read-side might be invocable from a fresh
isolation. **Category C (schema-rev for the rejecting read-sides)**
unblocks `listRemotePluginsPage` or `listSkillFiles` via grep on
the rejection literal — small-scope, useful as a fallback.
#### **PRIORITY: Call-site migration to `lib/ax.ts`'s
`waitForAxNode` for flake reduction.** Session 13 landed the
substrate; this session can promote the inline retry loops in
`claudeai.ts` (`activateTab` is the strongest candidate — it does a
one-shot snapshot with no retry, which is exactly the failure mode
T16 hits). Smaller-scope candidates: `findCompactPills` (one-shot
snapshot, no retry — same shape as `activateTab`), `openPill`'s
post-click while-loop, `clickMenuItem`'s while-loop. Each migration
is a localized refactor; verify by running the affected specs
(T16/T17/T26/H05) and checking pass rate. Don't speculatively
change the budget defaults — match the existing per-spec retry
budgets so the migration is shape-only. **If this is what session
14 ships, that's a strictly higher-impact outcome than another Tier
2 / Tier 3 reframe — flake reduction touches every existing AX-
using spec.** Doesn't need the debugger.
Three categories — pick ONE as the main bet, treat the others as
fallback if the main bet hits an early blocker:
| # | Tests | Source | Notes |
|---|---|---|---|
| **A** operon-mode navigation probe | n/a (investigation) + maybe small Tier 2 reframe | new probe + bundle grep for operon URL routes | Session 10 confirmed `OperonBootstrap.ensure` registers eagerly but the other 21 wrapper-exposed operon interfaces remain registry-unconfirmed. Outputs: either an operon-mode URL form recovered from the bundle (search for `operon`-keyed routes in `claude.ai/...` paths) plus a registry re-probe after navigation, OR a deferral note explaining why operon scope can't be reached without an operon-mode entry. |
| **B** Tier 3 read-only reframes | Pick from the Tier 3 list | T33c / T35b / T37b template + bundle grep | The Tier 3 list is full of login-required flows; some have read-only entry points that the harness CAN construct. Candidates: T22's `getPrChecks` read-side might accept a non-existent PR number / dry-run mode; T15's OAuth surface has read-only state queries. Most need the user-account-scoped state to fail-fast with a clean error rather than a real network roundtrip — investigate first. |
| **C** Schema-rev for `listRemotePluginsPage` / `listSkillFiles` | Bundle grep | session 9 schema-rev pattern | Both methods rejected every smoke-tested arg shape during session 12's investigation. `listRemotePluginsPage` needs `limit: number` at position 0 (rejection: `Argument "limit" at position 0 ...`); `listSkillFiles` needs both `pluginId` and `skillName` (rejection: `Argument "skillName" at position 1 ...`). Bundle-grep on the rejection literals → resolve the schema → ship a narrowly-scoped Tier 2 invocation if it unblocks a case-doc claim. Smaller scope than A or B; useful as a fallback. |
| **D** call-site migration to `waitForAxNode` | `claudeai.ts` page-objects + T26 + future Code-tab AX work | `lib/ax.ts` (session 13 primitive) | The PRIORITY shape this session. Promote `activateTab`'s one-shot snapshot to use `waitForAxNode`; same for `findCompactPills`. Validate by re-running T16 / T17 / T26 / H05 against the migrated form. Doesn't need the debugger. Risk: changing the retry shape can introduce new flake if the budget defaults don't match the existing per-spec tuning — keep migrations shape-only, no budget changes. |
| **A** operon-mode navigation probe | n/a (investigation) + maybe small Tier 2 reframe | new probe + bundle grep for operon URL routes | Session 10 confirmed `OperonBootstrap.ensure` registers eagerly but the other 21 wrapper-exposed operon interfaces remain registry-unconfirmed. Outputs: either an operon-mode URL form recovered from the bundle (search for `operon`-keyed routes in `claude.ai/...` paths) plus a registry re-probe after navigation, OR a deferral note explaining why operon scope can't be reached without an operon-mode entry. **Needs debugger-attached Claude on port 9229.** |
| **B** Tier 3 read-only reframes | Pick from the Tier 3 list | T33c / T35b / T37b template + bundle grep | The Tier 3 list is full of login-required flows; some have read-only entry points that the harness CAN construct. Candidates: T22's `getPrChecks` read-side might accept a non-existent PR number / dry-run mode; T15's OAuth surface has read-only state queries. Most need the user-account-scoped state to fail-fast with a clean error rather than a real network roundtrip — investigate first. **Needs debugger for smoke-test verification.** |
| **C** Schema-rev for `listRemotePluginsPage` / `listSkillFiles` | Bundle grep | session 9 schema-rev pattern | Both methods rejected every smoke-tested arg shape during session 12's investigation. `listRemotePluginsPage` needs `limit: number` at position 0 (rejection: `Argument "limit" at position 0 ...`); `listSkillFiles` needs both `pluginId` and `skillName` (rejection: `Argument "skillName" at position 1 ...`). Bundle-grep on the rejection literals → resolve the schema → ship a narrowly-scoped Tier 2 invocation if it unblocks a case-doc claim. **Needs debugger to verify the recovered schema.** |
If port 9229 is closed, only D is fully tractable. A documentation-
only session that audits the existing AX call-sites and proposes a
migration plan (without shipping) is also acceptable — pre-work for
a future session that DOES land the migration.
#### Category D — call-site migration to `waitForAxNode`
The plan: promote inline AX retry loops in `claudeai.ts` to use
`waitForAxNode` from `lib/ax.ts`.
1. **Audit the call-sites.** `activateTab` does one-shot snapshot,
no retry — direct candidate. `findCompactPills` same. `openPill`
post-click while-loop and `clickMenuItem` while-loop both do
snapshot+filter+sleep — convert to `waitForAxNode` /
`waitForAxNodes` with the existing budget. T26's pre/post-click
`retryUntil` blocks are also direct candidates.
2. **Migrate one call-site at a time.** Run the affected specs after
each migration (T16 / T17 / T26 / H05). Don't migrate all at
once — one bad budget change can cascade across multiple specs.
3. **Don't change the retry budgets.** The existing per-spec timeouts
are tuned (CodeTab.activate uses 5s default but T16 passes 15s);
match them when migrating.
4. **Don't add new functionality.** This is a shape-only refactor.
If a migration reveals a budget that's clearly wrong (e.g.
`activateTab` has NO retry today, which is the T16 failure mode),
that's a small bug-fix the migration corrects — but document it.
#### Category A — operon-mode navigation probe
@@ -188,14 +225,12 @@ The plan: find an operon-mode URL form and verify whether the other
"window.location.href = '<URL>'")`. After each navigation, re-run
the registry probe and check the operon scope's interface count.
3. **If any URL surfaces additional operon handlers**, ship a small
Tier 2 reframe spec (e.g. probe `OperonBootstrap.ensure` invocation
shape, or assert the lazy-registration count).
Tier 2 reframe spec.
4. **If none of the candidate URLs surface additional handlers**,
document as "operon scope handlers register lazily on a navigation
we can't easily construct from the harness" and defer.
This is the smaller-scope category — investigation + maybe one
spec landing.
**Needs debugger-attached Claude on port 9229.**
#### Category B — Tier 3 read-only reframes
@@ -209,14 +244,12 @@ is invocable from a fresh `seedFromHost` isolation.
scope. The exceptions are read-side anchors that just need
user-account-scoped data to assert against.
2. **Smoke-test the candidate read-side** with various arg shapes.
For example, T22's `LocalSessions.getPrChecks(prUrl)` might accept
a fake URL string and return an empty/error array shape that
asserts the impl is wired without making a real GitHub call —
investigate.
3. **Ship a Tier 2 reframe** if the read-side resolves cleanly.
4. **Defer** if every candidate requires real account state to assert
meaningfully.
**Needs debugger for smoke-test verification.**
#### Category C — Schema-rev for rejecting read-sides
The plan: resolve the validator schema for `listRemotePluginsPage` /
@@ -228,15 +261,10 @@ a case-doc claim.
9 finding). Read ~2KB around the hit to surface the full schema.
2. **Smoke-test the recovered schema** against the user's debugger-
attached running Claude.
3. **Connect the resolved invocation to a case-doc claim.** If
neither method connects to an existing case-doc test, the schema
knowledge is a finding for the plan-doc but not a spec to ship.
3. **Connect the resolved invocation to a case-doc claim.**
4. **Ship a Tier 2 invocation** if a case-doc claim is unblocked.
`listRemotePluginsPage` could potentially extend T33's plugin
browser coverage with a paginated listing assertion.
This is the smallest-scope category — best fallback if A and B are
blocked.
**Needs debugger to verify the recovered schema.**
#### Cross-compositor focus-shifter expansion (NOT recommended this session)
@@ -247,27 +275,32 @@ consumer.
#### Main-side `invokeEipcChannel` fallback (NOT recommended this session)
If a future spec needs to invoke a `claude.settings/*` handler that
only registers on the find_in_page or main_window webContents (where
the renderer is at `file://` and the wrapper isn't exposed), the
main-side direct-call path is documented in session 8's Status
section. Don't add it speculatively — wait for a real consumer.
Same status as sessions 8-13 — wait for a real consumer.
#### Launch event-subscription primitive (NOT recommended this session)
Session 11 noted that `window['claude.web'].Launch` exposes 5 `on*`
event subscribers + `activeServersStore` not visible in
`_invokeHandlers`. No consumer asks for an event-probe primitive
yet — wait for one.
Same status as sessions 11-13 — wait for a real consumer.
#### `waitForRenderedSurface` registry (NOT recommended this session)
Session 13's `lib/ax.ts` deliberately did NOT ship a named-surface
registry; promote when a third consumer crystallizes with a specific
surface name in mind.
#### CSS-querySelector primitive (NOT recommended this session)
Session 13's `lib/ax.ts` covers AX-tree consumers only. T07's CSS-
querySelector poll for the topbar is a different abstraction (DOM,
not AX). Wait for a second consumer before extracting.
### Constraints to respect (don't violate)
These are unchanged from sessions 1-12 and still load-bearing:
These are unchanged from sessions 1-13 and still load-bearing:
- **Default isolation** unless the spec needs otherwise. Use
`seedFromHost: true` for any test that depends on authenticated
renderer state — never assume default isolation gets past
`/login`. T11_runtime/T16/T19/T20/T21/T26/T22b/T27/T31b/T33b/T33c/T35b/T37b/T38b
`/login`. T07/T11_runtime/T16/T17/T19/T20/T21/T26/T22b/T27/T31b/T33b/T33c/T35b/T37b/T38b
are the templates.
- **eipc handlers register on `webContents.ipc._invokeHandlers`,
NOT global `ipcMain._invokeHandlers`.** Session 7 finding. Use
@@ -278,57 +311,28 @@ These are unchanged from sessions 1-12 and still load-bearing:
- **eipc invocation goes through the renderer-side wrapper at
`window['claude.<scope>'].<Iface>.<method>`.** Session 8 finding.
Use `lib/eipc.ts`'s `invokeEipcChannel` rather than rolling
main-side direct calls — the wrapper honors the per-handler origin
gate honestly. Main-side direct calls work but require spoofing
`senderFrame.url`; reserved as a fallback for non-claude.ai
webContents (no current consumer).
main-side direct calls.
- **For arg validator schema-rev: try smoke-test first, then grep
the rejection message literal.** Session 9 finding. When
`invokeEipcChannel` rejects with `Argument "<name>" at position N
... failed to pass validation`, that exact string lives inline in
the validator block. One grep on the literal resolves the
location; reading ~2KB around it surfaces the full schema. Cheaper
than runtime closure inspection in most cases. Session 11 finding:
for trivial `typeof === 'string'` validators, the smoke-test
resolves the shape in one round-trip — bundle-grep is unnecessary
overhead for simple validators. Session 12: most plugin-side
validators were resolvable by smoke-test alone (15-method
enumeration with 3-5 arg shapes per method costs ~5 minutes).
the rejection message literal.** Session 9 finding. Trivial
validators (`typeof === 'string'` / similar) resolve in one
round-trip. Elaborate validators get the bundle-grep treatment.
- **For session-scoped Tier 2 reframes: `LocalSessions/getAll` is
the foundational read-side surrogate.** Session 10 finding. When
a case-doc test's anchors are write-side LocalSessions handlers
with no read-side equivalent, ship a registration probe over the
case-doc-anchored suffixes PLUS a single
`invokeEipcChannel('LocalSessions_$_getAll', [])` array-shape
assertion as the read-side surrogate.
the foundational read-side surrogate.** Session 10 finding.
- **For Tier 2 reframes with case-doc-anchored read-side handlers:
invoke the case-doc-anchored handlers directly.** Session 11
finding. When the case-doc has read-side anchors with resolvable
arg shapes (like T21's `getConfiguredServices(cwd)` /
`getAutoVerify(cwd)`), prefer invoking those over a foundational
surrogate. Mixed-shape dual invocation (one returns array, another
returns boolean) is fine — assert each shape independently.
finding. Mixed-shape dual invocation is fine.
- **For Tier 2 reframes spanning two interfaces: invoke a read-side
from each.** Session 12 finding. When the case-doc surface spans
two impl objects (T11's CustomPlugins + LocalPlugins), invoke one
read-side from each rather than picking one. Cross-impl-object
dual invocation proves the plumbing crosses both impls intact —
strictly stronger than single-interface coverage. Mixed-arg-shape
fine (one needs `[[]]`, another `[]`).
- **`lib/input.ts` is X11-only.** Strict `XDG_SESSION_TYPE ===
'x11'` gate. Wayland consumers must skip — don't try to bolt
Wayland into the file.
- **`lib/input-niri.ts` is Niri-only.** Strict
`XDG_CURRENT_DESKTOP === 'niri'` gate. Sway / Hyprland / River
consumers must skip or live in their own per-compositor files.
from each.** Session 12 finding (T11_runtime template).
- **For AX-tree consumers: use `lib/ax.ts`.** Session 13 finding.
`snapshotAx` for one-shot reads, `waitForAxNode` /
`waitForAxNodes` for predicate-based polling. Don't reach into
`explore/walker.ts` directly — re-exports go through `lib/ax.ts`.
Consumers in session 13: `lib/claudeai.ts` page-objects + T26.
- **`lib/input.ts` is X11-only.** Strict gate.
- **`lib/input-niri.ts` is Niri-only.** Strict gate.
- **Don't speculate on `lib/input-wayland.ts` dispatcher.**
Per-compositor files until a second Wayland consumer (Sway /
Hyprland / River) lands. With only S14 on Niri, a dispatcher
is ceremony.
- **Code-tab AX anchors stay in plan-doc until a consumer needs
them.** Don't preemptively add `CodeTab.activateTopTab()` to
`claudeai.ts` — session 5's anchors block out the work for
whenever a future consumer surfaces.
them.**
- **CDP auth gate is alive** — runtime SIGUSR1 attach via
`app.attachInspector()`, never Playwright's `_electron.launch()`
or `chromium.connectOverCDP()`.
@@ -336,61 +340,49 @@ These are unchanged from sessions 1-12 and still load-bearing:
`webContents.getAllWebContents()` not
`BrowserWindow.getAllWindows()`. Constructor-level wraps don't
work; use prototype-method hooks.
- **`skipUnlessRow()` always first.** First line of every `test()`
body when the test is row-gated.
- **`skipUnlessRow()` always first.**
- **No fixed sleeps.** `retryUntil` from `lib/retry.ts`, or
Playwright auto-wait. Fixed `sleep(N)` is a smell. (Exception:
short sleeps inside hand-rolled retry loops that catch typed
errors and short-circuit; see S11 / S14 for the pattern.)
Playwright auto-wait, or `waitForAxNode` from `lib/ax.ts`.
(Exception: short sleeps inside hand-rolled retry loops that
catch typed errors and short-circuit; see S11 / S14.)
- **Diagnostics on every run.** `testInfo.attach()` the artefacts.
Single-shot JSON dumps for multi-state tests (S11, S14, S31,
T11_runtime, T19, T20, T21, T22b, T27, T31b, T33b, T33c, T35b,
T37b, T38b pattern) are cleaner than 5+ separate attachments.
- **Tag with annotations.** `severity:` and `surface:` on every
test so JUnit carries them through to matrix-regen.
- **Tabs in TS, ~80-char wrap as the existing files do.** Match
surrounding style.
- **Tabs in TS, ~80-char wrap as the existing files do.**
- **Don't break existing runners.** `npm run typecheck` must stay
clean. H01-H05 are the canaries; `npm test` must still pass them
after every commit.
after every commit. Note that T16/T17/T07/S25/S29-S31/S04 etc.
are pre-existing-flaky on KDE-W per session 13's full-suite run
— they're NOT canaries; baseline failures don't block work.
- **Always grep the installed asar** to verify a fingerprint
string is present (and how often) BEFORE shipping. Build-
reference is beautified — strings differ from the minified
bundle.
string is present.
- **For mock-then-call: the helper goes in
`lib/electron-mocks.ts`,** not `lib/claudeai.ts`.
`lib/electron-mocks.ts`.**
- **Marker windows / sacrificial host processes always die in
`finally`.** S11 / S14 are the templates — `marker.kill()` runs
before `app.close()` so the kill happens even if the spec throws.
- **Never log handler response BODIES into JUnit.** T37b's pattern
(response type + length only, never the body) is correct for any
invocation that returns user-account-scoped content. Memory bodies
may contain personal or sensitive content; MCP server tokens may
contain credentials; scheduled-task instructions may reference
internal projects; marketplace `pluginContext`-filtered listings
may surface internal-org marketplace pointers. T11_runtime's
defensive default extends the pattern: installed-plugin entries may
include workspace paths and plugin IDs that reveal org-internal
marketplace pointers when the user is in an org; configured dev
service entries (T21) may include workspace paths from auto-detect.
`finally`.**
- **Never log handler response BODIES into JUnit.**
### Phases
#### Phase 0 — calibration
1. `cd tools/test-harness && npm run typecheck` — should pass.
2. Read the plan doc's "Status (post-execution)" session 12 section,
then read `lib/eipc.ts`'s `invokeEipcChannel` API +
`T11_runtime.spec.ts` leading comments. Confirm you understand the
cross-impl-object dual invocation pattern.
3. Pick ONE Category as the main bet. Each has a different shape:
2. **Check debugger:** `ss -tln 2>/dev/null | grep ':9229'` (or
`curl --max-time 2 http://127.0.0.1:9229/json`). If port 9229 is
open, A / B / C are tractable; if closed, pivot to D or
documentation-only.
3. Read the plan doc's "Status (post-execution)" session 13 section,
then read `lib/ax.ts`'s API + `T26` and `claudeai.ts`'s
integration. Confirm you understand the `snapshotAx` /
`waitForAxNode` / `waitForAxNodes` surface.
4. Pick ONE Category as the main bet:
- **D** (PRIORITY when debugger is closed): pick 1-2 call-sites
in `claudeai.ts` to migrate, list which.
- **A**: bundle grep + per-URL navigation + registry re-probe.
- **B**: pick a Tier 3 candidate, smoke-test the read-side, decide
ship or defer.
- **C**: bundle grep on rejection literals, schema-rev, smoke-test
the resolved shape, decide ship or defer.
List which approaches you'll try in what order, with the cap at
2-3 distinct approaches before STOP AND REPORT.
If Phase 0 surfaces a problem (typecheck failing, primitives unclear,
the chosen Category's prerequisites don't hold), stop and report.
@@ -398,6 +390,10 @@ Don't fan out.
#### Phase 1 — fan-out batch
For Category D (call-site migration):
- Single subagent migrates 1-2 call-sites in `claudeai.ts` to use
`waitForAxNode`. Verify by running T16 / T17 / T26 / H05.
For Category A (operon investigation):
- Single subagent does bundle-grep for operon URL routes + per-URL
registry re-probe. Report findings; if a Tier 2 reframe is
@@ -405,23 +401,20 @@ For Category A (operon investigation):
For Category B (Tier 3 read-only reframes):
- Spawn ONE subagent for the candidate read-side investigation
(smoke-test + bundle-grep if needed). Treat as exploratory; report
findings before committing to a spec shape.
- Cap re-spawns at 2-3 distinct approaches; if no read-side resolves
cleanly, STOP AND REPORT.
(smoke-test + bundle-grep if needed).
For Category C (schema-rev):
- Single subagent does bundle-grep on the rejection literals, surfaces
the validator schemas, smoke-tests the recovered shapes against the
user's debugger-attached running Claude. If a recovered schema
unblocks a case-doc claim, ship; otherwise document and defer.
- Single subagent does bundle-grep on the rejection literals,
surfaces the validator schemas, smoke-tests the recovered shapes
against the user's debugger-attached running Claude.
Cap at ~1 spec total — same scope as session 12's T11_runtime.
Cap at ~1 spec OR ~1 primitive migration total — same scope as
sessions 9-13.
#### Per-subagent prompt shape
```
You're implementing ONE [test-harness runner | primitive |
You're implementing ONE [test-harness runner | primitive migration |
investigation] for <TARGET>.
Read in order:
@@ -429,7 +422,8 @@ Read in order:
- tools/test-harness/README.md (conventions; status section names
the most-recent-template that fits)
- tools/test-harness/src/runners/<closest-template>.spec.ts
- tools/test-harness/src/lib/ (the primitives you'll reuse)
- tools/test-harness/src/lib/ (the primitives you'll reuse
including session 13's `lib/ax.ts`)
- CLAUDE.md (project conventions)
Write tools/test-harness/src/runners/<TARGET>_short_name.spec.ts
@@ -437,8 +431,8 @@ Write tools/test-harness/src/runners/<TARGET>_short_name.spec.ts
[per-task specifics: pattern (seedFromHost / mock-then-call /
asar fingerprint / shared isolation / new-primitive-build /
investigation), assertion shape, skip rules, key constraint
warnings]
investigation / call-site migration), assertion shape, skip rules,
key constraint warnings]
Constraints:
- Tabs, ~80-char wrap.
@@ -446,7 +440,8 @@ Constraints:
- testInfo.attach() the diagnostics from the spec's "Diagnostics
on failure" block.
- Tag with severity + surface annotations.
- No fixed sleeps. retryUntil or Playwright auto-wait.
- No fixed sleeps. retryUntil, Playwright auto-wait, or
waitForAxNode.
- npm run typecheck must stay clean after your edits.
- Don't commit. The user reviews and commits.
@@ -454,17 +449,17 @@ If the target isn't reasonable to implement (anchors don't resolve
to anything assertable, the test depends on state you can't
construct, the existing primitives don't cover the surface), DO
NOT write a stub. Report under Open questions and stop. Sessions
1-12 had cumulative ~17 "stop and report" outcomes that were the
1-13 had cumulative ~17 "stop and report" outcomes that were the
right call.
Report shape (~150 words):
## <TARGET> [runner | primitive | investigation]
## <TARGET> [runner | primitive | investigation | migration]
- File written: tools/test-harness/src/runners/<filename>.spec.ts
[or lib/<newfile>.ts]
[or lib/<newfile>.ts or modified lib/<existing>.ts]
- Layer: file probe | argv probe | L1 | L2 (xprop) | L2 (DBus) |
pgrep | new-primitive | investigation
- Assertion shape: <one sentence>
pgrep | new-primitive | investigation | migration
- Assertion shape (or migration shape): <one sentence>
- Skip rules: <which rows + why>
- Verification path: <typecheck + run result>
- Open questions: <caveats>
@@ -475,49 +470,49 @@ Report shape (~150 words):
After fan-out returns:
1. `cd tools/test-harness && npm run typecheck` — must stay clean.
2. Run the new runners against KDE-W (the dev box) — but flag the
user first if any are destructive (seedFromHost kills running
Claude). Capture pass/skip/fail per spec for the matrix.
2. Run the new / migrated runners against KDE-W (the dev box) — but
flag the user first if any are destructive (seedFromHost kills
running Claude). Capture pass/skip/fail per spec for the matrix.
3. Update [`docs/testing/runner-implementation-plan.md`](runner-implementation-plan.md)
"Status (post-execution)" section to reflect newly-shipped
specs and any reclassifications discovered mid-flight.
specs / primitive migrations and any reclassifications.
4. Update [`tools/test-harness/README.md`](../../tools/test-harness/README.md)
inventory table.
5. Write a final report listing:
- Specs landed (pass / skip / needs-tuning per row)
- Specs landed / migrations completed (pass / skip / needs-tuning per row)
- Primitives landed (with API shape)
- Specs deferred (with the per-test rationale)
- Specs reclassified (Tier 3 → Tier 2, Tier 2 → Tier 1, etc.)
- Updated coverage stat (was 74/76 = 97%, now N/76 = M%)
6. Don't commit. The user reviews and commits.
6. Commit and push to `docs/compat-matrix` (the orchestration
directive at the top of the followup supersedes "don't commit").
7. Rotate this prompt: rewrite
`docs/testing/runner-implementation-followup-prompt.md` for
the NEXT session's deferred items.
### Self-correction loop
Same as sessions 1-12:
Same as sessions 1-13:
1. Subagent typecheck failure → re-spawn with explicit fix
instruction.
2. Subagent claims a runner exists but `git status` shows no new
file → re-spawn with explicit "use the Write tool" instruction.
2. Subagent claims a runner / migration exists but `git status`
shows no new file → re-spawn with explicit "use the Write tool"
instruction.
3. Two subagents wrote runners that share a primitive but with
different shapes → factor into `lib/<topic>.ts` BEFORE
shipping.
4. Spec passes locally but the assertion is actually trivial (e.g.
an unauthenticated launch where the handler check vacuously
passes because no handlers are registered) → re-examine the
assertion shape.
5. **Carry-over from session 5/6/7/8/9/10/11/12:** If the chosen
different shapes → factor into `lib/<topic>.ts` BEFORE shipping.
4. Spec passes locally but the assertion is actually trivial → re-
examine the assertion shape.
5. Migration breaks an existing spec → roll back the migration; the
per-spec retry budget was load-bearing and the primitive
defaults didn't match. Document the budget mismatch in plan-doc.
6. **Carry-over from session 5/6/7/8/9/10/11/12/13:** If the chosen
Category's investigation doesn't resolve / requires schema-rev
that exceeds budget after 2-3 approaches, STOP. Don't keep
digging — pivot to a fallback Category. Document what was tried.
6. **Carry-over from session 10:** If a registration probe surfaces
"registered but uninvocable" (handler is on the registry but the
renderer-side wrapper isn't exposed for the relevant scope or the
validator rejects every smoke-test arg shape), document and
defer rather than building the main-side fallback speculatively.
7. **Carry-over from session 10:** If a registration probe surfaces
"registered but uninvocable", document and defer rather than
building the main-side fallback speculatively.
Cap re-spawns at 2 per file. Past that, mark as needing human
review and move on.
@@ -534,76 +529,61 @@ Stop and write the final report when one of:
tests.** Stop, propose where the new primitive should live in
`lib/`. Future session adds the primitive first, then resumes.
4. **Session budget hits ~1 new spec OR one new primitive
landing.** Stop, synthesize, leave the rest for the next
session.
landing OR one substantive call-site migration.** Stop,
synthesize, leave the rest for the next session.
5. **All categories blocked after 2-3 attempts each.** Document the
findings as plan-doc additions and stop — coverage is at 97%, a
no-spec session that surfaces deferral notes is fine.
### What you should NOT do
- **Don't try to land Category A + B + C in one batch.** Pick
- **Don't try to land Category D + A + B + C in one batch.** Pick
ONE as the main bet.
- **Don't ship stubs.** If a runner can't actually assert what the
spec says, mark it as Tier 3 / blocked / primitive-gap and
don't write a placeholder. The cumulative seventeen "stop and
report" outcomes from sessions 1-12 were the right call — every
one revealed a real constraint.
don't write a placeholder.
- **Don't break existing runners.** H01-H05 are the canaries.
T16 / T17 / T07 / S25 / S29-S31 are pre-existing-flaky on KDE-W
per session 13's full-suite run — those are NOT canaries.
- **Don't restructure `lib/`** beyond targeted additions.
Premature abstractions are wrong abstractions.
`electron-mocks.ts` (session 3), `input.ts` (session 4),
`input-niri.ts` (session 6), and `eipc.ts` registry walker
(session 7) + invocation surface (session 8) were threshold-
driven extractions, not speculative.
- **Don't run destructive Tier 3 tests** that write to the user's
real claude.ai account (T22 PR write, T27 scheduling write, T29
worktree creation, T34 OAuth, T36 hooks-fire-on-prompt-submit).
Only the *read-only reframes* of those are in scope.
- **Don't introspect `ipcMain._invokeHandlers` for `claude.web`
eipc channels.** Session 7 confirmed those use the per-wc IPC
scope. Use `lib/eipc.ts`'s primitive (which targets the per-wc
scope) instead.
- **Don't call `invokeEipcChannel` for write-side handlers** —
`start*`, `set*`, `write*`, `run*`, `openIn*`, `delete*`,
`cancel*`, `reset*`, `installPlugin`, `uninstallPlugin`,
`updatePlugin`, `enablePlugin`, `uploadPlugin`, `syncRemotePlugins`.
The primitive doesn't enforce a read-only allowlist; the safety
property is that case-doc-anchored suffixes are read-side OR
case-doc-anchored write-side suffixes are tested via REGISTRATION
ONLY (`waitForEipcChannels`), never invoked. T11_runtime / T19 /
T20 / T21 ship registration probes over write-side suffixes — that's
the safe pattern.
eipc channels.** Use `lib/eipc.ts`.
- **Don't call `invokeEipcChannel` for write-side handlers.**
- **Don't bolt other compositors into `lib/input-niri.ts`.**
Sway / Hyprland / River each get their own per-compositor file
if a consumer surfaces.
- **Don't bolt Wayland into `lib/input.ts`.** X11-strict gate is
load-bearing.
- **Don't bolt Wayland into `lib/input.ts`.**
- **Don't speculate on a `lib/input-wayland.ts` dispatcher.**
Per-compositor files until a second Wayland consumer lands.
- **Don't preemptively build `CodeTab.activateTopTab()` /
`startNewSession()`.** Session 5 captured the AX anchors but
T36 Phase 2 (the only known consumer) was reclassified out.
`startNewSession()`.**
- **Don't add a main-side `invokeEipcChannel` fallback
speculatively.** Build it only if a concrete consumer needs to
invoke through a non-claude.ai webContents. Premature primitives
leak design debt.
speculatively.**
- **Don't speculate on a Launch event-subscription primitive.**
Session 11 noted that `window['claude.web'].Launch` exposes 5
`on*` event subscribers + `activeServersStore` not visible in
`_invokeHandlers`. No consumer asks for an event-probe primitive
yet. Wait for one.
- **Don't extract T07's CSS-querySelector poll into `lib/ax.ts`.**
That's a different abstraction (DOM, not AX). Wait for a second
CSS-poll consumer before extracting.
- **Don't add a `waitForRenderedSurface(client, surfaceKey)`
registry to `lib/ax.ts`.** Session 13 deliberately deferred
this — wait for a third consumer with a specific named surface.
- **Don't change the existing per-spec retry budgets when migrating
to `waitForAxNode`.** The budgets are tuned. Migration is shape-
only.
- **Don't reach into `explore/walker.ts` for AX types/helpers.**
`lib/ax.ts` re-exports `RawElement` / `AxNode` /
`axTreeToSnapshot` / `waitForAxTreeStable` — use those.
- **Don't implement the #569 power-inhibit patch in this
session.** That's a separate workstream.
- **Don't commit.** The user reviews and commits.
### Final report format
```markdown
## Runner implementation summary (session 13)
## Runner implementation summary (session 14)
- Main-bet category: A | B | C
- Main-bet category: D | A | B | C
- Specs landed: N
- Migrations completed: N
- Primitives landed: N
- Reclassified mid-flight: N (with reasons)
- Coverage: was 74/76 (97%), now <NEW>/76 (<PCT>%)
@@ -614,7 +594,7 @@ Stop and write the final report when one of:
| Cat | Test ID | File | Assertion shape | Status |
|---|---|---|---|---|
| A | <test_id> | <file>.spec.ts | … | ✓ pass / skip / fail |
| D | <call-site> | <file>.ts | … | ✓ pass / skip / fail |
| ... |
## Notable findings
@@ -624,9 +604,7 @@ Stop and write the final report when one of:
- ...
## Files touched
git status output (tools/test-harness/src/runners/*.spec.ts +
maybe lib/* primitives if extraction was needed; possibly plan-doc /
README updates).
git status output.
## Diff summary
git diff --stat
@@ -639,79 +617,44 @@ git diff --stat
- Each subagent's Write calls land directly in the working tree.
- The grounding probe (`tools/test-harness/grounding-probe.ts`)
can help when implementing a runner that asserts runtime API
state — capture once with `npm run grounding-probe -- --launch
--include-synthetic`, grep the output for the IPC channel /
accelerator / API your runner needs to assert against.
state.
- The eipc-registry probe (`tools/test-harness/eipc-registry-probe.ts`)
is the dedicated tool for inspecting per-wc IPC handler state.
Useful when designing new probes or auditing for upstream drift.
Connects to a debugger-attached running Claude on port 9229.
- For seedFromHost specs, the host MUST have a signed-in Claude
Desktop. The primitive throws with a clear message if not.
Document the prerequisite in your runner's leading comment if
it's the first one to add seedFromHost coverage to a new
surface.
- For tests that touch the AX tree, `claudeai.ts` page-objects
are the right substrate — see `T17_folder_picker.spec.ts` for
the end-to-end example. Don't query DOM by CSS selector unless
`claudeai.ts` doesn't already cover the surface. Code-tab
session-opener anchors are documented in plan-doc session 5;
don't add them to `claudeai.ts` unless a consumer surfaces.
- For mock-then-call: helpers live in `lib/electron-mocks.ts`
(extracted in session 3). See T24's leading comment for the
`Promise<boolean>` variant + T25's for the void variant.
- For tests that touch the AX tree, **`lib/ax.ts`** is the new
shared substrate. `claudeai.ts` page-objects are still the
right substrate for renderer-UI domain operations (CodeTab,
compact pills, menu items) — they consume `lib/ax.ts`
internally. Don't query DOM by CSS selector unless `claudeai.ts`
doesn't already cover the surface.
- For mock-then-call: helpers live in `lib/electron-mocks.ts`.
- For focus-shifting (X11 only): `lib/input.ts` exports
`focusOtherWindow` + `spawnMarkerWindow`. See S11 for the
end-to-end consumer pattern.
- For Wayland-native focus-shifting (Niri only): `lib/input-niri.ts`
exports the same shape with `niri msg --json` IPC + `foot`
marker. See S14 for the end-to-end consumer pattern.
`focusOtherWindow` + `spawnMarkerWindow`.
- For Wayland-native focus-shifting (Niri only): `lib/input-niri.ts`.
- For eipc registry walking: `lib/eipc.ts` exports
`getEipcChannels` / `findEipcChannel` / `findEipcChannels` /
`waitForEipcChannel` / `waitForEipcChannels` against
`webContents.ipc._invokeHandlers`. See T11_runtime / T19 / T20 /
T21 / T22b / T31b / T33b / T38b for end-to-end consumer patterns.
- For eipc invocation: `lib/eipc.ts` exports `invokeEipcChannel`
(renderer-side wrapper at
`window['claude.<scope>'].<Iface>.<method>`). See T11_runtime / T19 /
T20 / T21 / T27 / T33c / T35b / T37b for end-to-end consumer patterns.
`waitForEipcChannel` / `waitForEipcChannels`.
- For eipc invocation: `lib/eipc.ts` exports `invokeEipcChannel`.
Only call read-side suffixes; the primitive doesn't enforce a
read-only allowlist. Cross-impl-object dual invocation pattern is
T11_runtime; single-interface dual is T21 / T33c.
read-only allowlist.
- **For arg validator schema-rev (sessions 9 / 11 / 12 findings):**
when invocation rejects with `Argument "<name>" at position N ...
failed to pass validation`, FIRST try smoke-testing common arg
shapes against the user's debugger-attached Claude (session 11's
`launch-cwd-smoke.ts` / session 12's `localplugins-smoke.ts`
pattern — clone the InspectorClient connection, iterate over arg
shape candidates, report `[OK]` / `[REJ]` per shape). For trivial
validators (`typeof === 'string'` / similar), this resolves the
schema in one round-trip and avoids needing bundle-grep. For more
elaborate validators, fall back to grep on the bundled `index.js`
for the literal rejection string; validator block sits ~50-200
chars before the throw site. See plan-doc session 9 status section
for the byte offsets of the two CustomPlugins validators (5013601
/ 5018821) as worked examples.
smoke-test first, bundle-grep on rejection literal as fallback.
- **For session-scoped Tier 2 reframes (session 10 finding):**
`LocalSessions/getAll` is the foundational read-side surrogate
when case-doc anchors are write-side. Pattern: `args = []`,
returns `Array<Session>`. T19 and T20 are the templates.
`LocalSessions/getAll` foundational read-side surrogate.
- **For Tier 2 reframes with case-doc-anchored read-side handlers
(session 11 finding):** invoke the case-doc-anchored handlers
directly rather than using a foundational surrogate. Mixed-shape
dual invocation is fine. T21 is the template (one returns array,
another returns boolean — assert each shape independently).
(session 11 finding):** invoke directly. Mixed-shape OK.
- **For Tier 2 reframes spanning two interfaces (session 12
finding):** invoke a read-side from each impl object. T11_runtime
is the template (CustomPlugins/listInstalledPlugins array +
LocalPlugins/getPlugins array — proves the install plumbing
crosses both impls intact). Mixed-arg-shape fine.
finding):** invoke a read-side from each impl object.
- **For AX-tree polling (session 13 finding):** `lib/ax.ts`'s
`waitForAxNode` / `waitForAxNodes` for predicate-based polling.
`snapshotAx` for one-shot reads. Re-exports keep
`explore/walker.ts` types accessible without crossing the
lib/explore boundary.
- **For asar fingerprints: ALWAYS grep the installed asar
first.** Build-reference is beautified; the bundle is
minified. Case-doc text may be the user-facing form, not the
bundle form (e.g. `~/.claude.json` vs `.claude.json`). T18
reads `mainView.js`, not `index.js` — `lib/asar.ts`'s
`readAsarFile(filename, asarPath)` already handles this.
minified.
```bash
cd tools/test-harness && node -e "
const {extractFile} = require('@electron/asar');

View File

@@ -18,6 +18,116 @@ work begins.
## Status (post-execution)
**Shipped session 13 (1 new primitive, no new spec):** `lib/ax.ts`
shared AX-tree loading + traversal substrate, threshold-driven
extraction. The plan-doc had flagged "Unified DOM/AX loading +
traversal primitive" in session 12 as the natural priority for
session 13 if the operon / Tier 3 / schema-rev categories were
blocked. Phase 0 of session 13 found the debugger detached on the
dev box (port 9229 not listening), which blocked Categories A and C
(operon-mode navigation probe + schema-rev for `listRemotePluginsPage`
/ `listSkillFiles` — both need runtime probing against the user's
debugger-attached running Claude). Category B (Tier 3 read-only
reframes) ALSO effectively required the debugger for the smoke-test
investigation phase. The PRIORITY (DOM unification) primitive
landed as the strongly-supported alternative — two threshold-
driven extraction signals (T26 had duplicated `snapshotAx` from
claudeai.ts, plus user-reported flake in AX-tree queries).
Coverage stays at 74/76 (97%) — primitive-only session, no spec
landed. The matrix coverage doesn't reflect primitive landings;
those show up in the `lib/` surface and are picked up by future
spec consumers.
Two commits on `docs/compat-matrix` expected (SHAs inserted after
the test-harness commit lands — the user reviews and commits at the
end of every session):
- TBD — `test(harness): session 13 lib/ax.ts AX substrate primitive`
(extracts `snapshotAx` + adds `waitForAxNode` / `waitForAxNodes`;
refactors `claudeai.ts` and `T26_routines_page_renders.spec.ts` to
consume the shared substrate instead of carrying duplicate
implementations; passes typecheck + H01-H03 canaries + T26 +
T11_runtime spot-checks on KDE-W).
Session 13 findings + reclassifications:
- **`lib/ax.ts` primitive surface.** Threshold-driven extraction
hitting 2 consumers (the formerly-private `snapshotAx` in
`claudeai.ts` + the explicit duplicate in T26 noted as
"premature abstraction at 1 consumer"). Surface:
- `snapshotAx(inspector, opts)` — single AX read with a stability
gate. `opts.fast` skips the gate for inside-poll callers
(matches the existing internal contract).
- `waitForAxNode(inspector, predicate, opts)` — repeatedly
snapshot the tree and return the first matching `RawElement`,
or null on timeout. Gates on stability once at the start
(configurable), then iterates with `fast: true`. Built against
the inline polling loops in `CodeTab.activate`, `openPill`,
`clickMenuItem`, T26 pre/post-click anchor scans.
- `waitForAxNodes(inspector, predicate, opts)` — same shape,
returns every match. For consumers that want to enumerate.
- Re-exports: `RawElement`, `AxNode`, `axTreeToSnapshot`,
`waitForAxTreeStable` — so consumers don't have to reach into
`explore/walker.ts` themselves. Walker stays the source of
truth for AX-snapshot construction; this file is the runner-
facing alias.
- **Refactor scope was minimal.** `claudeai.ts` swaps its private
`snapshotAx` for the shared one (5-line import change). T26
drops its inlined helper and imports from `lib/ax.ts`. No
call-site rewrites — the predicate-based polling in
`CodeTab.activate` / `openPill` / `clickMenuItem` is unchanged
this session. Future sessions can opportunistically migrate
hand-rolled retry loops to `waitForAxNode` when re-touching
those code paths; not forced this session because the call-site
retry patterns each carry per-spec budget tuning that the
primitive's defaults need to validate against real flake data.
- **Why no spec landed.** Phase 0 calibration found port 9229
detached (Claude was running but debugger wasn't attached via
Developer → Enable Main Process Debugger). Categories A and C
strictly need runtime probing against the debugger; Category B
needs the debugger for the smoke-test verification phase (per
session-12 pattern). The PRIORITY primitive build was the
highest-impact deliverable that didn't require the debugger —
pure static-analysis-driven extraction with two existing
consumers as the threshold signal. Primitive-only sessions are
in scope per the followup prompt's termination criteria
("Session budget hits ~1 new spec OR one new primitive
landing").
- **What's NOT in `lib/ax.ts`.** Did NOT add a
`waitForRenderedSurface(client, surfaceKey)` registry — the
plan-doc flag mentioned it but no consumer asks for a named
surface anchor today; promote when a third consumer crystallizes
with a specific surface name in mind. Did NOT extract T07's
CSS-querySelector poll loop — that's a different abstraction
(DOM, not AX) with no second consumer signal yet. Did NOT
rewrite call-site retry budgets in `claudeai.ts` — the budgets
are tuned per-spec and changing them speculatively risks
introducing flake rather than removing it.
- **Pre-existing T16 / T17 flake confirmed unchanged.** Running
the full suite found T16 / T17 / T07 / S25 / S29-S31 / etc.
failing on KDE-W — these failures are pre-existing on the
baseline (verified by stashing the session-13 changes and re-
running T16, which still failed with the same
`CodeTab.activate: no AX-tree button with accessibleName="Code"
found` error). Session 13's primitive doesn't fix the existing
flake; it lays groundwork that future sessions can build
flake-reduction patches against (e.g. promoting `activateTab`
to use `waitForAxNode` with a longer budget instead of a one-
shot snapshot would be the next session's natural follow-up).
Tier 2 → Tier 2 candidates remaining for next session: the same
list as session 12 — operon-mode navigation probe (still needs a
debugger-attached Claude), schema-rev for `listRemotePluginsPage`
/ `listSkillFiles` (same), Tier 3 read-only reframes (same). The
new option for next session is **call-site migration to
`waitForAxNode`** — promote `activateTab`'s one-shot snapshot to a
proper retry, give T07's CSS poll a more durable wait shape, etc.
That's a flake-reduction session shape rather than a coverage-
expansion shape; the session 13 primitive made it tractable.
---
**Shipped session 12 (1 new spec, no primitive change):** T11_runtime
(Tier 2 reframe — `seedFromHost` + multi-suffix registration probe
over five install-flow handlers + dual-handler invocation across two
@@ -1642,35 +1752,22 @@ a primitive that needs a small extension:
dependent), but if it ever becomes tractable, a
`lib/displays.ts` mocking `screen.getAllDisplays()` would be
the entry.
- **Unified DOM/AX loading + traversal primitive (FLAGGED session
12).** Existing wait/traversal primitives are scattered:
`electron.ts:waitForReady('userLoaded')` covers the post-login
webContents URL transition; `claudeai.ts` page-objects roll their
own `retryUntil` for AX-tree node lookups; `eipc.ts:waitForEipcChannel`
covers handler registration. The user reports lots of failures
because tests aren't waiting long enough for the DOM to render —
AX-tree queries fire before the relevant subtree is mounted, and
individual specs each pick their own `retryUntil` budget. Symptoms:
flaky AX-anchor lookups under cold-cache or slow-machine conditions;
premature `waitForReady('userLoaded')` resolution before claude.ai's
client-side router has hydrated the surface the test wants to query.
Proposed shape: **`lib/dom-ready.ts`** exporting one or more
composable wait helpers — e.g. `waitForAxNode(client, selector,
opts)` (retryUntil over the AX walker with a sensible default
budget, ~15-30s, plus a per-call override), `waitForAxTreeStable(client,
opts)` (no node count change for N consecutive ticks — proxy for
"render finished"), and `waitForRenderedSurface(client, surfaceKey)`
(case-doc-anchored surface markers — a small registry of known
anchors per surface so consumers don't roll their own AX selectors).
Should also unify the existing `claudeai.ts` activation methods
around the new helpers rather than each rolling its own retryUntil.
Touches enough specs that a session 13 primitive build would
reduce flake across T16/T17/T26/T07/H05 plus any future Code-tab
AX work — flag as the main bet for session 13 if the operon /
Tier 3 / schema-rev categories are blocked. Pre-work: audit
per-spec `retryUntil` budgets and AX-query sites to identify the
3-5 most-flaky callsites; build the primitive against those
specifically rather than speculatively.
- **Unified DOM/AX loading + traversal primitive (LANDED session
13 as `lib/ax.ts`).** Threshold-driven extraction once T26 had to
redefine `snapshotAx` inline (after `claudeai.ts`'s private copy
was the only consumer for sessions 1-12). The primitive surface
exports `snapshotAx`, `waitForAxNode`, `waitForAxNodes`, plus
re-exports of `RawElement` / `AxNode` / `axTreeToSnapshot` /
`waitForAxTreeStable` so consumers don't reach into
`explore/walker.ts` directly. `claudeai.ts` and T26 both consume
the shared substrate; future call-site migrations (e.g.
`activateTab` → `waitForAxNode`) are tractable now. The
speculative `waitForRenderedSurface(client, surfaceKey)` shape
was deliberately NOT shipped — no consumer asks for a named-
surface registry today; promote when a third consumer
crystallizes with a specific surface name. The CSS-querySelector
poll in T07 was deliberately NOT extracted — different
abstraction (DOM, not AX), no second consumer signal yet.
## Open questions for the parent agent

View File

@@ -120,11 +120,18 @@ against case-doc anchors; consumed by T19 / T20 / T22b / T31b / T33b /
T38b) plus its session 8 invoke surface (`invokeEipcChannel` — calls
a registered handler through the renderer-side wrapper at
`window['claude.<scope>'].<Iface>.<method>`; consumed by T19 / T20 /
T27 / T33c / T35b / T37b) — and the `createIsolation({ seedFromHost:
true })` primitive that lets login-required tests run hermetically
against a copy of the host's signed-in auth state (T07, T11_runtime,
T16, T19, T20, T21, T22b, T26, T27, T31b, T33b, T33c, T35b, T37b,
T38b).
T27 / T33c / T35b / T37b), the `lib/ax.ts` AX-tree substrate
(`snapshotAx` for one-shot reads + `waitForAxNode` / `waitForAxNodes`
for predicate-based polling, plus re-exports of `RawElement` /
`AxNode` / `axTreeToSnapshot` / `waitForAxTreeStable` from
`explore/walker.ts` so consumers stay inside `lib/`; threshold-
driven extraction in session 13 once T26 had to duplicate the
formerly-private `snapshotAx` from `claudeai.ts`; consumed by
`claudeai.ts` page-objects + T26) — and the
`createIsolation({ seedFromHost: true })` primitive that lets login-
required tests run hermetically against a copy of the host's signed-
in auth state (T07, T11_runtime, T16, T19, T20, T21, T22b, T26, T27,
T31b, T33b, T33c, T35b, T37b, T38b).
Note on eipc channels: the `LocalSessions_$_*` and `CustomPlugins_$_*`
channel names referenced in the case-doc Code anchors don't register