mirror of
https://github.com/aaddrick/claude-desktop-debian.git
synced 2026-05-17 00:26:21 +03:00
docs(testing): session 15 plan/inventory + rotate session 16 prompt
Plan-doc Status (post-execution): adds session 15 entry capturing the T17 structural fix (legacy `CLAUDE_TEST_USE_HOST_CONFIG=1` → `seedFromHost: true`), the RawElement import prune, the debugger-attached-to-leaked-test-isolation discovery, the `openPill` / `clickMenuItem` migration park decision, and the "productivity signal is dimming — 3 consecutive sessions without coverage gain" note for the orchestrator. Followup prompt rotation: rewrites for session 16 with the new PRIORITY (run T17 to verify the seedFromHost migration), the upgraded Phase 0 calibration check (port-9229 attachment quality, not just port status — must distinguish auth-bearing Claude from leaked /login isolations via `evalInMain` webContents probe), the narrowed category list (D-verify + C + STOP recommendation), and the explicit STOP termination criterion if both D-verify and C turn up empty. Co-Authored-By: Claude <claude@anthropic.com>
This commit is contained in:
@@ -1,118 +1,139 @@
|
||||
# test-harness runner implementation — session 15 prompt
|
||||
# test-harness runner implementation — session 16 prompt
|
||||
|
||||
This file is meant to be **copied verbatim into a fresh Claude Code
|
||||
session** as the initial user message. Don't paraphrase it; the
|
||||
orchestration depends on the exact directives below.
|
||||
|
||||
You're picking up after a runner-implementation session that landed 1
|
||||
call-site migration (no new spec, no new primitive). Session 14 was
|
||||
a flake-reduction session: Phase 0 calibration found the debugger
|
||||
detached on the dev box (port 9229 not listening — Claude was not
|
||||
running, or running but Developer → Enable Main Process Debugger had
|
||||
not been clicked), which blocked Categories A (operon-mode
|
||||
navigation probe), B (Tier 3 read-only reframes), and C (schema-rev
|
||||
for `listRemotePluginsPage` / `listSkillFiles`) — all needing runtime
|
||||
probing against debugger-attached Claude. Session 14 pivoted to the
|
||||
PRIORITY Category D (call-site migration to `waitForAxNode`), which
|
||||
was tractable without the debugger because the migration is pure
|
||||
shape-only refactor against existing `lib/ax.ts` substrate. Coverage
|
||||
unchanged at 74/76 (97%) — migration sessions don't move the spec
|
||||
count, but T16's pre-existing failure mode (`no AX-tree button with
|
||||
accessibleName="Code" found`) is fixed by the migration. Two commits
|
||||
on `docs/compat-matrix` expected (autonomous orchestration commits +
|
||||
pushes — the user reviews after the session):
|
||||
structural fix (T17 migrated from legacy `CLAUDE_TEST_USE_HOST_CONFIG=1`
|
||||
auth path to `seedFromHost: true`, no new spec, no AX migration).
|
||||
Session 15 was an investigation session: Phase 0 calibration found
|
||||
port 9229 listening BUT the attached process was a leaked test
|
||||
isolation at `claude.ai/login` rather than the user's auth-bearing
|
||||
Claude — every webContents URL on that process was either `find_in_page`,
|
||||
`/login`, or `main_window/index.html`, and the user-data-dir was
|
||||
`/tmp/claude-test-*`. That made Categories A (operon-mode probe) / B
|
||||
(Tier 3 read-only reframes) / C (schema-rev) all soft-blocked: the
|
||||
debugger was technically attached, but to the wrong process for any
|
||||
auth-required investigation. Session 15 pivoted to investigating T17's
|
||||
pre-existing flake (the PRIORITY directive) and discovered the failure
|
||||
was structural rather than AX-polling-related — the spec was using the
|
||||
legacy `CLAUDE_TEST_USE_HOST_CONFIG=1` / `isolation: null` shape, and
|
||||
when run without that env var fell through to a fresh isolation with no
|
||||
auth, where `waitForUserLoaded`'s 90s default budget gets preempted by
|
||||
Playwright's 60s spec timeout. Coverage unchanged at 74/76 (97%) —
|
||||
structural fixes don't move the spec count, but T17 should now succeed
|
||||
when host is signed in (rather than auto-failing with a bare 60s
|
||||
timeout). Two commits on `docs/compat-matrix` expected (autonomous
|
||||
orchestration commits + pushes — the user reviews after the session):
|
||||
|
||||
- TBD — `test(harness): session 14 migrate activateTab to
|
||||
waitForAxNode (no spec, coverage unchanged at 97%)`
|
||||
(migrates `activateTab` from one-shot snapshot to `waitForAxNode`
|
||||
with a configurable pre-click timeout; migrates
|
||||
`CodeTab.activate`'s post-click `retryUntil`-around-
|
||||
`findCompactPills` loop to `waitForAxNodes`; T16 passes 3/3 on
|
||||
KDE-W against the migrated form, was pre-existing-flaky on the
|
||||
baseline; T26 passes; T17 still pre-existing-flaky — verified by
|
||||
stash + retry).
|
||||
- TBD — `test(harness): session 15 migrate T17 to seedFromHost +
|
||||
prune unused RawElement import (no spec, coverage unchanged at 97%)`
|
||||
(T17 spec rewrite swapping the `CLAUDE_TEST_USE_HOST_CONFIG=1` +
|
||||
`isolation: null` branch for the canonical `seedFromHost: true`
|
||||
pattern; prunes unused `RawElement` re-export import in
|
||||
`lib/claudeai.ts` per session 14's leftover hint; typecheck clean;
|
||||
T17 not actually run this session — see below).
|
||||
|
||||
The plan doc at
|
||||
[`docs/testing/runner-implementation-plan.md`](runner-implementation-plan.md)
|
||||
captures the tier classification and execution-time reclassifications.
|
||||
Its "Status (post-execution)" section is the source of truth for
|
||||
what's done and what's deferred — read **session 14** first, then
|
||||
**session 13**, then **session 12**, then **session 11**, then
|
||||
**session 10**, then **session 9**, then **session 8**, then **session
|
||||
7**, then **session 6**, then **session 5**, then **session 4**, then
|
||||
**session 3**, then **session 2**, then **session 1** sub-sections.
|
||||
what's done and what's deferred — read **session 15** first, then
|
||||
**session 14**, then **session 13**, then **session 12**, then
|
||||
**session 11**, then **session 10**, then **session 9**, then **session
|
||||
8**, then **session 7**, then **session 6**, then **session 5**, then
|
||||
**session 4**, then **session 3**, then **session 2**, then **session
|
||||
1** sub-sections.
|
||||
|
||||
This session is a continuation, not a restart. Start by reading the
|
||||
plan doc's status sections.
|
||||
|
||||
### Big new findings from session 14
|
||||
### Big new findings from session 15
|
||||
|
||||
1. **`activateTab` no-retry was the T16 failure mode.** Verified by
|
||||
stashing the migration and re-running T16 against the baseline —
|
||||
same `CodeTab.activate: no AX-tree button with accessibleName="Code"
|
||||
found` failure. The migration converts the pre-click snapshot from
|
||||
one-shot to a `waitForAxNode` poll, with the existing T16 budget
|
||||
(15s through `CodeTab.activate({ timeout })`) covering both the
|
||||
pre-click click-budget and the post-click pill poll. T16 passed
|
||||
3/3 in succession against the migrated form. Strong signal that
|
||||
"convert one-shot AX snapshots to `waitForAxNode` polling" is a
|
||||
high-leverage flake-reduction shape — this is the first migration
|
||||
that demonstrably fixed an existing failure.
|
||||
2. **T17 stays pre-existing-flaky.** T17 exercises the env-pill →
|
||||
Local → Select-folder → Open-folder chain via `openEnvPill` /
|
||||
`selectLocal` / `openFolderPicker`, which use `openPill` and
|
||||
`clickMenuItem` internally. Those weren't migrated this session
|
||||
(their post-click stability gates plus per-spec sleep budgets
|
||||
carry tuning the prompt explicitly cautioned against changing).
|
||||
T17's flake mode is unchanged-by-migration; future sessions can
|
||||
take it if budget tuning data warrants. The `openPill` while-loop
|
||||
on a successful menu render takes 100ms-per-poll-iteration; if the
|
||||
menu hasn't rendered within 5s, it returns `{ opened: false,
|
||||
items: [] }`. Migrating to `waitForAxNode` would flatten the loop
|
||||
shape but doesn't obviously change the outcome, so the migration
|
||||
wasn't worth the budget-tuning risk this session.
|
||||
3. **The debugger-attachment precondition is still binding.**
|
||||
Sessions 9-12 did extensive runtime probing of the per-wc IPC
|
||||
registry against the user's debugger-attached Claude. Without
|
||||
that probing, Categories A / B / C in this prompt are blocked at
|
||||
the smoke-test phase. If the user hasn't clicked Developer →
|
||||
Enable Main Process Debugger before the session starts, port 9229
|
||||
is closed and the categories pivot to either documentation work
|
||||
or further call-site migration. Phase 0 must check `ss -tln |
|
||||
grep ':9229'` (or `curl --max-time 2 http://127.0.0.1:9229/json`)
|
||||
before fanning out.
|
||||
4. **The reframe pool remains essentially exhausted.** Same status
|
||||
as sessions 12-13 — every Tier 1 fingerprint with a tractable
|
||||
runtime sibling has been promoted. The remaining options are now:
|
||||
(a) further call-site migration to `waitForAxNode` for flake
|
||||
reduction (`openPill` / `clickMenuItem` / T26's pre-click
|
||||
`retryUntil` — though T26's needs a `context-was-destroyed`
|
||||
exception swallow), (b) operon-mode navigation probe (still needs
|
||||
debugger), (c) schema-rev for `listRemotePluginsPage` /
|
||||
`listSkillFiles` (still needs debugger), (d) Tier 3 read-only
|
||||
reframes (most need user-account state). Session 14 demonstrated
|
||||
migration can deliver a measurable bug-fix outcome; that
|
||||
continues to be the highest-leverage shape when the debugger is
|
||||
closed.
|
||||
1. **T17 flake was structural, not AX-polling.** The trace showed
|
||||
bare 60s Playwright timeout with NO `renderer-url` attachment —
|
||||
meaning the test never reached line 49's attach call, which
|
||||
means it never resolved `waitForReady('userLoaded')` at line 40.
|
||||
Root cause: T17 was the last spec on the legacy
|
||||
`CLAUDE_TEST_USE_HOST_CONFIG=1` / `isolation: null` shape — every
|
||||
other auth-required spec (T07, T16, T19, T20, T21, T22b, T26,
|
||||
T27, T31b, T33b/c, T35b, T37b, T38b) had moved to `seedFromHost:
|
||||
true`. Without that env var (which CI / orchestration didn't
|
||||
set), T17 fell through to a fresh isolation with no auth, hit
|
||||
`/login`, and `waitForUserLoaded`'s 90s budget got preempted by
|
||||
the 60s spec timeout. **Session 14's hypothesis was wrong** —
|
||||
the AX click chain in `openPill` / `clickMenuItem` was never
|
||||
reached, so migrating those wouldn't have fixed anything.
|
||||
2. **`openPill` / `clickMenuItem` migration parked.** With T17's
|
||||
actual flake explained by the auth-path mismatch, there's no
|
||||
remaining flake-evidence pulling for the AX migration that
|
||||
sessions 14-15 considered. `openPill`'s while-loop and
|
||||
`clickMenuItem`'s while-loop work fine when the auth path is
|
||||
correct. Don't migrate speculatively — wait for a third
|
||||
consumer to surface with budget-tuning evidence.
|
||||
3. **Phase 0 must distinguish "port open" from "port attached to
|
||||
user's signed-in Claude".** Session 14 saw port 9229 closed and
|
||||
correctly classified as debugger-detached. Session 15 saw port
|
||||
9229 OPEN but attached to a leaked test isolation at /login —
|
||||
Categories A/B/C still soft-blocked. The right Phase 0 probe:
|
||||
`evalInMain` listing webContents and checking that AT LEAST one
|
||||
URL is `https://claude.ai/<not /login>`. If every webContents is
|
||||
`/login` or `find_in_page` or `main_window`, treat it the same
|
||||
as port-closed for auth-required investigations. Session 15's
|
||||
one-off probe shape (kept inline in the report, deleted after):
|
||||
|
||||
```ts
|
||||
const wcs = await client.evalInMain(`
|
||||
const { webContents } = process.mainModule.require('electron');
|
||||
return webContents.getAllWebContents().map((w) => ({
|
||||
id: w.id, url: w.getURL(), title: w.getTitle(),
|
||||
}));
|
||||
`);
|
||||
```
|
||||
|
||||
4. **Leaked `/tmp/claude-test-*` dirs accumulating on dev box.**
|
||||
Multiple test isolations from prior sessions have leaked their
|
||||
tmpdirs and (in some cases) their Electron child processes.
|
||||
`ls /tmp/ | grep claude-test` showed several. The session 15
|
||||
T17 spec wasn't run because killing those leaked Electron
|
||||
processes might also kill the user's real running Claude (PID
|
||||
ambiguity from `ps`). A future session can either (a) verify
|
||||
no real Claude is running before invoking T17, or (b) just
|
||||
accept the seedFromHost kill side effect and let the user
|
||||
re-launch Claude after the session.
|
||||
5. **Productivity signal is dimming.** Sessions 13-15 collectively
|
||||
produced one new primitive (`lib/ax.ts`), one substantive AX
|
||||
migration (`activateTab` + `CodeTab.activate`), and one
|
||||
structural fix (T17 seedFromHost). NO coverage gain in those
|
||||
three sessions. The remaining categories without an
|
||||
auth-bearing debugger-attached Claude are mostly exhausted.
|
||||
Next session should prioritise (a) running T17 to verify the
|
||||
seedFromHost fix actually resolves the timeout, and (b) checking
|
||||
whether a Category C schema-rev probe against the leaked /login
|
||||
isolation is tractable (validators don't need auth, only
|
||||
invocation does — worth a 15-min investigation). If both turn
|
||||
up empty, the orchestrator should seriously consider stopping —
|
||||
at 97% coverage with no clear high-leverage shapes left,
|
||||
further sessions are likely to produce documentation-only or
|
||||
marginal-improvement deliverables.
|
||||
|
||||
### Authoritative reference
|
||||
|
||||
Read these in order before fanning out:
|
||||
|
||||
- [`docs/testing/runner-implementation-plan.md`](runner-implementation-plan.md)
|
||||
— tier classification + status section. Read **session 14**, then
|
||||
**session 13**, **session 12**, **session 11**, **session 10**,
|
||||
**session 9**, **session 8**, **session 7**, **session 6**,
|
||||
**session 5**, **session 4**, **session 3**, **session 2**, then
|
||||
**session 1** "Status (post-execution)" sub-sections. The Tier-3
|
||||
list (search for "## Tier 3") is the candidate pool for any further
|
||||
reframes.
|
||||
— tier classification + status section. Read **session 15**, then
|
||||
**session 14**, **session 13**, **session 12**, **session 11**,
|
||||
**session 10**, **session 9**, **session 8**, **session 7**,
|
||||
**session 6**, **session 5**, **session 4**, **session 3**,
|
||||
**session 2**, then **session 1** "Status (post-execution)"
|
||||
sub-sections. The Tier-3 list (search for "## Tier 3") is the
|
||||
candidate pool for any further reframes.
|
||||
- [`tools/test-harness/README.md`](../../tools/test-harness/README.md)
|
||||
— runner conventions, the now-74-spec inventory, primitives in
|
||||
`lib/`, isolation defaults, the CDP-gate workaround, the eipc
|
||||
note, and `lib/ax.ts` substrate (session 13 addition; session 14
|
||||
migrated `activateTab` + `CodeTab.activate`'s post-click pill
|
||||
poll to use it).
|
||||
`lib/`, isolation defaults (T17 now seedFromHost per session 15),
|
||||
the CDP-gate workaround, the eipc note, and `lib/ax.ts` substrate.
|
||||
- [`docs/testing/cases/README.md`](cases/README.md) — case-doc
|
||||
structure and the four anchor scopes.
|
||||
- [`tools/test-harness/src/lib/`](../../tools/test-harness/src/lib/)
|
||||
@@ -123,297 +144,236 @@ Read these in order before fanning out:
|
||||
`waitForEipcChannels` / `invokeEipcChannel` on `lib/eipc.ts`) is
|
||||
unchanged.
|
||||
- [`tools/test-harness/eipc-registry-probe.ts`](../../tools/test-harness/eipc-registry-probe.ts)
|
||||
— the session 7 read-only registry probe. Re-run against a
|
||||
debugger-attached Claude (`Developer → Enable Main Process
|
||||
Debugger` from the menu) to capture the current registry shape.
|
||||
Sessions 11 / 12 used small one-off smoke-tests in the test-
|
||||
harness dir that clone the InspectorClient connection pattern
|
||||
and run N candidate read-sides through M arg shapes; deleted
|
||||
after.
|
||||
— the session 7 read-only registry probe. Re-run against an
|
||||
auth-bearing debugger-attached Claude (`Developer → Enable Main
|
||||
Process Debugger` from the menu, signed-in) to capture the
|
||||
current registry shape.
|
||||
- [`tools/test-harness/src/runners/`](../../tools/test-harness/src/runners/)
|
||||
— every existing spec is a template. Notable session 14
|
||||
— every existing spec is a template. Notable session 15
|
||||
candidates for follow-up:
|
||||
- `T17_folder_picker.spec.ts` — the next test that would benefit
|
||||
from `openPill` / `clickMenuItem` migration. Pre-existing
|
||||
flake; current failure is a 60s timeout in the
|
||||
openEnvPill/selectLocal/openFolderPicker chain.
|
||||
- `T26_routines_page_renders.spec.ts` — has a pre-click
|
||||
`retryUntil` block with `context-was-destroyed` exception
|
||||
handling that could become a `waitForAxNode` call once the
|
||||
primitive grows error-class options.
|
||||
- `T17_folder_picker.spec.ts` — newly migrated to seedFromHost.
|
||||
Run to verify the 60s timeout is gone. If T17 now passes, the
|
||||
structural fix shipped session 15 is verified.
|
||||
- Schema-rev for `listRemotePluginsPage` / `listSkillFiles` —
|
||||
rejection literals can be bundle-grepped without auth, and the
|
||||
validator runs auth-independent if /login state lets us
|
||||
invoke through the renderer-side wrapper. Session 12 found
|
||||
`listRemotePluginsPage` needs `limit: number` at position 0
|
||||
and `listSkillFiles` needs both `pluginId` and `skillName`.
|
||||
- [`docs/testing/cases/*.md`](cases/) — the spec each runner
|
||||
asserts. The **Code anchors:** field tells you exactly where
|
||||
upstream implements the feature.
|
||||
|
||||
### Tests in scope this session
|
||||
|
||||
**Realistic ceiling: ~1 new spec OR one substantive flake-reduction
|
||||
deliverable OR one investigation.** Sessions 9-12 each landed 1-2
|
||||
specs; session 13 landed only a primitive (debugger blocked); session
|
||||
14 landed only a migration (debugger blocked). Coverage at 74/76
|
||||
means the test budget naturally shifts toward either (a) further flake
|
||||
reduction by extending the migration shape, (b) investigation that
|
||||
requires the debugger and was deferred from sessions 12-14, or (c)
|
||||
Tier 3 read-only reframes that the harness can construct from
|
||||
existing `seedFromHost` state.
|
||||
**Realistic ceiling: ~1 verification run OR ~1 schema-rev investigation
|
||||
OR a "stop the orchestration" recommendation.** Sessions 9-12 each
|
||||
landed 1-2 specs; session 13 landed only a primitive (debugger
|
||||
blocked); session 14 landed only a migration (debugger blocked);
|
||||
session 15 landed only a structural fix (debugger soft-blocked).
|
||||
Coverage at 74/76 means the test budget naturally shifts toward
|
||||
verification, low-stakes investigation, or the orchestration
|
||||
termination decision.
|
||||
|
||||
**Phase 0 MUST check the debugger BEFORE picking a category.** Run
|
||||
`ss -tln 2>/dev/null | grep ':9229'` (or
|
||||
`curl --max-time 2 http://127.0.0.1:9229/json`). If port 9229 is not
|
||||
listening, Categories A and C are hard-blocked. Pivot to D or B.
|
||||
**Phase 0 MUST check the debugger-attachment quality, not just port
|
||||
status.** Run `ss -tln 2>/dev/null | grep ':9229'` for port. If open,
|
||||
also run an `evalInMain` probe to enumerate webContents URLs — if no
|
||||
URL is `https://claude.ai/<not /login>`, treat as soft-blocked for
|
||||
auth-required categories. Probe shape (kept inline; delete after):
|
||||
|
||||
#### **PRIORITY: Investigate why T17 stays flaky and decide on a
|
||||
migration-or-fix path.** Session 14's migration fixed T16's pre-
|
||||
existing failure mode. T17 is the next-clearest pre-existing-flaky
|
||||
spec on KDE-W; it shares plumbing with T16 (`CodeTab` → AX-driven
|
||||
clicks) but goes deeper through `openEnvPill` / `selectLocal` /
|
||||
`openFolderPicker`. The session 14 migration does NOT reach into
|
||||
those (they use `openPill` + `clickMenuItem`, both of which carry
|
||||
post-click stability gates and per-iteration sleep loops). The
|
||||
investigation: (1) read T17's failure trace from the most recent
|
||||
session-14 stashed run (under `tools/test-harness/results/local/
|
||||
test-output/T17_folder_picker-T17-—-Folder-picker-opens/`), (2)
|
||||
classify the failure (env-pill probe? Local item? Select-folder
|
||||
pill? Open-folder click?), (3) decide if (a) `openPill` migration
|
||||
to `waitForAxNode` would reach it, or (b) the budget defaults need
|
||||
tuning, or (c) the failure is from something orthogonal to AX
|
||||
polling. If (a), ship the migration. If (b), document the budget
|
||||
mismatch in plan-doc. If (c), defer to a future session with a
|
||||
clearer signal. **If this is what session 15 ships, that's a
|
||||
strictly higher-impact outcome than another Tier 2 / Tier 3 reframe
|
||||
— flake reduction touches every existing AX-using spec.** Doesn't
|
||||
need the debugger.
|
||||
```ts
|
||||
import { InspectorClient } from './src/lib/inspector.js';
|
||||
const client = await InspectorClient.connect(9229);
|
||||
const wcs = await client.evalInMain<unknown>(`
|
||||
const { webContents } = process.mainModule.require('electron');
|
||||
return webContents.getAllWebContents().map((w) => ({
|
||||
id: w.id, url: w.getURL(), title: w.getTitle(),
|
||||
}));
|
||||
`);
|
||||
console.log(wcs); client.close();
|
||||
```
|
||||
|
||||
Three categories — pick ONE as the main bet, treat the others as
|
||||
fallback if the main bet hits an early blocker:
|
||||
If every URL is `/login` or `find_in_page` or `main_window/index.html`,
|
||||
the debugger is attached to a leaked test isolation, not the user's
|
||||
Claude. Categories A and most of B are blocked. Category C may still
|
||||
be tractable since validators run auth-independent — try the schema-
|
||||
rev probe against the /login wrapper.
|
||||
|
||||
#### **PRIORITY: Verify T17's session 15 seedFromHost migration
|
||||
actually resolves the 60s timeout.** Session 15 didn't run T17 because
|
||||
the dev box had ambiguous Electron processes (some leaked test
|
||||
isolations, possibly the user's real Claude — `ps` couldn't
|
||||
disambiguate cleanly). Session 16's first action:
|
||||
|
||||
1. Check `pgrep -af "ozone-platform=x11.*app.asar"` and
|
||||
`ps -o pid,user-data-dir` to identify whether any real-Claude
|
||||
process is running (real Claude has a non-`/tmp/claude-test-*`
|
||||
user-data-dir, typically nothing or `~/.config/Claude`).
|
||||
2. If only test cruft is running, run T17 (`npx playwright test
|
||||
T17 --reporter=list`). The test will kill those leaked
|
||||
processes via `seedFromHost`'s host-Claude-kill semantics —
|
||||
that's actually a desirable cleanup side effect.
|
||||
3. If a real Claude IS running, **flag clearly in the report
|
||||
before running**, then run T17. The user accepted the
|
||||
`seedFromHost` kill side effect when authorising autonomous
|
||||
orchestration; just be transparent about it.
|
||||
4. Capture pass/skip/fail. Update the matrix coverage doc if
|
||||
T17 now passes.
|
||||
5. If T17 still fails, classify the new failure mode (is it now
|
||||
AX-polling? Folder picker chain? Mock not installing?) and
|
||||
decide whether to fix or defer.
|
||||
|
||||
This is **strictly higher-impact than session 14/15's
|
||||
spec-implementation work** because it produces a concrete
|
||||
pass/fail data point that resolves a 2-session-old hypothesis.
|
||||
Doesn't need the debugger.
|
||||
|
||||
Three categories — pick the verification run as the main bet, treat
|
||||
the others as fallback if the main bet hits an early blocker:
|
||||
|
||||
| # | Tests | Source | Notes |
|
||||
|---|---|---|---|
|
||||
| **D** further call-site migration / T17 investigation | T17 / `claudeai.ts` `openPill` + `clickMenuItem` | `lib/ax.ts` (session 13 primitive) | The PRIORITY shape this session. Read T17's failure trace, decide if `openPill` migration would fix it, ship the migration if so. Same shape-only refactor risk as session 14: keep the per-spec retry budgets matching the existing tuning. Doesn't need the debugger. **Risk:** `openPill` and `clickMenuItem` carry post-click stability gates that `waitForAxNode` already covers via `stabilityGate: true`, so the migration shape should slot in cleanly — but each spec's overall budget needs verification. |
|
||||
| **A** operon-mode navigation probe | n/a (investigation) + maybe small Tier 2 reframe | new probe + bundle grep for operon URL routes | Session 10 confirmed `OperonBootstrap.ensure` registers eagerly but the other 21 wrapper-exposed operon interfaces remain registry-unconfirmed. Outputs: either an operon-mode URL form recovered from the bundle (search for `operon`-keyed routes in `claude.ai/...` paths) plus a registry re-probe after navigation, OR a deferral note explaining why operon scope can't be reached without an operon-mode entry. **Needs debugger-attached Claude on port 9229.** |
|
||||
| **B** Tier 3 read-only reframes | Pick from the Tier 3 list | T33c / T35b / T37b template + bundle grep | The Tier 3 list is full of login-required flows; some have read-only entry points that the harness CAN construct. Candidates: T22's `getPrChecks` read-side might accept a non-existent PR number / dry-run mode; T15's OAuth surface has read-only state queries. Most need the user-account-scoped state to fail-fast with a clean error rather than a real network roundtrip — investigate first. **Needs debugger for smoke-test verification.** |
|
||||
| **C** Schema-rev for `listRemotePluginsPage` / `listSkillFiles` | Bundle grep | session 9 schema-rev pattern | Both methods rejected every smoke-tested arg shape during session 12's investigation. `listRemotePluginsPage` needs `limit: number` at position 0 (rejection: `Argument "limit" at position 0 ...`); `listSkillFiles` needs both `pluginId` and `skillName` (rejection: `Argument "skillName" at position 1 ...`). Bundle-grep on the rejection literals → resolve the schema → ship a narrowly-scoped Tier 2 invocation if it unblocks a case-doc claim. **Needs debugger to verify the recovered schema.** |
|
||||
| **D-verify** T17 verification run (PRIORITY) | T17 | session 15 migration | Run T17 against the dev box. If pass, log it. If fail, classify the new failure mode. **Side effect: kills any running Claude (the user's, or leaked test cruft). Flag in the report.** Doesn't need the debugger. |
|
||||
| **C** Schema-rev for `listRemotePluginsPage` / `listSkillFiles` | Bundle grep | session 9 schema-rev pattern | Both methods rejected every smoke-tested arg shape during session 12's investigation. `listRemotePluginsPage` needs `limit: number` at position 0 (rejection: `Argument "limit" at position 0 ...`); `listSkillFiles` needs both `pluginId` and `skillName` (rejection: `Argument "skillName" at position 1 ...`). Bundle-grep on the rejection literals → resolve the schema → ship a narrowly-scoped Tier 2 invocation if it unblocks a case-doc claim. **Tractable against a /login isolation since validators run auth-independent.** |
|
||||
| **STOP** Orchestrator stop recommendation | n/a | session 15 productivity signal | Coverage at 97%, three consecutive non-coverage sessions, remaining categories soft- or hard-blocked. If D-verify and C both produce nothing tractable, formally recommend the orchestrator stop. Documentation-only sessions are still acceptable per the followup termination criteria, but consecutive ones with no improvement signal are noise. |
|
||||
|
||||
If port 9229 is closed, only D is fully tractable. A documentation-
|
||||
only session that audits the existing AX call-sites and proposes a
|
||||
migration plan (without shipping) is also acceptable — pre-work for
|
||||
a future session that DOES land the migration.
|
||||
#### Category D-verify — T17 verification run
|
||||
|
||||
#### Category D — further call-site migration / T17 investigation
|
||||
The plan: run the post-session-15 T17 against the dev box and capture
|
||||
the result. Pass = the structural fix landed correctly. Fail = the
|
||||
hypothesis was incomplete; classify and decide.
|
||||
|
||||
The plan: investigate T17's pre-existing flake, decide on a fix path,
|
||||
ship if a `waitForAxNode`-shaped migration of `openPill` /
|
||||
`clickMenuItem` would reach it.
|
||||
|
||||
1. **Read T17's most recent failure trace.** Either the session-14
|
||||
stashed-baseline trace (under `tools/test-harness/results/local/
|
||||
test-output/T17_folder_picker-T17-—-Folder-picker-opens/`) or run
|
||||
T17 fresh against the post-session-14 form. Classify the failure:
|
||||
- openEnvPill timeout? (would suggest `openPill` migration)
|
||||
- selectLocal timeout? (would suggest `clickMenuItem` migration)
|
||||
- openFolderPicker chain timeout? (suggests deeper issue)
|
||||
- Some other failure?
|
||||
2. **If `openPill` migration would reach the failure**, migrate it.
|
||||
The shape: replace the post-click while-loop with
|
||||
`waitForAxNodes` filtered to MENU_ITEM_ROLES, with the existing
|
||||
`timeout` parameter as `timeoutMs`. Keep the upfront
|
||||
`waitForAxTreeStable` gate or pass `stabilityGate: true` to
|
||||
`waitForAxNodes`. Verify with T17 (or the originally-affected
|
||||
spec).
|
||||
3. **If `clickMenuItem` migration would reach the failure**, same
|
||||
shape. Replace the while-loop with `waitForAxNode` filtered on
|
||||
role + textPattern, with the existing `timeout` as `timeoutMs`.
|
||||
4. **If the failure is orthogonal to AX polling** (e.g. environmental,
|
||||
timing race outside the AX surface, dialog mock not installing),
|
||||
document and defer.
|
||||
1. **Disambiguate running Claude processes.** `pgrep -af
|
||||
"ozone-platform=x11.*app.asar"`; for each, `cat
|
||||
/proc/<pid>/cmdline | tr '\0' '\n' | grep user-data-dir` (or
|
||||
inspect via `ps` cmdline). If only `/tmp/claude-test-*`
|
||||
user-data-dirs, no real Claude is running.
|
||||
2. **Run T17.** `cd tools/test-harness && npx playwright test
|
||||
T17_folder_picker --reporter=list 2>&1 | tee
|
||||
/tmp/t17-session16.log`.
|
||||
3. **Classify.**
|
||||
- Pass: structural fix verified. Update plan-doc / matrix.
|
||||
- Skip with "seedFromHost unavailable": means host has no
|
||||
`~/.config/Claude/Local State`. Should be rare on the dev
|
||||
box but possible if config was wiped between sessions.
|
||||
- Skip with "seeded auth did not reach post-login URL":
|
||||
auth was seeded but stale. User needs to re-sign-in
|
||||
manually. Don't try to reseed automatically.
|
||||
- Fail with NEW failure mode: classify the failure (AX
|
||||
click? openFolderPicker chain? dialog mock?). If it's
|
||||
now in `openPill` / `clickMenuItem`, sessions 14/15's
|
||||
speculation has finally hit; ship the AX migration.
|
||||
Otherwise document and defer.
|
||||
4. **Don't restructure T17's body** unless step 3 surfaces a
|
||||
real new bug. Keep changes scoped to whatever the verification
|
||||
surfaces.
|
||||
|
||||
Doesn't need the debugger.
|
||||
|
||||
#### Category A — operon-mode navigation probe
|
||||
|
||||
The plan: find an operon-mode URL form and verify whether the other
|
||||
21 operon interfaces register lazily.
|
||||
|
||||
1. **Bundle grep for operon URL routes.** Search the bundled
|
||||
`index.js` and `mainView.js` for `operon`-keyed paths (e.g.
|
||||
`/operon/...`, `claude.ai/operon`, etc.). Compile a candidate URL
|
||||
list.
|
||||
2. **Navigate the user's debugger-attached running Claude** to each
|
||||
candidate URL via `inspector.evalInRenderer('claude.ai',
|
||||
"window.location.href = '<URL>'")`. After each navigation, re-run
|
||||
the registry probe and check the operon scope's interface count.
|
||||
3. **If any URL surfaces additional operon handlers**, ship a small
|
||||
Tier 2 reframe spec.
|
||||
4. **If none of the candidate URLs surface additional handlers**,
|
||||
document as "operon scope handlers register lazily on a navigation
|
||||
we can't easily construct from the harness" and defer.
|
||||
|
||||
**Needs debugger-attached Claude on port 9229.**
|
||||
|
||||
#### Category B — Tier 3 read-only reframes
|
||||
|
||||
The plan: identify a Tier 3 spec where a non-destructive read-side
|
||||
is invocable from a fresh `seedFromHost` isolation.
|
||||
|
||||
1. **Read the Tier 3 list** in plan-doc and pick 1-2 candidates with
|
||||
read-side anchors. Most Tier 3 specs are write-side flows (T15
|
||||
OAuth, T22 PR write, T27 scheduling write, T29 worktree creation,
|
||||
T34 OAuth, T36 hooks-fire-on-prompt-submit) — those are out of
|
||||
scope. The exceptions are read-side anchors that just need
|
||||
user-account-scoped data to assert against.
|
||||
2. **Smoke-test the candidate read-side** with various arg shapes.
|
||||
3. **Ship a Tier 2 reframe** if the read-side resolves cleanly.
|
||||
4. **Defer** if every candidate requires real account state to assert
|
||||
meaningfully.
|
||||
|
||||
**Needs debugger for smoke-test verification.**
|
||||
|
||||
#### Category C — Schema-rev for rejecting read-sides
|
||||
|
||||
The plan: resolve the validator schema for `listRemotePluginsPage` /
|
||||
`listSkillFiles` via bundle grep, ship invocations if either unblocks
|
||||
a case-doc claim.
|
||||
a case-doc claim. Tractable against a /login isolation since
|
||||
validators run auth-independent.
|
||||
|
||||
1. **Grep on the rejection literal** in the bundled `index.js`.
|
||||
Validator block sits ~50-200 chars before the throw site (session
|
||||
9 finding). Read ~2KB around the hit to surface the full schema.
|
||||
2. **Smoke-test the recovered schema** against the user's debugger-
|
||||
attached running Claude.
|
||||
attached running Claude (or, if auth-soft-blocked as in session 15,
|
||||
against the /login isolation — validators run regardless of auth).
|
||||
3. **Connect the resolved invocation to a case-doc claim.**
|
||||
4. **Ship a Tier 2 invocation** if a case-doc claim is unblocked.
|
||||
|
||||
**Needs debugger to verify the recovered schema.**
|
||||
Auth-independent for the validator; auth-bearing for any handler that
|
||||
actually returns plugin / skill data. If the validator resolves but
|
||||
the handler fails on auth, document the schema in plan-doc as a
|
||||
deferred reframe and move on.
|
||||
|
||||
#### Cross-compositor focus-shifter expansion (NOT recommended this session)
|
||||
#### STOP recommendation
|
||||
|
||||
Building `lib/input-sway.ts` / `lib/input-hypr.ts` would mirror
|
||||
`lib/input-niri.ts`'s shape but no consumer is asking for them.
|
||||
Premature abstractions are wrong abstractions. Wait for a real
|
||||
consumer.
|
||||
|
||||
#### Main-side `invokeEipcChannel` fallback (NOT recommended this session)
|
||||
|
||||
Same status as sessions 8-14 — wait for a real consumer.
|
||||
|
||||
#### Launch event-subscription primitive (NOT recommended this session)
|
||||
|
||||
Same status as sessions 11-14 — wait for a real consumer.
|
||||
|
||||
#### `waitForRenderedSurface` registry (NOT recommended this session)
|
||||
|
||||
Session 13's `lib/ax.ts` deliberately did NOT ship a named-surface
|
||||
registry; promote when a third consumer crystallizes with a specific
|
||||
surface name in mind.
|
||||
|
||||
#### CSS-querySelector primitive (NOT recommended this session)
|
||||
|
||||
Session 13's `lib/ax.ts` covers AX-tree consumers only. T07's CSS-
|
||||
querySelector poll for the topbar is a different abstraction (DOM,
|
||||
not AX). Wait for a second consumer before extracting.
|
||||
If D-verify resolves cleanly (pass or stable skip) and C produces no
|
||||
shippable spec after the schema-rev investigation, the productivity
|
||||
signal for further sessions is squarely "documentation-only with no
|
||||
clear next-step deliverable." The orchestrator should stop. State
|
||||
this plainly in the final report; don't keep cycling.
|
||||
|
||||
### Constraints to respect (don't violate)
|
||||
|
||||
These are unchanged from sessions 1-14 and still load-bearing:
|
||||
These are unchanged from sessions 1-15 and still load-bearing:
|
||||
|
||||
- **Default isolation** unless the spec needs otherwise. Use
|
||||
`seedFromHost: true` for any test that depends on authenticated
|
||||
renderer state — never assume default isolation gets past
|
||||
`/login`. T07/T11_runtime/T16/T17/T19/T20/T21/T26/T22b/T27/T31b/T33b/T33c/T35b/T37b/T38b
|
||||
are the templates.
|
||||
are the templates. **T17 was migrated to this shape in session 15.**
|
||||
- **eipc handlers register on `webContents.ipc._invokeHandlers`,
|
||||
NOT global `ipcMain._invokeHandlers`.** Session 7 finding. Use
|
||||
`lib/eipc.ts` rather than rolling a new walker. The framing
|
||||
prefix `$eipc_message$_<UUID>_$_` should stay opaque to consumers
|
||||
(UUID has been stable but `lib/eipc.ts` doesn't pin it — match
|
||||
by case-doc-anchored suffix).
|
||||
`lib/eipc.ts` rather than rolling a new walker.
|
||||
- **eipc invocation goes through the renderer-side wrapper at
|
||||
`window['claude.<scope>'].<Iface>.<method>`.** Session 8 finding.
|
||||
Use `lib/eipc.ts`'s `invokeEipcChannel` rather than rolling
|
||||
main-side direct calls.
|
||||
- **For arg validator schema-rev: try smoke-test first, then grep
|
||||
the rejection message literal.** Session 9 finding. Trivial
|
||||
validators (`typeof === 'string'` / similar) resolve in one
|
||||
round-trip. Elaborate validators get the bundle-grep treatment.
|
||||
- **For session-scoped Tier 2 reframes: `LocalSessions/getAll` is
|
||||
the foundational read-side surrogate.** Session 10 finding.
|
||||
- **For Tier 2 reframes with case-doc-anchored read-side handlers:
|
||||
invoke the case-doc-anchored handlers directly.** Session 11
|
||||
finding. Mixed-shape dual invocation is fine.
|
||||
- **For Tier 2 reframes spanning two interfaces: invoke a read-side
|
||||
from each.** Session 12 finding (T11_runtime template).
|
||||
the rejection message literal.** Session 9 finding.
|
||||
- **For AX-tree consumers: use `lib/ax.ts`.** Session 13 finding.
|
||||
`snapshotAx` for one-shot reads, `waitForAxNode` /
|
||||
`waitForAxNodes` for predicate-based polling. Don't reach into
|
||||
`explore/walker.ts` directly — re-exports go through `lib/ax.ts`.
|
||||
Consumers in session 14: `lib/claudeai.ts`'s `activateTab` +
|
||||
`CodeTab.activate` post-click pill poll (migrated from one-shot
|
||||
/ hand-rolled retryUntil), plus T26.
|
||||
`waitForAxNodes` for predicate-based polling.
|
||||
- **For call-site migrations to `waitForAxNode`: keep the per-spec
|
||||
retry budgets matching the existing tuning.** Session 14
|
||||
finding. The defaults in `lib/ax.ts` (`timeoutMs: 5000`,
|
||||
`intervalMs: 200`) are reasonable starting values, but any
|
||||
caller with a known per-spec budget should pass it through. The
|
||||
one acceptable bug-fix during migration is when the existing
|
||||
call-site had NO retry at all (e.g. `activateTab`'s pre-click
|
||||
one-shot snapshot) — adding a budget is the fix the migration
|
||||
delivers, and the prompt explicitly authorized it.
|
||||
finding. Migration is shape-only EXCEPT when the call-site has
|
||||
NO retry at all — adding a budget is the bug-fix the migration
|
||||
delivers.
|
||||
- **For test specs that depend on host auth: use `seedFromHost:
|
||||
true`.** Session 15 finding. The legacy `CLAUDE_TEST_USE_HOST_CONFIG=1`
|
||||
/ `isolation: null` shape collides with Playwright's 60s spec
|
||||
timeout when the env var isn't set; `seedFromHost` gives a clean
|
||||
skip-or-pass shape. T17 was the last spec on the legacy shape.
|
||||
- **`lib/input.ts` is X11-only.** Strict gate.
|
||||
- **`lib/input-niri.ts` is Niri-only.** Strict gate.
|
||||
- **Don't speculate on `lib/input-wayland.ts` dispatcher.**
|
||||
- **Code-tab AX anchors stay in plan-doc until a consumer needs
|
||||
them.**
|
||||
- **CDP auth gate is alive** — runtime SIGUSR1 attach via
|
||||
`app.attachInspector()`, never Playwright's `_electron.launch()`
|
||||
or `chromium.connectOverCDP()`.
|
||||
- **BrowserWindow Proxy gotcha** — use
|
||||
`webContents.getAllWebContents()` not
|
||||
`BrowserWindow.getAllWindows()`. Constructor-level wraps don't
|
||||
work; use prototype-method hooks.
|
||||
`BrowserWindow.getAllWindows()`.
|
||||
- **`skipUnlessRow()` always first.**
|
||||
- **No fixed sleeps.** `retryUntil` from `lib/retry.ts`, or
|
||||
Playwright auto-wait, or `waitForAxNode` from `lib/ax.ts`.
|
||||
(Exception: short sleeps inside hand-rolled retry loops that
|
||||
catch typed errors and short-circuit; see S11 / S14.)
|
||||
- **Diagnostics on every run.** `testInfo.attach()` the artefacts.
|
||||
- **Tag with annotations.** `severity:` and `surface:` on every
|
||||
test so JUnit carries them through to matrix-regen.
|
||||
- **Tabs in TS, ~80-char wrap as the existing files do.**
|
||||
- **Don't break existing runners.** `npm run typecheck` must stay
|
||||
clean. H01-H05 are the canaries; `npm test` must still pass them
|
||||
after every commit. Note that T17/T07/S25/S29-S31/S04 etc.
|
||||
are pre-existing-flaky on KDE-W per session 13's full-suite run
|
||||
(T16 fixed by session 14) — they're NOT canaries; baseline
|
||||
failures don't block work.
|
||||
after every commit. Note that T07 / S25 / S29-S31 / S04 etc.
|
||||
may be pre-existing-flaky on KDE-W — they're NOT canaries;
|
||||
baseline failures don't block work.
|
||||
- **Always grep the installed asar** to verify a fingerprint
|
||||
string is present.
|
||||
- **For mock-then-call: the helper goes in
|
||||
`lib/electron-mocks.ts`.**
|
||||
- **Marker windows / sacrificial host processes always die in
|
||||
`finally`.**
|
||||
- **Never log handler response BODIES into JUnit.**
|
||||
|
||||
### Phases
|
||||
|
||||
#### Phase 0 — calibration
|
||||
|
||||
1. `cd tools/test-harness && npm run typecheck` — should pass.
|
||||
2. **Check debugger:** `ss -tln 2>/dev/null | grep ':9229'` (or
|
||||
`curl --max-time 2 http://127.0.0.1:9229/json`). If port 9229 is
|
||||
open, A / B / C are tractable; if closed, pivot to D or
|
||||
documentation-only.
|
||||
3. Read the plan doc's "Status (post-execution)" session 14 section,
|
||||
then read `lib/ax.ts`'s API + `lib/claudeai.ts`'s post-session-14
|
||||
migration shape. Confirm you understand the `waitForAxNode` /
|
||||
`waitForAxNodes` consumer pattern.
|
||||
4. Pick ONE Category as the main bet:
|
||||
- **D** (PRIORITY when debugger is closed): read T17's failure
|
||||
trace; classify the failure; decide if `openPill` /
|
||||
`clickMenuItem` migration would reach it.
|
||||
- **A**: bundle grep + per-URL navigation + registry re-probe.
|
||||
- **B**: pick a Tier 3 candidate, smoke-test the read-side, decide
|
||||
ship or defer.
|
||||
- **C**: bundle grep on rejection literals, schema-rev, smoke-test
|
||||
the resolved shape, decide ship or defer.
|
||||
2. **Check debugger ATTACHMENT QUALITY (not just port).** First
|
||||
`ss -tln 2>/dev/null | grep ':9229'`. If port open, also probe
|
||||
webContents via `evalInMain` (see "Big new findings" §3 for
|
||||
the probe shape). If every URL is `/login` /
|
||||
`find_in_page` / `main_window`, treat as soft-blocked.
|
||||
3. **Disambiguate running Claude processes.** Required before any
|
||||
`seedFromHost` spec. `pgrep -af "ozone-platform=x11.*app.asar"`
|
||||
+ cmdline inspection for user-data-dir.
|
||||
4. Read the plan doc's "Status (post-execution)" session 15 section,
|
||||
then read T17's session-15 form and the seedFromHost convention.
|
||||
5. Pick the main bet:
|
||||
- **D-verify** (PRIORITY): run T17, classify the result.
|
||||
- **C**: bundle grep on rejection literals, schema-rev,
|
||||
smoke-test the resolved shape against the /login isolation.
|
||||
- **STOP**: if both above produce nothing tractable, recommend
|
||||
stopping the orchestration.
|
||||
|
||||
If Phase 0 surfaces a problem (typecheck failing, primitives unclear,
|
||||
the chosen Category's prerequisites don't hold), stop and report.
|
||||
@@ -421,31 +381,24 @@ Don't fan out.
|
||||
|
||||
#### Phase 1 — fan-out batch
|
||||
|
||||
For Category D (further migration / T17 investigation):
|
||||
- Single subagent reads T17's trace, classifies, ships the migration
|
||||
if applicable. Verify by running T16 / T17 / T26 / H05.
|
||||
|
||||
For Category A (operon investigation):
|
||||
- Single subagent does bundle-grep for operon URL routes + per-URL
|
||||
registry re-probe. Report findings; if a Tier 2 reframe is
|
||||
tractable, ship one spec.
|
||||
|
||||
For Category B (Tier 3 read-only reframes):
|
||||
- Spawn ONE subagent for the candidate read-side investigation
|
||||
(smoke-test + bundle-grep if needed).
|
||||
For Category D-verify (T17 run):
|
||||
- Single subagent (or do directly — it's a single-command run +
|
||||
trace inspection) runs T17 and classifies. Verify by checking
|
||||
pass/skip/fail and any new failure-mode trace.
|
||||
|
||||
For Category C (schema-rev):
|
||||
- Single subagent does bundle-grep on the rejection literals,
|
||||
surfaces the validator schemas, smoke-tests the recovered shapes
|
||||
against the user's debugger-attached running Claude.
|
||||
against the user's debugger-attached running Claude (or /login
|
||||
isolation if soft-blocked).
|
||||
|
||||
Cap at ~1 spec OR ~1 primitive migration total — same scope as
|
||||
sessions 9-14.
|
||||
Cap at ~1 spec OR ~1 verification + 1 schema-rev — same scope as
|
||||
sessions 9-15.
|
||||
|
||||
#### Per-subagent prompt shape
|
||||
|
||||
```
|
||||
You're implementing ONE [test-harness runner | primitive migration |
|
||||
You're implementing ONE [verification run | primitive migration |
|
||||
investigation] for <TARGET>.
|
||||
|
||||
Read in order:
|
||||
@@ -454,15 +407,11 @@ Read in order:
|
||||
the most-recent-template that fits)
|
||||
- tools/test-harness/src/runners/<closest-template>.spec.ts
|
||||
- tools/test-harness/src/lib/ (the primitives you'll reuse —
|
||||
including session 13's `lib/ax.ts` and session 14's migration
|
||||
examples in `lib/claudeai.ts`)
|
||||
including session 13's `lib/ax.ts` and session 15's seedFromHost
|
||||
T17 migration)
|
||||
- CLAUDE.md (project conventions)
|
||||
|
||||
Write tools/test-harness/src/runners/<TARGET>_short_name.spec.ts
|
||||
[ AND/OR tools/test-harness/src/lib/<NEW-PRIMITIVE>.ts
|
||||
AND/OR edits to tools/test-harness/src/lib/claudeai.ts ].
|
||||
|
||||
[per-task specifics: pattern (seedFromHost / mock-then-call /
|
||||
[per-task specifics: pattern (verification run / mock-then-call /
|
||||
asar fingerprint / shared isolation / new-primitive-build /
|
||||
investigation / call-site migration), assertion shape, skip rules,
|
||||
key constraint warnings]
|
||||
@@ -481,17 +430,15 @@ Constraints:
|
||||
If the target isn't reasonable to implement (anchors don't resolve
|
||||
to anything assertable, the test depends on state you can't
|
||||
construct, the existing primitives don't cover the surface), DO
|
||||
NOT write a stub. Report under Open questions and stop. Sessions
|
||||
1-14 had cumulative ~17 "stop and report" outcomes that were the
|
||||
right call.
|
||||
NOT write a stub. Report under Open questions and stop.
|
||||
|
||||
Report shape (~150 words):
|
||||
## <TARGET> [runner | primitive | investigation | migration]
|
||||
## <TARGET> [verification | primitive | investigation | migration]
|
||||
|
||||
- File written: tools/test-harness/src/runners/<filename>.spec.ts
|
||||
[or lib/<newfile>.ts or modified lib/<existing>.ts]
|
||||
- Layer: file probe | argv probe | L1 | L2 (xprop) | L2 (DBus) |
|
||||
pgrep | new-primitive | investigation | migration
|
||||
pgrep | new-primitive | investigation | migration | verification
|
||||
- Assertion shape (or migration shape): <one sentence>
|
||||
- Skip rules: <which rows + why>
|
||||
- Verification path: <typecheck + run result>
|
||||
@@ -525,7 +472,7 @@ After fan-out returns:
|
||||
|
||||
### Self-correction loop
|
||||
|
||||
Same as sessions 1-14:
|
||||
Same as sessions 1-15:
|
||||
|
||||
1. Subagent typecheck failure → re-spawn with explicit fix
|
||||
instruction.
|
||||
@@ -538,12 +485,11 @@ Same as sessions 1-14:
|
||||
examine the assertion shape.
|
||||
5. Migration breaks an existing spec → roll back the migration; the
|
||||
per-spec retry budget was load-bearing and the primitive
|
||||
defaults didn't match. Document the budget mismatch in plan-doc.
|
||||
6. **Carry-over from session 5/6/7/8/9/10/11/12/13/14:** If the
|
||||
chosen Category's investigation doesn't resolve / requires
|
||||
schema-rev that exceeds budget after 2-3 approaches, STOP. Don't
|
||||
keep digging — pivot to a fallback Category. Document what was
|
||||
tried.
|
||||
defaults didn't match.
|
||||
6. **Carry-over from sessions 5-15:** If the chosen Category's
|
||||
investigation doesn't resolve / requires schema-rev that exceeds
|
||||
budget after 2-3 approaches, STOP. Don't keep digging — pivot
|
||||
to a fallback Category. Document what was tried.
|
||||
7. **Carry-over from session 10:** If a registration probe surfaces
|
||||
"registered but uninvocable", document and defer rather than
|
||||
building the main-side fallback speculatively.
|
||||
@@ -562,29 +508,26 @@ Stop and write the final report when one of:
|
||||
3. **Discovered a primitive gap that breaks 5+ Tier 2/Tier 3
|
||||
tests.** Stop, propose where the new primitive should live in
|
||||
`lib/`. Future session adds the primitive first, then resumes.
|
||||
4. **Session budget hits ~1 new spec OR one new primitive
|
||||
landing OR one substantive call-site migration.** Stop,
|
||||
synthesize, leave the rest for the next session.
|
||||
5. **All categories blocked after 2-3 attempts each.** Document the
|
||||
findings as plan-doc additions and stop — coverage is at 97%, a
|
||||
no-spec session that surfaces deferral notes is fine.
|
||||
4. **Session budget hits ~1 verification + 1 schema-rev landing.**
|
||||
Stop, synthesize, leave the rest for the next session.
|
||||
5. **All categories blocked / unproductive after 2-3 attempts
|
||||
each.** Document the findings as plan-doc additions, **and
|
||||
recommend the orchestrator stop the campaign** — coverage at
|
||||
97%, three+ consecutive non-coverage sessions, dimming
|
||||
productivity signal.
|
||||
|
||||
### What you should NOT do
|
||||
|
||||
- **Don't try to land Category D + A + B + C in one batch.** Pick
|
||||
ONE as the main bet.
|
||||
- **Don't try to land D-verify + C in one batch.** Pick D-verify
|
||||
first; if that resolves cleanly, take C as a stretch goal.
|
||||
- **Don't ship stubs.** If a runner can't actually assert what the
|
||||
spec says, mark it as Tier 3 / blocked / primitive-gap and
|
||||
don't write a placeholder.
|
||||
- **Don't break existing runners.** H01-H05 are the canaries.
|
||||
T17 / T07 / S25 / S29-S31 are pre-existing-flaky on KDE-W
|
||||
per session 13's full-suite run (T16 fixed by session 14) —
|
||||
those are NOT canaries.
|
||||
- **Don't restructure `lib/`** beyond targeted additions.
|
||||
Premature abstractions are wrong abstractions.
|
||||
- **Don't run destructive Tier 3 tests** that write to the user's
|
||||
real claude.ai account (T22 PR write, T27 scheduling write, T29
|
||||
worktree creation, T34 OAuth, T36 hooks-fire-on-prompt-submit).
|
||||
real claude.ai account.
|
||||
- **Don't introspect `ipcMain._invokeHandlers` for `claude.web`
|
||||
eipc channels.** Use `lib/eipc.ts`.
|
||||
- **Don't call `invokeEipcChannel` for write-side handlers.**
|
||||
@@ -602,25 +545,28 @@ Stop and write the final report when one of:
|
||||
- **Don't add a `waitForRenderedSurface(client, surfaceKey)`
|
||||
registry to `lib/ax.ts`.** Session 13 deliberately deferred
|
||||
this — wait for a third consumer with a specific named surface.
|
||||
- **Don't change the existing per-spec retry budgets when migrating
|
||||
to `waitForAxNode`.** The budgets are tuned. Migration is shape-
|
||||
only — except when the call-site has NO retry at all (the
|
||||
session-14-authorized bug-fix shape).
|
||||
- **Don't migrate `openPill` / `clickMenuItem` to `waitForAxNode`
|
||||
speculatively.** Session 15 confirmed T17's flake didn't need
|
||||
it; without a third consumer signal, it's premature optimisation.
|
||||
- **Don't reach into `explore/walker.ts` for AX types/helpers.**
|
||||
`lib/ax.ts` re-exports `RawElement` / `AxNode` /
|
||||
`axTreeToSnapshot` / `waitForAxTreeStable` — use those.
|
||||
`lib/ax.ts` re-exports — use those.
|
||||
- **Don't implement the #569 power-inhibit patch in this
|
||||
session.** That's a separate workstream.
|
||||
- **Don't keep cycling on documentation-only sessions.** If
|
||||
D-verify and C both turn up empty, formally recommend the
|
||||
orchestrator stop the campaign rather than burning another
|
||||
session of compute on marginal output.
|
||||
|
||||
### Final report format
|
||||
|
||||
```markdown
|
||||
## Runner implementation summary (session 15)
|
||||
## Runner implementation summary (session 16)
|
||||
|
||||
- Main-bet category: D | A | B | C
|
||||
- Main-bet category: D-verify | C | STOP
|
||||
- Specs landed: N
|
||||
- Migrations completed: N
|
||||
- Primitives landed: N
|
||||
- Verifications run: N
|
||||
- Reclassified mid-flight: N (with reasons)
|
||||
- Coverage: was 74/76 (97%), now <NEW>/76 (<PCT>%)
|
||||
- Typecheck: clean | <errors>
|
||||
@@ -630,7 +576,7 @@ Stop and write the final report when one of:
|
||||
|
||||
| Cat | Test ID | File | Assertion shape | Status |
|
||||
|---|---|---|---|---|
|
||||
| D | <call-site> | <file>.ts | … | ✓ pass / skip / fail |
|
||||
| D-verify | T17 | T17_folder_picker.spec.ts | … | ✓ pass / skip / fail |
|
||||
| ... |
|
||||
|
||||
## Notable findings
|
||||
@@ -639,6 +585,9 @@ Stop and write the final report when one of:
|
||||
## Open questions
|
||||
- ...
|
||||
|
||||
## Stop recommendation
|
||||
- Yes / no, with rationale.
|
||||
|
||||
## Files touched
|
||||
git status output.
|
||||
|
||||
@@ -659,12 +608,8 @@ git diff --stat
|
||||
Connects to a debugger-attached running Claude on port 9229.
|
||||
- For seedFromHost specs, the host MUST have a signed-in Claude
|
||||
Desktop. The primitive throws with a clear message if not.
|
||||
- For tests that touch the AX tree, **`lib/ax.ts`** is the new
|
||||
shared substrate. `claudeai.ts` page-objects are still the
|
||||
right substrate for renderer-UI domain operations (CodeTab,
|
||||
compact pills, menu items) — they consume `lib/ax.ts`
|
||||
internally. Don't query DOM by CSS selector unless `claudeai.ts`
|
||||
doesn't already cover the surface.
|
||||
- For tests that touch the AX tree, **`lib/ax.ts`** is the shared
|
||||
substrate.
|
||||
- For mock-then-call: helpers live in `lib/electron-mocks.ts`.
|
||||
- For focus-shifting (X11 only): `lib/input.ts` exports
|
||||
`focusOtherWindow` + `spawnMarkerWindow`.
|
||||
@@ -685,14 +630,13 @@ git diff --stat
|
||||
finding):** invoke a read-side from each impl object.
|
||||
- **For AX-tree polling (session 13 finding):** `lib/ax.ts`'s
|
||||
`waitForAxNode` / `waitForAxNodes` for predicate-based polling.
|
||||
`snapshotAx` for one-shot reads. Re-exports keep
|
||||
`explore/walker.ts` types accessible without crossing the
|
||||
lib/explore boundary.
|
||||
- **For call-site migrations to `waitForAxNode` (session 14
|
||||
finding):** keep per-spec retry budgets matching the existing
|
||||
tuning. Migration is shape-only EXCEPT when the call-site had
|
||||
NO retry at all — adding a budget is the bug-fix the migration
|
||||
delivers.
|
||||
tuning.
|
||||
- **For auth-required spec migrations (session 15 finding):**
|
||||
use `seedFromHost: true`, NOT `CLAUDE_TEST_USE_HOST_CONFIG=1` /
|
||||
`isolation: null`. The legacy shape collides with Playwright's
|
||||
60s spec timeout.
|
||||
- **For asar fingerprints: ALWAYS grep the installed asar
|
||||
first.** Build-reference is beautified; the bundle is
|
||||
minified.
|
||||
|
||||
@@ -18,6 +18,135 @@ work begins.
|
||||
|
||||
## Status (post-execution)
|
||||
|
||||
**Shipped session 15 (1 structural fix, no new spec, no AX migration):**
|
||||
T17 migrated from the legacy `CLAUDE_TEST_USE_HOST_CONFIG=1` /
|
||||
`isolation: null` auth path to the canonical `seedFromHost: true`
|
||||
pattern (mirroring T16 / T26). Phase 0 calibration found port 9229
|
||||
listening BUT the attached process was a leaked test isolation
|
||||
(claude.ai loaded at `/login`, NOT the user's auth-bearing Claude),
|
||||
which made Categories A (operon-mode probe) / B (Tier 3 read-only
|
||||
reframes) / C (schema-rev) all soft-blocked: the debugger was
|
||||
technically attached, but to the wrong process for any auth-required
|
||||
investigation. Session 15 pivoted to investigating T17's pre-existing
|
||||
flake (the PRIORITY directive in the followup) and discovered the
|
||||
failure was structural rather than AX-polling-related.
|
||||
|
||||
**T17 flake root cause (session 15 finding):** The trace shows a
|
||||
bare 60s Playwright spec timeout with NO `renderer-url` attachment
|
||||
fired. That attachment lives at line 49 of the pre-migration spec —
|
||||
which means the test never reached line 40's `waitForReady(
|
||||
'userLoaded')` resolution. Session 14's hypothesis that T17's flake
|
||||
was an `openPill` / `clickMenuItem` issue was wrong: the failure is
|
||||
upstream of the AX click chain. The spec was running with
|
||||
`isolation: undefined` (the no-`CLAUDE_TEST_USE_HOST_CONFIG` branch),
|
||||
which produces a fresh isolation with no auth tokens, claude.ai
|
||||
redirects to `/login`, and `waitForUserLoaded` polls for its full 90s
|
||||
budget — but Playwright's spec timeout is 60s (per
|
||||
`playwright.config.ts`). The 30s incompatibility produces the bare
|
||||
"Test timeout of 60000ms exceeded" with no test-body trace events.
|
||||
The fix is to align T17 with T16 / T26's shape: `seedFromHost: true`
|
||||
copies the host's auth into the per-test isolation, hits a clean
|
||||
`postLoginUrl` resolution, and skips cleanly when no signed-in host is
|
||||
available (rather than hanging until the spec timeout preempts).
|
||||
|
||||
Coverage stays at 74/76 (97%) — structural fix, no spec landed. The
|
||||
matrix coverage doesn't reflect spec-shape migrations; this shows up
|
||||
as a real productivity gain (T17 should now succeed when host is
|
||||
signed in, rather than auto-failing with a 60s timeout regardless).
|
||||
|
||||
Two commits on `docs/compat-matrix` expected (the orchestration
|
||||
directive supersedes "the user reviews and commits" — autonomous
|
||||
commit + push at end of session):
|
||||
|
||||
- TBD — `test(harness): session 15 migrate T17 to seedFromHost +
|
||||
prune unused RawElement import (no spec, coverage unchanged at
|
||||
97%)` (T17 spec rewrite swapping the `CLAUDE_TEST_USE_HOST_CONFIG`
|
||||
+ `isolation: null` branch for the canonical `seedFromHost: true`
|
||||
pattern; prunes unused `RawElement` re-export import in
|
||||
`lib/claudeai.ts` per session 14's leftover hint; typecheck clean;
|
||||
T17 not run this session because the dev box's running processes
|
||||
ambiguously include leaked test isolations and possibly the user's
|
||||
real Claude — `seedFromHost` would kill both, deferred to next
|
||||
session for verification).
|
||||
- TBD — `docs(testing): session 15 plan/inventory + rotate session 16
|
||||
prompt`.
|
||||
|
||||
Session 15 findings + reclassifications:
|
||||
|
||||
- **T17 flake reclassified from "AX-polling tuning" to "auth path
|
||||
not seeded".** Session 14's followup hypothesised the flake lived
|
||||
in `openPill` / `clickMenuItem` post-click loops; the trace
|
||||
evidence rules that out. The Playwright spec timeout (60s) is
|
||||
shorter than `waitForReady('userLoaded')`'s default budget (90s),
|
||||
so any unauth'd test that polls userLoaded will fail with a bare
|
||||
timeout regardless of what the AX code does. T17 was the last
|
||||
spec on the legacy `CLAUDE_TEST_USE_HOST_CONFIG=1` / `isolation:
|
||||
null` shape — every other auth-required spec (T07, T16, T19,
|
||||
T20, T21, T22b, T26, T27, T31b, T33b/c, T35b, T37b, T38b) had
|
||||
already moved to `seedFromHost: true`. T17 was an outlier, and
|
||||
the outlier-ness was the flake.
|
||||
- **`openPill` / `clickMenuItem` migration NOT shipped.** Session
|
||||
14's followup proposed migrating these to `waitForAxNode` /
|
||||
`waitForAxNodes`. With T17's actual failure mode resolved by
|
||||
the structural fix, there's no remaining flake-evidence pulling
|
||||
for that migration. `openPill`'s while-loop and
|
||||
`clickMenuItem`'s while-loop both work fine when the auth path
|
||||
is correct; speculatively migrating them now would be premature
|
||||
optimisation. Future sessions can take it if a third consumer
|
||||
surfaces with budget-tuning evidence.
|
||||
- **Unused `RawElement` import pruned.** Session 14 left
|
||||
`import type { RawElement }` in `lib/claudeai.ts`'s
|
||||
destructured `./ax.js` import after the migration didn't end up
|
||||
needing the type re-export. Pruned in session 15 alongside the
|
||||
T17 migration (one commit, two related shape fixes).
|
||||
- **Debugger-attached process is a leaked test isolation.** The
|
||||
port-9229 listener pointed at a process whose webContents listed
|
||||
three URLs: `find_in_page.html`, `https://claude.ai/login`, and
|
||||
`main_window/index.html`. NOT the user's signed-in Claude. The
|
||||
user-data-dir on those processes was `/tmp/claude-test-*`,
|
||||
confirming they're leaked from prior test runs. There are
|
||||
multiple `/tmp/claude-test-*` dirs accumulating on the dev box
|
||||
(visible via `ls /tmp/`). Future sessions: Phase 0 calibration
|
||||
should distinguish "port 9229 is open" from "port 9229 is open
|
||||
AND attached to the user's auth-bearing Claude". Probe via
|
||||
`evalInMain` listing webContents — if every URL is `/login`,
|
||||
the auth-required investigations (Categories A/B/C) are blocked
|
||||
same as if the debugger were closed.
|
||||
- **No primitive change, no AX migration.** `lib/ax.ts` and the
|
||||
session 14 migration shape are unchanged. The change was a
|
||||
spec-level structural fix, not a substrate or page-object
|
||||
change.
|
||||
|
||||
Tier 2 → Tier 2 candidates remaining for next session: same as
|
||||
sessions 12-14 — operon-mode navigation probe (still needs an
|
||||
auth-bearing debugger-attached Claude), schema-rev for
|
||||
`listRemotePluginsPage` / `listSkillFiles` (might be tractable
|
||||
against the leaked-isolation /login process since validators run
|
||||
auth-independent — investigate), Tier 3 read-only reframes
|
||||
(login-required). The `openPill` / `clickMenuItem` migration is
|
||||
parked: session 15 confirmed T17's flake didn't need it, and no
|
||||
other consumer is signalling for it. Coverage at 74/76 (97%) with
|
||||
the test budget naturally cycling through low-impact deliverables
|
||||
unless a true coverage opportunity surfaces.
|
||||
|
||||
**Productivity signal for next session.** Session 15 fixed a
|
||||
real T17 failure mode (structural). Sessions 13-15 collectively
|
||||
have produced one new primitive (`lib/ax.ts`), one substantive
|
||||
migration (`activateTab` + `CodeTab.activate`), one structural
|
||||
fix (T17 seedFromHost). NO coverage gain in those three sessions.
|
||||
The remaining categories without a debugger that hits the user's
|
||||
signed-in process are mostly exhausted. Next session should
|
||||
prioritise (a) running T17 to verify the seedFromHost fix actually
|
||||
resolves the 60s timeout, and (b) checking whether a Category C
|
||||
schema-rev probe against the leaked /login isolation is tractable
|
||||
(validators don't need auth, only invocation does — worth a 15-min
|
||||
investigation). If both turn up empty, the orchestrator should
|
||||
seriously consider stopping — at 97% coverage with no clear
|
||||
high-leverage shapes left, further sessions are likely to produce
|
||||
documentation-only or marginal-improvement deliverables.
|
||||
|
||||
---
|
||||
|
||||
**Shipped session 14 (1 call-site migration, no new spec):**
|
||||
`activateTab` and `CodeTab.activate` in `lib/claudeai.ts` migrated
|
||||
from hand-rolled retry loops to session 13's `lib/ax.ts` substrate.
|
||||
|
||||
Reference in New Issue
Block a user