mirror of
https://github.com/aaddrick/claude-desktop-debian.git
synced 2026-05-17 08:36:35 +03:00
docs: add cowork-vm-daemon learnings
Capture the architecture and failure modes of the Linux cowork-vm daemon — respawn logic, crash diagnosis, and the one-shot-guard / preserved-image pitfalls that caused issue #408. Intended for future contributors (human or AI) who need to navigate this area without re-deriving it from minified JS. Co-Authored-By: Claude <claude@anthropic.com>
This commit is contained in:
@@ -9,6 +9,7 @@ This project repackages Claude Desktop (Electron app) for Debian/Ubuntu Linux, a
|
||||
The [`docs/learnings/`](docs/learnings/) directory contains hard-won technical knowledge from debugging and fixing issues — things that aren't obvious from reading the code or docs alone. Consult these before working on related areas. Add new entries when you discover something non-obvious that would save future contributors (human or AI) significant time.
|
||||
|
||||
- [`nix.md`](docs/learnings/nix.md) — NixOS packaging, Electron resource path resolution, testing without NixOS
|
||||
- [`cowork-vm-daemon.md`](docs/learnings/cowork-vm-daemon.md) — Cowork VM daemon lifecycle, respawn logic, crash diagnosis
|
||||
|
||||
## Code Style
|
||||
|
||||
|
||||
174
docs/learnings/cowork-vm-daemon.md
Normal file
174
docs/learnings/cowork-vm-daemon.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Cowork VM Daemon — Learnings
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
Cowork mode on Linux uses a custom Node.js daemon
|
||||
([`scripts/cowork-vm-service.js`](../../scripts/cowork-vm-service.js))
|
||||
that replaces the Windows cowork-vm-service. The Electron app talks to
|
||||
it over a Unix domain socket at
|
||||
`$XDG_RUNTIME_DIR/cowork-vm-service.sock` using length-prefixed JSON —
|
||||
the same wire format as the Windows named pipe.
|
||||
|
||||
The daemon is forked by **Patch 6** in `build.sh`'s
|
||||
`patch_cowork_linux()` function, which injects auto-launch code into
|
||||
the Electron app's retry loop for the VM-service connection.
|
||||
|
||||
## Daemon Lifecycle
|
||||
|
||||
1. First connect attempt: the app tries `$XDG_RUNTIME_DIR/cowork-vm-service.sock`.
|
||||
2. `ENOENT` / `ECONNREFUSED`: retry loop catches the error (the
|
||||
`ECONNREFUSED` branch is Linux-only, added by Patch 6 step 1 so
|
||||
stale sockets don't bypass retry).
|
||||
3. Auto-launch (Patch 6 step 2): the injected code forks the daemon
|
||||
via `child_process.fork()` with `detached:true`, stdio redirected
|
||||
to `~/.config/Claude/logs/cowork_vm_daemon.log`.
|
||||
4. Spawn cooldown: `FUNC._lastSpawn = Date.now()` — subsequent
|
||||
iterations only re-fork after 10 s have elapsed. This replaces the
|
||||
old one-shot `_svcLaunched` boolean so the retry loop can recover
|
||||
after mid-session daemon death (issue #408).
|
||||
5. Retry: the loop waits and reconnects, which now succeeds.
|
||||
|
||||
## Issue #408 — Daemon Recovery
|
||||
|
||||
### Root cause (one-shot guard)
|
||||
|
||||
Before the fix, Patch 6 injected:
|
||||
|
||||
```javascript
|
||||
process.platform==="linux" && !FUNC._svcLaunched && (
|
||||
FUNC._svcLaunched = true,
|
||||
/* fork daemon */
|
||||
)
|
||||
```
|
||||
|
||||
`FUNC._svcLaunched` was set on the first successful spawn and never
|
||||
cleared, so when the daemon died mid-session the retry loop saw the
|
||||
guard already set and skipped the re-fork. The client looped forever
|
||||
on `connect ENOENT`.
|
||||
|
||||
### Fix (rate-limited respawn)
|
||||
|
||||
Timestamp-based cooldown replaces the boolean:
|
||||
|
||||
```javascript
|
||||
process.platform==="linux" &&
|
||||
(!FUNC._lastSpawn || Date.now() - FUNC._lastSpawn > 1e4) &&
|
||||
(FUNC._lastSpawn = Date.now(), /* fork daemon */)
|
||||
```
|
||||
|
||||
10 s is short enough that the retry loop (which sleeps on the order of
|
||||
seconds between iterations) recovers promptly after a crash, and long
|
||||
enough that a crash-looping daemon can't turn into a fork bomb.
|
||||
|
||||
### Secondary cause (preserved images block recovery)
|
||||
|
||||
The app's `_ue()` / `deleteVMBundle()` function deletes a whitelist of
|
||||
reinstall files on auto-reinstall. Upstream deliberately preserves
|
||||
`sessiondata.img` and `rootfs.img.zst` to avoid re-download.
|
||||
|
||||
On 1.2773.0 those preserved files put the daemon into an unstartable
|
||||
state that persists across app restart and OS reboot. The client's
|
||||
symptom is `connect ENOENT` (daemon never got far enough to create the
|
||||
socket) rather than `ECONNREFUSED` (daemon started, crashed, socket
|
||||
stayed). RayCharlizard (2026-04-16) confirmed that manually wiping
|
||||
`~/.config/Claude/vm_bundles/claudevm.bundle/` is required to recover,
|
||||
even after rolling back the AppImage to a known-good version.
|
||||
|
||||
### Fix (extend delete list — Patch 6b)
|
||||
|
||||
`build.sh` now matches the `const NAME=["rootfs.img",...]` array at
|
||||
module level and appends `"sessiondata.img"` and `"rootfs.img.zst"` if
|
||||
they're not already present. The auto-reinstall path now wipes these
|
||||
too. Trade-off: the next successful startup re-downloads/re-extracts
|
||||
these files. Acceptable because auto-reinstall only runs after startup
|
||||
has already failed — biasing toward recovery over re-download
|
||||
avoidance is correct.
|
||||
|
||||
Not included in the delete list: `~/.config/Claude/claude-code-vm/`.
|
||||
That's CLI-binary storage (`2.1.x/claude`), unrelated to the VM
|
||||
daemon, and has its own version-check logic at `this.vmStorageDir`
|
||||
inside the app. Wiping it would just force a slow re-download of the
|
||||
CLI on every auto-reinstall.
|
||||
|
||||
## Silent Death — Now Logged
|
||||
|
||||
Before the fix the daemon was forked with `stdio:"ignore"`, and its
|
||||
internal `log()` function was gated by `COWORK_VM_DEBUG=1`, so a crash
|
||||
left no trace anywhere.
|
||||
|
||||
Two changes together make crashes visible:
|
||||
|
||||
1. **Patch 6 (client side)** redirects the forked daemon's stdout +
|
||||
stderr to `~/.config/Claude/logs/cowork_vm_daemon.log`. Any
|
||||
Node-level crash dump (uncaught exception pre-handler, native
|
||||
assertion, etc.) now lands in that file.
|
||||
2. **`cowork-vm-service.js` (daemon side)** adds `logLifecycle()` —
|
||||
an always-on writer that bypasses `DEBUG` for startup, SIGTERM,
|
||||
SIGINT, `uncaughtException`, `unhandledRejection`, and `exit`
|
||||
events. It also proactively `mkdirSync`'s the log directory so the
|
||||
first write doesn't get swallowed if the daemon is the first thing
|
||||
writing under `~/.config/Claude/logs/`.
|
||||
|
||||
Interpreting the log after a failure:
|
||||
|
||||
| Last line | Diagnosis |
|
||||
|-----------|-----------|
|
||||
| `lifecycle startup ...` + gap + no further entries | SIGKILL'd (OOM killer, `kill -9`, etc.) — no handler fires |
|
||||
| `lifecycle startup` + `lifecycle listening` + nothing else | Daemon running fine but died by signal with no handler (rare; check `dmesg`) |
|
||||
| `lifecycle uncaughtException ...` | JS-level crash, stack is in the log entry |
|
||||
| `lifecycle SIGTERM received` + `lifecycle exit code=0` | Clean app-initiated shutdown |
|
||||
| No `startup` entry at all | `fork()` didn't complete; check launcher.log for `[cowork-autolaunch]` errors |
|
||||
|
||||
## Key Files
|
||||
|
||||
- [`build.sh`](../../build.sh) lines ~1274-1390 — Patch 6 (auto-launch +
|
||||
stdio pipe + rate limiter) and Patch 6b (reinstall array extension).
|
||||
- [`scripts/cowork-vm-service.js`](../../scripts/cowork-vm-service.js)
|
||||
lines ~49-86 — log infrastructure, including `logLifecycle()`.
|
||||
- [`scripts/cowork-vm-service.js`](../../scripts/cowork-vm-service.js)
|
||||
lines ~2399-2440 — signal handlers and entry point.
|
||||
- [`scripts/launcher-common.sh`](../../scripts/launcher-common.sh) — `--doctor` checks.
|
||||
- [`docs/cowork-linux-handover.md`](../cowork-linux-handover.md) — architecture reference.
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Is the daemon running?
|
||||
pgrep -af cowork-vm-service
|
||||
|
||||
# Socket present?
|
||||
ls -la "${XDG_RUNTIME_DIR:-/tmp}/cowork-vm-service.sock"
|
||||
|
||||
# Watch lifecycle events as they happen
|
||||
tail -f ~/.config/Claude/logs/cowork_vm_daemon.log
|
||||
|
||||
# Look for the last startup / exit pair
|
||||
grep -E 'lifecycle (startup|exit|SIGTERM|SIGINT|uncaughtException|unhandledRejection)' \
|
||||
~/.config/Claude/logs/cowork_vm_daemon.log | tail -20
|
||||
|
||||
# Find any orphan sockets
|
||||
lsof -U 2>/dev/null | grep -iE 'cowork|claude'
|
||||
|
||||
# Force a respawn test: kill daemon, watch client log for reconnect
|
||||
pkill -9 -f cowork-vm-service.js
|
||||
tail -f ~/.cache/claude-desktop-debian/launcher.log
|
||||
|
||||
# Find the daemon script inside a mounted AppImage
|
||||
find /tmp -path '*claude*cowork-vm-service*' 2>/dev/null
|
||||
```
|
||||
|
||||
## Testing Notes
|
||||
|
||||
- **Host-direct** (`COWORK_VM_BACKEND=host`): no isolation, direct
|
||||
execution. Matches the `--doctor` "host-direct (no isolation, via
|
||||
override)" line. This is what issue #408 was reported against.
|
||||
- **Bwrap** (`COWORK_VM_BACKEND=bwrap`): Bubblewrap sandbox; requires
|
||||
`bwrap` installed.
|
||||
- **KVM** (`COWORK_VM_BACKEND=kvm`): full VM; requires QEMU, KVM,
|
||||
rootfs image.
|
||||
- **Debug** (`COWORK_VM_DEBUG=1` or `CLAUDE_LINUX_DEBUG=1`): verbose
|
||||
logging via the existing `log()` path. `logLifecycle()` is always
|
||||
on regardless of this flag.
|
||||
- **Force-cooldown test**: kill the daemon, relaunch a Cowork session
|
||||
within 10 s — the guard should block that single retry. Wait 10 s
|
||||
and retry: should succeed. Confirms the cooldown boundary.
|
||||
Reference in New Issue
Block a user