Revert "docs: add issue triage pipeline design document"

This reverts commit 1d020aa. Moving the change to a branch for review
instead of shipping directly to main.

Co-Authored-By: Claude <claude@anthropic.com>
This commit is contained in:
aaddrick
2026-04-19 17:21:58 -04:00
parent 1d020aa628
commit f829d3bf5f

View File

@@ -1,614 +0,0 @@
# Issue Triage Pipeline
Automated first-pass triage for GitHub issues on this repository. Runs on
`issues: [opened, reopened]` and on manual `workflow_dispatch`. Classifies the
issue, investigates likely root cause against the repo and the upstream
beautified source, validates every factual claim mechanically and with a
fresh-context LLM reviewer, and — when the validated findings clear hard
gates — posts an **explicitly non-authoritative draft comment** plus triage
labels.
The design is constrained by three simultaneous goals:
- **Useful**: give the maintainer a head start on orientation, candidate sites,
and related issues.
- **Safe**: never mislead a reporter or reviewer with fabricated identifiers,
non-matching patch code, or authoritative voice on unverified claims.
- **Cheap**: run in under three minutes per issue for a two-digit dollar
monthly spend.
Everything in this document follows from those three goals being in tension.
---
## Design principles
> [!IMPORTANT]
> These seven principles are load-bearing. Every stage in the pipeline exists
> to serve one of them. If a future change breaks a principle, the stage it
> lives in should be removed, not weakened.
### 1. Mechanical checks before LLM checks
Grep, `gh api`, file stat, regex matching — deterministic, cheap, and
complementary to LLM reasoning. The class of error an LLM reviewer misses most
often is the one an LLM drafter made: fabricated identifiers, non-matching
anchors, misremembered issue numbers. A second LLM pass that only sees the
first pass's output can rubber-stamp the fabrication. A `grep -P` against the
actual source cannot.
LLM review is reserved for questions a grep cannot answer — semantic
entailment, intent, whether two issues describe the same failure mode.
### 2. Structured output, not prose
Every claim the pipeline produces has a typed slot: `file`, `line_start`,
`line_end`, `evidence_quote`, `claim_type`, `confidence`. Prose is generated
last, from already-validated structure. Free-form investigation output is
banned because it allows unverifiable assertions to hide inside narrative.
Anthropic's own Claude Code Security Review action uses structured tool output
for exactly this reason: it lets the pipeline drop individual findings without
rewriting prose around them.[^anthropic-security-review]
### 3. Writer/Reviewer with fresh context on source
The reviewer reads the **source** and the **claim**. It does *not* read the
drafter's reasoning or the draft comment. This is Anthropic's recommended
pattern for bias prevention in Claude Code[^anthropic-best-practices] — a
fresh context avoids the reviewer rationalizing the drafter's framing.
The reviewer is also **adversarial by construction**: it must produce the
strongest counter-reading of each evidence quote *before* emitting a verdict.
Rubber-stamping is the base rate for reviewer agents otherwise.
### 4. Category exclusion over classification
Whole classes of issue skip investigation entirely. Hardware-specific GPU
driver crashes, kernel-level behavior, non-reproducible reports, and
upstream-only bugs do not benefit from automated investigation against our
patch surface — the bot ends up inventing "launcher flag workarounds" for
problems outside our control.
This is the pattern Anthropic's security-review action uses to keep
signal-to-noise high: whole finding categories (DoS, open-redirects,
rate-limiting) are excluded upfront rather than triaged with
disclaimers.[^anthropic-security-review]
### 5. Confidence gates suppress, not hedge
Low-confidence findings do not get posted with a cautionary footer. They do
not get posted. Labels apply; comment is skipped. Research consistently shows
that posting hedged but confident-looking output is worse than posting
nothing — specificity reads as authority regardless of preamble
phrasing.[^diffray-hallucinations][^lakera-hallucinations]
### 6. Non-authoritative framing is structural, not textual
The comment template signals tentativeness through structure:
- Upfront "won't-do" boundary statement (what the bot *will not* claim),
modeled on Anthropic's "won't approve PRs — that's still a human
call"[^anthropic-code-review]
- Required file:line citations on every claim (enforced by post-processor —
claims without citations are dropped)
- Hypothesis phrasing ("Looks like X", "Likely path is Y") — prompt-enforced
and post-processor-checked
- Patch code in a collapsed `<details>` block, labeled as unverified draft
- No voice replication of the maintainer
A disclaimer sentence alone cannot neutralize the amplifying effect of
authoritative voice; the shape of the artifact has to do the work.
### 7. Feedback is an explicit signal, not a pattern-matched one
Keyword-triggered "bot got this wrong" capture is theater — corrections arrive
as natural language, curation never happens, the file grows stale. Real
feedback uses explicit signals the pipeline can act on deterministically:
- 👎 reaction on a bot comment auto-opens a correction issue
- `/triage-wrong` slash command in any reply opens the same
- A human-curated `.claude/triage-corrections.md` file is loaded as context
into future runs
No free-form NLP classification of collaborator replies.
---
## Pipeline overview
```mermaid
flowchart TD
A[Issue opened/reopened] --> B[1. Gate]
B -->|needs-human or<br/>already triaged| Z[exit]
B -->|proceed| C[2. Classify + double-check]
C -->|category excluded| L[9. Label + archive]
C -->|investigable| D[3. Fetch reference]
D --> E[4. Investigate<br/>structured output]
E --> F[5. Mechanical validation<br/>no LLM]
F --> G[6. Adversarial review<br/>fresh context]
G --> H[7. Decision gate]
H -->|version drift| L
H -->|no surviving findings<br/>or low confidence| L
H -->|pass| I[8. Comment generation]
I --> L
L --> M[Post + archive]
N[Sideband:<br/>triage-feedback.yml] -.->|corrections.md| B
style C fill:#e1f5ff
style E fill:#e1f5ff
style G fill:#e1f5ff
style I fill:#e1f5ff
style F fill:#fff4e1
style H fill:#fff4e1
style L fill:#fff4e1
```
Stages shaded blue are LLM calls; amber are deterministic bash steps.
| Stage | Runtime | Tool | Purpose |
|-------|---------|------|---------|
| 1. Gate | ~5s | bash | Skip already-triaged, load corrections file |
| 2. Classify | ~30s | Sonnet (×2) | Categorize + double-check exclusions |
| 3. Fetch reference | ~10s | bash | Download `reference-source.tar.gz` |
| 4. Investigate | ~90s | Sonnet | Structured findings + sweeps + anchors |
| 5. Mechanical validation | ~15s | bash | Grep, `gh`, closed-world extraction |
| 6. Adversarial review | ~40s | Sonnet | Counter-reading + verdict, fresh context |
| 7. Decision gate | ~1s | bash | Enforce hard gates |
| 8. Comment generation | ~40s | Sonnet | Template-enforced draft |
| 9. Label + archive | ~5s | bash | Apply labels, post, upload artifacts |
Total: ~3 minutes per investigable issue. Excluded-category issues exit at
~35s (stages 12 plus archive).
---
## Stage-by-stage detail
### 1. Gate
Deterministic filter. Runs before any paid API call.
**Skip conditions:**
- Issue labeled `triage: needs-human` (unless manually dispatched)
- Issue already has a terminal triage label (`investigated`, `duplicate`,
`not-actionable`)
- Title similarity >0.9 with an open issue (likely duplicate, defers to
human)
**Side effects:**
- Loads `.claude/triage-corrections.md` as context for downstream stages.
This file contains human-curated past errors, classified by failure class
(identifier-hallucination, false-duplicate, missed-site, version-drift,
etc.). Future runs see them and are instructed to avoid the same shapes.
### 2. Classify
First Sonnet call. Structured JSON output only.
<details>
<summary><b>Classify output schema</b></summary>
```json
{
"classification": "bug|feature|question|duplicate|needs-info|not-actionable|needs-human",
"confidence": "high|medium|low",
"claimed_version": "1.3109.0 | null",
"category": "bug|feature|question|driver|hardware|kernel|upstream-only|container|...",
"exclusion_reason": "null | string — filled iff category excluded",
"category_evidence": "verbatim excerpt from the issue body supporting the category",
"suggested_labels": ["priority: high", "format: rpm", ...],
"duplicate_of": "null | integer"
}
```
</details>
The `claimed_version` field is parsed from the issue body — `--doctor`
output, `claude-desktop (X.Y.Z)` references, AppImage filenames. Consumed
by stage 7 to gate against version drift.
> [!WARNING]
> **Category exclusion is verified by a second Sonnet pass.** If the first
> pass flags the issue as excluded, a second call sees only the issue body,
> the `category_evidence` excerpt, and the exclusion list — no other context.
> Only if both agree does the exclusion apply. Classify is the least-validated
> stage elsewhere in the pipeline; this double-check is how we budget for its
> miscalibration.
**Excluded categories** route to stage 9 with a `triage: needs-human` label
and no comment. Current exclusions:
- `driver` (NVIDIA, AMD, Intel GPU driver issues)
- `hardware` (specific-chipset crashes, sensor/thermal issues)
- `kernel` (namespace restrictions, cgroup behavior, distro kernel
differences)
- `upstream-only` (reproduces on unmodified Claude Desktop on Windows/macOS)
- `container` (Distrobox/Flatpak/Docker without display passthrough)
### 3. Fetch reference
Downloads `reference-source.tar.gz` from the GitHub release matching
`CLAUDE_DESKTOP_VERSION`. This tarball is produced by `ci.yml` on every
release: `app.asar` extracted, `.vite/build/*.js` beautified with Prettier,
tarred. No re-extraction happens in the triage pipeline — that work was
already done once at release time.
If `claimed_version` from stage 2 differs from `CLAUDE_DESKTOP_VERSION`,
`VERSION_DRIFT=true` is exported and consumed by the stage 7 gate. The
reference still downloads, investigation still runs (the artifacts are useful
internally), but stage 7 will prevent a public comment.
### 4. Investigate
Sonnet call with repo + reference source + corrections file + issue
context. **Output is schema-enforced — no free prose permitted.**
<details>
<summary><b>Investigation output schema</b></summary>
```json
{
"findings": [
{
"claim_type": "identifier|behavior|flow|absence",
"claim": "string — the factual assertion being made",
"file": "path/to/file.js",
"line_start": 1234,
"line_end": 1240,
"evidence_quote": "verbatim source excerpt supporting the claim",
"confidence": "high|medium|low",
"enclosing_construct": "for identifier claims only — the enum/switch/literal containing the identifier"
}
],
"pattern_sweep": [
{
"pattern": "regex pattern used to sweep the repo",
"match_count": 17,
"matches": [
{ "file": "...", "line": 42, "snippet": "..." }
]
}
],
"proposed_anchors": [
{
"description": "what this regex targets",
"regex": "pattern",
"expected_match_count": 1,
"target_file": "path/to/file",
"word_boundary_required": true
}
],
"related_issues": [
{
"number": 288,
"why_related": "one-sentence rationale",
"quoted_excerpt": "relevant snippet from the cited issue"
}
]
}
```
</details>
**Hard schema bans** (validator rejects output if any present):
| Banned | Why |
|--------|-----|
| Negative per-site assertions ("X should stay as-is") | Bad historical track record; these block fixes instead of enabling them |
| "Already fixed in #N" without a diff/PR link | Same failure class — an unverified negative claim that blocks scope |
| Substring regex on identifier claims | Substring matches pass `grep` but don't prove identifier identity |
| `expected_match_count: ">=1"` | Must be exact — ≥1 is what lets fabricated anchors slip through |
| Prescriptive patch text without a backing finding | Detached prescriptions are how unverified `sed` patterns get posted |
**Pattern-sweep caps:** each sweep is limited to 20 match rows. Additional
matches are summarized as `match_count: N (showing first 20)`. This prevents
investigation token bloat on common identifiers.
### 5. Mechanical validation
Pure bash. No LLM call. Produces `validation.json` with pass/fail per item.
**Per finding:**
- [x] `file` exists and `line_end` is within the file's length
- [x] `evidence_quote` grep-matches at the cited `file:line_start`
- [x] If `claim_type == "identifier"`, extract `closed_world_options` — the
full enclosing enum/switch/case-block/object-literal — verbatim, using
`ast-grep` (fast, language-aware). The options list is attached to the
finding for the reviewer to use in stage 6.
**Per proposed anchor:**
- [x] `grep -P` against the reference source with `\b` word boundaries
enforced for identifier anchors
- [x] Match count **exactly equal** to `expected_match_count` (not ≥)
- [x] No substring hits on identifier-type anchors
**Per related_issue:**
- [x] `gh issue view NNN` — actual title, state, first 500 chars of body
captured and attached to the finding. The bot's `why_related` claim is
not trusted — the reviewer in stage 6 reads the real body.
**Per pattern_sweep match:**
- [x] Re-grep to confirm the match still exists (catches investigation
hallucinating file paths or line numbers)
> [!NOTE]
> **Why the closed-world extraction matters.** A bot that fabricates an
> identifier (e.g., claiming the VM backend values are `qemu`/`virt` when
> they're actually `kvm`/`bwrap`/`host`) can pick a nearby real line
> containing the substring "virt" as its `evidence_quote`. Grep validation
> alone would pass this — the quote exists, the file exists, the line
> matches. Closed-world extraction pulls the full enum the claim is *about*
> and hands it to the reviewer as a bounded option list. "Is the claimed
> identifier in this list?" is a closed question the reviewer cannot
> rationalize around.
### 6. Adversarial review
Sonnet call with **fresh context**. Does NOT see the draft comment, the
investigation's free-form reasoning, or any voice instructions. Sees only:
- The original issue
- `validation.json` with findings that passed mechanical
- `closed_world_options` for each identifier-type finding
- The actual fetched body of each cited related issue
- Source excerpts at claim sites
**Prompt is adversarial per finding, in this order:**
1. State the strongest counter-reading of this evidence quote. What would
make this claim wrong, given the actual code shown?
2. For identifier claims: list every option in `closed_world_options`. Is
the claimed identifier verbatim in that list? (yes/no — exact match only)
3. For related_issues: does the fetched body describe the same failure mode?
(exact / related / unrelated)
4. **Only after steps 13**, give a verdict on each finding. No new claims.
Output is per-item verdicts. The reviewer cannot propose new findings,
cannot rewrite the drafter's claims, and cannot insert prose. Its only
powers are: approve, reject, or downgrade confidence.
> [!WARNING]
> **Reviewer approval rate is a health metric.** Rolling approval rate is
> logged to telemetry. If it exceeds **70%**, a GitHub issue is auto-opened
> warning the reviewer is rubber-stamping — the prompt needs re-tuning, or
> the drafter is producing findings too weak for the reviewer to
> meaningfully discriminate. See [Health monitoring](#health-monitoring).
### 7. Decision gate
Deterministic. Evaluates all hard gates; any failure routes to
"labels-only, no post."
| Gate | Trigger | Action |
|------|---------|--------|
| Version drift | `claimed_version != CLAUDE_DESKTOP_VERSION` | Labels only |
| No surviving findings | Zero items passed mechanical + review | `triage: needs-human`, labels only |
| Low average confidence | Avg confidence of survivors < medium | `triage: needs-human`, labels only |
| Excluded category | Already caught at stage 2, defensive check | Labels only |
All gates are fail-closed. If any condition is ambiguous, the gate denies
the comment.
### 8. Comment generation
Sonnet call with only the findings that survived every prior stage. Prompt
mandates hypothesis framing ("Looks like", "Likely", "Worth checking first")
and requires a file:line citation on every factual claim.
**Template structure, enforced by post-processor:**
```markdown
**Automated draft — AI analysis, not maintainer judgment.** This bot won't
close issues, apply labels beyond triage routing, or claim fixes are
shipped. Findings below are starting points; the code citations are what
to verify first.
[one hypothesis line]
- [finding with required `file:line` citation]
- [finding with required `file:line` citation]
<details>
<summary>Unverified patch sketch (draft, not applied)</summary>
[patch code here, if any — runs only if confidence high AND proposed_anchor
verified match-count exactly 1]
</details>
Related: #NNN — exact-match | related | unrelated (per reviewer rating)
---
React 👎 on this comment if a finding is wrong, or run `/triage-wrong` in
a reply to open a correction issue.
```
**Post-processor enforcement:**
- [x] Drop any claim sentence without a `file:line` or `#NNN` citation
- [x] Strip LLM preambles — regex removal of lines like
`^(Here'?s|I'?ve got|I have everything|Sure|Certainly|Got it|OK)`
- [x] Enforce 400-word cap (truncate the `<details>` patch section if
needed, never the findings)
- [x] Refuse to post if the enforced template leaves less than two
findings standing
### 9. Label + post + archive
Deterministic. Always applies triage labels (classification maps to
`triage: investigated` | `duplicate` | `needs-info` | `not-actionable` |
`needs-human`). Always applies `suggested_labels`.
Posts the comment only if stage 7 passed and the template enforcer produced
non-empty output.
Uploads three artifacts (14-day retention):
- `investigation.json` — raw investigation output
- `validation.json` — per-item mechanical + review verdicts
- `review.json` — counter-readings and closed-world answers
Writes a structured summary to `$GITHUB_STEP_SUMMARY`:
| Metric | Value |
|--------|-------|
| Classification | bug |
| Confidence | medium |
| Category | bug (investigable) |
| Findings proposed | 4 |
| Findings passed mechanical | 3 |
| Findings passed review | 2 |
| Comment posted | yes |
---
## Feedback loop
Separate workflow: `.github/workflows/triage-feedback.yml`.
### Trigger A — slash command
`issue_comment` event. If the comment body contains `/triage-wrong`, parse
the surrounding context (bot comment being replied to, the correction text)
and open a correction issue labeled `triage-correction`. Assign to
`@aaddrick`. Link back to both the bot comment and the source issue.
### Trigger B — reaction
`reaction` event for 👎 on a comment by `github-actions[bot]` in the triage
flow. Fetch the bot comment plus the issue body, open a correction issue
with both, assign to `@aaddrick`.
### Trigger C — human curation
Maintainer consolidates correction issues into
`.claude/triage-corrections.md`. Entries are classified by failure class:
```markdown
## Identifier hallucination — issue #442 (2026-04-19)
Bot invented `qemu` / `virt` as VM backend values. Actual values are
`kvm` / `bwrap` / `host` — see `scripts/cowork-vm-service.js:2226-2232`.
Closed-world extraction should have caught this; reviewer approved
because `evidence_quote` was from a nearby real line.
## False duplicate — issue #382 (2026-04-12)
Bot asserted fix landed in #373; it had not. Negative assertions about
issue state need diff/PR link citations.
```
This file is loaded as context by every subsequent triage run (stage 1).
The drafter and reviewer prompts instruct: "Before making a claim, check
whether a similar claim appears in the corrections file. If so, be
especially strict about evidence."
---
## Health monitoring
Telemetry emitted per run, aggregated on a rolling window:
| Metric | Alert threshold |
|--------|----------------|
| Classify confidence distribution | — (logged, not alerted) |
| Mechanical validation pass rate | > 95% rolling → validation too lax |
| Reviewer approval rate | > 70% rolling → reviewer is rubber-stamping |
| Version drift rate | > 30% → fetch-reference needs stronger pinning |
| Exclusion rate | (monitored for classify-miscalibration) |
| Comment-post rate vs. labels-only rate | (expected ~60/40; skew signals gate miscalibration) |
| Time-to-correction (bot comment → 👎) | (low = high error rate) |
Alerts open GitHub issues in the repo labeled `triage-health`.
---
## What is explicitly out of scope
- **Voice replication.** The bot speaks as bot. No prior-art fetching of
writing-style profiles. The disclaimer banner does not try to mimic the
maintainer.
- **Closing issues, merging patches, assigning priority beyond label
routing.** The bot's label scope is `triage: *` and `suggested_labels`
from the classification schema. Priority, assignee, and milestone changes
are manual.
- **Speculative fixes for out-of-scope categories.** Driver/hardware/kernel
issues route to `needs-human` without investigation; no launcher-flag
workarounds get prescribed.
- **Posting hedged low-confidence output.** If the confidence or validation
gates fail, the comment is not posted. Labels are still applied.
---
## References
### Anthropic — agent design and review patterns
[^anthropic-framework]: [Our framework for developing safe and trustworthy
agents](https://www.anthropic.com/news/our-framework-for-developing-safe-and-trustworthy-agents).
Five principles for agent design; emphasizes process transparency and
human-in-the-loop over output-level disclaimers.
[^anthropic-code-review]: [Code Review for Claude
Code](https://claude.com/blog/code-review). Source of the "won't approve
PRs — that's still a human call" framing pattern. Documents the
Writer/Reviewer pattern and confidence-based surface gating.
[^anthropic-security-review]: [claude-code-security-review (GitHub
Action)](https://github.com/anthropics/claude-code-security-review).
Anthropic's own automated review bot for PRs. Source of
category-exclusion and upfront limitation-disclosure patterns.
[^anthropic-best-practices]: [Best Practices for Claude
Code](https://code.claude.com/docs/en/best-practices). Documents the
fresh-context Writer/Reviewer pattern and confidence-scoring multi-agent
review.
[^anthropic-autonomy]: [Measuring AI agent autonomy in
practice](https://www.anthropic.com/research/measuring-agent-autonomy).
User trust is earned and measurable (~20% auto-approve for novices
rising to ~40% with experience). Motivates the conservative-framing
choice: trust-erosion from wrong output is slow to recover.
### LLM hallucination research
[^diffray-hallucinations]: [LLM Hallucinations in AI Code Review
(diffray)](https://diffray.ai/blog/llm-hallucinations-code-review/).
2945% of AI-generated code contains security vulnerabilities; ~20% of
package recommendations reference non-existent libraries. Motivates the
"validate proposed patches against actual source" gate.
[^lakera-hallucinations]: [LLM Hallucinations in 2026: How to Understand
and Tackle AI's Most Persistent
Quirk](https://www.lakera.ai/blog/guide-to-hallucinations-in-large-language-models).
Hallucinations originate from training incentives where confident
guessing outperforms acknowledging uncertainty. Motivates
structural-tentativeness framing over prose hedges.
### GitHub — automated triage in production
[^github-taskflow]: [AI-supported vulnerability triage with the GitHub
Security Lab Taskflow
Agent](https://github.blog/security/ai-supported-vulnerability-triage-with-the-github-security-lab-taskflow-agent/).
Production LLM-triage system. Source of the "require precise file and
line references" and "staged verification with intermediate artifacts"
patterns.
[^github-copilot-review]: [Responsible use of GitHub Copilot code
review](https://docs.github.com/en/copilot/responsible-use/code-review).
Documents the structural-tentativeness approach (suggestions require
manual approval rather than explicit uncertainty signals) and the
"missed issues / false positives / unreliable suggestions" disclosure
triad.
### Other
[^triage-project]: [trIAge — LLM-powered triage bot for open
source](https://github.com/trIAgelab/trIAge). Reference implementation of
an open-source triage bot; useful as a comparative architecture.