feat(triage): Phase 4 sub-PR 2 — suspicious-input tells (#471)

* feat(triage): Phase 4 sub-PR 2 — suspicious-input tells Adds a conservative Stage 2a tripwire that scans the raw issue body and title for prompt-injection tells before any LLM call. A match short-circuits routing to 8b with reason `suspicious-input — manual review`, no Sonnet invocation. The scan is the front-line filter; the actual injection mitigations (wrap-as-data, fresh-context reviewer, schema-constrained output) remain in place for everything that doesn't trip. The two layers are complementary: the scan catches the obvious attempts cheaply, the downstream defenses protect against the clever ones. Taxonomy - taxonomies/suspicious-input-tells.json — eight tells with regex patterns and rationale: - ignore-prior-instructions: classic opener - system-prompt-leak: exfiltration attempts - role-override: "you are now a different…" - forget-instructions: variation of ignore-prior - developer-mode: named jailbreaks (DAN, etc.) - instruction-injection-sysrole: chat-template tokens - long-base64-block: 200+ contiguous base64 chars - unicode-tag-sequence: U+E0000-E007F invisibles Scanner - scripts/triage/suspicious-input-scan.sh — pure bash, PCRE via grep -Pzi, writes suspicious-input.json with matched_tells[]. Uses the same taxonomy-as-data pattern as reasons.json and label-blocklist.json. Workflow - Stage 2a step runs between input snapshot and classify, outputs `suspicious` boolean - Classify + doublecheck both `if:`-gated so they skip on a hit - Decide route takes suspicious first, before the doublecheck disagreement check — a tripped tell defers deterministically - Step summary shows the suspicious flag Co-Authored-By: Claude <claude@anthropic.com> * refactor(triage): drop dead null-string guards in suspicious-input scan jq -r '.body // ""' already returns an empty string for JSON null or a missing field, so the subsequent `[[ "${body}" == "null" ]]` guards only fire when a reporter's body is the literal four-character string "null" — which isn't an injection signal and matches no tell. The comment describing the guards was also wrong about jq's behavior. Remove both guards and correct the comment. Also fix a misleading comment about `|| true` (which isn't in the code) and collapse the 4-line `suspicious` boolean derivation into a single `jq 'length > 0'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:36:35 +03:00 · 2026-04-20 23:34:46 -04:00
parent b9fe8e3c14
commit 9fc49bd260
3 changed files with 166 additions and 7 deletions
--- a/.github/workflows/issue-triage-v2.yml
+++ b/.github/workflows/issue-triage-v2.yml
@@ -131,9 +131,34 @@ jobs:
            --json name --jq '[.[].name]' \
            > /tmp/triage/repo-labels.json

+      # Stage 2a — suspicious-input scan. Runs before the classifier
+      # Sonnet call so a matched tell short-circuits to 8b without
+      # sending the issue body through an LLM. The scan is a
+      # conservative tripwire; the actual injection mitigations (wrap-
+      # as-data, fresh-context reviewer, schema-constrained output)
+      # remain in place downstream. Any match routes to 8b with reason
+      # `suspicious-input — manual review` via the Decide route step.
+      - name: Suspicious-input scan
+        id: suspicious
+        run: |
+          bash .claude/scripts/triage/suspicious-input-scan.sh \
+            /tmp/triage/issue.json \
+            .claude/scripts/taxonomies/suspicious-input-tells.json \
+            /tmp/triage/suspicious-input.json
+
+          hit=$(jq -r '.suspicious' /tmp/triage/suspicious-input.json)
+          echo "suspicious=${hit}" >> "$GITHUB_OUTPUT"
+
+          if [[ "${hit}" == "true" ]]; then
+            matched=$(jq -r '.matched_tells | join(",")' \
+              /tmp/triage/suspicious-input.json)
+            echo "::warning::suspicious-input tells matched: ${matched}"
+          fi
+
      # Stage 2 — classify.
      - name: Classify issue
        id: classify
+        if: steps.suspicious.outputs.suspicious != 'true'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
@@ -232,17 +257,23 @@ jobs:
      # `bug` so Stage 5 + Stage 6 can rate the `duplicate_of` target.
      # Phase 4 adds `enhancement` to the investigate route — needed
      # for the 8c variant, which renders existing-surface citations
-      # from reviewer-kept findings. All three classifications run
-      # through the same Stage 3-7 pipeline; Stage 7 selects between
-      # 8a (bug findings), 8c (enhancement-design), and 8b (deferral)
-      # based on classification + reviewer state.
+      # from reviewer-kept findings. Phase 4 also adds a
+      # suspicious-input short-circuit: when Stage 2a matched any
+      # prompt-injection tell, route straight to 8b with reason
+      # `suspicious-input — manual review`, bypassing the LLM
+      # classifier entirely.
      - name: Decide route
        id: route
        run: |
+          suspicious="${{ steps.suspicious.outputs.suspicious }}"
          classification="${{ steps.classify.outputs.classification }}"
          disagreed="${{ steps.doublecheck.outputs.disagreed }}"

-          if [[ "${disagreed}" == "true" ]]; then
+          if [[ "${suspicious}" == "true" ]]; then
+            echo "route=deferral" >> "$GITHUB_OUTPUT"
+            echo "deferral_reason_id=suspicious-input" \
+              >> "$GITHUB_OUTPUT"
+          elif [[ "${disagreed}" == "true" ]]; then
            echo "route=deferral" >> "$GITHUB_OUTPUT"
            echo "deferral_reason_id=ambiguous" >> "$GITHUB_OUTPUT"
          elif [[ "${classification}" == "bug" \
@@ -1530,6 +1561,7 @@ jobs:

      - name: Write step summary
        env:
+          SUSPICIOUS: ${{ steps.suspicious.outputs.suspicious }}
          CLASSIFICATION: ${{ steps.classify.outputs.classification }}
          CONFIDENCE: ${{ steps.classify.outputs.confidence }}
          DISAGREED: ${{ steps.doublecheck.outputs.disagreed }}
@@ -1558,8 +1590,9 @@ jobs:
            echo "|---|---|"
            echo "| Issue | #${ISSUE_NUMBER} |"
            echo "| Dry run | ${DRY_RUN:-false} |"
-            echo "| Classification | ${CLASSIFICATION} |"
-            echo "| Confidence | ${CONFIDENCE} |"
+            echo "| Suspicious-input tripped | ${SUSPICIOUS:-false} |"
+            echo "| Classification | ${CLASSIFICATION:-n/a (scan blocked)} |"
+            echo "| Confidence | ${CONFIDENCE:-n/a} |"
            echo "| Doublecheck disagreed | ${DISAGREED:-n/a} |"
            echo "| Version drift | ${DRIFT_DETECTED:-n/a} |"
            echo "| Findings proposed | ${FINDINGS_TOTAL:-0} |"