feat(triage): Phase 4 sub-PR 2 — suspicious-input tells (#471)

* feat(triage): Phase 4 sub-PR 2 — suspicious-input tells Adds a conservative Stage 2a tripwire that scans the raw issue body and title for prompt-injection tells before any LLM call. A match short-circuits routing to 8b with reason `suspicious-input — manual review`, no Sonnet invocation. The scan is the front-line filter; the actual injection mitigations (wrap-as-data, fresh-context reviewer, schema-constrained output) remain in place for everything that doesn't trip. The two layers are complementary: the scan catches the obvious attempts cheaply, the downstream defenses protect against the clever ones. Taxonomy - taxonomies/suspicious-input-tells.json — eight tells with regex patterns and rationale: - ignore-prior-instructions: classic opener - system-prompt-leak: exfiltration attempts - role-override: "you are now a different…" - forget-instructions: variation of ignore-prior - developer-mode: named jailbreaks (DAN, etc.) - instruction-injection-sysrole: chat-template tokens - long-base64-block: 200+ contiguous base64 chars - unicode-tag-sequence: U+E0000-E007F invisibles Scanner - scripts/triage/suspicious-input-scan.sh — pure bash, PCRE via grep -Pzi, writes suspicious-input.json with matched_tells[]. Uses the same taxonomy-as-data pattern as reasons.json and label-blocklist.json. Workflow - Stage 2a step runs between input snapshot and classify, outputs `suspicious` boolean - Classify + doublecheck both `if:`-gated so they skip on a hit - Decide route takes suspicious first, before the doublecheck disagreement check — a tripped tell defers deterministically - Step summary shows the suspicious flag Co-Authored-By: Claude <claude@anthropic.com> * refactor(triage): drop dead null-string guards in suspicious-input scan jq -r '.body // ""' already returns an empty string for JSON null or a missing field, so the subsequent `[[ "${body}" == "null" ]]` guards only fire when a reporter's body is the literal four-character string "null" — which isn't an injection signal and matches no tell. The comment describing the guards was also wrong about jq's behavior. Remove both guards and correct the comment. Also fix a misleading comment about `|| true` (which isn't in the code) and collapse the 4-line `suspicious` boolean derivation into a single `jq 'length > 0'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:36:35 +03:00 · 2026-04-20 23:34:46 -04:00
parent b9fe8e3c14
commit 9fc49bd260
3 changed files with 166 additions and 7 deletions
--- a/.claude/scripts/taxonomies/suspicious-input-tells.json
+++ b/.claude/scripts/taxonomies/suspicious-input-tells.json
@@ -0,0 +1,46 @@
+{
+  "comment": "Fixed list of prompt-injection tells scanned against the raw issue body at Stage 2 before any LLM call. A hit routes the issue to 8b with reason 'suspicious-input — manual review'; no investigation, no labels beyond triage routing. The goal is a conservative, easy-to-audit front-line filter — not to replace the structured prompt-injection defenses downstream (wrap-as-data, fresh-context reviewer, schema-constrained output), which are the actual mitigation. Stage 2 is a tripwire; if it fires the maintainer reads the issue themselves rather than asking an LLM to.",
+  "rationale": "Regex patterns are case-insensitive (ripgrep -i semantics). Each pattern targets a specific tactic documented in the prompt-injection literature or observed in real spam/abuse attempts. Keep the list narrow — over-broad patterns block legitimate reports. Any hit defers to a human; there is no 'this is fine, investigate anyway' fallback.",
+  "tells": [
+    {
+      "id": "ignore-prior-instructions",
+      "pattern": "ignore (all )?(prior|previous|above) (instructions|prompts|directives)",
+      "description": "Classic prompt-injection opener. Seen verbatim in indirect-injection research (Willison, Greshake et al.)."
+    },
+    {
+      "id": "system-prompt-leak",
+      "pattern": "(reveal|print|show|output|disclose) (your )?(system|initial|original) (prompt|instructions|directive)",
+      "description": "Attempts to exfiltrate the surrounding prompt context. Legitimate reports don't need the system prompt."
+    },
+    {
+      "id": "role-override",
+      "pattern": "you are (now|actually|really) (a |an )?(different|new|evil|jailbroken|unrestricted|developer-mode)",
+      "description": "Role-reassignment attack. Legitimate issues don't redefine the bot's role."
+    },
+    {
+      "id": "forget-instructions",
+      "pattern": "(forget|disregard|override) (everything|all|your|the) (above|prior|previous|instructions|training)",
+      "description": "Variation of ignore-prior-instructions with different verb."
+    },
+    {
+      "id": "developer-mode",
+      "pattern": "(enter|activate|enable) (developer|dan|jailbreak|unrestricted|admin|root) mode",
+      "description": "Named jailbreak tactic. No legitimate reporter asks for this."
+    },
+    {
+      "id": "instruction-injection-sysrole",
+      "pattern": "<\\|?(system|im_start|assistant)\\|?>",
+      "description": "Chat-template tokens. A legitimate Markdown issue body would not contain these; they exist to try to forge conversation turns."
+    },
+    {
+      "id": "long-base64-block",
+      "pattern": "[A-Za-z0-9+/]{200,}={0,2}",
+      "description": "A contiguous base64-looking run of 200+ characters is almost always an attempt to smuggle encoded instructions past visible scanning. Legitimate logs with base64 payloads (certificate fingerprints, compressed traces) should be uploaded as files or quoted in short snippets."
+    },
+    {
+      "id": "unicode-tag-sequence",
+      "pattern": "[\\x{E0000}-\\x{E007F}]{3,}",
+      "description": "Unicode Tag block (U+E0000-E007F) is invisible in most renderers and used to smuggle hidden instructions. Three or more consecutive tag characters is a deliberate signal, not accidental."
+    }
+  ]
+}
--- a/.claude/scripts/triage/suspicious-input-scan.sh
+++ b/.claude/scripts/triage/suspicious-input-scan.sh
@@ -0,0 +1,80 @@
+#!/usr/bin/env bash
+# Stage 2 suspicious-input scan for issue triage v2.
+#
+# Reads the raw issue body + title from a JSON file and scans for
+# prompt-injection tells listed in
+# taxonomies/suspicious-input-tells.json. Any match routes the issue
+# to 8b human-deferral with reason `suspicious-input — manual review`,
+# bypassing the LLM classifier entirely. The scanner is conservative
+# by design — the structured defenses downstream (wrap-as-data, fresh
+# reviewer context, schema-constrained output) remain the actual
+# mitigation; Stage 2 is the front-line tripwire.
+#
+# Usage: suspicious-input-scan.sh <issue.json> <tells.json> <output.json>
+#
+# Reads `.title` and `.body` from <issue.json>, each tell's `pattern`
+# from <tells.json>, writes
+#   { "suspicious": <bool>, "matched_tells": [<id>, ...] }
+# to <output.json>.
+#
+# Patterns are PCRE (grep -P); case-insensitive; multi-line DOTALL
+# where the pattern spans lines (grep -z handles the body as one
+# blob). Empty body or title scanning is a no-op — the scan ignores
+# absent fields rather than treating them as matches.
+
+set -o errexit
+set -o nounset
+set -o pipefail
+
+issue_json="${1:?issue.json required}"
+tells_json="${2:?tells.json required}"
+output="${3:?output path required}"
+
+# ─── Read fields ──────────────────────────────────────────────────
+# `// ""` turns a JSON null into an empty string. `-r` strips the
+# quotes so a legitimately-empty field is "" rather than the literal
+# four-char string "null".
+
+title=$(jq -r '.title // ""' "${issue_json}")
+body=$(jq -r '.body // ""' "${issue_json}")
+
+# ─── Scan ─────────────────────────────────────────────────────────
+# Each tell's regex runs against the concatenated title + body. Using
+# printf '%s\n%s' keeps them on separate lines so patterns that
+# require line-anchored match (none do today) stay line-aware.
+#
+# grep -P is PCRE for `\x{...}` unicode escapes. -i is case-
+# insensitive for verbal tells. -z treats the input as one record
+# separated by NUL so patterns can span lines (relevant for the
+# long-base64-block tell).
+
+combined=$(printf '%s\n%s' "${title}" "${body}")
+
+matched='[]'
+
+while IFS= read -r tell; do
+	tell_id=$(jq -r '.id' <<<"${tell}")
+	pattern=$(jq -r '.pattern' <<<"${tell}")
+
+	# grep -zP reads the whole input as one record so patterns can
+	# span lines; -q because we only need the exit status. `if`
+	# consumes grep's exit code, so the non-match exit 1 doesn't trip
+	# pipefail + errexit.
+	if printf '%s' "${combined}" \
+			| grep -qziP -- "${pattern}" 2>/dev/null; then
+		matched=$(jq --arg id "${tell_id}" \
+			'. + [$id]' <<<"${matched}")
+	fi
+done < <(jq -c '.tells[]' "${tells_json}")
+
+# ─── Output ───────────────────────────────────────────────────────
+
+suspicious=$(jq 'length > 0' <<<"${matched}")
+
+jq -n \
+	--argjson suspicious "${suspicious}" \
+	--argjson matched "${matched}" \
+	'{
+		suspicious: $suspicious,
+		matched_tells: $matched
+	}' > "${output}"
--- a/.github/workflows/issue-triage-v2.yml
+++ b/.github/workflows/issue-triage-v2.yml
@@ -131,9 +131,34 @@ jobs:
            --json name --jq '[.[].name]' \
            > /tmp/triage/repo-labels.json

+      # Stage 2a — suspicious-input scan. Runs before the classifier
+      # Sonnet call so a matched tell short-circuits to 8b without
+      # sending the issue body through an LLM. The scan is a
+      # conservative tripwire; the actual injection mitigations (wrap-
+      # as-data, fresh-context reviewer, schema-constrained output)
+      # remain in place downstream. Any match routes to 8b with reason
+      # `suspicious-input — manual review` via the Decide route step.
+      - name: Suspicious-input scan
+        id: suspicious
+        run: |
+          bash .claude/scripts/triage/suspicious-input-scan.sh \
+            /tmp/triage/issue.json \
+            .claude/scripts/taxonomies/suspicious-input-tells.json \
+            /tmp/triage/suspicious-input.json
+
+          hit=$(jq -r '.suspicious' /tmp/triage/suspicious-input.json)
+          echo "suspicious=${hit}" >> "$GITHUB_OUTPUT"
+
+          if [[ "${hit}" == "true" ]]; then
+            matched=$(jq -r '.matched_tells | join(",")' \
+              /tmp/triage/suspicious-input.json)
+            echo "::warning::suspicious-input tells matched: ${matched}"
+          fi
+
      # Stage 2 — classify.
      - name: Classify issue
        id: classify
+        if: steps.suspicious.outputs.suspicious != 'true'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
@@ -232,17 +257,23 @@ jobs:
      # `bug` so Stage 5 + Stage 6 can rate the `duplicate_of` target.
      # Phase 4 adds `enhancement` to the investigate route — needed
      # for the 8c variant, which renders existing-surface citations
-      # from reviewer-kept findings. All three classifications run
-      # through the same Stage 3-7 pipeline; Stage 7 selects between
-      # 8a (bug findings), 8c (enhancement-design), and 8b (deferral)
-      # based on classification + reviewer state.
+      # from reviewer-kept findings. Phase 4 also adds a
+      # suspicious-input short-circuit: when Stage 2a matched any
+      # prompt-injection tell, route straight to 8b with reason
+      # `suspicious-input — manual review`, bypassing the LLM
+      # classifier entirely.
      - name: Decide route
        id: route
        run: |
+          suspicious="${{ steps.suspicious.outputs.suspicious }}"
          classification="${{ steps.classify.outputs.classification }}"
          disagreed="${{ steps.doublecheck.outputs.disagreed }}"

-          if [[ "${disagreed}" == "true" ]]; then
+          if [[ "${suspicious}" == "true" ]]; then
+            echo "route=deferral" >> "$GITHUB_OUTPUT"
+            echo "deferral_reason_id=suspicious-input" \
+              >> "$GITHUB_OUTPUT"
+          elif [[ "${disagreed}" == "true" ]]; then
            echo "route=deferral" >> "$GITHUB_OUTPUT"
            echo "deferral_reason_id=ambiguous" >> "$GITHUB_OUTPUT"
          elif [[ "${classification}" == "bug" \
@@ -1530,6 +1561,7 @@ jobs:

      - name: Write step summary
        env:
+          SUSPICIOUS: ${{ steps.suspicious.outputs.suspicious }}
          CLASSIFICATION: ${{ steps.classify.outputs.classification }}
          CONFIDENCE: ${{ steps.classify.outputs.confidence }}
          DISAGREED: ${{ steps.doublecheck.outputs.disagreed }}
@@ -1558,8 +1590,9 @@ jobs:
            echo "|---|---|"
            echo "| Issue | #${ISSUE_NUMBER} |"
            echo "| Dry run | ${DRY_RUN:-false} |"
-            echo "| Classification | ${CLASSIFICATION} |"
-            echo "| Confidence | ${CONFIDENCE} |"
+            echo "| Suspicious-input tripped | ${SUSPICIOUS:-false} |"
+            echo "| Classification | ${CLASSIFICATION:-n/a (scan blocked)} |"
+            echo "| Confidence | ${CONFIDENCE:-n/a} |"
            echo "| Doublecheck disagreed | ${DISAGREED:-n/a} |"
            echo "| Version drift | ${DRIFT_DETECTED:-n/a} |"
            echo "| Findings proposed | ${FINDINGS_TOTAL:-0} |"