mirror of
https://github.com/aaddrick/claude-desktop-debian.git
synced 2026-05-17 08:36:35 +03:00
feat(triage): Phase 4 sub-PR 2 — suspicious-input tells (#471)
* feat(triage): Phase 4 sub-PR 2 — suspicious-input tells
Adds a conservative Stage 2a tripwire that scans the raw issue body
and title for prompt-injection tells before any LLM call. A match
short-circuits routing to 8b with reason
`suspicious-input — manual review`, no Sonnet invocation.
The scan is the front-line filter; the actual injection mitigations
(wrap-as-data, fresh-context reviewer, schema-constrained output)
remain in place for everything that doesn't trip. The two layers are
complementary: the scan catches the obvious attempts cheaply, the
downstream defenses protect against the clever ones.
Taxonomy
- taxonomies/suspicious-input-tells.json — eight tells with regex
patterns and rationale:
- ignore-prior-instructions: classic opener
- system-prompt-leak: exfiltration attempts
- role-override: "you are now a different…"
- forget-instructions: variation of ignore-prior
- developer-mode: named jailbreaks (DAN, etc.)
- instruction-injection-sysrole: chat-template tokens
- long-base64-block: 200+ contiguous base64 chars
- unicode-tag-sequence: U+E0000-E007F invisibles
Scanner
- scripts/triage/suspicious-input-scan.sh — pure bash, PCRE via
grep -Pzi, writes suspicious-input.json with matched_tells[].
Uses the same taxonomy-as-data pattern as reasons.json and
label-blocklist.json.
Workflow
- Stage 2a step runs between input snapshot and classify, outputs
`suspicious` boolean
- Classify + doublecheck both `if:`-gated so they skip on a hit
- Decide route takes suspicious first, before the doublecheck
disagreement check — a tripped tell defers deterministically
- Step summary shows the suspicious flag
Co-Authored-By: Claude <claude@anthropic.com>
* refactor(triage): drop dead null-string guards in suspicious-input scan
jq -r '.body // ""' already returns an empty string for JSON null or a
missing field, so the subsequent `[[ "${body}" == "null" ]]` guards only
fire when a reporter's body is the literal four-character string "null"
— which isn't an injection signal and matches no tell. The comment
describing the guards was also wrong about jq's behavior. Remove both
guards and correct the comment.
Also fix a misleading comment about `|| true` (which isn't in the code)
and collapse the 4-line `suspicious` boolean derivation into a single
`jq 'length > 0'`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
47
.github/workflows/issue-triage-v2.yml
vendored
47
.github/workflows/issue-triage-v2.yml
vendored
@@ -131,9 +131,34 @@ jobs:
|
||||
--json name --jq '[.[].name]' \
|
||||
> /tmp/triage/repo-labels.json
|
||||
|
||||
# Stage 2a — suspicious-input scan. Runs before the classifier
|
||||
# Sonnet call so a matched tell short-circuits to 8b without
|
||||
# sending the issue body through an LLM. The scan is a
|
||||
# conservative tripwire; the actual injection mitigations (wrap-
|
||||
# as-data, fresh-context reviewer, schema-constrained output)
|
||||
# remain in place downstream. Any match routes to 8b with reason
|
||||
# `suspicious-input — manual review` via the Decide route step.
|
||||
- name: Suspicious-input scan
|
||||
id: suspicious
|
||||
run: |
|
||||
bash .claude/scripts/triage/suspicious-input-scan.sh \
|
||||
/tmp/triage/issue.json \
|
||||
.claude/scripts/taxonomies/suspicious-input-tells.json \
|
||||
/tmp/triage/suspicious-input.json
|
||||
|
||||
hit=$(jq -r '.suspicious' /tmp/triage/suspicious-input.json)
|
||||
echo "suspicious=${hit}" >> "$GITHUB_OUTPUT"
|
||||
|
||||
if [[ "${hit}" == "true" ]]; then
|
||||
matched=$(jq -r '.matched_tells | join(",")' \
|
||||
/tmp/triage/suspicious-input.json)
|
||||
echo "::warning::suspicious-input tells matched: ${matched}"
|
||||
fi
|
||||
|
||||
# Stage 2 — classify.
|
||||
- name: Classify issue
|
||||
id: classify
|
||||
if: steps.suspicious.outputs.suspicious != 'true'
|
||||
env:
|
||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
run: |
|
||||
@@ -232,17 +257,23 @@ jobs:
|
||||
# `bug` so Stage 5 + Stage 6 can rate the `duplicate_of` target.
|
||||
# Phase 4 adds `enhancement` to the investigate route — needed
|
||||
# for the 8c variant, which renders existing-surface citations
|
||||
# from reviewer-kept findings. All three classifications run
|
||||
# through the same Stage 3-7 pipeline; Stage 7 selects between
|
||||
# 8a (bug findings), 8c (enhancement-design), and 8b (deferral)
|
||||
# based on classification + reviewer state.
|
||||
# from reviewer-kept findings. Phase 4 also adds a
|
||||
# suspicious-input short-circuit: when Stage 2a matched any
|
||||
# prompt-injection tell, route straight to 8b with reason
|
||||
# `suspicious-input — manual review`, bypassing the LLM
|
||||
# classifier entirely.
|
||||
- name: Decide route
|
||||
id: route
|
||||
run: |
|
||||
suspicious="${{ steps.suspicious.outputs.suspicious }}"
|
||||
classification="${{ steps.classify.outputs.classification }}"
|
||||
disagreed="${{ steps.doublecheck.outputs.disagreed }}"
|
||||
|
||||
if [[ "${disagreed}" == "true" ]]; then
|
||||
if [[ "${suspicious}" == "true" ]]; then
|
||||
echo "route=deferral" >> "$GITHUB_OUTPUT"
|
||||
echo "deferral_reason_id=suspicious-input" \
|
||||
>> "$GITHUB_OUTPUT"
|
||||
elif [[ "${disagreed}" == "true" ]]; then
|
||||
echo "route=deferral" >> "$GITHUB_OUTPUT"
|
||||
echo "deferral_reason_id=ambiguous" >> "$GITHUB_OUTPUT"
|
||||
elif [[ "${classification}" == "bug" \
|
||||
@@ -1530,6 +1561,7 @@ jobs:
|
||||
|
||||
- name: Write step summary
|
||||
env:
|
||||
SUSPICIOUS: ${{ steps.suspicious.outputs.suspicious }}
|
||||
CLASSIFICATION: ${{ steps.classify.outputs.classification }}
|
||||
CONFIDENCE: ${{ steps.classify.outputs.confidence }}
|
||||
DISAGREED: ${{ steps.doublecheck.outputs.disagreed }}
|
||||
@@ -1558,8 +1590,9 @@ jobs:
|
||||
echo "|---|---|"
|
||||
echo "| Issue | #${ISSUE_NUMBER} |"
|
||||
echo "| Dry run | ${DRY_RUN:-false} |"
|
||||
echo "| Classification | ${CLASSIFICATION} |"
|
||||
echo "| Confidence | ${CONFIDENCE} |"
|
||||
echo "| Suspicious-input tripped | ${SUSPICIOUS:-false} |"
|
||||
echo "| Classification | ${CLASSIFICATION:-n/a (scan blocked)} |"
|
||||
echo "| Confidence | ${CONFIDENCE:-n/a} |"
|
||||
echo "| Doublecheck disagreed | ${DISAGREED:-n/a} |"
|
||||
echo "| Version drift | ${DRIFT_DETECTED:-n/a} |"
|
||||
echo "| Findings proposed | ${FINDINGS_TOTAL:-0} |"
|
||||
|
||||
Reference in New Issue
Block a user