Compare commits

...

9 Commits

Author SHA1 Message Date
Xavier Roche
f987083f14 Add an opt-in pre-commit hook that auto-formats changed C lines
Enable with: git config core.hooksPath .githooks

The hook runs git-clang-format (clang-format 19, repo .clang-format) on the
staged C lines only and re-stages the result, so commits stay
clang-format-clean and the CI format check passes without a round-trip. It never
reformats the whole tree, only the lines a commit changes.

Safe by construction: if clang-format 19 is absent it skips (CI still enforces);
and if a file has both staged and unstaged changes it does not auto-mutate
(which would commit the unstaged part), it reports and asks the author to
stage/stash. HTTRACK_NO_AUTOFORMAT=1 skips it for one commit. README covers the
noexec-working-tree case (point core.hooksPath at an exec-fs copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 12:55:17 +02:00
Xavier Roche
eb565f0bd8 Merge pull request #333 from xroche/cleanup/clang-format-setup
Add a .clang-format and a changed-lines CI format check
2026-06-14 12:38:20 +02:00
Xavier Roche
71398d510e Add a .clang-format and a changed-lines CI format check
The engine predates clang-format (it was shaped by an old Visual Studio
formatter) and does not round-trip through it: a whole-tree reformat is ~25k
lines of churn, so we never do one. Instead we format only the lines a change
touches, via git-clang-format, and enforce that in CI diff-scoped.

.clang-format is reverse-engineered from src/*.c (2-space, no tabs, 80 cols,
char *x pointers, attached braces, un-indented case labels, space after C-style
casts). That is mostly LLVM defaults; the deliberate deviations are
SpaceAfterCStyleCast (the dominant "(int) x" form) and SortIncludes: false
(C include order can be significant, so never reorder).

The CI "format" job pins clang-format-19 from apt.llvm.org's noble channel
(ubuntu-24.04's native is 18) to match local dev, and fails only if a PR's
changed C lines are not clang-format-clean. Existing untouched code is left
alone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 12:26:49 +02:00
Xavier Roche
75fc040f06 Merge pull request #331 from xroche/cleanup/htsbuff-builder
Add htsbuff: a bounded string builder over a fixed buffer
2026-06-14 10:40:23 +02:00
Xavier Roche
c4ef18f5a5 Add htsbuff: a bounded string builder over a fixed buffer
Many pointer-destination buff() sites are cursors walking a buffer of known
capacity, with a manual "p += strlen(p)" after each write (the url_savename
renderer does this ~40 times). That hand-rolled pointer math is where several
of the off-by-one hazards live.

htsbuff captures the pattern: a non-owning builder (buf/cap/len) built from an
in-scope array (htsbuff_array, capacity via sizeof) or a pointer of known size
(htsbuff_ptr). htsbuff_cat/catn/cpy bound every write against the real capacity
and abort on overflow, same contract as the *_safe_ helpers, so the pointer
math goes away.

Extend the -#8 self-test and tests/01_engine-strsafe.test with builder
correctness (append, truncating append, reset, length) and an overflow-abort
case. No call sites are converted yet; that follows per file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 10:38:22 +02:00
Xavier Roche
d76dad47f7 Merge pull request #330 from xroche/cleanup/htssafe-pointer-diagnostics
Flag unchecked pointer-destination uses of the buff() string macros
2026-06-14 08:49:26 +02:00
Xavier Roche
055e17b057 Merge pull request #328 from xroche/cli/header-ua-length-152
Raise the user-agent and custom-header length limits
2026-06-14 01:43:31 +02:00
Xavier Roche
d7bb97d697 Merge pull request #329 from xroche/parser/lock-background-image-237
Lock CSS background-image url() rewriting in the parser test
2026-06-14 01:37:51 +02:00
Xavier Roche
ca810ef7e3 Lock CSS background-image url() rewriting in the parser test
background-image is already captured and rewritten through the style/CSS
url() path, in both an external <style> block and an inline style attribute,
with the URL unquoted, double-quoted or single-quoted. Extend the offline
parser test to cover all of these so the behavior stays locked.

closes #237
2026-06-14 01:07:42 +02:00
8 changed files with 352 additions and 13 deletions

27
.clang-format Normal file
View File

@@ -0,0 +1,27 @@
# clang-format 19 config for the HTTrack C engine.
#
# IMPORTANT: this is applied to TOUCHED LINES ONLY (via git-clang-format / the
# CI format check). The engine was originally formatted by GNU indent / by hand
# and does NOT round-trip through clang-format, so a whole-tree reformat is
# intentionally never done. Format the lines you change; leave the rest.
#
# Reverse-engineered from src/*.c: 2-space indent, no tabs, 80 columns, pointers
# bound to the name (char *x), attached braces, un-indented case labels, and a
# space after C-style casts ((int) x). Most of that is LLVM's defaults; the
# lines below are the deliberate deviations.
BasedOnStyle: LLVM
# Engine specifics / deviations from LLVM:
SpaceAfterCStyleCast: true # "(int) x", overwhelmingly dominant (542 vs 7)
SortIncludes: false # C include order can be significant; never reorder
IncludeBlocks: Preserve # do not merge/reflow include groups
# Stated explicitly for robustness against base-style drift (these match LLVM):
IndentWidth: 2
UseTab: Never
ColumnLimit: 80
PointerAlignment: Right
IndentCaseLabels: false
SpaceBeforeParens: ControlStatements
AllowShortIfStatementsOnASingleLine: Never

35
.githooks/README.md Normal file
View File

@@ -0,0 +1,35 @@
# Git hooks
Versioned hooks for this repo. Enable them once per clone:
```sh
git config core.hooksPath .githooks
```
## pre-commit: auto-format changed C lines
Runs `git-clang-format` (clang-format 19, using the repo `.clang-format`) on the
**staged lines only** and re-stages the result, so every commit is
clang-format-clean and the CI `format` check passes. It never reformats the
whole tree, only the lines you changed.
- Disable for a single commit: `HTTRACK_NO_AUTOFORMAT=1 git commit ...`
- If clang-format 19 isn't installed, the hook skips silently (CI still
enforces). Install it with your distro's `clang-format-19`, or from
apt.llvm.org.
- If a file has *both* staged and unstaged changes, the hook does not
auto-mutate it (that would commit the unstaged part); it instead reports
whether its staged lines need formatting and asks you to stage/stash the rest.
### noexec working trees
Git executes the hook directly, so if your working tree is on a `noexec` mount
git cannot run `.githooks/pre-commit`. Point `core.hooksPath` at a copy on an
exec filesystem instead:
```sh
mkdir -p ~/.httrack-hooks && cp .githooks/pre-commit ~/.httrack-hooks/
chmod +x ~/.httrack-hooks/pre-commit
git config core.hooksPath ~/.httrack-hooks
```
</content>

71
.githooks/pre-commit Executable file
View File

@@ -0,0 +1,71 @@
#!/usr/bin/env bash
#
# Auto-format the staged C lines with clang-format (touched lines only), then
# re-stage them, so commits stay clang-format-clean and CI's format check passes.
#
# Enable once per clone: git config core.hooksPath .githooks
# Skip for one commit: HTTRACK_NO_AUTOFORMAT=1 git commit ...
#
# Matches the CI gate (.clang-format, clang-format 19). It only ever touches the
# lines a commit changes; it never reformats the whole tree.
set -euo pipefail
[ "${HTTRACK_NO_AUTOFORMAT:-}" = "1" ] && exit 0
# Staged C/H files (added/copied/modified/renamed).
mapfile -t files < <(git diff --cached --name-only --diff-filter=ACMR -- '*.c' '*.h')
[ "${#files[@]}" -eq 0 ] && exit 0
# Locate clang-format 19 and the git driver; if absent, skip (CI is the backstop).
cf=""
for c in clang-format-19 clang-format; do
if command -v "$c" >/dev/null 2>&1; then
case "$("$c" --version)" in *"version 19."*)
cf="$c"
break
;;
esac
fi
done
gcf=""
for g in git-clang-format-19 git-clang-format; do
command -v "$g" >/dev/null 2>&1 && {
gcf="$g"
break
}
done
if [ -z "$cf" ] || [ -z "$gcf" ]; then
echo "pre-commit: clang-format 19 not found; skipping auto-format (CI still checks)." >&2
exit 0
fi
# Files that are staged AND also have unstaged changes: re-staging them would
# pull in the unstaged work, so don't auto-mutate. Check instead and let the
# author resolve it.
partial=()
for f in "${files[@]}"; do
if ! git diff --quiet -- "$f"; then partial+=("$f"); fi
done
if [ "${#partial[@]}" -ne 0 ]; then
d="$("$gcf" --binary "$cf" --style=file --staged --diff --extensions c,h || true)"
case "$d" in
"" | "no modified files to format" | *"did not modify any files"*)
exit 0
;; # staged lines already clean
*)
echo "pre-commit: these files have both staged and unstaged changes, so" >&2
echo "auto-format was skipped to avoid committing unstaged work:" >&2
printf ' %s\n' "${partial[@]}" >&2
echo "Their staged lines need formatting. Stage the rest (or stash it)," >&2
echo "or run: $gcf --binary $cf --staged" >&2
exit 1
;;
esac
fi
# Clean-staged files: format the staged lines in the working tree, then re-stage.
"$gcf" --binary "$cf" --style=file --staged --extensions c,h >/dev/null || true
git add -- "${files[@]}"
exit 0

View File

@@ -81,7 +81,65 @@ jobs:
# Lint the scripts we maintain; the legacy scripts are a separate cleanup. # Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck - name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh tests/*.test tests/check-network.sh run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh
- name: shfmt - name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip,
# so we never reformat the whole tree -- only the lines a PR touches.
format:
name: format (clang-format-19, changed lines)
if: github.event_name == 'pull_request'
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install clang-format 19 (pinned, from apt.llvm.org)
run: |
set -euo pipefail
# ubuntu-24.04's native clang-format is 18; pin 19 to match local dev.
wget -qO- https://apt.llvm.org/llvm-snapshot.gpg.key \
| sudo tee /etc/apt/trusted.gpg.d/apt.llvm.org.asc >/dev/null
echo "deb http://apt.llvm.org/noble/ llvm-toolchain-noble-19 main" \
| sudo tee /etc/apt/sources.list.d/llvm-19.list >/dev/null
sudo apt-get update
sudo apt-get install -y --no-install-recommends clang-format-19
# git-clang-format driver, pinned to an immutable release tag (not a
# moving branch) since we curl and then execute it.
sudo curl -fsSL -o /usr/local/bin/git-clang-format \
https://raw.githubusercontent.com/llvm/llvm-project/llvmorg-19.1.7/clang/tools/clang-format/git-clang-format
sudo chmod 0755 /usr/local/bin/git-clang-format
clang-format-19 --version
- name: Check formatting of changed lines
run: |
set -euo pipefail
git fetch --no-tags origin \
"+refs/heads/${{ github.base_ref }}:refs/remotes/origin/${{ github.base_ref }}"
base="origin/${{ github.base_ref }}"
set +e
diff="$(git clang-format --binary clang-format-19 --style=file \
--diff --extensions c,h "$base")"
rc=$?
set -e
# Classify by output first: a non-empty diff means "not clean",
# regardless of the driver's exit convention (the release-tag driver
# exits 0 and signals via stdout; some packaged drivers exit 1 on a
# diff). A nonzero exit with clean output is a real checker error.
case "$diff" in
"" | "no modified files to format" | *"did not modify any files"*)
if [ "$rc" -ne 0 ]; then
echo "::error::git clang-format failed (exit $rc): checker error."
exit 1
fi
echo "Formatting OK: changed C lines are clang-format-clean." ;;
*)
echo "$diff"
echo "::error::Changed C lines are not clang-format-clean."
echo "Fix locally with: git clang-format --binary clang-format-19 $base"
exit 1 ;;
esac

View File

@@ -193,6 +193,37 @@ static int string_safety_selftests(void) {
#pragma GCC diagnostic pop #pragma GCC diagnostic pop
#endif #endif
/* htsbuff: bounded builder over a fixed array (append, truncating append,
reset, and length tracking) */
{
char dst[8];
htsbuff b = htsbuff_array(dst);
htsbuff_cat(&b, "ab");
htsbuff_cat(&b, "cd");
if (strcmp(htsbuff_str(&b), "abcd") != 0 || b.len != 4)
return 1;
htsbuff_catn(&b, "efghij", 2); /* append at most 2 */
if (strcmp(htsbuff_str(&b), "abcdef") != 0)
return 1;
htsbuff_cpy(&b, "xyz"); /* reset */
if (strcmp(htsbuff_str(&b), "xyz") != 0 || b.len != 3)
return 1;
}
/* boundary: filling to exactly cap-1 must succeed (one more aborts, which the
-#8 overflow-buff mode checks) */
{
char d2[4];
htsbuff c = htsbuff_array(d2);
htsbuff_cat(&c, "abc");
if (strcmp(htsbuff_str(&c), "abc") != 0 || c.len != 3)
return 1;
}
return 0; return 0;
} }
@@ -2494,16 +2525,23 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; return 0;
break; break;
case '8': /* string-safety selftest: httrack -#8 [overflow <bigstr>] */ case '8': /* string-safety selftest: httrack -#8 [overflow <bigstr>] */
if (na + 1 < argc && strcmp(argv[na + 1], "overflow") == 0) { if (na + 1 < argc
/* Deliberately exceed a sized buffer: the bounded macro must && strncmp(argv[na + 1], "overflow", 8) == 0) {
/* Deliberately exceed a sized buffer: the bounded op must
abort. The source comes from argv so its length is opaque abort. The source comes from argv so its length is opaque
to the compiler (no static -Wstringop-overflow, genuine to the compiler (no static -Wstringop-overflow, genuine
runtime check). */ runtime check). "overflow-buff" exercises htsbuff. */
char small[4]; char small[4];
const char *const src = const char *const src =
(na + 2 < argc) ? argv[na + 2] : "overflowing"; (na + 2 < argc) ? argv[na + 2] : "overflowing";
strcpybuff(small, src); if (strcmp(argv[na + 1], "overflow-buff") == 0) {
htsbuff b = htsbuff_array(small);
htsbuff_cat(&b, src);
} else {
strcpybuff(small, src);
}
printf("strsafe: NOT aborted\n"); /* must be unreachable */ printf("strsafe: NOT aborted\n"); /* must be unreachable */
htsmain_free(); htsmain_free();
return 1; return 1;

View File

@@ -287,6 +287,81 @@ static HTS_INLINE HTS_UNUSED char* strcpy_safe_(char *const dest, const size_t s
return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1, exp, file, line); return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1, exp, file, line);
} }
/**
* htsbuff: a non-owning bounded string builder over a fixed buffer.
*
* Companion to the strcpybuff()/strcatbuff() macros for the common case of a
* cursor walking a buffer of known capacity (building a name into a fixed
* array, assembling a status line, etc.). It tracks the write position, bounds
* every write against the real capacity, and aborts on overflow (same contract
* as the *_safe_ helpers), so the error-prone manual "p += strlen(p)" dance
* goes away.
*
* Build one from an in-scope array with htsbuff_array() (capacity via sizeof,
* so pass an array, not a pointer), or from a pointer of known capacity with
* htsbuff_ptr(). The buffer is kept NUL-terminated; htsbuff_str() returns it.
*/
typedef struct {
char *buf; /* backing buffer (kept NUL-terminated) */
size_t cap; /* total capacity of buf, including the NUL */
size_t len; /* current length, excluding the NUL */
} htsbuff;
static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) {
htsbuff b;
b.buf = buf;
b.cap = cap;
b.len = 0;
assertf(cap != 0);
buf[0] = '\0';
return b;
}
/**
* Builder over the in-scope array ARR (capacity = sizeof(ARR)).
* On GCC/Clang this rejects a non-array (e.g. a char* pointer), whose sizeof
* would be the pointer size and silently wrong; use htsbuff_ptr() for pointers.
* On other compilers there is no such guard, so pass only true arrays there.
*/
#if (defined(__GNUC__) && !defined(__cplusplus))
/* 0 for an array, a -1 array-size compile error for a pointer. */
#define htsbuff_must_be_array_(A) \
(sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), typeof(&(A)[0]))]) - 1)
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR))
#else
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR))
#endif
/** Builder over pointer P of known capacity N (N includes the NUL). */
#define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N))
/** Append at most n characters of s (stopping at its NUL). Aborts on overflow. */
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s, size_t n) {
const size_t add = strnlen(s, n);
/* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The
maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so
'add < cap - len' cannot wrap the way 'len + add < cap' could. */
assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__, __LINE__);
memcpy(b->buf + b->len, s, add);
b->len += add;
b->buf[b->len] = '\0';
}
/** Append s. Aborts on overflow. */
static HTS_INLINE HTS_UNUSED void htsbuff_cat(htsbuff *b, const char *s) {
htsbuff_catn(b, s, (size_t) -1);
}
/** Reset content to s. Aborts on overflow. */
static HTS_INLINE HTS_UNUSED void htsbuff_cpy(htsbuff *b, const char *s) {
b->len = 0;
htsbuff_catn(b, s, (size_t) -1);
}
/** Current NUL-terminated content. */
static HTS_INLINE HTS_UNUSED const char *htsbuff_str(const htsbuff *b) {
return b->buf;
}
#define malloct(A) malloc(A) #define malloct(A) malloc(A)
#define calloct(A,B) calloc((A), (B)) #define calloct(A,B) calloc((A), (B))
#define freet(A) do { if ((A) != NULL) { free(A); (A) = NULL; } } while(0) #define freet(A) do { if ((A) != NULL) { free(A); (A) = NULL; } } while(0)

View File

@@ -99,17 +99,25 @@ grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
! grep -Eq 'srcset="[^"]*file://' "$saved" || ! grep -Eq 'srcset="[^"]*file://' "$saved" ||
! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1 ! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
# xlink:href (#298) and inline background-image (#237): detected and rewritten # xlink:href (#298) and CSS background-image (#237): detected and rewritten to
# to local; no-detect attributes (title, alt, ...) left untouched. Asserted by # local. background-image is covered in both an external <style> block and an
# rewrite (deterministic), not download. data-* (#201/#203) is omitted: its # inline style attribute, with the URL unquoted, double-quoted and single-quoted
# detection is currently nondeterministic and can't be locked yet. # (the quote style is preserved on rewrite). No-detect attributes (title, alt,
# ...) are left untouched. Asserted by rewrite (deterministic), not download.
# data-* (#201/#203) is omitted: its detection is currently nondeterministic and
# can't be locked yet.
site2="$tmp/attrs" site2="$tmp/attrs"
mkdir -p "$site2" mkdir -p "$site2"
for f in xl ibg tt; do gif "$site2/$f.gif"; done for f in xl ibg ibgs cex cexd cexs tt; do gif "$site2/$f.gif"; done
cat >"$site2/index.html" <<EOF cat >"$site2/index.html" <<EOF
<html><body> <html><head><style>
.a { background-image: url(file://$site2/cex.gif); }
.b { background-image: url("file://$site2/cexd.gif"); }
.c { background-image: url('file://$site2/cexs.gif'); }
</style></head><body>
<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a> <a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
<div style="background-image:url(file://$site2/ibg.gif)"></div> <div style="background-image:url(file://$site2/ibg.gif)"></div>
<div style="background-image:url('file://$site2/ibgs.gif')"></div>
<span title="file://$site2/tt.gif">excluded attribute</span> <span title="file://$site2/tt.gif">excluded attribute</span>
</body></html> </body></html>
EOF EOF
@@ -121,8 +129,24 @@ test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
# detected attributes: the absolute URL is rewritten to a local link # detected attributes: the absolute URL is rewritten to a local link
grep -Eq 'xlink:href="xl\.gif"' "$saved2" || grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1 ! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
# #237 external <style> block, each quoting form, quote style preserved
grep -Eq 'url\(cex\.gif\)' "$saved2" ||
! echo "FAIL #237: unquoted background-image in <style> not rewritten" || exit 1
grep -Eq 'url\("cexd\.gif"\)' "$saved2" ||
! echo "FAIL #237: double-quoted background-image in <style> not rewritten" || exit 1
grep -Eq "url\('cexs\.gif'\)" "$saved2" ||
! echo "FAIL #237: single-quoted background-image in <style> not rewritten" || exit 1
# #237 inline style attribute, unquoted and single-quoted url()
grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" || grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
! echo "FAIL #237: inline background-image url() not detected/rewritten" || exit 1 ! echo "FAIL #237: inline unquoted background-image not rewritten" || exit 1
grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
! echo "FAIL #237: inline single-quoted background-image not rewritten" || exit 1
# no file:// URL survived inside any rewritten background-image
! grep -Eq 'background-image:[^;"]*file://' "$saved2" ||
! echo "FAIL #237: a file:// URL survived inside a rewritten background-image" || exit 1
# excluded attribute: title is on the no-detect list, so its value is left as-is # excluded attribute: title is on the no-detect list, so its value is left as-is
grep -q 'title="file://' "$saved2" || grep -q 'title="file://' "$saved2" ||

View File

@@ -21,3 +21,14 @@ case "$err" in
*"overflow while copying"*) ;; *"overflow while copying"*) ;;
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;; *) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;;
esac esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer
# capacity (4 bytes into a 4-byte buffer), so this also pins the boundary: a
# '<=' off-by-one in the capacity check would let it through (and print "NOT
# aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1)
case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;;
*"htsbuff append overflow"*) ;;
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;;
esac