Compare commits

...

6 Commits

Author SHA1 Message Date
Xavier Roche
594820d3eb Add AGENTS.md operational checklist for AI-assisted contributions
LLM-assisted PRs are arriving; give agents one compact, tool-neutral file
covering the repo's toolchain rules and invariants so contributions arrive
review-ready instead of needing the conventions reconstructed each time.

AGENTS.md is the operational checklist (build/test, autotools regen, touched-
lines-only formatting, byte-safe Latin-1 edits, overflow-safe bounds,
adversarial self-review, commit/PR discipline). CLAUDE.md imports it via
@AGENTS.md so Claude Code auto-loads the same source. CONTRIBUTING.md keeps the
policy and gains a Co-Authored-By attribution rule plus a PR-conciseness line.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 04:01:29 +02:00
Xavier Roche
a6fc0e9dab Merge pull request #361 from xroche/chore/bump-coucal-shift-ub
Bump src/coucal to fadf29b (MurmurHash3 signed-shift UB fix)
2026-06-15 17:04:09 +02:00
Xavier Roche
f227135d16 Bump src/coucal to fadf29b (MurmurHash3 signed-shift UB fix)
Picks up coucal PR #6: the MurmurHash3 tail mixing shifted a byte
promoted to int left by 24, overflowing signed int once the byte had
its high bit set (UBSan). A sanitized live crawl hashing arbitrary URL
keys aborted on it.

Verified: the ASan+UBSan www.edf.fr crawl that previously aborted at
murmurhash3.h:123 now completes clean (100 pages, no findings).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 14:46:04 +02:00
Xavier Roche
223564eaca Merge pull request #360 from xroche/cleanup/htscore-bounds
Bound htscore.c pointer-destination buffer writes (batch 8)
2026-06-15 10:28:29 +02:00
Xavier Roche
7db49a64b6 Bound htscore.c pointer-destination buffer writes (batch 8)
Convert htscore.c's 18 pointer-destination strcpybuff/strcatbuff sites (which
silently degrade to unchecked strcpy/strcat per the htssafe.h diagnostic) to
bounded forms:

- httpmirror(): one htsbuff over the malloc'd primary buffer drives the whole
  link accumulation, replacing the manual "primary_ptr += strlen" cursor in the
  filelist loop; the +/- filter slots build through htsbuff over their known
  HTS_URLMAXSIZE*2 capacity.
- host_ban(): the "-host/*" filter slot builds through htsbuff.
- htsAddLink(): str->localLink builds through htsbuff / strlcpybuff bounded by
  str->localLinkSize.
- next_token(): the in-place unquote/unescape copied the (always shorter) result
  back through an 8KB temp buffer, which both relied on an unchecked pointer copy
  and aborted on tokens over 8KB. Replace with memmove left-shift compaction: no
  capacity guess, no size cap.

Add a next_token() regression test to basic_selftests (httrack -#7) covering
plain tokens, quote stripping, and \" / \\ unescaping; teeth verified.

htscore.c pointer-destination sites 18 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 10:16:06 +02:00
Xavier Roche
f1c04c10eb Merge pull request #359 from xroche/fix/malloc-size-plus4
Allocate exactly one extra byte for cache-buffer NUL terminators
2026-06-15 09:33:26 +02:00
6 changed files with 156 additions and 45 deletions

67
AGENTS.md Normal file
View File

@@ -0,0 +1,67 @@
# AGENTS.md — working in the HTTrack tree
Policy and PR etiquette live in [CONTRIBUTING.md](CONTRIBUTING.md). This file is
the operational checklist: toolchain, invariants, and how to ship a change.
## Build & test
- Fresh clone first: `git submodule update --init src/coucal`
- `bash configure && make && make check`
## Hard invariants
- **Toolchain edit** (`configure.ac`, any `Makefile.am`, `m4/`) → run
`autoreconf -fi` and commit the regenerated tracked files. The repo ships the
generated `configure`/`Makefile.in` so users build without autotools; CI does
**not** catch staleness.
- **Format only changed lines** with `git clang-format` (clang-format 19). Never
reformat untouched code: the engine was formatted by an old tool and won't
round-trip.
- **Byte-safe edits.** Files with raw high bytes are ISO-8859-1 (French
comments). Edit them byte-wise (`perl -0pi`, `sed`), not through a tool that
re-encodes to UTF-8 and corrupts them.
## Security (HTTrack parses hostile input off the network)
- Bounds-check every copy. Overflow-safe form: put the untrusted value alone,
`untrusted < limit - controlled` — never `controlled + untrusted < limit`,
which can wrap and pass.
## Code & prose
- Be terse. Comment the why, in English; translate French comments you touch.
- Strip AI tells from prose (em-dash overuse, rule-of-three, filler, vague
attributions). Ref: Wikipedia "Signs of AI writing". Claude Code: `/humanizer`.
- Behavior change → add a test. Fast path: a hidden `httrack -#N` debug
subcommand (`htscoremain.c`) driven by a `tests/NN_*.test`, over a slow crawl.
## Review your change adversarially (strongly suggested)
Before pushing, and when reviewing others, don't skim for bugs:
- **One invariant at a time.** Name a property the diff must preserve (bounds
hold, cache/wire format unchanged, no use-after-free, ABI stable), then
construct inputs that would break it. "General correctness" is not a charter.
- **Audit tests against the spec, not the code.** For each new test ask: "what
buggy path would still pass this?" If you can build one, the test is
confirmation-biased: assertions copied from observed output lock bugs in.
- **Risk areas need runtime probes.** Touching hostile-input parsing, struct
layout/ABI, cache/wire format, or a security path? A static or unit check
isn't enough; exercise the wrong behavior at runtime. Claude Code:
`/review-recipe`.
## Commits
- **Sign-off is mandatory.** Every commit carries a `Signed-off-by` trailer:
`git commit -s` (DCO, CI-enforced — unsigned commits are rejected).
- **Co-Authored-By is mandatory for AI-assisted commits.** Carry a
`Co-Authored-By:` trailer naming the assistant. Attribute there, never in a
PR-body footer.
- PRs land as a merge commit; every commit on the branch goes onto master, so
keep each commit message clean and meaningful.
## PR descriptions
- Plain concise prose; lead with what changed and why. No What/Why/How template.
- Title names the problem, not the implementation.
- Don't restate the diff — give what it can't show: motivation, context,
tradeoffs, risk.
- Length tracks the change: a typo is one sentence; a security fix earns a writeup.
- Verify claims against the code before you write them; flag drift, don't repeat it.
- Don't hard-wrap (GitHub reflows). No "Generated with Claude" footer. Run the
prose through `/humanizer`.
## Toolchain
C · clang-format-19 · autoreconf · shfmt + shellcheck (shell) · black + flake8 (Python)

1
CLAUDE.md Normal file
View File

@@ -0,0 +1 @@
@AGENTS.md

View File

@@ -1,12 +1,15 @@
# Contributing to HTTrack
HTTrack is small and old. Keep changes easy to review and safe to merge.
HTTrack is small and old. Keep changes easy to review and safe to merge. Working
with an AI assistant? The operational checklist is [AGENTS.md](AGENTS.md).
## Pull requests
- One change per PR. Small diffs merge fast.
- PRs are squash-merged: the title and description become the commit message, so
explain *why*.
- PRs land as a merge commit, so the branch's commits go onto master as-is: keep
each commit message clean and explain *why*.
- Be terse in the PR title and description: name the problem, not the fix, don't
restate the diff, and calibrate length to the change.
- Add or update tests for engine changes (`tests/`), and keep CI green.
## Style
@@ -30,6 +33,9 @@ Welcome, and nothing to disclose. Two rules:
- **Own every line** as if you wrote it. Can't explain it in review? Not ready.
- **Don't push your work onto reviewers.** A raw generated patch a maintainer has
to vet from scratch will be closed.
- **Attribution is mandatory.** AI-assisted commits must carry a
`Co-Authored-By:` trailer naming the assistant, not a footer in the PR
description.
The sign-off covers AI-assisted code too.

View File

@@ -633,13 +633,12 @@ int httpmirror(char *url1, httrackp * opt) {
// c'est plus propre et plus logique que d'entrer à la main les liens dans la pile
// on bénéficie ainsi des vérifications et des tests du robot pour les liens "primaires"
primary = (char *) malloct(primary_len);
if (primary) {
primary[0] = '\0';
} else {
if (!primary) {
printf("PANIC! : Not enough memory [%d]\n", __LINE__);
XH_extuninit;
return 0;
}
htsbuff primarybuff = htsbuff_ptr(primary, primary_len);
while(*a) {
int i;
@@ -687,11 +686,11 @@ int httpmirror(char *url1, httrackp * opt) {
strcatbuff(tempo, "*"); // ajouter un *
}
}
if (type)
strcpybuff(filters[filptr], "+");
else
strcpybuff(filters[filptr], "-");
strcatbuff(filters[filptr], tempo);
{
htsbuff fb = htsbuff_ptr(filters[filptr], HTS_URLMAXSIZE * 2);
htsbuff_cpy(&fb, type ? "+" : "-");
htsbuff_cat(&fb, tempo);
}
filptr++;
/* sanity check */
@@ -726,12 +725,10 @@ int httpmirror(char *url1, httrackp * opt) {
}
url[i++] = '\0';
//strcatbuff(primary,"<PRIMARY=\"");
if (strstr(url, ":/") == NULL)
strcatbuff(primary, "http://");
strcatbuff(primary, url);
//strcatbuff(primary,"\">");
strcatbuff(primary, "\n");
htsbuff_cat(&primarybuff, "http://");
htsbuff_cat(&primarybuff, url);
htsbuff_cat(&primarybuff, "\n");
}
} // while
@@ -762,7 +759,6 @@ int httpmirror(char *url1, httrackp * opt) {
int filelist_ptr = 0;
int n = 0;
char BIGSTK line[HTS_URLMAXSIZE * 2];
char *primary_ptr = primary + strlen(primary);
while(filelist_ptr < filelist_sz) {
int count =
@@ -771,13 +767,10 @@ int httpmirror(char *url1, httrackp * opt) {
if (count && line[0]) {
n++;
if (strstr(line, ":/") == NULL) {
strcpybuff(primary_ptr, "http://");
primary_ptr += strlen(primary_ptr);
htsbuff_cat(&primarybuff, "http://");
}
strcpybuff(primary_ptr, line);
primary_ptr += strlen(primary_ptr);
strcpybuff(primary_ptr, "\n");
primary_ptr += 1;
htsbuff_cat(&primarybuff, line);
htsbuff_cat(&primarybuff, "\n");
}
}
// fclose(fp);
@@ -2453,9 +2446,10 @@ void host_ban(httrackp * opt, int ptr,
// interdire host
assertf((*_FILTERS_PTR) < opt->maxfilter);
if (*_FILTERS_PTR < opt->maxfilter) {
strcpybuff(_FILTERS[*_FILTERS_PTR], "-");
strcatbuff(_FILTERS[*_FILTERS_PTR], host);
strcatbuff(_FILTERS[*_FILTERS_PTR], "/*"); // host/ * interdit
htsbuff fb = htsbuff_ptr(_FILTERS[*_FILTERS_PTR], HTS_URLMAXSIZE * 2);
htsbuff_cpy(&fb, "-");
htsbuff_cat(&fb, host);
htsbuff_cat(&fb, "/*"); // forbid host/*
(*_FILTERS_PTR)++;
}
// oups
@@ -3518,7 +3512,7 @@ char *next_token(char *p, int flag) {
p--;
do {
p++;
if (flag && (*p == '\\')) { // sauter \x ou \"
if (flag && (*p == '\\')) { // skip \x or \"
if (quote) {
char c = '\0';
@@ -3527,20 +3521,14 @@ char *next_token(char *p, int flag) {
else if (*(p + 1) == '"')
c = '"';
if (c) {
char BIGSTK tempo[8192];
tempo[0] = c;
tempo[1] = '\0';
strcatbuff(tempo, p + 2);
strcpybuff(p, tempo);
/* unescape the 2 chars to one, shifting left in place */
*p = c;
memmove(p + 1, p + 2, strlen(p + 2) + 1);
}
}
} else if (*p == 34) { // guillemets (de fin)
char BIGSTK tempo[8192];
tempo[0] = '\0';
strcatbuff(tempo, p + 1);
strcpybuff(p, tempo); /* wipe "" */
} else if (*p == 34) { // closing quote
/* drop the quote, shifting the rest left in place */
memmove(p, p + 1, strlen(p + 1) + 1);
p--;
/* */
quote = !quote;
@@ -3880,7 +3868,7 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
afs.af.adr, afs.save, savename(), tempo);
if (str->localLink
&& str->localLinkSize > (int) strlen(tempo) + 1) {
strcpybuff(str->localLink, tempo);
strlcpybuff(str->localLink, tempo, str->localLinkSize);
}
}
}
@@ -3892,11 +3880,11 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
lien);
if (str->localLink
&& str->localLinkSize > (int) (strlen(afs.af.adr) + strlen(afs.af.fil) + 8)) {
str->localLink[0] = '\0';
htsbuff lb = htsbuff_ptr(str->localLink, str->localLinkSize);
if (!link_has_authority(afs.af.adr))
strcpybuff(str->localLink, "http://");
strcatbuff(str->localLink, afs.af.adr);
strcatbuff(str->localLink, afs.af.fil);
htsbuff_cat(&lb, "http://");
htsbuff_cat(&lb, afs.af.adr);
htsbuff_cat(&lb, afs.af.fil);
}
r = -1;
}

View File

@@ -236,6 +236,55 @@ static void basic_selftests(void) {
}
freet(slots);
}
// next_token(): in-place token scanner. Strips surrounding quotes, unescapes
// \" and \\ when flag is set, and returns the token terminator (the space, or
// NULL at end of string). The unquote/unescape rewrites the string in place
// by shifting left, so the result is always shorter -- regression for that
// compaction.
{
char tok[64];
// plain token: unchanged, returns a pointer AT the separating space (exact
// position, not just any space -- a strchr-style impl would land elsewhere
// once quotes shift the content)
strcpybuff(tok, "abc def");
{
char *const end = next_token(tok, 0);
assertf(end == tok + 3 && *end == ' ' && strcmp(tok, "abc def") == 0);
}
// surrounding quotes stripped, returns the (post-shift) trailing space
strcpybuff(tok, "\"ab\" cd");
{
char *const end = next_token(tok, 1);
assertf(end == tok + 2 && *end == ' ' && strcmp(tok, "ab cd") == 0);
}
// a space inside quotes does not end the token; end of string returns NULL
strcpybuff(tok, "\"a b\"c");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a bc") == 0);
}
// \" and \\ are unescaped to literal " and \ in place
strcpybuff(tok, "\"a\\\"b\\\\c\"");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a\"b\\c") == 0);
}
// unterminated quote: the opening quote is dropped, the rest survives, and
// the scan runs to the NUL (returns NULL)
strcpybuff(tok, "\"ab");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "ab") == 0);
}
// trailing lone backslash in a quote: *(p+1) is the NUL, not an escape, so
// the backslash is kept intact (and there is no over-read past the NUL)
strcpybuff(tok, "\"a\\");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a\\") == 0);
}
}
}
/* Self-tests for the htssafe.h bounded string ops (driven by httrack -#8).