Compare commits

...

40 Commits

Author SHA1 Message Date
Xavier Roche
5a716a0e30 Bound htsparse.c pointer-destination buffer writes (batch 15)
The makeindex_firstlink_, base, codebase and loc_ aliases in the HTML
parser are bare char* views onto HTS_URLMAXSIZE*2 caller arrays, so
strcpybuff degraded to a raw strcpy (htssafe.h pointer-dest branch).
Bound all five with strlcpybuff(..., HTS_URLMAXSIZE*2), the documented
capacity of every target (makeindex_firstlink/base/codebase/loc in
htscore.c, r->location aliasing loc).

Behavior-preserving: each source (tempo, lien, back[].r.location) is
itself an HTS_URLMAXSIZE*2 buffer, so its NUL-terminated contents are
<= cap-1 and copy identically; no truncation is reachable. htsparse.c
now has zero pointer-destination warnings; htsserver.c (5) is the last
file before the stub can be flipped to an error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 21:20:01 +02:00
Xavier Roche
4bc6855213 Merge pull request #371 from xroche/cleanup/htsalias-bounds
Bound htsalias.c config-file alias buffer writes (batch 14)
2026-06-16 20:45:31 +02:00
Xavier Roche
fe8bd59d19 Bound htsalias.c pointer-destination buffer writes (batch 14)
htsalias.c keeps its own copy of htscoremain.c's cmdl_ins macro (config-file
alias expansion in optinclude_file). The copy still wrote alias-expanded tokens
into the argv block with an unbounded strcpybuff on a bare char*. Thread the
block capacity (x_argvblk_size) through optinclude_file and bound the insert
with strlcpybuff + cmdl_room, the same guard batch 13 applied to the original:
cmdl_room yields 0 instead of size_t-wrapping when the offset outruns the block,
so an alias/doit.log expansion bomb aborts cleanly rather than overflowing.

Adds 01_engine-rcfile.test, which had no coverage before: it drops a .httrackrc
with a long user-agent alias in the working directory, runs httrack with no -O
(the only way the rc files load), and checks the alias-expanded -F <value> token
reaches hts-cache/doit.log intact. user-agent expands to two tokens, exercising
both cmdl_ins insertions; a truncating bound is caught (verified by injecting
one).

htsalias.c pointer-destination warnings 2->0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 20:41:08 +02:00
Xavier Roche
83d813eb7f Merge pull request #370 from xroche/cleanup/htscoremain-bounds
Bound htscoremain.c pointer-destination buffer writes (batch 13)
2026-06-16 19:37:06 +02:00
Xavier Roche
31eead95df Bound htscoremain.c pointer-destination buffer writes (batch 13)
Continues the htssafe.h pointer-destination migration in the CLI parser
(hts_main_internal). All sites write into a bare char*.

* The cmdl_add()/cmdl_ins() macros build argv entries into the x_argvblk block
  (malloc'd as the command-line size + 32768). Thread the block's total size
  (recorded in a new x_argvblk_size) and bound the copy with strlcpybuff. The
  remaining room is computed by a cmdl_room() helper that yields 0 once the block
  is exhausted (alias expansion or doit.log insertion can outrun the 32768 slack)
  so the copy aborts cleanly instead of the size_t subtraction wrapping to a huge
  unbounded value.
* The in-place argv rewrites each write no more than the slot already holds, so
  they are bounded by strlen(dest)+1 (provably sufficient): the "(none)" ->
  "\"\"" replacement, the two quote-strip copies (tempo is argv[na] minus its
  surrounding quotes), and the "--catchurl" -> "-#P" rewrite. The "--clean"/
  "--tide" empty rewrite becomes a direct argv[i][1]='\0'.
* Guard the quote-strip's tempo[strlen(tempo)-1] read: a lone '"' argument left
  tempo empty and read tempo[-1] (out of bounds). It now takes the existing
  missing-quote error path.
* The URL accumulator append uses strlcatbuff against the tracked url_sz.

These are macros/locals inside hts_main_internal, so not -#7 unit-testable;
cmdl_add runs on every invocation (covered by the whole suite). New
01_engine-cmdline.test cases exercise the quote-strip rewrite as the sole URL (a
quoted URL is mirrored; dangling- and lone-quote arguments are refused cleanly,
never a crash).

htscoremain.c pointer-destination warnings: 10 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 19:29:30 +02:00
Xavier Roche
1f29ed41db Bound htscoremain.c pointer-destination buffer writes (batch 13)
Continues the htssafe.h pointer-destination migration in the CLI parser
(hts_main_internal). All sites write into a bare char*.

* The cmdl_add()/cmdl_ins() macros build argv entries into the x_argvblk block
  (malloc'd as the command-line size + 32768). Thread the block's total size and
  bound the copy with strlcpybuff(argv[i], token, bufsize - ptr); record the size
  in a new x_argvblk_size alongside x_argvblk.
* The in-place argv rewrites each write no more than the slot already holds, so
  they are bounded by strlen(dest)+1 (provably sufficient): the "(none)" ->
  "\"\"" replacement, the two quote-strip copies (tempo is argv[na] minus its
  surrounding quotes), and the "--catchurl" -> "-#P" rewrite. The "--clean"/
  "--tide" empty rewrite becomes a direct argv[i][1]='\0'.
* The URL accumulator append uses strlcatbuff against the tracked url_sz.

These are macros/locals inside hts_main_internal, so they are not -#7
unit-testable; cmdl_add runs on every invocation (covered by the whole suite),
and a new 01_engine-cmdline.test case exercises the quote-strip rewrite (a quoted
URL is mirrored; a dangling quote is refused cleanly, never a crash).

htscoremain.c pointer-destination warnings: 10 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 18:57:19 +02:00
Xavier Roche
9db360e5fd Merge pull request #369 from xroche/cleanup/htstools-bounds
Bound htstools.c pointer-destination buffer writes (batch 12)
2026-06-16 18:25:07 +02:00
Xavier Roche
88bfcff10c Bound htstools.c pointer-destination buffer writes (batch 12)
Continues the htssafe.h pointer-destination migration: the strcpybuff/strcatbuff
macros silently fall back to a raw strcpy/strcat when the destination is a bare
char* rather than a sized array.

All four functions are internal (hidden, not HTSEXT_API), so they take explicit
destination sizes:
* lienrelatif() builds a relative link into a char* caller buffer; threads a
  size_t and bounds the "../"/path appends with strlcatbuff (the local _curr
  copy uses sizeof(_curr)).
* long_to_83() / longfile_to_83() build an 8-3 / ISO9660 name into a caller
  buffer; thread a size_t and use strl(n)catbuff.
* ident_url_relatif()'s in-place IDNA host rewrite bounds the copy by the
  remaining capacity of adrfil->adr (a pointer into that array).

Callers in htscore.c, htswizard.c, htsparse.c and htsname.c pass sizeof(dest)
(all the destinations are HTS_URLMAXSIZE*2 arrays).

Add -#7 basic_selftests for longfile_to_83 (8-3 and ISO9660), long_to_83
(per-segment path conversion) and lienrelatif (same-dir basename, parent "../").

htstools.c pointer-destination warnings: 10 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 18:01:47 +02:00
Xavier Roche
1df45fc231 Merge pull request #368 from xroche/cleanup/htsname-bounds
Bound htsname.c pointer-destination buffer writes (batch 11)
2026-06-16 17:25:12 +02:00
Xavier Roche
3a0f5779dd Bound htsname.c pointer-destination buffer writes (batch 11)
Continues the htssafe.h pointer-destination migration: the strcpybuff/strcatbuff
macros silently fall back to a raw strcpy/strcat when the destination is a bare
char* rather than a sized array.

In htsname.c:
* standard_name() builds the md5-based name into a caller buffer it received as
  char* (size lost), via a chain of strncatbuff/strcatbuff. It is internal
  (hidden, not HTSEXT_API), so it now takes an explicit destination size and
  builds through an htsbuff bounded builder; its one caller (the
  ADD_STANDARD_NAME macro) passes sizeof(buff).
* url_savename()'s delayed-extension append into lastDot (a pointer into the
  afs->save[HTS_URLMAXSIZE*2] array) is bounded with strlcatbuff against the
  remaining capacity.

Add a -#7 basic_selftests case for standard_name covering the no-query (no md5),
query (4-char md5) and short-name (clamped extension) paths.

htsname.c pointer-destination warnings: 12 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 17:23:22 +02:00
Xavier Roche
46fd973e0b Merge pull request #366 from xroche/docs/agents-md
Add AGENTS.md operational checklist for AI-assisted contributions
2026-06-16 16:59:33 +02:00
Xavier Roche
ddc39b7dc0 Merge pull request #367 from xroche/cleanup/htslib-mime-bounds
Fix get_httptype contenttype overflow; bound the mime/normalize APIs
2026-06-16 16:59:11 +02:00
Xavier Roche
085937b305 Fix get_httptype contenttype overflow; bound the mime/normalize APIs
get_httptype() took the caller buffer as a bare char* and raw-strcpy'd the MIME
string into it, so crawling a URL ending in .docx/.pptx/.xlsx (whose table MIME
types reach 73 chars) overflowed the 64-byte htsblk.contenttype that the htsback
and htslib callers pass, corrupting the adjacent struct fields. Remotely
triggerable.

* Widen htsblk contenttype/charset/contentencoding to HTS_MIMETYPE_SIZE (128, a
  new named constant holding the longest registered MIME type). This changes the
  installed htsblk layout, so bump the library soname (VERSION_INFO 2:49:0 ->
  3:0:0).
* Add bounded get_httptype_sized(), guess_httptype_sized() and
  adr_normalized_sized() that take the destination size and use
  strlcpybuff/snprintf. The old get_httptype(), guess_httptype() and
  adr_normalized() stay as wrappers, now marked HTS_DEPRECATED (portable:
  GCC/Clang attribute, MSVC __declspec, nothing elsewhere). Internal callers
  pass the real buffer size; the deprecated wrappers bound to the implicit
  contract their old callers relied on (HTS_MIMETYPE_SIZE for the mime buffer,
  HTS_URLMAXSIZE*2 for the URL buffer) rather than staying unbounded, so they
  abort on overflow instead of silently corrupting memory.
* get_httptype_sized(), guess_httptype_sized() and give_mimext() now report
  whether a type/extension was written; callers check the result and bail
  rather than use a possibly-empty buffer (e.g. the is_hypertext_mime helpers).
  A user "--assume cgi=" rule (empty value) matches but writes nothing, so
  get_httptype_sized() returns the buffer's emptiness, matching the old callers'
  strnotempty(s) test rather than reporting a bogus recognized type.
* -#7 basic_selftests: a .pptx MIME (73 chars) is stored whole into a real
  htsblk.contenttype (a [64] field makes the bounded copy abort); give_mimext
  and get_httptype_sized return values; the octet-stream fallback; the empty
  --assume rule; plus fil_normalized "//"-in-query preservation and cut_path
  trailing-slash / single-char branches.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 11:10:49 +02:00
Xavier Roche
594820d3eb Add AGENTS.md operational checklist for AI-assisted contributions
LLM-assisted PRs are arriving; give agents one compact, tool-neutral file
covering the repo's toolchain rules and invariants so contributions arrive
review-ready instead of needing the conventions reconstructed each time.

AGENTS.md is the operational checklist (build/test, autotools regen, touched-
lines-only formatting, byte-safe Latin-1 edits, overflow-safe bounds,
adversarial self-review, commit/PR discipline). CLAUDE.md imports it via
@AGENTS.md so Claude Code auto-loads the same source. CONTRIBUTING.md keeps the
policy and gains a Co-Authored-By attribution rule plus a PR-conciseness line.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 04:01:29 +02:00
Xavier Roche
36a9f5a827 Merge pull request #365 from xroche/cleanup/htslib-bounds
Bound htslib.c pointer-destination buffer writes (batch 9)
2026-06-16 03:54:38 +02:00
Xavier Roche
20880c1a4d Bound htslib.c pointer-destination buffer writes (batch 9)
Continues the htssafe.h pointer-destination migration (X1), where the
strcpybuff/strcatbuff macros silently fall back to a raw strcpy/strcat
when the destination is a bare char* rather than a sized array.

In htslib.c:
* fil_normalized() rebuilds the sorted query through an htsbuff bounded
  builder over the malloc'd copyBuff, then copies it back with strlcpybuff
  (capacity is the known qLen + 1).
* treathead() bounds the Location: copy with strlcpybuff against the
  location_buffer[HTS_URLMAXSIZE*2] contract.
* give_mimext(), convtolower() and cut_path() are internal (hidden, not
  HTSEXT_API), so they take an explicit destination size and the callers
  pass it: give_mimext in htsname.c/htscoremain.c/htslib.c, convtolower in
  htshash.c. cut_path has no callers.

Add strlncatbuff(dst, src, size, n) to htssafe.h: a bounded n-limited
append with explicit capacity, the missing parallel to strlcatbuff.

Cover fil_normalized query-sort, give_mimext, convtolower and cut_path with
the -#7 basic_selftests.

get_httptype() and adr_normalized() are left for a follow-up: both are
exported (HTSEXT_API), and get_httptype() exposes a real latent overflow
(a .docx/.pptx/.xlsx URL writes a 65-73 char mime type into 64-byte
contenttype callers) whose fix is a public-ABI decision.

htslib.c pointer-destination warnings: 14 -> 4.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-16 03:48:52 +02:00
Xavier Roche
a6fc0e9dab Merge pull request #361 from xroche/chore/bump-coucal-shift-ub
Bump src/coucal to fadf29b (MurmurHash3 signed-shift UB fix)
2026-06-15 17:04:09 +02:00
Xavier Roche
f227135d16 Bump src/coucal to fadf29b (MurmurHash3 signed-shift UB fix)
Picks up coucal PR #6: the MurmurHash3 tail mixing shifted a byte
promoted to int left by 24, overflowing signed int once the byte had
its high bit set (UBSan). A sanitized live crawl hashing arbitrary URL
keys aborted on it.

Verified: the ASan+UBSan www.edf.fr crawl that previously aborted at
murmurhash3.h:123 now completes clean (100 pages, no findings).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 14:46:04 +02:00
Xavier Roche
223564eaca Merge pull request #360 from xroche/cleanup/htscore-bounds
Bound htscore.c pointer-destination buffer writes (batch 8)
2026-06-15 10:28:29 +02:00
Xavier Roche
7db49a64b6 Bound htscore.c pointer-destination buffer writes (batch 8)
Convert htscore.c's 18 pointer-destination strcpybuff/strcatbuff sites (which
silently degrade to unchecked strcpy/strcat per the htssafe.h diagnostic) to
bounded forms:

- httpmirror(): one htsbuff over the malloc'd primary buffer drives the whole
  link accumulation, replacing the manual "primary_ptr += strlen" cursor in the
  filelist loop; the +/- filter slots build through htsbuff over their known
  HTS_URLMAXSIZE*2 capacity.
- host_ban(): the "-host/*" filter slot builds through htsbuff.
- htsAddLink(): str->localLink builds through htsbuff / strlcpybuff bounded by
  str->localLinkSize.
- next_token(): the in-place unquote/unescape copied the (always shorter) result
  back through an 8KB temp buffer, which both relied on an unchecked pointer copy
  and aborted on tokens over 8KB. Replace with memmove left-shift compaction: no
  capacity guess, no size cap.

Add a next_token() regression test to basic_selftests (httrack -#7) covering
plain tokens, quote stripping, and \" / \\ unescaping; teeth verified.

htscore.c pointer-destination sites 18 -> 0.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 10:16:06 +02:00
Xavier Roche
f1c04c10eb Merge pull request #359 from xroche/fix/malloc-size-plus4
Allocate exactly one extra byte for cache-buffer NUL terminators
2026-06-15 09:33:26 +02:00
Xavier Roche
17fc54869d Allocate exactly one extra byte for cache-buffer NUL terminators
These fread buffers were over-allocated as size+4, a superstitious margin
that never bought anything: every site writes a single trailing NUL at
[size], so size+1 is exactly right. Trim them all to size+1.

The proxytrack disk-fallback read in PT_ReadCache__New_u never wrote that
NUL at all, unlike its sibling read paths in the same file; add the missing
r->adr[r->size] = '\0' so the spare byte is actually used and the buffer is
a valid C string.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 09:30:34 +02:00
Xavier Roche
d2e43549d8 Merge pull request #358 from xroche/ci/asan-poison-fill
ci: poison the ASan allocator to surface missing-NUL bugs
2026-06-15 09:19:04 +02:00
Xavier Roche
a9b16d96ea ci: poison the ASan allocator to surface missing-NUL bugs
Fill malloc'd and freed memory with 0xCA in the sanitize job so a buffer
fread into without NUL termination, then used as a C string, runs off into
the redzone instead of stopping at an accidental zero byte. ASan caps its
malloc fill at the first 4096 bytes by default, which lets large cache
buffers escape; max_malloc_fill_size lifts the cap. No rebuild, no source
change -- purely the test environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 09:16:48 +02:00
Xavier Roche
4ed828ff78 Merge pull request #357 from xroche/audit/fread-nul-termination
Fix more un-NUL-terminated fread buffers used as C strings
2026-06-15 09:07:37 +02:00
Xavier Roche
82ace34c4d Add a cache disk-fallback self-test for the NUL-termination invariant
The disk-fallback read (cache_readex with X-In-Cache: 0, body on disk) had no
runtime coverage: the crawl tests never re-read such a body into memory, which
is why the missing terminator there went unnoticed until the audit. Extend the
-#A cache self-test:

- check_entry now asserts every read-back body is NUL-terminated at [size],
  covering the in-zip read paths.
- A new pass stores a non-hypertext record (X-In-Cache: 0), creates the body at
  the exact fconv()-resolved path the reader uses, reads it back through the
  disk-fallback branch, and asserts it round-trips and is terminated.

Verified by reverting the fix: with the terminator removed the new pass fails
("body not NUL-terminated"); with it in place the pass is clean. Runs under the
ASan/UBSan CI job, so it now guards the disk-fallback path that had none.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 09:02:37 +02:00
Xavier Roche
3970eb3706 Fix more un-NUL-terminated fread buffers used as C strings
Follow-up audit after the cache strstr() overflow in #356: same pattern of
reading a file or record into a malloc'd buffer and then treating it as a C
string without a terminator.

- cache_readex disk-fallback paths (htscache.c, "previous_save"/"return_save")
  read a record body into malloc(size+4) but, unlike their zip and .dat
  siblings, never set the trailing NUL. The body is later strlen'd
  (htscache.c:923, htscore.c:1046), so an un-terminated one over-reads.
  Terminate it like the siblings do, but only for r.size >= 0: these two paths
  guard the read with `r.size > 0 &&`, so a crafted cache with a negative
  X-Size would otherwise fall through to write *(r.adr + r.size) one byte
  before the allocation (heap underflow). The sibling paths read
  unconditionally and fail the read for a negative size, so they never hit it.
- cache_readdata (HTS_FAST_CACHE) reads the record into malloc(len+4) whose
  comment already reserves the "Plus byte 0" but never set it. Set it (the
  enclosing `len > 0` keeps the write in bounds).
- index_finish (htsindex.c) ran strchr() over a malloc(size+4) buffer read raw
  from the temp index file; a final line without a newline would over-read.
  NUL-terminate before scanning.

All four are exercised under the ASan/UBSan CI job. proxytrack's store.c has the
same structural pattern but never strlen()s the body (it is served as binary),
so it is left as is.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 07:23:19 +02:00
Xavier Roche
d3c41b31e8 Merge pull request #356 from xroche/ci/hardening-sanitize-nossl-distcheck
ci: ASan/UBSan, no-openssl, and distcheck jobs (plus the bugs they found)
2026-06-15 06:57:02 +02:00
Xavier Roche
f8367eeac7 Fix heap-buffer-overflow reading the update cache
httpmirror() read hts-cache/new.lst into a malloc(sz) buffer and then ran
strstr() over it to decide which old files to purge. fread() does not
NUL-terminate, so strstr() scanned past the end of the allocation; with the
wrong heap layout it ran into the redzone. ASan caught it as a
heap-buffer-overflow on the cache-read (update) crawl. Whether it tripped
depended on the byte just past the buffer, which is why it surfaced only
intermittently on cold CI runners and never reproduced locally.

Allocate sz + 1 and NUL-terminate after the read, matching the existing
filelist_buff pattern in the same file. Both strstr() calls in the block are
covered.

Found by the new ASan/UBSan CI job.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-15 06:51:17 +02:00
Xavier Roche
9279a4b349 ci: add ASan/UBSan, no-openssl, and distcheck jobs
sanitize: build and run the suite under AddressSanitizer + UndefinedBehavior
Sanitizer, driving the parsers that handle untrusted crawled input. This
surfaced the use-after-free, the numeric-entity overflow, and the coucal
alignment fix in this branch; leak detection is off so the job reports
memory-safety errors rather than exit-time leaks.

no-ssl: build and test with --disable-https (and no libssl installed) so the
#if HTS_USEOPENSSL branches, never compiled by the libssl-equipped matrix, do
not rot.

distcheck: roll the release tarball and build/test it out-of-tree, guarding
against a source missing from *_SOURCES or EXTRA_DIST.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:37:59 +02:00
Xavier Roche
b52e8c4c0f Drop EXTRA_DIST wildcards so the dist tarball builds
automake does not expand wildcards in EXTRA_DIST, so "coucal/*" and the
"*.dsp/*.dsw/*.vcproj" globs were left as literal targets that broke
"make dist" (and distcheck) out-of-tree with "No rule to make target
'coucal/*'". List the files explicitly; coucal's .c/.h ship via *_SOURCES
already, so only its aux files (LICENSE, Makefile, README.md, sample.c,
tests.c) plus the Windows project files needed listing. Regenerated
src/Makefile.in.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:37:28 +02:00
Xavier Roche
665f51d1a0 Bump coucal: fix misaligned 32-bit loads in MurmurHash3
Picks up the coucal fix that reads each hash block with memcpy instead of
dereferencing an unaligned uint32_t*, clearing a UBSan alignment finding that
fired on nearly every hashtable insert during a crawl.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:37:27 +02:00
Xavier Roche
e4e5d4699a Fix signed overflow when decoding large numeric HTML entities
A numeric entity such as &#9999999999; was accumulated digit by digit into an
int with no bound, overflowing once past INT_MAX (undefined behavior). Guard
before each multiply: a value beyond the Unicode maximum (0x10FFFF) is invalid
anyway, so stop and keep the entity literal instead of overflowing. The input
comes straight from crawled pages.

Found by the new ASan/UBSan CI job.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:37:27 +02:00
Xavier Roche
a50691c0f8 Fix use-after-free in the HTML post-process path
The post-process step captured a pointer into output_buffer's own storage,
reset the array size to zero, then re-appended that pointer. The append's
realloc (TypedArrayEnsureRoom reallocs unconditionally) could move the block,
leaving the copy reading freed memory. The default callback returns "modified"
without touching the data, so this hit on every crawl; ASan flagged the
use-after-free. glibc usually returns the same pointer on a same-size realloc,
which is why a plain build never crashed.

Only copy when the callback handed back a different buffer. When it edited
output_buffer in place, just adopt the new length.

Found by the new ASan/UBSan CI job.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:37:27 +02:00
Xavier Roche
5f96e86818 Merge pull request #355 from xroche/ci/bump-checkout-v5
ci: bump actions/checkout to v6
2026-06-14 23:15:01 +02:00
Xavier Roche
6002bc20ca ci: bump actions/checkout from v4 to v6
Keeps the checkout action on a supported major; v4 runs on the
end-of-life Node 20 runtime, v6 moves to Node 24.

Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 23:13:06 +02:00
Xavier Roche
bdbc741597 Merge pull request #354 from xroche/ci/mkdeb-single-test
mkdeb: drop the redundant pre-build test pass
2026-06-14 22:22:49 +02:00
Xavier Roche
d0a1b957cd ci: let the deb job run debuild's test pass
The deb job set DEB_BUILD_OPTIONS=nocheck to skip a redundant second test run.
With mkdeb.sh no longer running its own pre-build check, debuild's is the only
test pass, so nocheck would suppress it entirely and CI would never exercise the
packaged build's tests. Drop nocheck; keep noautodbgsym and parallel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 22:15:51 +02:00
Xavier Roche
6c329744e7 mkdeb: drop the redundant pre-build test pass
mkdeb.sh built and tested the sources twice: once in its own export-tree
pre-build (make check, offline), then again under debuild, whose dh_auto_test
runs the suite with the online tests enabled (debian/rules configures with
--enable-online-unit-tests=auto). The first run was a slower, offline-only
subset of the second.

Drop mkdeb's own make check. The export-tree build stays, since regen-man needs
the compiled binaries, but the suite now runs once, under debuild, as the
superset. This is the same redundancy CI #352 removed via DEB_BUILD_OPTIONS=nocheck;
fixing it in mkdeb.sh applies it to release builds too instead of per-environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 22:13:23 +02:00
Xavier Roche
1375ef97d7 Merge pull request #353 from xroche/ci/macos-i386
ci: add macOS and 32-bit (i386) build jobs
2026-06-14 22:09:11 +02:00
38 changed files with 1098 additions and 294 deletions

View File

@@ -31,7 +31,7 @@ jobs:
env:
CC: ${{ matrix.cc }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
submodules: recursive
@@ -69,7 +69,7 @@ jobs:
name: build (macOS arm64, clang)
runs-on: macos-14
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
submodules: recursive
@@ -104,7 +104,7 @@ jobs:
name: build (linux i386, gcc -m32)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
submodules: recursive
@@ -133,6 +133,97 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Memory safety: build and run the suite under AddressSanitizer +
# UndefinedBehaviorSanitizer. The offline engine self-tests drive the parsers
# that chew on untrusted crawled input (charset, mime, HTML, entities, IDNA,
# filters, cache) straight through the sanitizers, so a buffer overrun,
# use-after-free, or signed overflow there fails the build instead of slipping
# past a plain -O2 build. gcc's runtimes; one job is enough (the bug class is
# arch-independent and the matrix already covers compile portability).
sanitize:
name: sanitize (ASan+UBSan, gcc)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: Configure (sanitized)
run: |
set -euo pipefail
autoreconf -fi
./configure CC=gcc \
CFLAGS="-fsanitize=address,undefined -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
LDFLAGS="-fsanitize=address,undefined"
- name: Build
run: make -j"$(nproc)"
- name: Test (sanitized)
# Leaks at exit are out of scope (the CLI frees little on the way out);
# we want memory-safety errors, so turn leak detection off and make every
# other finding abort the run.
#
# Poison fresh allocations with 0xCA and freed blocks with 0xCB (decimal
# 202/203) so memory never reads back as accidental zeros: a missing-NUL
# fread buffer then runs strlen off into the redzone instead of stopping
# at a lucky zero. Distinct bytes tell the two apart in a dump (0xCA =
# uninitialized, 0xCB = use-after-free). ASan caps its malloc fill at 4096
# bytes by default, so max_malloc_fill_size lifts it to cover large cache
# buffers; free_fill flags use-after-free reads.
env:
ASAN_OPTIONS: detect_leaks=0:abort_on_error=1:halt_on_error=1:strict_string_checks=1:malloc_fill_byte=202:max_malloc_fill_size=2147483647:free_fill_byte=203:max_free_fill_size=2147483647
UBSAN_OPTIONS: print_stacktrace=1:halt_on_error=1
run: make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
# the configuration users on minimal systems build, and one libssl is not even
# installed here so configure cannot silently re-enable it. The matrix above
# always has libssl, so the #if HTS_USEOPENSSL branches would otherwise never
# be compiled and could rot unnoticed.
no-ssl:
name: build (no openssl, --disable-https)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies (no libssl)
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive zlib1g-dev
- name: Configure (https disabled)
run: |
set -euo pipefail
autoreconf -fi
./configure --disable-https
- name: Build
run: make -j"$(nproc)"
- name: Test
run: make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above
@@ -141,7 +232,7 @@ jobs:
name: deb package (lintian)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
submodules: recursive
@@ -158,22 +249,50 @@ jobs:
# debuild builds every package, then lintian gates on errors.
#
# DEB_BUILD_OPTIONS trims work CI does not need (release builds via
# mkdeb.sh are untouched): nocheck skips debuild's make check, redundant
# here since the build matrix and mkdeb's own pre-build already run the
# suite; noautodbgsym drops the -dbgsym packages whose LTO payloads are
# slow to compress and that CI never ships; parallel uses every core.
# mkdeb.sh are untouched): noautodbgsym drops the -dbgsym packages whose
# LTO payloads are slow to compress and that CI never ships; parallel uses
# every core. We let debuild run its test pass -- the only one now that
# mkdeb no longer runs its own -- so CI exercises the packaged tests.
- name: Build Debian packages
run: |
export DEB_BUILD_OPTIONS="nocheck noautodbgsym parallel=$(nproc)"
export DEB_BUILD_OPTIONS="noautodbgsym parallel=$(nproc)"
bash tools/mkdeb.sh --unsigned --no-release-artifacts
# Release-tarball integrity: `make distcheck` rolls the dist tarball, then
# configures, builds and tests it out-of-tree from a read-only source tree and
# checks nothing is left behind. Catches a file referenced in *_SOURCES or
# EXTRA_DIST but missing from the tarball -- the same "ships broken to users"
# class as a stale committed Makefile.in.
distcheck:
name: distcheck (release tarball)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: distcheck
run: |
set -euo pipefail
autoreconf -fi
./configure
make -j"$(nproc)" distcheck
dco:
name: DCO sign-off
# Only checkable on a PR, where we have the base..head commit range.
if: github.event_name == 'pull_request'
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0
@@ -202,7 +321,7 @@ jobs:
name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Install linters
env:
@@ -231,7 +350,7 @@ jobs:
if: github.event_name == 'pull_request'
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0

67
AGENTS.md Normal file
View File

@@ -0,0 +1,67 @@
# AGENTS.md — working in the HTTrack tree
Policy and PR etiquette live in [CONTRIBUTING.md](CONTRIBUTING.md). This file is
the operational checklist: toolchain, invariants, and how to ship a change.
## Build & test
- Fresh clone first: `git submodule update --init src/coucal`
- `bash configure && make && make check`
## Hard invariants
- **Toolchain edit** (`configure.ac`, any `Makefile.am`, `m4/`) → run
`autoreconf -fi` and commit the regenerated tracked files. The repo ships the
generated `configure`/`Makefile.in` so users build without autotools; CI does
**not** catch staleness.
- **Format only changed lines** with `git clang-format` (clang-format 19). Never
reformat untouched code: the engine was formatted by an old tool and won't
round-trip.
- **Byte-safe edits.** Files with raw high bytes are ISO-8859-1 (French
comments). Edit them byte-wise (`perl -0pi`, `sed`), not through a tool that
re-encodes to UTF-8 and corrupts them.
## Security (HTTrack parses hostile input off the network)
- Bounds-check every copy. Overflow-safe form: put the untrusted value alone,
`untrusted < limit - controlled` — never `controlled + untrusted < limit`,
which can wrap and pass.
## Code & prose
- Be terse. Comment the why, in English; translate French comments you touch.
- Strip AI tells from prose (em-dash overuse, rule-of-three, filler, vague
attributions). Ref: Wikipedia "Signs of AI writing". Claude Code: `/humanizer`.
- Behavior change → add a test. Fast path: a hidden `httrack -#N` debug
subcommand (`htscoremain.c`) driven by a `tests/NN_*.test`, over a slow crawl.
## Review your change adversarially (strongly suggested)
Before pushing, and when reviewing others, don't skim for bugs:
- **One invariant at a time.** Name a property the diff must preserve (bounds
hold, cache/wire format unchanged, no use-after-free, ABI stable), then
construct inputs that would break it. "General correctness" is not a charter.
- **Audit tests against the spec, not the code.** For each new test ask: "what
buggy path would still pass this?" If you can build one, the test is
confirmation-biased: assertions copied from observed output lock bugs in.
- **Risk areas need runtime probes.** Touching hostile-input parsing, struct
layout/ABI, cache/wire format, or a security path? A static or unit check
isn't enough; exercise the wrong behavior at runtime. Claude Code:
`/review-recipe`.
## Commits
- **Sign-off is mandatory.** Every commit carries a `Signed-off-by` trailer:
`git commit -s` (DCO, CI-enforced — unsigned commits are rejected).
- **Co-Authored-By is mandatory for AI-assisted commits.** Carry a
`Co-Authored-By:` trailer naming the assistant. Attribute there, never in a
PR-body footer.
- PRs land as a merge commit; every commit on the branch goes onto master, so
keep each commit message clean and meaningful.
## PR descriptions
- Plain concise prose; lead with what changed and why. No What/Why/How template.
- Title names the problem, not the implementation.
- Don't restate the diff — give what it can't show: motivation, context,
tradeoffs, risk.
- Length tracks the change: a typo is one sentence; a security fix earns a writeup.
- Verify claims against the code before you write them; flag drift, don't repeat it.
- Don't hard-wrap (GitHub reflows). No "Generated with Claude" footer. Run the
prose through `/humanizer`.
## Toolchain
C · clang-format-19 · autoreconf · shfmt + shellcheck (shell) · black + flake8 (Python)

1
CLAUDE.md Normal file
View File

@@ -0,0 +1 @@
@AGENTS.md

View File

@@ -1,12 +1,15 @@
# Contributing to HTTrack
HTTrack is small and old. Keep changes easy to review and safe to merge.
HTTrack is small and old. Keep changes easy to review and safe to merge. Working
with an AI assistant? The operational checklist is [AGENTS.md](AGENTS.md).
## Pull requests
- One change per PR. Small diffs merge fast.
- PRs are squash-merged: the title and description become the commit message, so
explain *why*.
- PRs land as a merge commit, so the branch's commits go onto master as-is: keep
each commit message clean and explain *why*.
- Be terse in the PR title and description: name the problem, not the fix, don't
restate the diff, and calibrate length to the change.
- Add or update tests for engine changes (`tests/`), and keep CI green.
## Style
@@ -30,6 +33,9 @@ Welcome, and nothing to disclose. Two rules:
- **Own every line** as if you wrote it. Can't explain it in review? Not ready.
- **Don't push your work onto reviewers.** A raw generated patch a maintainer has
to vet from scratch will be closed.
- **Attribution is mandatory.** AI-assisted commits must carry a
`Co-Authored-By:` trailer naming the assistant, not a footer in the PR
description.
The sign-off covers AI-assisted code too.

4
configure vendored
View File

@@ -3685,7 +3685,9 @@ fi
VERSION_INFO="2:49:0"
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to
# 128), an incompatible ABI break, so bump current and reset revision/age.
VERSION_INFO="3:0:0"
{ printf "%s\n" "$as_me:${as_lineno-$LINENO}: checking whether to enable maintainer-specific portions of Makefiles" >&5
printf %s "checking whether to enable maintainer-specific portions of Makefiles... " >&6; }

View File

@@ -29,7 +29,9 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
VERSION_INFO="2:49:0"
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to
# 128), an incompatible ABI break, so bump current and reset revision/age.
VERSION_INFO="3:0:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

View File

@@ -114,5 +114,12 @@ EXTRA_DIST = httrack.h webhttrack \
proxy/proxytrack.h \
proxy/store.h \
proxy/proxytrack.vcproj \
coucal/* \
*.dsw *.dsp *.vcproj
coucal/LICENSE \
coucal/Makefile \
coucal/README.md \
coucal/sample.c \
coucal/tests.c \
htsjava.vcproj \
httrack.dsp httrack.dsw httrack.vcproj \
libhttrack.dsp libhttrack.dsw libhttrack.vcproj \
webhttrack.dsp webhttrack.dsw webhttrack.vcproj

View File

@@ -565,8 +565,15 @@ EXTRA_DIST = httrack.h webhttrack \
proxy/proxytrack.h \
proxy/store.h \
proxy/proxytrack.vcproj \
coucal/* \
*.dsw *.dsp *.vcproj
coucal/LICENSE \
coucal/Makefile \
coucal/README.md \
coucal/sample.c \
coucal/tests.c \
htsjava.vcproj \
httrack.dsp httrack.dsw httrack.vcproj \
libhttrack.dsp libhttrack.dsw libhttrack.vcproj \
webhttrack.dsp webhttrack.dsw webhttrack.vcproj
all: all-am

View File

@@ -41,19 +41,24 @@ Please visit our Website: http://www.httrack.com
#define _NOT_NULL(a) ( (a!=NULL) ? (a) : "" )
// COPY OF cmdl_ins in htsmain.c
// Insert a command in the argc/argv
#define cmdl_ins(token,argc,argv,buff,ptr) \
{ \
int i; \
for(i=argc;i>0;i--)\
argv[i]=argv[i-1];\
} \
argv[0]=(buff+ptr); \
strcpybuff(argv[0],token); \
ptr += (int) (strlen(argv[0])+1); \
// COPY OF cmdl_ins in htscoremain.c
/* Bytes left in x_argvblk from offset ptr. The offset can in principle outrun
the block (alias/doit.log expansion), so the copy aborts cleanly instead of
the subtraction wrapping to a huge unbounded size. */
#define cmdl_room(bufsize, ptr) \
((ptr) < (size_t) (bufsize) ? (size_t) (bufsize) - (ptr) : 0)
// Insert a command in the argc/argv (buff has total capacity bufsize)
#define cmdl_ins(token, argc, argv, buff, bufsize, ptr) \
{ \
int i; \
for (i = argc; i > 0; i--) \
argv[i] = argv[i - 1]; \
} \
argv[0] = (buff + ptr); \
strlcpybuff(argv[0], token, cmdl_room(bufsize, ptr)); \
ptr += (int) (strlen(argv[0]) + 1); \
argc++
// END OF COPY OF cmdl_ins in htsmain.c
// END OF COPY OF cmdl_ins in htscoremain.c
/*
Aliases for command-line and config file definitions
@@ -468,7 +473,7 @@ const char *optalias_help(const char *token) {
*/
/* Note: NOT utf-8 */
int optinclude_file(const char *name, int *argc, char **argv, char *x_argvblk,
int *x_ptr) {
size_t x_argvblk_size, int *x_ptr) {
FILE *fp;
fp = fopen(name, "rb");
@@ -542,14 +547,15 @@ int optinclude_file(const char *name, int *argc, char **argv, char *x_argvblk,
/* temporary argc: Number of parameters after minus insert_after_argc */
insert_after_argc = (*argc) - insert_after;
cmdl_ins((tmp_argv[2]), insert_after_argc, (argv + insert_after),
x_argvblk, (*x_ptr));
x_argvblk, x_argvblk_size, (*x_ptr));
*argc = insert_after_argc + insert_after;
insert_after++;
/* Second one */
if (return_argc > 1) {
insert_after_argc = (*argc) - insert_after;
cmdl_ins((tmp_argv[3]), insert_after_argc,
(argv + insert_after), x_argvblk, (*x_ptr));
(argv + insert_after), x_argvblk, x_argvblk_size,
(*x_ptr));
*argc = insert_after_argc + insert_after;
insert_after++;
}

View File

@@ -45,7 +45,7 @@ int optalias_find(const char *token);
const char *optalias_help(const char *token);
int optreal_find(const char *token);
int optinclude_file(const char *name, int *argc, char **argv, char *x_argvblk,
int *x_ptr);
size_t x_argvblk_size, int *x_ptr);
const char *optreal_value(int p);
const char *optalias_value(int p);
const char *opttype_value(int p);

View File

@@ -3584,8 +3584,9 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.is_file = 1;
back[i].r.totalsize = back[i].r.size =
fsize_utf8(back[i].url_sav);
get_httptype(opt, back[i].r.contenttype,
back[i].url_sav, 1);
get_httptype_sized(opt, back[i].r.contenttype,
sizeof(back[i].r.contenttype),
back[i].url_sav, 1);
hts_log_print(opt, LOG_DEBUG,
"Not-modified status without cache guessed: %s%s",
back[i].url_adr, back[i].url_fil);

View File

@@ -939,7 +939,7 @@ static htsblk cache_readex_new(httrackp * opt, cache_back * cache,
FILE *const fp = FOPEN(fconv(catbuff, sizeof(catbuff), previous_save), "rb");
if (fp != NULL) {
r.adr = (char *) malloct((int) r.size + 4);
r.adr = (char *) malloct((int) r.size + 1);
if (r.adr != NULL) {
if (r.size > 0
&& fread(r.adr, 1, (int) r.size, fp) != r.size) {
@@ -948,7 +948,8 @@ static htsblk cache_readex_new(httrackp * opt, cache_back * cache,
r.statuscode = STATUSCODE_INVALID;
sprintf(r.msg, "Read error in cache disk data: %s",
strerror(last_errno));
}
} else if (r.size >= 0)
*(r.adr + r.size) = '\0';
} else {
r.statuscode = STATUSCODE_INVALID;
strcpybuff(r.msg,
@@ -965,7 +966,7 @@ static htsblk cache_readex_new(httrackp * opt, cache_back * cache,
// Data in cache.
else {
// lire fichier (d'un coup)
r.adr = (char *) malloct((int) r.size + 4);
r.adr = (char *) malloct((int) r.size + 1);
if (r.adr != NULL) {
if (unzReadCurrentFile((unzFile) cache->zipInput, r.adr, (int) r.size) != r.size) { // erreur
freet(r.adr);
@@ -1245,13 +1246,14 @@ static htsblk cache_readex_old(httrackp * opt, cache_back * cache,
FILE *fp = FOPEN(fconv(catbuff, sizeof(catbuff), return_save), "rb");
if (fp != NULL) {
r.adr = (char *) malloct((size_t) r.size + 4);
r.adr = (char *) malloct((size_t) r.size + 1);
if (r.adr != NULL) {
if (r.size > 0
&& fread(r.adr, 1, (size_t) r.size, fp) != r.size) {
r.statuscode = STATUSCODE_INVALID;
strcpybuff(r.msg, "Read error in cache disk data");
}
} else if (r.size >= 0)
*(r.adr + r.size) = '\0';
} else {
r.statuscode = STATUSCODE_INVALID;
strcpybuff(r.msg,
@@ -1266,7 +1268,7 @@ static htsblk cache_readex_old(httrackp * opt, cache_back * cache,
}
} else {
// lire fichier (d'un coup)
r.adr = (char *) malloct((size_t) r.size + 4);
r.adr = (char *) malloct((size_t) r.size + 1);
if (r.adr != NULL) {
if (fread(r.adr, 1, (size_t) r.size, cache->olddat) != r.size) { // erreur
freet(r.adr);
@@ -1369,10 +1371,11 @@ int cache_readdata(cache_back * cache, const char *str1, const char *str2,
cache_rint(cache->olddat, &len);
if (len > 0) {
char *mem_buff = (char *) malloct(len + 4); /* Plus byte 0 */
char *mem_buff = (char *) malloct(len + 1); /* trailing \0 */
if (mem_buff) {
if (fread(mem_buff, 1, len, cache->olddat) == len) { // lire tout (y compris statuscode etc)*/
mem_buff[len] = '\0';
*inbuff = mem_buff;
*inlen = len;
return 1;

View File

@@ -182,6 +182,16 @@ static int check_entry(httrackp *opt, cache_back *cache, const char *adr,
fail++;
}
/* The loaded body must be NUL-terminated at [size]: cache_readex's strlen()
consumers (htscore.c:1046, htscache.c) rely on it, and a missing
terminator is a heap over-read. The buffer is malloc(size + slack), so
reading [size] is in bounds. */
if (r.adr != NULL && r.adr[r.size] != '\0') {
fprintf(stderr, "cache-selftest: %s%s: body not NUL-terminated at [size]\n",
adr, fil);
fail++;
}
#undef CHECK_STR
if (r.adr != NULL) {
@@ -208,6 +218,107 @@ static void gen_body(char *buf, size_t len, int kind) {
}
}
/* Exercise the disk-fallback read path: a record stored with X-In-Cache: 0
keeps its body on disk (not in the ZIP), and cache_readex must load it from
there. The one-shot crawl tests never re-read such a body into memory, so
this path otherwise has no runtime coverage. We store the header with
all_in_cache=0 and a non-hypertext content-type (-> X-In-Cache: 0), create
the body at the exact fconv()-resolved path the reader uses, then read it
back and assert it round-trips and is NUL-terminated. */
static int disk_fallback_selftest(httrackp *opt) {
int fail = 0;
cache_back cache;
htsblk r;
char catbuff[HTS_URLMAXSIZE * 2];
char *path;
char *locbuf;
FILE *fp;
const char *const adr = "example.com";
const char *const fil = "/blob.bin";
char save[HTS_URLMAXSIZE * 2];
/* no embedded NUL: were the read to leave this un-terminated, a later
strlen() would run off the end (the bug this guards) */
static const char body[] = "BINARY-on-disk-body-0123456789-no-trailing-nul";
const size_t body_len = sizeof(body) - 1;
/* X-Save must start with path_html_utf8 so the reader resolves it verbatim
(otherwise it re-roots it as a pre-3.40 relative path); then the body we
create at fconv(save) is exactly where cache_readex looks for it. */
fconcat(save, sizeof(save), StringBuff(opt->path_html_utf8),
"example.com/blob.bin");
/* write only the header (X-In-Cache: 0); the body stays on disk */
selftest_open_for_write(&cache, opt);
{
htsblk w;
char locw[4];
char *bodycopy = malloct(body_len);
hts_init_htsblk(&w);
w.statuscode = 200;
w.size = (LLint) body_len;
strcpybuff(w.msg, "OK");
strcpybuff(w.contenttype, "application/octet-stream");
locw[0] = '\0';
w.location = locw;
w.is_write = 0;
memcpy(bodycopy, body, body_len);
w.adr = bodycopy;
cache_add(opt, &cache, &w, adr, fil, save, 0 /* all_in_cache */, NULL);
freet(bodycopy);
}
selftest_close(&cache);
/* create the on-disk body where the reader will look for it */
path = fconv(catbuff, sizeof(catbuff), save);
(void) structcheck(path);
fp = FOPEN(path, "wb");
if (fp == NULL) {
fprintf(stderr, "cache-selftest: disk-fallback: cannot create '%s'\n",
path);
return 1;
}
if (fwrite(body, 1, body_len, fp) != body_len) {
fprintf(stderr, "cache-selftest: disk-fallback: short write to '%s'\n",
path);
fail++;
}
fclose(fp);
/* read it back: takes the X-In-Cache: 0 disk-fallback branch */
selftest_open_for_read(&cache, opt);
locbuf = malloct(HTS_URLMAXSIZE * 2);
locbuf[0] = '\0';
r = cache_readex(opt, &cache, adr, fil, "", locbuf, NULL, 1);
if (r.statuscode != 200) {
fprintf(stderr,
"cache-selftest: disk-fallback: statuscode %d, expected 200"
" (path not taken or read failed)\n",
r.statuscode);
fail++;
}
if (r.size != (LLint) body_len) {
fprintf(stderr,
"cache-selftest: disk-fallback: size " LLintP ", expected %d\n",
(LLint) r.size, (int) body_len);
fail++;
} else if (r.adr == NULL || memcmp(r.adr, body, body_len) != 0) {
fprintf(stderr, "cache-selftest: disk-fallback: body mismatch\n");
fail++;
}
/* the loaded body must be NUL-terminated at [size] */
if (r.adr != NULL && r.adr[r.size] != '\0') {
fprintf(stderr, "cache-selftest: disk-fallback: body not NUL-terminated\n");
fail++;
}
if (r.adr != NULL) {
freet(r.adr);
}
freet(locbuf);
selftest_close(&cache);
return fail;
}
int cache_selftests(httrackp *opt, const char *dir) {
int failures = 0;
cache_back cache;
@@ -257,6 +368,10 @@ int cache_selftests(httrackp *opt, const char *dir) {
strcatbuff(base, "/");
}
StringCopy(opt->path_log, base);
/* the disk-fallback pass resolves on-disk body paths through fconv(), which
is rooted at path_html; keep it inside the test directory too */
StringCopy(opt->path_html, base);
StringCopy(opt->path_html_utf8, base);
}
opt->cache = 1;
@@ -366,6 +481,9 @@ int cache_selftests(httrackp *opt, const char *dir) {
"", body_updated, strlen(body_updated));
selftest_close(&cache);
/* pass 5: the disk-fallback read path (X-In-Cache: 0, body on disk) */
failures += disk_fallback_selftest(opt);
for (i = 0; i < large_count; i++) {
freet(large_body[i]);
}

View File

@@ -633,13 +633,12 @@ int httpmirror(char *url1, httrackp * opt) {
// c'est plus propre et plus logique que d'entrer à la main les liens dans la pile
// on bénéficie ainsi des vérifications et des tests du robot pour les liens "primaires"
primary = (char *) malloct(primary_len);
if (primary) {
primary[0] = '\0';
} else {
if (!primary) {
printf("PANIC! : Not enough memory [%d]\n", __LINE__);
XH_extuninit;
return 0;
}
htsbuff primarybuff = htsbuff_ptr(primary, primary_len);
while(*a) {
int i;
@@ -687,11 +686,11 @@ int httpmirror(char *url1, httrackp * opt) {
strcatbuff(tempo, "*"); // ajouter un *
}
}
if (type)
strcpybuff(filters[filptr], "+");
else
strcpybuff(filters[filptr], "-");
strcatbuff(filters[filptr], tempo);
{
htsbuff fb = htsbuff_ptr(filters[filptr], HTS_URLMAXSIZE * 2);
htsbuff_cpy(&fb, type ? "+" : "-");
htsbuff_cat(&fb, tempo);
}
filptr++;
/* sanity check */
@@ -726,12 +725,10 @@ int httpmirror(char *url1, httrackp * opt) {
}
url[i++] = '\0';
//strcatbuff(primary,"<PRIMARY=\"");
if (strstr(url, ":/") == NULL)
strcatbuff(primary, "http://");
strcatbuff(primary, url);
//strcatbuff(primary,"\">");
strcatbuff(primary, "\n");
htsbuff_cat(&primarybuff, "http://");
htsbuff_cat(&primarybuff, url);
htsbuff_cat(&primarybuff, "\n");
}
} // while
@@ -762,7 +759,6 @@ int httpmirror(char *url1, httrackp * opt) {
int filelist_ptr = 0;
int n = 0;
char BIGSTK line[HTS_URLMAXSIZE * 2];
char *primary_ptr = primary + strlen(primary);
while(filelist_ptr < filelist_sz) {
int count =
@@ -771,13 +767,10 @@ int httpmirror(char *url1, httrackp * opt) {
if (count && line[0]) {
n++;
if (strstr(line, ":/") == NULL) {
strcpybuff(primary_ptr, "http://");
primary_ptr += strlen(primary_ptr);
htsbuff_cat(&primarybuff, "http://");
}
strcpybuff(primary_ptr, line);
primary_ptr += strlen(primary_ptr);
strcpybuff(primary_ptr, "\n");
primary_ptr += 1;
htsbuff_cat(&primarybuff, line);
htsbuff_cat(&primarybuff, "\n");
}
}
// fclose(fp);
@@ -1741,7 +1734,7 @@ int httpmirror(char *url1, httrackp * opt) {
{
char buff[256];
guess_httptype(opt, buff, urlfil());
guess_httptype_sized(opt, buff, sizeof(buff), urlfil());
if (strcmp(buff, "image/gif") == 0)
create_gif_warning = 1;
}
@@ -2193,16 +2186,19 @@ int httpmirror(char *url1, httrackp * opt) {
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.lst"), "rb");
if (new_lst != NULL && sz != (size_t) -1) {
char *adr = (char *) malloct(sz);
/* +1 for the NUL below: new.lst is read raw, and the strstr()
that follows needs a terminated C string. */
char *adr = (char *) malloct(sz + 1);
if (adr) {
if (fread(adr, 1, sz, new_lst) == sz) {
adr[sz] = '\0';
char line[1100];
int purge = 0;
while(!feof(old_lst)) {
linput(old_lst, line, 1000);
if (!strstr(adr, line)) { // fichier non trouvé dans le nouveau?
if (!strstr(adr, line)) { // not found in the new list?
char BIGSTK file[HTS_URLMAXSIZE * 2];
strcpybuff(file, StringBuff(opt->path_html));
@@ -2450,9 +2446,10 @@ void host_ban(httrackp * opt, int ptr,
// interdire host
assertf((*_FILTERS_PTR) < opt->maxfilter);
if (*_FILTERS_PTR < opt->maxfilter) {
strcpybuff(_FILTERS[*_FILTERS_PTR], "-");
strcatbuff(_FILTERS[*_FILTERS_PTR], host);
strcatbuff(_FILTERS[*_FILTERS_PTR], "/*"); // host/ * interdit
htsbuff fb = htsbuff_ptr(_FILTERS[*_FILTERS_PTR], HTS_URLMAXSIZE * 2);
htsbuff_cpy(&fb, "-");
htsbuff_cat(&fb, host);
htsbuff_cat(&fb, "/*"); // forbid host/*
(*_FILTERS_PTR)++;
}
// oups
@@ -3153,7 +3150,7 @@ static void postprocess_file(httrackp * opt, const char *save, const char *adr,
/* CID */
make_content_id(adr, fil, cid, sizeof(cid));
guess_httptype(opt, mimebuff, save);
guess_httptype_sized(opt, mimebuff, sizeof(mimebuff), save);
fprintf(opt->state.mimefp, "--%s\r\n",
StringBuff(opt->state.mimemid));
/*if (first)
@@ -3515,7 +3512,7 @@ char *next_token(char *p, int flag) {
p--;
do {
p++;
if (flag && (*p == '\\')) { // sauter \x ou \"
if (flag && (*p == '\\')) { // skip \x or \"
if (quote) {
char c = '\0';
@@ -3524,20 +3521,14 @@ char *next_token(char *p, int flag) {
else if (*(p + 1) == '"')
c = '"';
if (c) {
char BIGSTK tempo[8192];
tempo[0] = c;
tempo[1] = '\0';
strcatbuff(tempo, p + 2);
strcpybuff(p, tempo);
/* unescape the 2 chars to one, shifting left in place */
*p = c;
memmove(p + 1, p + 2, strlen(p + 2) + 1);
}
}
} else if (*p == 34) { // guillemets (de fin)
char BIGSTK tempo[8192];
tempo[0] = '\0';
strcatbuff(tempo, p + 1);
strcpybuff(p, tempo); /* wipe "" */
} else if (*p == 34) { // closing quote
/* drop the quote, shifting the rest left in place */
memmove(p, p + 1, strlen(p + 1) + 1);
p--;
/* */
quote = !quote;
@@ -3871,13 +3862,14 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
opt->savename_83 = b;
if (r != -1 && !forbidden_url) {
if (savename()) {
if (lienrelatif(tempo, afs.save, savename()) == 0) {
if (lienrelatif(tempo, sizeof(tempo), afs.save, savename()) ==
0) {
hts_log_print(opt, LOG_DEBUG,
"(module): relative link at %s build with %s and %s: %s",
afs.af.adr, afs.save, savename(), tempo);
if (str->localLink
&& str->localLinkSize > (int) strlen(tempo) + 1) {
strcpybuff(str->localLink, tempo);
strlcpybuff(str->localLink, tempo, str->localLinkSize);
}
}
}
@@ -3889,11 +3881,11 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
lien);
if (str->localLink
&& str->localLinkSize > (int) (strlen(afs.af.adr) + strlen(afs.af.fil) + 8)) {
str->localLink[0] = '\0';
htsbuff lb = htsbuff_ptr(str->localLink, str->localLinkSize);
if (!link_has_authority(afs.af.adr))
strcpybuff(str->localLink, "http://");
strcatbuff(str->localLink, afs.af.adr);
strcatbuff(str->localLink, afs.af.fil);
htsbuff_cat(&lb, "http://");
htsbuff_cat(&lb, afs.af.adr);
htsbuff_cat(&lb, afs.af.fil);
}
r = -1;
}

View File

@@ -69,23 +69,29 @@ Please visit our Website: http://www.httrack.com
/* Resolver */
extern int IPV6_resolver;
// Add a command in the argc/argv
#define cmdl_add(token,argc,argv,buff,ptr) \
argv[argc]=(buff+ptr); \
strcpybuff(argv[argc],token); \
ptr += (int) (strlen(argv[argc])+2); \
/* Remaining room in the argv block; 0 once it is exhausted (alias expansion or
doit.log insertion can outrun the +32768 slack), so the copy aborts cleanly
instead of the subtraction wrapping to a huge unbounded size. */
#define cmdl_room(bufsize, ptr) \
((ptr) < (size_t) (bufsize) ? (size_t) (bufsize) - (ptr) : 0)
// Add a command in the argc/argv (buff has total capacity bufsize)
#define cmdl_add(token, argc, argv, buff, bufsize, ptr) \
argv[argc] = (buff + ptr); \
strlcpybuff(argv[argc], token, cmdl_room(bufsize, ptr)); \
ptr += (int) (strlen(argv[argc]) + 2); \
argc++
// Insert a command in the argc/argv
#define cmdl_ins(token,argc,argv,buff,ptr) \
{ \
int i; \
for(i=argc;i>0;i--)\
argv[i]=argv[i-1];\
} \
argv[0]=(buff+ptr); \
strcpybuff(argv[0],token); \
ptr += (int) (strlen(argv[0])+2); \
// Insert a command in the argc/argv (buff has total capacity bufsize)
#define cmdl_ins(token, argc, argv, buff, bufsize, ptr) \
{ \
int i; \
for (i = argc; i > 0; i--) \
argv[i] = argv[i - 1]; \
} \
argv[0] = (buff + ptr); \
strlcpybuff(argv[0], token, cmdl_room(bufsize, ptr)); \
ptr += (int) (strlen(argv[0]) + 2); \
argc++
#define htsmain_free() do { \
@@ -236,6 +242,245 @@ static void basic_selftests(void) {
}
freet(slots);
}
// next_token(): in-place token scanner. Strips surrounding quotes, unescapes
// \" and \\ when flag is set, and returns the token terminator (the space, or
// NULL at end of string). The unquote/unescape rewrites the string in place
// by shifting left, so the result is always shorter -- regression for that
// compaction.
{
char tok[64];
// plain token: unchanged, returns a pointer AT the separating space (exact
// position, not just any space -- a strchr-style impl would land elsewhere
// once quotes shift the content)
strcpybuff(tok, "abc def");
{
char *const end = next_token(tok, 0);
assertf(end == tok + 3 && *end == ' ' && strcmp(tok, "abc def") == 0);
}
// surrounding quotes stripped, returns the (post-shift) trailing space
strcpybuff(tok, "\"ab\" cd");
{
char *const end = next_token(tok, 1);
assertf(end == tok + 2 && *end == ' ' && strcmp(tok, "ab cd") == 0);
}
// a space inside quotes does not end the token; end of string returns NULL
strcpybuff(tok, "\"a b\"c");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a bc") == 0);
}
// \" and \\ are unescaped to literal " and \ in place
strcpybuff(tok, "\"a\\\"b\\\\c\"");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a\"b\\c") == 0);
}
// unterminated quote: the opening quote is dropped, the rest survives, and
// the scan runs to the NUL (returns NULL)
strcpybuff(tok, "\"ab");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "ab") == 0);
}
// trailing lone backslash in a quote: *(p+1) is the NUL, not an escape, so
// the backslash is kept intact (and there is no over-read past the NUL)
strcpybuff(tok, "\"a\\");
{
char *const end = next_token(tok, 1);
assertf(end == NULL && strcmp(tok, "a\\") == 0);
}
}
// fil_normalized(): canonicalizes a URL path. Query arguments are sorted
// alphabetically (by the text after each '?'/'&') and the query is rebuilt
// through a bounded builder; outside the query, "//" collapses to "/".
// Regression for that builder.
{
char norm[256];
assertf(strcmp(fil_normalized("/p?b=2&a=1&c=3", norm), "/p?a=1&b=2&c=3") ==
0);
assertf(strcmp(fil_normalized("/a//b", norm), "/a/b") == 0);
// "//" is collapsed only before the query; inside the query it is kept
assertf(strcmp(fil_normalized("/a//b?x=c//d", norm), "/a/b?x=c//d") == 0);
}
// give_mimext(): mime type -> file extension, bounded into the caller buffer.
// Returns 1 when an extension was written, 0 otherwise.
{
char ext[16];
assertf(give_mimext(ext, sizeof(ext), "image/gif") == 1);
assertf(strcmp(ext, "gif") == 0);
assertf(give_mimext(ext, sizeof(ext), "text/html") == 1);
assertf(strcmp(ext, "html") == 0);
assertf(give_mimext(ext, sizeof(ext), "no/such-mime-type") == 0);
assertf(ext[0] == '\0');
}
// convtolower(): lower-cases into the caller buffer (bounded by its size).
{
char low[64];
assertf(strcmp(convtolower(low, sizeof(low), "ABC/Def.HTML"),
"abc/def.html") == 0);
}
// cut_path(): splits a path into directory (with trailing '/') and basename,
// each bounded by its buffer size.
{
char path[256];
char pname[256];
{
char full[] = "/dir/sub/file.html";
cut_path(full, path, sizeof(path), pname, sizeof(pname));
assertf(strcmp(path, "/dir/sub/") == 0);
assertf(strcmp(pname, "file.html") == 0);
}
{ // a trailing slash is trimmed before the split
char full[] = "/dir/sub/";
cut_path(full, path, sizeof(path), pname, sizeof(pname));
assertf(strcmp(path, "/dir/") == 0);
assertf(strcmp(pname, "sub") == 0);
}
{ // a path of length <= 1 yields empty results
char full[] = "/";
cut_path(full, path, sizeof(path), pname, sizeof(pname));
assertf(path[0] == '\0' && pname[0] == '\0');
}
}
// get_httptype_sized(): a long MIME type (Office OOXML reaches 73 chars) is
// written whole into a contenttype-sized buffer; returns 1 on a match, 0 when
// flag==0 and nothing matched. Regression for the old contenttype[64]
// overflow.
{
httrackp *opt = hts_create_opt();
htsblk r; // write into the real struct field, not a stand-in
assertf(opt != NULL);
// a long MIME (Office OOXML reaches 73 chars) must fit htsblk.contenttype
// whole: a [64] field would make this bounded copy abort.
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"deck.pptx", 0) == 1);
assertf(strcmp(r.contenttype,
"application/vnd.openxmlformats-officedocument."
"presentationml.presentation") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.gif", 0) == 1);
assertf(strcmp(r.contenttype, "image/gif") == 0);
// no extension and flag==0: nothing written, returns 0
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 0) == 0);
assertf(r.contenttype[0] == '\0');
// no extension and flag==1: octet-stream fallback, returns 1
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 1) == 1);
assertf(strcmp(r.contenttype, "application/octet-stream") == 0);
// a user --assume rule with an empty value matches but writes nothing:
// get_userhttptype returns 1 with the buffer empty, so get_httptype_sized
// must still report 0 (callers test the return like the old
// strnotempty(s)).
StringCopy(opt->mimedefs, "\ncgi=\n");
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"/x.cgi", 0) == 0);
assertf(r.contenttype[0] == '\0');
StringCopy(opt->mimedefs, "\ncgi=text/html\n");
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"/x.cgi", 0) == 1);
assertf(strcmp(r.contenttype, "text/html") == 0);
hts_free_opt(opt);
}
// adr_normalized_sized(): bounded host normalization (passthrough when
// already normal).
{
char n[HTS_URLMAXSIZE];
assertf(strcmp(adr_normalized_sized("example.com", n, sizeof(n)),
"example.com") == 0);
}
// standard_name(): builds "<name><md5?>.<ext>" into a bounded buffer. The md5
// is appended (4 chars) only when the URL has a query string (see url_md5),
// so test both; pin the structure (name + ext, lengths), not the md5 chars.
{
char b[HTS_URLMAXSIZE * 2];
const char *nom = "index.html"; // name part
const char *dot = nom + 5; // points at ".html"
size_t len;
// no query -> no md5: "index" + ".html"
standard_name(b, sizeof(b), dot, nom, "http://example.com/index.html", 0);
assertf(strcmp(b, "index.html") == 0);
// query -> 4 md5 chars between name and ext: "index" + md5(4) + ".html"
standard_name(b, sizeof(b), dot, nom, "http://example.com/index.html?v=1",
0);
len = strlen(b);
assertf(len == 5 + 4 + 5);
assertf(strncmp(b, "index", 5) == 0);
assertf(strcmp(b + len - 5, ".html") == 0);
// short names: name kept (<=8), the extension is clamped to 3 -> ".htm"
standard_name(b, sizeof(b), dot, nom, "http://example.com/index.html?v=1",
1);
len = strlen(b);
assertf(len == 5 + 4 + 4);
assertf(strcmp(b + len - 4, ".htm") == 0);
// short names with a >8-char name: the name is clamped to 8 ("indexpag")
{
const char *lnom = "indexpage.html";
const char *ldot = lnom + 9; // points at ".html"
standard_name(b, sizeof(b), ldot, lnom,
"http://example.com/indexpage.html?v=1", 1);
len = strlen(b);
assertf(len == 8 + 4 + 4);
assertf(strncmp(b, "indexpag", 8) == 0);
assertf(strcmp(b + len - 4, ".htm") == 0);
}
}
// longfile_to_83(): single-name 8-3 (mode 1) / ISO9660 (mode 2) conversion;
// uppercases, clamps the name (8 / 31) and the extension (3). It rewrites
// 'save' in place, so pass a mutable array.
{
char n83[256];
{
char save[] = "longfilename.html";
longfile_to_83(1, n83, sizeof(n83), save); // 8-3: name->8, ext->3
assertf(strcmp(n83, "LONGFILE.HTM") == 0);
}
{
char save[] = "longfilename.html";
longfile_to_83(2, n83, sizeof(n83), save); // ISO9660: name->31, ext->3
assertf(strcmp(n83, "LONGFILENAME.HTM") == 0);
}
{ // sanitization: leading '.'->'_', interior dots
char save[] = ".a b.c.d e"; // collapse to '_', spaces/specials -> '_'
// (only the last dot stays as the separator)
longfile_to_83(1, n83, sizeof(n83), save);
assertf(strcmp(n83, "_A_B_C.D_E") == 0);
}
}
// long_to_83(): per-segment 8-3 conversion of a whole path.
{
char n83[HTS_URLMAXSIZE * 2];
char save[] = "dir/longfilename.html";
long_to_83(1, n83, sizeof(n83), save);
assertf(strcmp(n83, "DIR/LONGFILE.HTM") == 0);
}
// lienrelatif(): relative path from the directory of curr_fil to link.
{
char s[HTS_URLMAXSIZE * 2];
// same directory -> just the basename
assertf(lienrelatif(s, sizeof(s), "dir/page.html", "dir/index.html") == 0);
assertf(strcmp(s, "page.html") == 0);
// link one level up -> a "../" prefix
assertf(lienrelatif(s, sizeof(s), "a.html", "dir/index.html") == 0);
assertf(strcmp(s, "../a.html") == 0);
}
}
/* Self-tests for the htssafe.h bounded string ops (driven by httrack -#8).
@@ -353,6 +598,7 @@ HTSEXT_API int hts_main2(int argc, char **argv, httrackp * opt) {
static int hts_main_internal(int argc, char **argv, httrackp * opt) {
char **x_argv = NULL; // Patch pour argv et argc: en cas de récupération de ligne de commande
char *x_argvblk = NULL; // (reprise ou update)
size_t x_argvblk_size = 0; // total capacity of x_argvblk
int x_ptr = 0; // offset
//
@@ -430,7 +676,8 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
*a = ' ';
/* equivalent to "empty parameter" */
if ((strcmp(argv[na], HTS_NOPARAM) == 0) || (strcmp(argv[na], HTS_NOPARAM2) == 0)) // (none)
strcpybuff(argv[na], "\"\"");
/* replacing "(none)"/"\"(none)\"" with "\"\"" always fits in place */
strlcpybuff(argv[na], "\"\"", strlen(argv[na]) + 1);
if (strncmp(argv[na], "-&", 2) == 0)
argv[na][1] = '%';
}
@@ -452,6 +699,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return -1;
}
x_argvblk_size = (size_t) (current_size + 32768);
x_argvblk[0] = '\0';
x_ptr = 0;
@@ -473,7 +721,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
//
argv_url = 0; /* pour comptage */
//
cmdl_add(argv[0], x_argc, x_argv, x_argvblk, x_ptr);
cmdl_add(argv[0], x_argc, x_argv, x_argvblk, x_argvblk_size, x_ptr);
na = 1; /* commencer après nom_prg */
while(na < argc) {
int result = 1;
@@ -494,9 +742,10 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
/* Copier */
cmdl_add(tmp_argv[0], x_argc, x_argv, x_argvblk, x_ptr);
cmdl_add(tmp_argv[0], x_argc, x_argv, x_argvblk, x_argvblk_size, x_ptr);
if (tmp_argc > 1) {
cmdl_add(tmp_argv[1], x_argc, x_argv, x_argvblk, x_ptr);
cmdl_add(tmp_argv[1], x_argc, x_argv, x_argvblk, x_argvblk_size,
x_ptr);
}
/* Compter URLs et détecter -i,-q.. */
@@ -568,7 +817,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
char BIGSTK tempo[HTS_CDLMAXSIZE];
strcpybuff(tempo, argv[na] + 1);
if (tempo[strlen(tempo) - 1] != '"') {
if (tempo[0] == '\0' || tempo[strlen(tempo) - 1] != '"') {
char BIGSTK s[HTS_CDLMAXSIZE];
sprintf(s, "Missing quote in %s", argv[na]);
@@ -577,7 +826,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return -1;
}
tempo[strlen(tempo) - 1] = '\0';
strcpybuff(argv[na], tempo);
/* tempo is argv[na] minus its surrounding quotes, so it fits in place
*/
strlcpybuff(argv[na], tempo, strlen(argv[na]) + 1);
}
if (cmdl_opt(argv[na])) { // option
@@ -678,18 +929,19 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/doit.log"))) || (argv_url > 0)) {
if (!optinclude_file
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), HTS_HTTRACKRC),
&argc, argv, x_argvblk, &x_ptr))
if (!optinclude_file(HTS_HTTRACKRC, &argc, argv, x_argvblk, &x_ptr)) {
if (!optinclude_file
(fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
hts_gethome(), "/" HTS_HTTRACKRC),
&argc, argv, x_argvblk, &x_ptr)) {
if (!optinclude_file(
fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), HTS_HTTRACKRC),
&argc, argv, x_argvblk, x_argvblk_size, &x_ptr))
if (!optinclude_file(HTS_HTTRACKRC, &argc, argv, x_argvblk,
x_argvblk_size, &x_ptr)) {
if (!optinclude_file(
fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
hts_gethome(), "/" HTS_HTTRACKRC),
&argc, argv, x_argvblk, x_argvblk_size, &x_ptr)) {
#ifdef HTS_HTTRACKCNF
optinclude_file(HTS_HTTRACKCNF, &argc, argv, x_argvblk, &x_ptr);
optinclude_file(HTS_HTTRACKCNF, &argc, argv, x_argvblk,
x_argvblk_size, &x_ptr);
#endif
}
}
@@ -742,7 +994,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (strnotempty(lastp)) {
insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_ptr);
x_argvblk_size, x_ptr);
argc = insert_after_argc + insert_after;
insert_after++;
}
@@ -862,7 +1114,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (argv[i][0] == '-') {
if (argv[i][1] == '-') { // --xxx
if ((strfield2(argv[i] + 2, "clean")) || (strfield2(argv[i] + 2, "tide"))) { // nettoyer
strcpybuff(argv[i] + 1, "");
argv[i][1] = '\0';
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log), "hts-log.txt")))
@@ -971,7 +1223,8 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
//
} else if (strfield2(argv[i] + 2, "catchurl")) { // capture d'URL via proxy temporaire!
argv_url = 1; // forcer a passer les parametres
strcpybuff(argv[i] + 1, "#P");
/* argv[i] is "--catchurl"; "#P" fits after its first char */
strlcpybuff(argv[i] + 1, "#P", strlen(argv[i] + 1) + 1);
//
} else if (strfield2(argv[i] + 2, "updatehttrack")) {
#ifdef _WIN32
@@ -1299,7 +1552,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
char BIGSTK tempo[HTS_CDLMAXSIZE + 256];
strcpybuff(tempo, argv[na] + 1);
if (tempo[strlen(tempo) - 1] != '"') {
if (tempo[0] == '\0' || tempo[strlen(tempo) - 1] != '"') {
char s[HTS_CDLMAXSIZE + 256];
sprintf(s, "Missing quote in %s", argv[na]);
@@ -1308,7 +1561,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return -1;
}
tempo[strlen(tempo) - 1] = '\0';
strcpybuff(argv[na], tempo);
/* tempo is argv[na] minus its surrounding quotes, so it fits in place
*/
strlcpybuff(argv[na], tempo, strlen(argv[na]) + 1);
}
if (cmdl_opt(argv[na])) { // option
@@ -2549,15 +2804,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
// initialiser mimedefs
//get_userhttptype(opt,1,opt->mimedefs,NULL);
// check
mime[0] = '\0';
get_httptype(opt, mime, argv[na + 1], 0);
if (mime[0] != '\0') {
if (get_httptype_sized(opt, mime, sizeof(mime), argv[na + 1],
0)) {
char ext[256];
printf("%s is '%s'\n", argv[na + 1], mime);
ext[0] = '\0';
give_mimext(ext, mime);
if (ext[0]) {
if (give_mimext(ext, sizeof(ext), mime)) {
printf("and its local type is '.%s'\n", ext);
}
} else {
@@ -2970,7 +3222,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (urlSize < HTS_URLMAXSIZE) {
ensureUrlCapacity(url, url_sz, capa);
if (strnotempty(url))
strcatbuff(url, " "); // espace de séparation
strlcatbuff(url, " ", url_sz); // separator space
append_escape_spc_url(unescape_http_unharm(catbuff, sizeof(catbuff), argv[na], 1), url, url_sz);
}
} // if argv=- etc.

View File

@@ -145,8 +145,13 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
if (!hex) {
if (src[i] >= '0' && src[i] <= '9') {
const int h = src[i] - '0';
uc *= 10;
uc += h;
/* Guard before multiplying: a codepoint past the Unicode max
(0x10FFFF) is invalid anyway, so stop rather than overflow uc. */
if (uc > (0x10FFFF - h) / 10) {
ampStart = (size_t) -1;
} else {
uc = uc * 10 + h;
}
} else {
/* abandon */
ampStart = (size_t) -1;
@@ -156,8 +161,11 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
else {
const int h = get_hex_value(src[i]);
if (h != -1) {
uc *= 16;
uc += h;
if (uc > (0x10FFFF - h) / 16) {
ampStart = (size_t) -1;
} else {
uc = uc * 16 + h;
}
} else {
/* abandon */
ampStart = (size_t) -1;

View File

@@ -197,10 +197,13 @@ Please visit our Website: http://www.httrack.com
#endif
/* Taille max d'une URL */
/* Max URL length */
#define HTS_URLMAXSIZE 1024
/* Taille max ligne de commande (>=HTS_URLMAXSIZE*2) */
/* Max command-line length (>=HTS_URLMAXSIZE*2) */
#define HTS_CDLMAXSIZE 1024
/* MIME-type buffer contract (htsblk.contenttype/charset/contentencoding); holds
the longest registered MIME type, the Office OOXML ones reaching 73 chars */
#define HTS_MIMETYPE_SIZE 128
/* Copyright (C) 1998 Xavier Roche and other contributors */
#define HTTRACK_AFF_AUTHORS "[XR&CO'2014]"
@@ -250,6 +253,22 @@ Please visit our Website: http://www.httrack.com
#endif
#endif
/**
* Mark a function deprecated, with a message pointing at the replacement.
* Placed before the declaration so both the GCC/Clang attribute and the MSVC
* __declspec sit in a position both accept. Degrades to nothing elsewhere.
*/
#if defined(__GNUC__) && \
(__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5))
#define HTS_DEPRECATED(msg) __attribute__((deprecated(msg)))
#elif defined(__GNUC__)
#define HTS_DEPRECATED(msg) __attribute__((deprecated))
#elif defined(_MSC_VER) && (_MSC_VER >= 1400)
#define HTS_DEPRECATED(msg) __declspec(deprecated(msg))
#else
#define HTS_DEPRECATED(msg)
#endif
#ifndef HTS_LONGLONG
#ifdef HTS_NO_64_BIT
#define HTS_LONGLONG 0

View File

@@ -76,7 +76,7 @@ static coucal_key key_duphandler(void *arg, coucal_key_const name) {
/* Key sav hashes are using case-insensitive version */
static coucal_hashkeys key_sav_hashes(void *arg, coucal_key_const key) {
hash_struct *const hash = (hash_struct*) arg;
convtolower(hash->catbuff, (const char*) key);
convtolower(hash->catbuff, sizeof(hash->catbuff), (const char *) key);
return coucal_hash_string(hash->catbuff);
}

View File

@@ -334,7 +334,7 @@ void index_finish(const char *indexpath, int mode) {
if (fp_tmpproject) {
tab = (char **) malloct(sizeof(char *) * (hts_primindex_size + 2));
if (tab) {
blk = malloct(size + 4);
blk = malloct(size + 1);
if (blk) {
fseek(fp_tmpproject, 0, SEEK_SET);
if ((INTsys) fread(blk, 1, size, fp_tmpproject) == size) {
@@ -343,6 +343,7 @@ void index_finish(const char *indexpath, int mode) {
int i;
FILE *fp;
blk[size] = '\0';
while((b = strchr(a, '\n')) && (index < hts_primindex_size)) {
tab[index++] = a;
*b = '\0';

View File

@@ -472,9 +472,8 @@ static int tris(httrackp * opt, char *buffer) {
{
char type[256];
type[0] = '\0';
get_httptype(opt, type, buffer, 0);
if (strnotempty(type)) // type reconnu!
if (get_httptype_sized(opt, type, sizeof(type), buffer,
0)) // recognized type
return 1;
// ajout RX 05/2001
else if (is_dyntype(get_ext(catbuff, sizeof(catbuff), buffer))) // asp,cgi...

View File

@@ -754,7 +754,8 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
if (soc != INVALID_SOCKET) {
retour->statuscode = HTTP_OK; // OK
strcpybuff(retour->msg, "OK");
guess_httptype(opt, retour->contenttype, fil);
guess_httptype_sized(opt, retour->contenttype,
sizeof(retour->contenttype), fil);
} else if (strnotempty(retour->msg) == 0)
strcpybuff(retour->msg, "Unable to open local file");
return soc; // renvoyer
@@ -1530,8 +1531,9 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
if (retour->location) {
while(is_realspace(*(rcvd + p)))
p++; // sauter espaces
if ((int) strlen(rcvd + p) < HTS_URLMAXSIZE) // pas trop long?
strcpybuff(retour->location, rcvd + p);
if ((int) strlen(rcvd + p) < HTS_URLMAXSIZE) // not too long?
/* location aliases location_buffer[HTS_URLMAXSIZE * 2] */
strlcpybuff(retour->location, rcvd + p, HTS_URLMAXSIZE * 2);
else // erreur.. ignorer
retour->location[0] = '\0';
}
@@ -3444,16 +3446,17 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
/* Replace query by sorted query */
copyBuff = malloct(qLen + 1);
assertf(copyBuff != NULL);
copyBuff[0] = '\0';
for(i = 0; i < ampargs; i++) {
if (i == 0)
strcatbuff(copyBuff, "?");
else
strcatbuff(copyBuff, "&");
strcatbuff(copyBuff, amps[i] + 1);
{
htsbuff cb = htsbuff_ptr(copyBuff, qLen + 1);
for (i = 0; i < ampargs; i++) {
htsbuff_cat(&cb, i == 0 ? "?" : "&");
htsbuff_cat(&cb, amps[i] + 1);
}
assertf(cb.len == qLen);
}
assertf(strlen(copyBuff) == qLen);
strcpybuff(query, copyBuff);
/* query points into dest where the original qLen-byte query was */
strlcpybuff(query, copyBuff, qLen + 1);
/* Cleanup */
freet(amps);
@@ -3464,12 +3467,19 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
}
#define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
HTSEXT_API char *adr_normalized(const char *source, char *dest) {
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
size_t destsize) {
/* not yet too aggressive (no com<->net<->org checkings) */
strcpybuff(dest, jump_normalized_const(source));
strlcpybuff(dest, jump_normalized_const(source), destsize);
return dest;
}
// deprecated variant; kept for ABI compatibility. Bounds to the implicit
// contract the old callers relied on (an HTS_URLMAXSIZE*2 URL buffer).
HTSEXT_API char *adr_normalized(const char *source, char *dest) {
return adr_normalized_sized(source, dest, HTS_URLMAXSIZE * 2);
}
#undef endwith
// find port (:80) or NULL if not found
@@ -3894,9 +3904,9 @@ HTSEXT_API size_t escape_for_html_print_full(const char *const s, char *const de
#undef ADD_CHAR
// conversion minuscules, avec buffer
char *convtolower(char *catbuff, const char *a) {
strcpybuff(catbuff, a);
// lower-case conversion into caller buffer (capacity catbuffsize)
char *convtolower(char *catbuff, size_t catbuffsize, const char *a) {
strlcpybuff(catbuff, a, catbuffsize);
hts_lowcase(catbuff); // lower case
return catbuff;
}
@@ -3919,22 +3929,34 @@ void hts_replace(char *s, char from, char to) {
}
}
// deviner type d'un fichier local..
// ex: fil="toto.gif" -> s="image/gif"
void guess_httptype(httrackp * opt, char *s, const char *fil) {
get_httptype(opt, s, fil, 1);
// guess a local file's mime type (e.g. fil="toto.gif" -> s="image/gif")
// returns 1 if a type was written to s, 0 otherwise
int guess_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil) {
return get_httptype_sized(opt, s, ssize, fil, 1);
}
// idem
// flag: 1 si toujours renvoyer un type
HTSEXT_API void get_httptype(httrackp * opt, char *s, const char *fil, int flag) {
// userdef overrides get_httptype
// deprecated variant; kept for ABI compatibility. Bounds to the implicit
// contract the old callers relied on (a contenttype-sized buffer).
void guess_httptype(httrackp * opt, char *s, const char *fil) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, 1);
}
// write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag) {
// userdef overrides get_httptype (a rule with an empty value, e.g. "--assume
// cgi=", matches but writes nothing: report it as "no type" like the old
// code, whose callers tested strnotempty(s))
if (get_userhttptype(opt, s, fil)) {
return;
return s[0] != '\0';
}
// regular tests
if (ishtml(opt, fil) == 1) {
strcpybuff(s, "text/html");
strlcpybuff(s, "text/html", ssize);
return 1;
} else {
/* Check html -> text/html */
const char *a = fil + strlen(fil) - 1;
@@ -3947,21 +3969,33 @@ HTSEXT_API void get_httptype(httrackp * opt, char *s, const char *fil, int flag)
a++;
while(strnotempty(hts_mime[j][1])) {
if (strfield2(hts_mime[j][1], a)) {
if (hts_mime[j][0][0] != '*') { // Une correspondance existe
strcpybuff(s, hts_mime[j][0]);
return;
if (hts_mime[j][0][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][0], ssize);
return 1;
}
}
j++;
}
if (flag)
sprintf(s, "application/%s", a);
if (flag) {
snprintf(s, ssize, "application/%s", a);
return 1;
}
} else {
if (flag)
strcpybuff(s, "application/octet-stream");
if (flag) {
strlcpybuff(s, "application/octet-stream", ssize);
return 1;
}
}
}
return 0;
}
// deprecated variant; kept for ABI compatibility. Bounds to the implicit
// contract the old callers relied on (a contenttype-sized buffer).
HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil,
int flag) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, flag);
}
// get type of fil (php)
@@ -4071,17 +4105,17 @@ int get_userhttptype(httrackp * opt, char *s, const char *fil) {
return 0;
}
// renvoyer extesion d'un type mime..
// ex: "image/gif" -> gif
void give_mimext(char *s, const char *st) {
// give the file extension for a mime type (e.g. "image/gif" -> "gif")
// returns 1 if an extension was found (and written to s), 0 otherwise
int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {
if (hts_mime[j][1][0] != '*') { // Une correspondance existe
strcpybuff(s, hts_mime[j][1]);
if (hts_mime[j][1][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][1], ssize);
ok = 1;
}
}
@@ -4102,12 +4136,13 @@ void give_mimext(char *s, const char *st) {
if (a) {
if ((int) strlen(a) >= 1) {
if ((int) strlen(a) <= 4) {
strcpybuff(s, a);
strlcpybuff(s, a, ssize);
ok = 1;
}
}
}
}
return ok;
}
// extension connue?..
@@ -4205,9 +4240,8 @@ int may_bogus_multiple(httrackp * opt, const char *mime, const char *filename) {
if (strfield2(hts_mime_bogus_multiple[j], mime)) { /* found mime type in suspicious list */
char ext[64];
ext[0] = '\0';
give_mimext(ext, mime);
if (ext[0] != 0) { /* we have an extension for that */
if (give_mimext(ext, sizeof(ext),
mime)) { /* we have an extension for that */
const size_t ext_size = strlen(ext);
const char *file = strrchr(filename, '/'); /* fetch terminal filename */
@@ -4930,7 +4964,8 @@ void hts_freeall(void) {
// cut path and project name
// patch also initial path
void cut_path(char *fullpath, char *path, char *pname) {
void cut_path(char *fullpath, char *path, size_t path_size, char *pname,
size_t pname_size) {
path[0] = pname[0] = '\0';
if (strnotempty(fullpath)) {
if ((fullpath[strlen(fullpath) - 1] == '/')
@@ -4946,8 +4981,8 @@ void cut_path(char *fullpath, char *path, char *pname) {
a--;
if (*a == '/')
a++;
strcpybuff(pname, a);
strncatbuff(path, fullpath, (int) (a - fullpath));
strlcpybuff(pname, a, pname_size);
strlncatbuff(path, fullpath, path_size, (size_t) (a - fullpath));
}
}
}

View File

@@ -252,7 +252,7 @@ int ishtml_ext(const char *a);
int ishttperror(int err);
int get_userhttptype(httrackp * opt, char *s, const char *fil);
void give_mimext(char *s, const char *st);
int give_mimext(char *s, size_t ssize, const char *st);
int may_bogus_multiple(httrackp * opt, const char *mime, const char *filename);
int may_unknown2(httrackp * opt, const char *mime, const char *filename);
@@ -264,7 +264,7 @@ void code64(unsigned char *a, int size_a, unsigned char *b, int crlf);
#define copychar(catbuff,a) concat(catbuff,(a),NULL)
char *convtolower(char *catbuff, const char *a);
char *convtolower(char *catbuff, size_t catbuffsize, const char *a);
void hts_lowcase(char *s);
void hts_replace(char *s, char from, char to);
int multipleStringMatch(const char *s, const char *match);
@@ -276,7 +276,8 @@ void fprintfio(FILE * fp, const char *buff, const char *prefix);
int sig_ignore_flag(int setflag); // flag ignore
#endif
void cut_path(char *fullpath, char *path, char *pname);
void cut_path(char *fullpath, char *path, size_t path_size, char *pname,
size_t pname_size);
int fexist(const char *s);
int fexist_utf8(const char *s);
@@ -499,7 +500,8 @@ HTS_STATIC int is_hypertext_mime(httrackp * opt, const char *mime,
char guessed[256];
guessed[0] = '\0';
guess_httptype(opt, guessed, file);
if (!guess_httptype_sized(opt, guessed, sizeof(guessed), file))
return 0;
return is_hypertext_mime__(guessed);
}
return 0;
@@ -514,7 +516,8 @@ HTS_STATIC int may_be_hypertext_mime(httrackp * opt, const char *mime,
char guessed[256];
guessed[0] = '\0';
guess_httptype(opt, guessed, file);
if (!guess_httptype_sized(opt, guessed, sizeof(guessed), file))
return 0;
return may_be_hypertext_mime__(guessed);
}
return 0;
@@ -529,7 +532,8 @@ HTS_STATIC int compare_mime(httrackp * opt, const char *mime, const char *file,
char guessed[256];
guessed[0] = '\0';
guess_httptype(opt, guessed, file);
if (!guess_httptype_sized(opt, guessed, sizeof(guessed), file))
return 0;
return strfield2(guessed, reference);
}
return 0;

View File

@@ -51,12 +51,13 @@ Please visit our Website: http://www.httrack.com
url_savename_addstr(afs->save, buff);\
}
#define ADD_STANDARD_NAME(shortname) \
{ /* ajout nom */\
char BIGSTK buff[HTS_URLMAXSIZE*2];\
standard_name(buff,dot_pos,nom_pos,fil_complete,(shortname));\
url_savename_addstr(afs->save, buff);\
}
#define ADD_STANDARD_NAME(shortname) \
{ /* add name */ \
char BIGSTK buff[HTS_URLMAXSIZE * 2]; \
standard_name(buff, sizeof(buff), dot_pos, nom_pos, fil_complete, \
(shortname)); \
url_savename_addstr(afs->save, buff); \
}
/* Avoid stupid DOS system folders/file such as 'nul' */
/* Based on linux/fs/umsdos/mangle.c */
@@ -200,7 +201,7 @@ int url_savename(lien_adrfilsave *const afs,
// foo.com/bar//foobar -> foo.com/bar/foobar
if (opt->urlhack) {
// copy of adr (without protocol), used for lookups (see urlhack)
normadr = adr_normalized(adr, normadr_);
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil = fil_normalized(fil_complete, normfil_);
} else {
if (link_has_authority(adr_complete)) { // https or other protocols : in "http/" subfolder
@@ -344,8 +345,7 @@ int url_savename(lien_adrfilsave *const afs,
mime[0] = ext[0] = '\0';
get_userhttptype(opt, mime, fil);
if (strnotempty(mime)) {
give_mimext(ext, mime);
if (strnotempty(ext)) {
if (give_mimext(ext, sizeof(ext), mime)) {
ext_chg = 1;
}
}
@@ -378,8 +378,8 @@ int url_savename(lien_adrfilsave *const afs,
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori?
give_mimext(s, r.contenttype); // obtenir extension
if (strnotempty(s) > 0) { // on a reconnu l'extension
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
@@ -403,8 +403,7 @@ int url_savename(lien_adrfilsave *const afs,
mime[0] = ext[0] = '\0';
get_userhttptype(opt, mime, fil);
if (strnotempty(mime)) {
give_mimext(ext, mime);
if (strnotempty(ext)) {
if (give_mimext(ext, sizeof(ext), mime)) {
ext_chg = 1;
}
}
@@ -420,9 +419,9 @@ int url_savename(lien_adrfilsave *const afs,
strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
char s[16];
s[0] = '\0';
give_mimext(s, headers->r.contenttype); // obtenir extension
if (strnotempty(s) > 0) { // on a reconnu l'extension
if (give_mimext(
s, sizeof(s),
headers->r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
@@ -431,13 +430,14 @@ int url_savename(lien_adrfilsave *const afs,
else if (mime_type != NULL) {
ext[0] = '\0';
if (*mime_type) {
give_mimext(ext, mime_type);
give_mimext(ext, sizeof(ext), mime_type);
}
if (strnotempty(ext)) {
char mime_from_file[128];
mime_from_file[0] = 0;
get_httptype(opt, mime_from_file, fil, 1);
get_httptype_sized(opt, mime_from_file, sizeof(mime_from_file),
fil, 1);
if (!strnotempty(mime_from_file) || strcasecmp(mime_type, mime_from_file) != 0) { /* different mime for this type */
/* type change not forbidden (or no extension at all) */
if (!may_unknown2(opt, mime_type, fil)) {
@@ -646,8 +646,9 @@ int url_savename(lien_adrfilsave *const afs,
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
give_mimext(s, back[b].r.contenttype); // obtenir extension
if (strnotempty(s) > 0) { // on a reconnu l'extension
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
@@ -924,7 +925,7 @@ int url_savename(lien_adrfilsave *const afs,
pth[0] = n83[0] = '\0';
strncatbuff(pth, fil, (int) (nom_pos - fil) - 1);
long_to_83(opt->savename_83, n83, pth);
long_to_83(opt->savename_83, n83, sizeof(n83), pth);
htsbuff_cat(&sb, n83);
}
}
@@ -1306,7 +1307,7 @@ int url_savename(lien_adrfilsave *const afs,
if (opt->savename_83) {
char BIGSTK n83[HTS_URLMAXSIZE * 2];
long_to_83(opt->savename_83, n83, afs->save);
long_to_83(opt->savename_83, n83, sizeof(n83), afs->save);
strcpybuff(afs->save, n83);
}
// enforce stricter ISO9660 compliance (bug reported by Steffo Carlsson)
@@ -1377,7 +1378,9 @@ int url_savename(lien_adrfilsave *const afs,
if (lastDot == NULL) {
strcatbuff(afs->save, "." DELAYED_EXT);
} else if (!IS_DELAYED_EXT(afs->save)) {
strcatbuff(lastDot, "." DELAYED_EXT);
/* lastDot points within afs->save; bound by the remaining capacity */
strlcatbuff(lastDot, "." DELAYED_EXT,
sizeof(afs->save) - (size_t) (lastDot - afs->save));
}
}
// enforce 260-character path limit before inserting destination path
@@ -1582,41 +1585,41 @@ int url_savename(lien_adrfilsave *const afs,
return 0;
}
/* nom avec md5 urilisé partout */
void standard_name(char *b, const char *dot_pos, const char *nom_pos, const char *fil,
int short_ver) {
/* md5-based name used everywhere; builds into b (capacity bsize) */
void standard_name(char *b, size_t bsize, const char *dot_pos,
const char *nom_pos, const char *fil, int short_ver) {
char md5[32 + 2];
htsbuff bb = htsbuff_ptr(b, bsize);
b[0] = '\0';
/* Nom */
/* Name */
if (dot_pos) {
if (!short_ver) // Noms longs
strncatbuff(b, nom_pos, (dot_pos - nom_pos));
if (!short_ver) // long names
htsbuff_catn(&bb, nom_pos, (size_t) (dot_pos - nom_pos));
else
strncatbuff(b, nom_pos, min(dot_pos - nom_pos, 8));
htsbuff_catn(&bb, nom_pos, (size_t) min(dot_pos - nom_pos, 8));
} else {
if (!short_ver) // Noms longs
strcatbuff(b, nom_pos);
if (!short_ver) // long names
htsbuff_cat(&bb, nom_pos);
else
strncatbuff(b, nom_pos, 8);
htsbuff_catn(&bb, nom_pos, 8);
}
/* MD5 - 16 bits */
strncatbuff(b, url_md5(md5, fil), 4);
htsbuff_catn(&bb, url_md5(md5, fil), 4);
/* Ext */
if (dot_pos) {
strcatbuff(b, ".");
if (!short_ver) // Noms longs
strcatbuff(b, dot_pos + 1);
htsbuff_catc(&bb, '.');
if (!short_ver) // long names
htsbuff_cat(&bb, dot_pos + 1);
else
strncatbuff(b, dot_pos + 1, 3);
htsbuff_catn(&bb, dot_pos + 1, 3);
}
// Allow extensionless
#ifdef DO_NOT_ALLOW_EXTENSIONLESS
else {
if (!short_ver) // Noms longs
strcatbuff(b, DEFAULT_EXT);
if (!short_ver) // long names
htsbuff_cat(&bb, DEFAULT_EXT);
else
strcatbuff(b, DEFAULT_EXT_SHORT);
htsbuff_cat(&bb, DEFAULT_EXT_SHORT);
}
#endif
}

View File

@@ -96,8 +96,8 @@ int url_savename(lien_adrfilsave *const afs,
httrackp * opt, struct_back * sback, cache_back * cache,
hash_struct * hash, int ptr, int numero_passe,
const lien_back * headers);
void standard_name(char *b, const char *dot_pos, const char *nom_pos,
const char *fil_complete,
void standard_name(char *b, size_t bsize, const char *dot_pos,
const char *nom_pos, const char *fil_complete,
int short_ver);
void url_savename_addstr(char *d, const char *s);
char *url_md5(char *digest_buffer, const char *fil_complete);

View File

@@ -499,9 +499,9 @@ struct htsblk {
FILE *out; // écriture directe sur disque (si is_write=1)
LLint size; // taille fichier
char msg[80]; // message éventuel si échec ("\0"=non précisé)
char contenttype[64]; // content-type ("text/html" par exemple)
char charset[64]; // charset ("iso-8859-1" par exemple)
char contentencoding[64]; // content-encoding ("gzip" par exemple)
char contenttype[HTS_MIMETYPE_SIZE]; // content-type (e.g. "text/html")
char charset[HTS_MIMETYPE_SIZE]; // charset (e.g. "iso-8859-1")
char contentencoding[HTS_MIMETYPE_SIZE]; // content-encoding (e.g. "gzip")
char *location; // on copie dedans éventuellement la véritable 'location'
LLint totalsize; // taille totale à télécharger (-1=inconnue)
short int is_file; // ce n'est pas une socket mais un descripteur de fichier si 1

View File

@@ -610,20 +610,22 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b = strchr(a, '<'); // prochain tag
}
}
if (lienrelatif
(tempo, heap(ptr)->sav,
concat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_html_utf8),
"index.html")) == 0) {
if (lienrelatif(tempo, sizeof(tempo), heap(ptr)->sav,
concat(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_html_utf8),
"index.html")) == 0) {
detect_title = 1; // ok détecté pour cette page!
makeindex_links++; // un de plus
strcpybuff(makeindex_firstlink, tempo);
strlcpybuff(makeindex_firstlink, tempo,
HTS_URLMAXSIZE * 2);
//
/* Hack */
if (opt->mimehtml) {
strcpybuff(makeindex_firstlink,
"cid:primary/primary");
strlcpybuff(makeindex_firstlink,
"cid:primary/primary",
HTS_URLMAXSIZE * 2);
}
if ((b == a) || (a == NULL) || (b == NULL)) { // pas de titre
@@ -1649,8 +1651,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// Prendre si extension reconnue
if (!url_ok) {
get_httptype(opt, type, tempo, 0);
if (strnotempty(type)) // type reconnu!
if (get_httptype_sized(opt, type,
sizeof(type), tempo,
0)) // recognized type
url_ok = 1;
else if (is_dyntype(get_ext(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), tempo))) // reconnu php,cgi,asp..
url_ok = 1;
@@ -2318,12 +2321,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
switch (p_type) {
case 2:{
//if (*lien!='/') strcatbuff(base,"/");
strcpybuff(base, lien);
strlcpybuff(base, lien, HTS_URLMAXSIZE * 2);
}
break; // base
case -2:{
//if (*lien!='/') strcatbuff(codebase,"/");
strcpybuff(codebase, lien);
strlcpybuff(codebase, lien, HTS_URLMAXSIZE * 2);
}
break; // base
}
@@ -2719,7 +2722,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
strcpybuff(save, StringBuff(opt->path_html_utf8));
strcatbuff(save, cat_name);
if (lienrelatif(tempo, save, relativesavename()) == 0) {
if (lienrelatif(tempo, sizeof(tempo), save,
relativesavename()) == 0) {
/* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo)); // escape with %xx
//if (!no_esc_utf)
@@ -2949,7 +2953,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
tempo[0] = '\0';
// calculer le lien relatif
if (lienrelatif(tempo, afs.save, relativesavename()) == 0) {
if (lienrelatif(tempo, sizeof(tempo), afs.save,
relativesavename()) == 0) {
if (!in_media) { // In media (such as real audio): don't patch
/* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo));
@@ -3416,8 +3421,17 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (RUN_CALLBACK4(opt, postprocess, &cAddr, &cSize, urladr(), urlfil()) == 1) {
hts_log_print(opt, LOG_DEBUG,
"engine: postprocess-html: callback modified data, applying %d bytes", cSize);
TypedArraySize(output_buffer) = 0;
TypedArrayAppend(output_buffer, cAddr, cSize);
/* The callback either edits output_buffer in place (cAddr
unchanged) or hands back its own buffer (cAddr changed). Only
the latter needs a copy: re-appending output_buffer onto itself
would read freed memory, as the append's realloc can relocate
the block out from under cAddr. */
if (cAddr != TypedArrayElts(output_buffer)) {
TypedArraySize(output_buffer) = 0;
TypedArrayAppend(output_buffer, cAddr, cSize);
} else {
TypedArraySize(output_buffer) = (size_t) cSize;
}
}
}
@@ -3498,9 +3512,9 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];
n_adr[0] = n_fil[0] = '\0';
(void) adr_normalized(moved->adr, n_adr);
(void) adr_normalized_sized(moved->adr, n_adr, sizeof(n_adr));
(void) fil_normalized(moved->fil, n_fil);
(void) adr_normalized(urladr(), pn_adr);
(void) adr_normalized_sized(urladr(), pn_adr, sizeof(pn_adr));
(void) fil_normalized(urlfil(), pn_fil);
if (strcasecmp(n_adr, pn_adr) == 0
&& strcasecmp(n_fil, pn_fil) == 0) {
@@ -4385,7 +4399,7 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
memcpy(r, &(back[b].r), sizeof(htsblk));
r->location = stre->loc_; // ne PAS copier location!! adresse, pas de buffer
if (back[b].r.location)
strcpybuff(r->location, back[b].r.location);
strlcpybuff(r->location, back[b].r.location, HTS_URLMAXSIZE * 2);
back[b].r.adr = NULL; // ne pas faire de desalloc ensuite
// libérer emplacement backing

View File

@@ -237,6 +237,15 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__)
/**
* Append at most "N" characters of "B" to "A", "A" having a maximum capacity
* of "S".
*/
#define strlncatbuff(A, B, S, N) \
strncat_safe_(A, S, B, HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
N, "overflow while appending '" #B "' to '" #A "'", __FILE__, \
__LINE__)
/**
* Copy characters of "B" to "A", "A" having a maximum capacity of "S".
*/

View File

@@ -274,7 +274,9 @@ int ident_url_relatif(const char *lien, const char *origin_adr,
char *const idna = hts_convertStringUTF8ToIDNA(a, strlen(a));
if (idna != NULL) {
if (strlen(idna) < HTS_URLMAXSIZE) {
strcpybuff(a, idna);
/* a points within adrfil->adr; bound by the remaining capacity */
strlcpybuff(a, idna,
sizeof(adrfil->adr) - (size_t) (a - adrfil->adr));
}
free(idna);
}
@@ -286,7 +288,7 @@ int ident_url_relatif(const char *lien, const char *origin_adr,
// créer dans s, à partir du chemin courant curr_fil, le lien vers link (absolu)
// un ident_url_relatif a déja été fait avant, pour que link ne soit pas un chemin relatif
int lienrelatif(char *s, const char *link, const char *curr_fil) {
int lienrelatif(char *s, size_t ssize, const char *link, const char *curr_fil) {
char BIGSTK _curr[HTS_URLMAXSIZE * 2];
char BIGSTK newcurr_fil[HTS_URLMAXSIZE * 2], newlink[HTS_URLMAXSIZE * 2];
char *curr;
@@ -314,9 +316,9 @@ int lienrelatif(char *s, const char *link, const char *curr_fil) {
}
}
// recopier uniquement le chemin courant
// copy only the current path
curr = _curr;
strcpybuff(curr, curr_fil);
strlcpybuff(curr, curr_fil, sizeof(_curr));
if ((a = strchr(curr, '?')) == NULL) // couper au ? (params)
a = curr + strlen(curr) - 1; // pas de params: aller à la fin
while((*a != '/') && (a > curr))
@@ -359,14 +361,14 @@ int lienrelatif(char *s, const char *link, const char *curr_fil) {
a++;
while(*a)
if (*(a++) == '/')
strcatbuff(s, "../");
strlcatbuff(s, "../", ssize);
//if (strlen(s)==0) strcatbuff(s,"/");
if (slash)
strcatbuff(s, "/"); // garder absolu!!
strlcatbuff(s, "/", ssize); // keep it absolute!
// on est dans le répertoire de départ, copier
strcatbuff(s, link + ((*link == '/') ? 1 : 0));
// we are in the starting directory, copy
strlcatbuff(s, link + ((*link == '/') ? 1 : 0), ssize);
/* Security check */
if (strlen(s) >= HTS_URLMAXSIZE)
@@ -410,7 +412,7 @@ int link_has_authorization(const char *lien) {
}
// conversion chemin de fichier/dossier vers 8-3 ou ISO9660
void long_to_83(int mode, char *n83, char *save) {
void long_to_83(int mode, char *n83, size_t n83size, char *save) {
n83[0] = '\0';
while(*save) {
@@ -425,19 +427,19 @@ void long_to_83(int mode, char *n83, char *save) {
}
fnl[j] = '\0';
// conversion
longfile_to_83(mode, fn83, fnl);
strcatbuff(n83, fn83);
longfile_to_83(mode, fn83, sizeof(fn83), fnl);
strlcatbuff(n83, fn83, n83size);
save += i;
if (*save == '/') {
strcatbuff(n83, "/");
strlcatbuff(n83, "/", n83size);
save++;
}
}
}
// conversion nom de fichier/dossier isolé vers 8-3 ou ISO9660
void longfile_to_83(int mode, char *n83, char *save) {
void longfile_to_83(int mode, char *n83, size_t n83size, char *save) {
int j = 0, max = 0;
int i = 0;
char nom[256];
@@ -526,10 +528,10 @@ void longfile_to_83(int mode, char *n83, char *save) {
}
// corriger vers 8-3
n83[0] = '\0';
strncatbuff(n83, nom, max);
strlncatbuff(n83, nom, n83size, max);
if (strnotempty(ext)) {
strcatbuff(n83, ".");
strncatbuff(n83, ext, 3);
strlcatbuff(n83, ".", n83size);
strlncatbuff(n83, ext, n83size, 3);
}
}

View File

@@ -61,11 +61,11 @@ typedef struct lien_adrfilsave lien_adrfilsave;
int ident_url_relatif(const char *lien, const char *origin_adr,
const char *origin_fil,
lien_adrfil* const adrfil);
int lienrelatif(char *s, const char *link, const char *curr);
int lienrelatif(char *s, size_t ssize, const char *link, const char *curr);
int link_has_authority(const char *lien);
int link_has_authorization(const char *lien);
void long_to_83(int mode, char *n83, char *save);
void longfile_to_83(int mode, char *n83, char *save);
void long_to_83(int mode, char *n83, size_t n83size, char *save);
void longfile_to_83(int mode, char *n83, size_t n83size, char *save);
HTS_INLINE int __rech_tageq(const char *adr, const char *s);
HTS_INLINE int __rech_tageqbegdigits(const char *adr, const char *s);
HTS_INLINE int rech_tageq_all(const char *adr, const char *s);

View File

@@ -223,8 +223,9 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// note (up/down): on calcule à partir du lien primaire, ET du lien précédent.
// ex: si on descend 2 fois on peut remonter 1 fois
if (lienrelatif(tempo, fil, heap(heap(ptr)->premier)->fil) == 0) {
if (lienrelatif(tempo2, fil, heap(ptr)->fil) == 0) {
if (lienrelatif(tempo, sizeof(tempo), fil,
heap(heap(ptr)->premier)->fil) == 0) {
if (lienrelatif(tempo2, sizeof(tempo2), fil, heap(ptr)->fil) == 0) {
hts_log_print(opt, LOG_DEBUG,
"build relative links to test: %s %s (with %s and %s)",
tempo, tempo2, heap(heap(ptr)->premier)->fil,
@@ -326,8 +327,9 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
char BIGSTK tempo[HTS_URLMAXSIZE * 2];
char BIGSTK tempo2[HTS_URLMAXSIZE * 2];
if (lienrelatif(tempo, fil, heap(heap(ptr)->premier)->fil) == 0) {
if (lienrelatif(tempo2, fil, heap(ptr)->fil) == 0) {
if (lienrelatif(tempo, sizeof(tempo), fil,
heap(heap(ptr)->premier)->fil) == 0) {
if (lienrelatif(tempo2, sizeof(tempo2), fil, heap(ptr)->fil) == 0) {
} else {
hts_log_print(opt, LOG_ERROR,
"Error building relative link %s and %s", fil,
@@ -336,7 +338,6 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
} else {
hts_log_print(opt, LOG_ERROR, "Error building relative link %s and %s",
fil, heap(heap(ptr)->premier)->fil);
}
} // fin tester interdiction de monter

View File

@@ -207,6 +207,9 @@ HTSEXT_API const char *jump_normalized_const(const char *);
HTSEXT_API char *jump_toport(char *);
HTSEXT_API const char *jump_toport_const(const char *);
HTSEXT_API char *fil_normalized(const char *source, char *dest);
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
size_t destsize);
HTS_DEPRECATED("use adr_normalized_sized(source, dest, destsize)")
HTSEXT_API char *adr_normalized(const char *source, char *dest);
HTSEXT_API const char *hts_rootdir(char *file);
@@ -244,6 +247,9 @@ HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, co
HTSEXT_API char *antislash_unescaped(char *catbuff, const char *s);
HTSEXT_API void escape_remove_control(char *s);
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag);
HTS_DEPRECATED("use get_httptype_sized(opt, s, ssize, fil, flag)")
HTSEXT_API void get_httptype(httrackp * opt, char *s, const char *fil,
int flag);
HTSEXT_API int is_knowntype(httrackp * opt, const char *fil);
@@ -251,6 +257,9 @@ HTSEXT_API int is_userknowntype(httrackp * opt, const char *fil);
HTSEXT_API int is_dyntype(const char *fil);
HTSEXT_API const char *get_ext(char *catbuff, size_t size, const char *fil);
HTSEXT_API int may_unknown(httrackp * opt, const char *st);
HTSEXT_API int guess_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil);
HTS_DEPRECATED("use guess_httptype_sized(opt, s, ssize, fil)")
HTSEXT_API void guess_httptype(httrackp * opt, char *s, const char *fil);
/* Ugly string tools */

View File

@@ -1162,7 +1162,7 @@ static PT_Element PT_ReadCache__New_u(PT_Index index_, const char *url,
FILE *fp = fopen(file_convert(catbuff, sizeof(catbuff), previous_save), "rb");
if (fp != NULL) {
r->adr = (char *) malloc(r->size + 4);
r->adr = (char *) malloc(r->size + 1);
if (r->adr != NULL) {
if (r->size > 0
&& fread(r->adr, 1, r->size, fp) != r->size) {
@@ -1172,6 +1172,7 @@ static PT_Element PT_ReadCache__New_u(PT_Index index_, const char *url,
sprintf(r->msg, "Read error in cache disk data: %s",
strerror(last_errno));
}
r->adr[r->size] = '\0';
} else {
r->statuscode = STATUSCODE_INVALID;
strcpy(r->msg,

View File

@@ -30,6 +30,17 @@ run() {
RC=$?
}
# crawl using exactly the given args as the only URL(s), no implicit primary URL;
# leaves the exit status in RC
run_only() {
local out="$1"
shift
rm -rf "$out"
mkdir -p "$out"
httrack -O "$out" --quiet -n "$@" >"$out/.log" 2>&1
RC=$?
}
# assert the value was accepted: clean exit and the fixture was mirrored
accepted() {
{ test "$RC" -eq 0 && test -n "$(find "$1" -type f -path '*/index.html' -print -quit)"; } ||
@@ -68,4 +79,15 @@ refused "#152: over-cap -F not refused cleanly"
run "$tmp/ov-l" --user-agent "$over"
refused "#152: over-cap --user-agent not refused cleanly"
# Quote handling on the sole URL (run_only, so the quoted arg is the only URL and
# can't be masked by an implicit one). A fully "-quoted URL has its surrounding
# quotes stripped in place and is mirrored; a dangling opening quote, and a lone
# quote (empty after the opening "), are refused cleanly and never crash.
run_only "$tmp/q-ok" "\"file://$tmp/index.html\""
accepted "$tmp/q-ok" "quoted URL not stripped/mirrored"
run_only "$tmp/q-bad" '"foo'
refused "dangling-quote argument not refused cleanly"
run_only "$tmp/q-lone" '"'
refused "lone-quote argument not refused cleanly"
exit 0

91
tests/01_engine-rcfile.test Executable file
View File

@@ -0,0 +1,91 @@
#!/bin/bash
#
# Config-file alias loading (no network). A .httrackrc in the working directory
# is read by optinclude_file(), whose cmdl_ins macro inserts each alias-expanded
# token into the x_argvblk block. That macro used to copy with an unbounded
# strcpy on a bare char*; it is now bounded (strlcpybuff + cmdl_room over the
# block capacity). Two properties are checked:
# 1. The bound does not truncate: a long user-agent alias reaches doit.log
# intact. user-agent expands to two tokens (-F <value>), so it exercises
# both cmdl_ins insertions.
# 2. The bound holds under exhaustion: a pathological .httrackrc whose alias
# expansions overflow the block aborts cleanly through the htssafe bounds
# check (a message naming htsalias.c) instead of overrunning the heap. The
# unbounded version segfaulted here.
# set -e with the intentional-nonzero httrack runs guarded explicitly (the
# crawls below are expected to fail/abort and their status is inspected by hand).
set -euo pipefail
# Resolve httrack to an absolute path before we cd: PATH may hold a build-relative
# entry that would not resolve from the temp directory.
bin=$(command -v httrack) || {
echo "FAIL: httrack not found on PATH"
exit 1
}
case "$bin" in
/*) ;;
*) bin="$(cd "$(dirname "$bin")" && pwd)/$(basename "$bin")" ;;
esac
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_rcfile.XXXXXX") || exit 1
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
# --- 1. alias token survives the bound intact -------------------------------
d1="$tmp/intact"
mkdir -p "$d1"
echo '<html><body>hello</body></html>' >"$d1/index.html"
# optinclude_file() lowercases each config line, so the marker is lowercase to
# survive the comparison verbatim.
marker='zzz_rcfile_marker_0123456789_abcdefghijklmnopqrstuvwxyz_intact'
printf 'user-agent=%s\n' "$marker" >"$d1/.httrackrc"
# Run with no -O so the working-directory .httrackrc is loaded (an -O path makes
# the engine skip the rc files). Output lands in the temp dir. Guard the run so a
# nonzero exit is captured for the assertion instead of tripping set -e.
rc=0
(cd "$d1" && "$bin" "file://$d1/index.html" --quiet -n >.log 2>&1) || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: rc-file crawl exited $rc"
exit 1
}
test -f "$d1/hts-cache/doit.log" || {
echo "FAIL: doit.log not written (rc file not processed)"
exit 1
}
# A truncated copy would cut the token; require the full -F value.
grep -q -- "-F $marker" "$d1/hts-cache/doit.log" || {
echo "FAIL: user-agent alias missing or truncated in doit.log"
head -1 "$d1/hts-cache/doit.log"
exit 1
}
# --- 2. block exhaustion aborts through the bound, not the heap -------------
d2="$tmp/exhaust"
mkdir -p "$d2"
echo '<html><body>hi</body></html>' >"$d2/index.html"
# Each line inserts ~two tokens of ~200 bytes; 400 lines overflow the block's
# fixed slack (current_size + 32768) many times over, deterministically.
val=$(printf 'a%.0s' $(seq 1 200))
for _ in $(seq 1 400); do
printf 'user-agent=%s\n' "$val"
done >"$d2/.httrackrc"
# The process aborts (httrack turns the fatal signal into exit 134 either way),
# so the exit code does not distinguish the bounded abort from a heap overflow;
# the stderr diagnostic does. The htssafe bounds check names the offending file.
# Expected to fail, so the nonzero exit is swallowed; only the log is inspected.
(cd "$d2" && "$bin" "file://$d2/index.html" --quiet -n >.log 2>&1) || true
grep -Eq "overflow while copying.*htsalias\.c" "$d2/.log" || {
echo "FAIL: exhausted rc file did not abort through the htsalias.c bound"
echo "(an unbounded copy would overrun the heap here)"
tail -3 "$d2/.log"
exit 1
}
exit 0

View File

@@ -25,6 +25,7 @@ TESTS = \
01_engine-idna.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \

View File

@@ -499,6 +499,7 @@ TESTS = \
01_engine-idna.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \

View File

@@ -118,7 +118,10 @@ main() {
git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD |
tar -x -C "$export_dir"
# Refresh build system and man page, then build and validate the tarball.
# Refresh build system and man page, then build the tarball. We build here
# only because regen-man needs the compiled binaries; the test suite is not
# run in this pass. debuild (below) runs the full suite once, with the online
# tests enabled, so a check here would just be a slower, offline-only repeat.
info "regenerating build system and man page"
(
cd "$export_dir"
@@ -126,8 +129,6 @@ main() {
./configure --quiet
make -s -j"$(nproc)"
make -s -C man regen-man
info "running test suite"
make -s check
# Build the tarball from a clean tree so no object files leak into it.
make -s clean
make -s dist