treathead() parses the Content-Type value with sscanf("%s") into a local
`tempo` buffer, then calls strlen(tempo) and stores the result. A response
whose Content-Type header has an empty or whitespace-only value yields no
token: sscanf leaves `tempo` uninitialized, so strlen reads uninitialized
stack and can over-read past the buffer. A hostile server triggers this with
a bare `Content-Type:` line.
Guard on sscanf's return: adopt the value, and mark the type as server-given,
only when a token was actually read. An empty value now falls back to the
default type with contenttype_given left false, i.e. it is treated like a
missing header and the URL extension is kept -- which is also the correct
naming behavior.
Found while reviewing #409, which added contenttype_given right beside this
parse; the bug itself predates it. tests/17_local-empty-ct.test exercises the
empty-Content-Type path, and the ASan/UBSan CI job is what catches the
uninitialized read.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.
The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.
Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.
Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.
Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.
Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.
tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.
wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.
tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Replace the network dependency for crawl tests with a self-contained Python
stdlib server (http.server + ssl) that httrack crawls over loopback. The server
binds an ephemeral port and prints it on stdout; local-crawl.sh discovers the
port, substitutes the BASEURL token into the httrack arguments, runs the crawl,
and audits the mirror under the discovered host-root directory.
This prototype migrates two cases off ut.httrack.com:
- 13_local-cookies.test drives the cookie chain (entrance/second/third)
reimplemented as Python handlers from the old ut/cookies/*.php fixtures. A
missing or wrong cookie answers 500, so a clean 3-files/0-errors run proves
the cookie jar is replayed across links.
- 14_local-https.test crawls over HTTPS using a shipped long-dated self-signed
cert. httrack does not verify certs, so the cert is accepted as-is and the
real TLS path runs offline.
The group skips (exit 77) when python3 is missing, mirroring check-network.sh.
Fixtures and the cert are listed explicitly in EXTRA_DIST (automake does not
expand globs); make distcheck passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
mkdeb.sh regenerated the upstream orig from a fresh `git archive HEAD | make
dist` on every run. That is right for a -1 release, but a Debian revision >= 2
reuses the orig frozen in the archive at -1: the .dsc pins it by checksum, and
a regenerated orig (different mtimes, and content drift whenever the release
tooling shipped in EXTRA_DIST changes) gets rejected by dak. The -2 upload had
to bypass mkdeb.sh and stitch the package by hand.
Derive the upstream version and Debian revision from debian/changelog and let
the revision pick the orig: revision 1 builds a fresh tarball as before;
revision >= 2 reuses the one passed with --orig FILE, untouched. The --orig
requirement is enforced only for a signed (upload-bound) build: an unsigned
build is a throwaway (CI, local lintian) that can never reach the archive, so
it still regenerates the orig as before rather than demanding a frozen one.
Two guards close the gap the old code left implicit: the regenerate path
asserts the built tarball matches the changelog version (catching a
configure.ac/changelog skew), and the overlay step confirms the orig unpacks
to httrack-<ver>/ before dropping debian/ on top.
Validated end to end by reusing the official 3.49.8 orig to build 3.49.8-2:
the resulting .dsc pins the frozen orig's checksum byte for byte.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The release-artifacts step signs and checksums httrack_<ver>.orig.tar.gz in
$outdir, but $outdir is populated by `dcmd cp` from the .changes, which lists
only the files in the upload. dpkg-genchanges omits the orig from a revision
>= 2 .changes (it is already in the archive), so the orig never reached
$outdir and `gpg --detach-sign` failed with "No such file or directory",
aborting a -2 (or later) release after the source package was already built.
Copy the orig from the build tree into $outdir before signing so the website
artifacts are produced regardless of the Debian revision. The upload is
unaffected: dput uploads the .changes-referenced files, not the extra orig.
CI didn't catch this because the deb job builds unsigned and the artifact
block is gated on a signing key.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Two packaging nits surfaced while reviewing the libhttrack3 rename, both
debian/-only:
- debian/libhttrack-swf1.files listed libhtsswf.so.1* but there is no
libhttrack-swf1 package in debian/control and the swf module is no longer
built (lib_LTLIBRARIES is just libhttrack/libhtsjava). dh_movefiles only
consults built packages, so the list was dead. Remove it.
- libhttrack3.lintian-overrides claimed the ABI is tracked via "a strict
=version dependency", but dh_makeshlibs --version-info emits the
conservative (>= upstream-version) form, which is the correct choice for a
soname-versioned library; a = ${binary:Version} shlibs dependency draws
lintian's distant-prerequisite-in-shlibs. Correct the comment to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The 3.49.8 ABI bump moved the soname to libhttrack.so.3, but the packaging
still globbed .so.2 in debian/libhttrack2.files, so the runtime libraries
matched nothing there and fell through into the catch-all httrack package;
libhttrack2 shipped no library (lintian package-name-doesnt-match-sonames).
Rename the binary package to libhttrack3, take over the misplaced libraries
from httrack and the old libhttrack2 via Breaks/Replaces, and switch the
.files globs to a .so.3* wildcard so a future soname bump no longer silently
misplaces the libraries. Ships as 3.49.8-2; new binary name goes through NEW.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Compute a start past every range already in /etc/subuid+subgid and print the
canonical sudo usermod --add-subuids/--add-subgids command, instead of a raw
file append the user has to adjust by hand.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The unshare backend maps a whole UID range, not just the caller's, because the
base install creates system users. Without an /etc/subuid+subgid entry the
install crashes (dpkg SIGSEGV) instead of failing cleanly. Check for the range
before bootstrapping and point at the one-line fix; skip the check for root,
which uses mode=root.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The idempotency guard matched chroot_mode.*unshare anywhere in ~/.sbuildrc,
including a commented-out line, so --write-sbuildrc would silently skip the
append and leave the unshare backend unconfigured. Anchor the match to an
active assignment.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The --sbuild gate needs an sbuild chroot, which was only documented as loose
commands. This adds a companion script that bootstraps one with the rootless
unshare backend (mmdebstrap into ~/.cache/sbuild/<dist>-<arch>.tar.zst, where
sbuild finds it by name), idempotent unless --force, optionally writing the
unshare mode into ~/.sbuildrc. mkdeb.sh's --sbuild help now points at it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
With source-only uploads the archive's buildds are the first place the package
is built in a clean environment, so an undeclared Build-Depends or any FTBFS
only shows up after the upload. --sbuild rebuilds the freshly produced .dsc in a
minimal chroot holding only the declared Build-Depends, reproducing the buildd
environment; a failure aborts the release before the upload. It runs after the
source package is built and before the upstream-tarball release artifacts are
signed. Logs and the clean-built debs land in <outdir>/sbuild.
The distribution comes from the changelog (UNRELEASED falls back to unstable),
and the flag fails fast if sbuild isn't installed. Off by default; needs an
sbuild chroot for the target suite.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Round out the 3.49-8 entry in history.txt and the debian changelog with the
user-facing work landed since 3.49-7: the HTTPS-proxy CONNECT tunnel, wider
srcset parsing, the crawler and parser fixes (CSS @import, xmlns, relative
paths, RFC 6265 cookies, doit.log reload), the parser and engine buffer-copy
security hardening, and brief summary lines for the API, build, CI and test
work.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Set SeparateDefinitionBlocks: Always in .clang-format so clang-format keeps
a blank line between adjacent definitions, then reformat the installed
(DevIncludes) headers in full. Several of them packed struct/typedef/macro
definitions with no separation and carried non-canonical spacing (char*,
__attribute__ ((x)), padded inner parens), which made them hard to read;
this brings them to the repo's clang-format-19 canonical form and inserts
the separating blank lines.
Headers only, no semantic change: out-of-tree build is clean and make check
passes (21 pass, 7 network skip, 0 fail). htsconfig.h is UTF-8 and its
French comments survive byte-for-byte (clang-format only reflowed them to 80
columns). The new option also governs future touched-line formatting of the
engine sources.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.
Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:
- Quote single-value expansions, drop the redundant ${} in arithmetic,
add read -r, and use printf '%s' instead of variables in format
strings, across the generators, crawl-test.sh, run-all-tests.sh and
search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
lists into bash arrays (space-safe, no scattered disables) and replace
the numeric trap signal lists with names, dropping the un-trappable
KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
literal-quoted pattern.
The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).
No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.
The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.
While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.
Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.
Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).
Closes#163
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.
Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.
Closes#218
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.
Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.
Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.
Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.
01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.
Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
before the terminator, leaving the closing quote inside the range so the
bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
before the semicolon) for every quoted detection, not only @import.
01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.
Closes#94
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.
01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).
- Bound the response read so a proxy cannot stall the single-threaded back_wait
crawl: proxy_getline now fails on an over-long line instead of consuming it
forever, the header drain is capped at 64 lines, and the send loop gives up
rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
CR/LF defense lives entirely upstream (escape_remove_control / header-line
splitting).
- Test: the origin now records the headers it receives, and the test asserts
Proxy-Authorization never reaches the origin through the tunnel (the previous
assertions couldn't see a leak). Added a flooding-proxy scenario that proves
the crawl terminates instead of hanging on an unbounded response.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.
Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).
Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.
Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.
01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:
Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/
Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":
Cookie: name=value; has_js=1
The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Type opt->savename_83 as a new hts_savename_83 enum (LONG/DOS/ISO9660 =
0/1/2) and replace the remaining magic-number literals for the already-
typed verbosedisplay and savename_delayed fields with their named enum
constants across the engine.
Behavior-preserving: every constant equals the literal it replaces, and a
C enum is int-sized, so struct layout is unchanged (sizeof(httrackp) and
offsetof(savename_83) are identical to origin/master, no soname bump). The
-L option block is deliberately reflowed to clang-format style, which is
what made the savename_83 retype tractable. Bitmask fields (travel/seeker/
getmode/parsejava/hostcontrol) intentionally stay int with named bit enums,
per the existing flags-as-enum split.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
copy_htsopt() copies each field only when it is not the "-1 means unset"
sentinel, written as `if (from->X > -1)`. The boolean/enum option
migrations turned nearlink, errpage and parseall into hts_boolean, which
GCC backs with unsigned int. `unsigned > -1` is always false, so those
three fields silently stopped being copied.
Cast to int at the guard to restore the signed sentinel test. Add a
hidden `httrack -#9` self-test that drives copy_htsopt over distinct
boolean values plus an int positive control (tests/01_engine-copyopt.test);
it fails on the unfixed guard.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The -t "test all" flag was a stray #define sitting next to the scope
enum; make it an enum constant so the named travel values live in one
place. The mask (HTS_TRAVEL_SCOPE_MASK) stays a #define: it selects the
scope out of opt->travel, it is not a member of the value set.
Name and value (1 << 8) are unchanged, so every use site compiles
identically and opt->travel stays plain int. No ABI change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
debug becomes hts_log_type (it already stored LOG_* values; the int
declaration was a latent type hole), savename_delayed becomes a new
hts_savename_delayed { NONE, SOFT, HARD }, and verbosedisplay becomes a
new hts_verbosedisplay { NONE, SIMPLE, FULL }. hostcontrol stays int but
its bits are now named by a new hts_hostcontrol flags enum, matching the
existing getmode/seeker/travel/htsparsejava_flags pattern.
A C enum is int-sized, so struct layout, field offsets and
sizeof(httrackp) are unchanged: no ABI break, no soname bump. The three
sscanf("%d", ...) sites that fill these fields now write through an int*
(size-identical) to keep the format type exact.
These enums are unsigned-backed (all enumerators non-negative), so the
non-negative debug comparisons (debug < level, debug > LOG_INFO, etc.)
now compile to unsigned jumps. debug is never negative, never sscanf'd
and never tested against a negative bound, so the result is unchanged;
disassembly is otherwise byte-identical bar instruction scheduling.
savename_83 is left as int on purpose: its sscanf sits in the -L parser
block whose old indentation does not round-trip through clang-format.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The exported API had many functions returning int where the int is really a
yes/no answer. Type the 14 genuinely-boolean ones as hts_boolean
(catch_url, dir_exists, is_dyntype, may_unknown, hts_findnext,
hts_findisdir/isfile/issystem, hts_has_stopped, hts_addurl, hts_resetaddurl,
hts_log, get_httptype_sized, guess_httptype_sized) and the three boolean int
parameters likewise (get_httptype_sized's flag, unescape_http_unharm's no_high,
hts_request_stop's force).
hts_boolean moves from htsopt.h to htsglobal.h so the library header, which only
forward-declares httrackp and does not include htsopt.h, can see the type.
The audit deliberately left alone the functions whose name suggests a boolean
but whose value is not 0/1: hts_is_testing returns 0..5, hts_is_exiting and
is_knowntype/is_userknowntype are tri-state, structcheck and the *_utf8 wrappers
are POSIX 0/-1, hts_findgetsize is a size, hts_main is an exit code, and
copy_htsopt returns 0 for success (a bool would read backwards). hts_setpause
and hts_is_parsing keep int params because they gate on '>= 0', not 0/1.
Not an ABI break: int -> int-sized enum is the same calling convention for both
return values (eax) and parameters, and enum<->int is implicit for callers, so
already-compiled consumers keep working. Verified by comparing per-object
disassembly against master: 39 of 45 objects byte-identical, htslib differs only
in __LINE__ immediates, and the five caller/definer objects differ only in
register allocation and return-block merging (no control-flow or value change).
make check passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The httrackp option fields that are pure on/off toggles were declared as bare
int. Introduce a two-value enum, hts_boolean { HTS_FALSE, HTS_TRUE }, and use it
as the type of the 38 boolean fields so each one documents its nature at the
declaration. The hts_create_opt() defaults block now reads HTS_TRUE/HTS_FALSE.
An enum is used rather than C bool on purpose: a C enum is int-sized and
represented like int, so the struct layout, every field offset and
sizeof(httrackp) are unchanged (verified: 141648 bytes before and after). The
size_httrackp guard value still holds and there is no soname bump. A bool field
would be one byte and would repack the whole struct.
Scope is httrackp only; fields that look boolean but are not were left as int
(savename_delayed is tri-state, hostcontrol is a bitmask), as was is_update in
the separate lien_back struct. The four CLI sites that sscanf("%d") into a
boolean field now cast to int* to keep the read well-defined.
Value-preserving: built against origin/master and compared per-object
disassembly. 40 of 45 objects are byte-identical; the five that differ
(htscore/htslib/htsname/htsparse/htswizard) differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes. Runtime assignments and tests on these fields are
left as plain 0/1, which compile identically.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The per-mirror option fields in the installed htsopt.h carried bare ints whose
values were scattered magic numbers, decoded only by reading the parser. Type
the four single-valued fields as enums (urlmode -> hts_urlmode, cache ->
hts_cachemode, wizard -> hts_wizard, robots -> hts_robots) and name the bitmask
bits as enums too (hts_getmode, hts_seeker, hts_travel_scope, plus
HTS_TRAVEL_SCOPE_MASK / HTS_TRAVEL_TEST_ALL), following the existing
htsparsejava_flags pattern where the flag bits are an enum but the field stays
int. Replace the magic numbers at every use site with the named values.
This is not an ABI break: a C enum is int-sized and represented identically, so
the struct layout, field offsets and sizeof(httrackp) are unchanged and the
size_httrackp guard value still holds. No soname bump.
The substitution is value-preserving and was verified by comparing per-object
disassembly between this branch and origin/master: 98 of 103 objects are
byte-identical, the htscore/htscoremain/htsparse objects have identical opcode
sequences (the only deltas are __LINE__ immediates moved by clang-format
wrapping long lines), and htslib/htswizard differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Four declarations named functions that have no definition anywhere, so
they were never exported (absent from libhttrack.so) and any caller
would fail to link: htswrap_set_userdef and htswrap_get_userdef (the
live path is the CHAIN_FUNCTION ARGUMENT with CALLBACKARG_USERDEF),
antislash_unescaped, and the internal liens_record. escape_remove_control
was additionally declared twice in httrack-library.h; the documented
declaration stays, the bare duplicate goes.
Header-only cleanup. The exported symbol set is unchanged (verified with
nm -D), so this is not an ABI break and needs no soname bump.
Found while documenting the public API (#382).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
mtime_local() returns milliseconds since the epoch, but the POSIX
branch divided tv_usec (microseconds) by 1000000 instead of 1000,
dropping the entire millisecond term. The clock only advanced at
whole-second boundaries, so every sub-second delta the callers compute
(request/connect timing, transfer-rate smoothing) read as zero. The
Windows ftime() branch was already correct.
Found while documenting the public API (#382).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The public C API was largely undocumented: most exported functions and the
installed structs had no contract, and htsopt.h documented its fields only in
terse, implementation-flavored French.
Add concise Doxygen comments across the public surface, stating the contract a
caller needs (ownership of returned and passed pointers, return and error
sentinels, buffer-size semantics, static/scratch-buffer lifetimes, and
thread-safety) rather than narrating the implementation. Covered: the 12
installed headers (DevIncludes_DATA) plus htsbase.h and htscore.h, which the
Windows and Android consumers include directly. All French comments are
translated to English; the touched files are now pure ASCII. A blank line now
separates each top-level definition for readability.
The change is comment and whitespace only, except for removing three accidental
duplicate declarations in httrack-library.h (hts_get_stats, hts_cancel_test,
hts_cancel_parsing were each declared twice). Verified by comparing the
comment-stripped preprocessor output against the previous version (no other
code token changes) and by a clean build.
Defects surfaced while reading the implementations (dead exported decls, an
mtime_local precision bug, the hts_get_stats global-aliasing hazard, and
several ABI-fragile or vestigial struct members) are left for separate fixes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The -#A cache self-test writes the cache with the same build it reads back,
a round-trip that by construction cannot catch a read-path or on-disk-ZIP
format regression: a writer and reader that drift together still agree. The
batch-6 read-path bounds work (ZIP_READFIELD_STRING, the r.adr/r.location
NUL invariant) had no guard against exactly that.
Add a hidden -#B subcommand that reads a committed, frozen new.zip
(tests/fixtures/cache-golden) and asserts a fixed set of entries -- normal
HTML, an empty redirect with a Location, JSON, a binary body with embedded
NUL and high bytes plus a Content-Disposition, a 404 -- still decodes field
for field and byte for byte. The fixture is a witness written once by an
earlier build; the table in htscache_selftest.c that defines the
expectations also regenerates it via `-#B <dir> regen`, used only when the
format changes on purpose. Every body stays in the ZIP (all_in_cache=1), so
reading needs only new.zip with no on-disk body, timestamp, or path
dependency -- portable across machines and through make distcheck's
read-only srcdir.
check_entry now also asserts Content-Disposition (cdispo) and tags its
diagnostics with the running mode (cache-selftest vs cache-golden).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
cmdl_room/cmdl_add/cmdl_ins were copy-pasted between htscoremain.c (the CLI
parser) and htsalias.c (config-file alias expansion), tagged "COPY OF cmdl_ins
in htscoremain.c". The copies had already drifted: htscoremain advanced the
pack offset by strlen+2, htsalias by strlen+1. Both are correct (a token plus
its NUL is L+1 bytes; +2 just leaves a one-byte gap), so the argv content was
identical either way, but two definitions of the same thing is one too many.
Move all three into htsalias.h (internal, gated by HTS_INTERNAL_BYTECODE,
already included by both translation units) and unify on the tight +1. This
only shrinks the inter-token gap in htscoremain's x_argvblk; every argv[] entry
is still an independently NUL-terminated string read through its own pointer,
so behavior is unchanged and the +32768 slack is untouched.
Adds 01_engine-doitlog.test for the doit.log reprise path, which drives
htscoremain's cmdl_ins (re-running httrack with no url re-inserts each recorded
argument) and had no coverage: 02_update-cache always passes a url, and
01_engine-rcfile exercises only the htsalias.c side. The test mirrors a file://
fixture, re-runs with no url, and asserts the reprise re-mirrors cleanly and
re-crawls the inserted url after a source change. Teeth-checked: dropping the
+1 makes the inserted tokens run together and the test fails on the resulting
crawl error.
make check: 16 PASS, 7 SKIP (offline). shellcheck clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The test scripts mostly ran with no error flags, so a failing command in
the middle would be ignored and the script would limp on to a misleading
result. Turn on strict mode everywhere, guarding the spots that legitimately
expect a non-zero exit:
- the htssafe overflow probes (-#8) deliberately abort, and the strsafe/
cmdline crawls capture an exit code to assert on, so those are run with
`|| true` / `|| rc=$?` rather than letting set -e kill the script first;
- the parser fixture crawl ignores httrack's own exit (it checks the mirrored
files), so it keeps `|| true`;
- 02_update-cache replaced `find ... | grep -q .` with a `-print -quit`
command substitution: under pipefail grep -q can close the pipe early and
leave find killed by SIGPIPE, which would spuriously fail an existing file;
- 12_crawl_https guards $HTTPS_SUPPORT with `${...:-}` for set -u.
02_manpage-regen and 01_engine-cache stay on `set -eu` (no pipefail): both are
run via $(BASH), which can be a plain POSIX /bin/sh where `set -o pipefail`
does not exist.
shellcheck clean; make check: 15 PASS, 7 SKIP (offline).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Both linters fetched a tool over the network. The format job pulled the
git-clang-format driver from raw.githubusercontent.com, which 429 rate-limits
the shared runner egress IPs; a 429 failed the job and left the cache empty, so
every later run cold-missed and 429'd again. The lint job similarly fetched the
shfmt release binary from github.com.
Both are unnecessary. The clang-format-19 package already installed ships the
matching git-clang-format driver (/usr/bin/git-clang-format-19); symlink it to
the unsuffixed name. And ubuntu-24.04 (noble) ships shfmt 3.8.0 in universe,
exactly the pinned version, so install it from apt too. This drops both fetches,
both actions/cache steps, and the LLVM_TAG / SHFMT_VERSION env: no network call,
nothing to rate-limit. Each tool's version now tracks its apt package, same as
clang-format itself.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The generated build system (configure, every Makefile.in, config.h.in,
ltmain.sh, config.guess/sub, the aux scripts) was committed so a bare git
clone could build without autotools. Nothing downstream relied on the
committed copies: CI runs autoreconf -fi, Debian regenerates via
dh_autoreconf, and the release tarball is built by make dist, which
regenerates them regardless. The only cost was a recurring footgun: a stale
Makefile.in after a *_SOURCES edit silently broke the plain build (undefined
reference to cache_selftests), and CI could not catch it.
Treat them as build products. They are now .gitignored and regenerated from
configure.ac/Makefile.am by the new ./bootstrap (autoreconf -fi), and shipped
only inside make dist tarballs so tarball users still need no autotools.
build.sh is a one-shot wrapper (bootstrap + configure + make) that runs
configure via /bin/sh, so it survives a noexec source tree. Both scripts join
EXTRA_DIST. INSTALL.Linux, README.md and AGENTS.md document the git flow:
./bootstrap before ./configure, autotools required for a git build.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The strcpybuff/strcatbuff/strncatbuff family in htssafe.h silently degrades
to an unchecked strcpy/strcat/strncat when the destination is a bare char*
(capacity unknown); the pointer-destination stubs carry a 'warning' function
attribute to flag every such call. The migration that converted those 241
sites is complete, so any new char* destination is a regression.
Promote that attribute to a hard error in our own build via
-Werror=attribute-warning (gcc) / -Werror=user-defined-warnings (clang),
probed with AX_CHECK_COMPILE_FLAG so each compiler picks up only the spelling
it accepts. htssafe.h is unchanged, so downstream consumers of the installed
header still see a plain warning rather than a build break.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
LANG_LIST bounded its fixed "LANGUAGE_NAME" copy into lang_str[1024] by
buffer_size — the capacity of the *output* buffer, not lang_str's. Harmless
today (the source is a 13-byte literal), but it's the wrong size for that
destination and would become a real overflow if the source ever grew. Bound by
sizeof(lang_str) like the sibling htslang_load call just below it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Clears htsserver.c's five remaining unbounded strcpybuff/strcatbuff/
strncatbuff pointer-destination sites, the last in the tree, completing the
htssafe pointer-destination migration.
Four are behavior-preserving: each destination's capacity is known at the
call site, so the explicit-size form bounds by the same allocation the raw
copy already relied on (smallserver's POST buffer over buffer_size, the
template name_[1026] scratch over its own size with n already < 1024, the
exact-fit malloc(len+1) lang-key copy).
htslang_load's two writes into its caller buffer were raw strcpy of a
language-name string read from the lang files; a name longer than the
caller's lang_str[1024] would have overflowed. Thread a limit_size through
the (static, internal) signature and bound both writes; the NULL-limit
callers pass 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The git-clang-format cache key and fetch URL each hardcoded llvmorg-19.1.7.
A future one-sided version bump would leave the cache serving the old driver
under a stale key. Pull the tag into an LLVM_TAG job env, mirroring how the
lint job already single-sources SHFMT_VERSION.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Print an explicit HIT/MISS line in each install step so the job log shows
whether the binary came from the cache or was downloaded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
shfmt has the same shape as the git-clang-format driver: a pinned, immutable
release binary curled from github.com on every lint run. Cache it keyed on the
pinned version (and arch) so it is fetched at most once, and retry the cold
fetch through transient errors -- so the lint job stops depending on a
github.com download succeeding on every PR run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The format job downloaded git-clang-format from raw.githubusercontent.com on
every run. The URL is pinned to an immutable tag (llvmorg-19.1.7), so the file
never changes, yet the repeated fetch was the one thing in the job that could
hit raw.githubusercontent.com's per-IP rate limit -- and on shared runner
egress IPs it did, failing the job with curl 429 (apt.llvm.org was fine; only
the step name suggested otherwise).
Cache the driver keyed on the tag so it is fetched at most once, and retry the
cold fetch through transient 429s with curl --retry --retry-all-errors.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The makeindex_firstlink_, base, codebase and loc_ aliases in the HTML
parser are bare char* views onto HTS_URLMAXSIZE*2 caller arrays, so
strcpybuff degraded to a raw strcpy (htssafe.h pointer-dest branch).
Bound all five with strlcpybuff(..., HTS_URLMAXSIZE*2), the documented
capacity of every target (makeindex_firstlink/base/codebase/loc in
htscore.c, r->location aliasing loc).
Behavior-preserving: each source (tempo, lien, back[].r.location) is
itself an HTS_URLMAXSIZE*2 buffer, so its NUL-terminated contents are
<= cap-1 and copy identically; no truncation is reachable. htsparse.c
now has zero pointer-destination warnings; htsserver.c (5) is the last
file before the stub can be flipped to an error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
htsalias.c keeps its own copy of htscoremain.c's cmdl_ins macro (config-file
alias expansion in optinclude_file). The copy still wrote alias-expanded tokens
into the argv block with an unbounded strcpybuff on a bare char*. Thread the
block capacity (x_argvblk_size) through optinclude_file and bound the insert
with strlcpybuff + cmdl_room, the same guard batch 13 applied to the original:
cmdl_room yields 0 instead of size_t-wrapping when the offset outruns the block,
so an alias/doit.log expansion bomb aborts cleanly rather than overflowing.
Adds 01_engine-rcfile.test, which had no coverage before: it drops a .httrackrc
with a long user-agent alias in the working directory, runs httrack with no -O
(the only way the rc files load), and checks the alias-expanded -F <value> token
reaches hts-cache/doit.log intact. user-agent expands to two tokens, exercising
both cmdl_ins insertions; a truncating bound is caught (verified by injecting
one).
htsalias.c pointer-destination warnings 2->0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Continues the htssafe.h pointer-destination migration in the CLI parser
(hts_main_internal). All sites write into a bare char*.
* The cmdl_add()/cmdl_ins() macros build argv entries into the x_argvblk block
(malloc'd as the command-line size + 32768). Thread the block's total size
(recorded in a new x_argvblk_size) and bound the copy with strlcpybuff. The
remaining room is computed by a cmdl_room() helper that yields 0 once the block
is exhausted (alias expansion or doit.log insertion can outrun the 32768 slack)
so the copy aborts cleanly instead of the size_t subtraction wrapping to a huge
unbounded value.
* The in-place argv rewrites each write no more than the slot already holds, so
they are bounded by strlen(dest)+1 (provably sufficient): the "(none)" ->
"\"\"" replacement, the two quote-strip copies (tempo is argv[na] minus its
surrounding quotes), and the "--catchurl" -> "-#P" rewrite. The "--clean"/
"--tide" empty rewrite becomes a direct argv[i][1]='\0'.
* Guard the quote-strip's tempo[strlen(tempo)-1] read: a lone '"' argument left
tempo empty and read tempo[-1] (out of bounds). It now takes the existing
missing-quote error path.
* The URL accumulator append uses strlcatbuff against the tracked url_sz.
These are macros/locals inside hts_main_internal, so not -#7 unit-testable;
cmdl_add runs on every invocation (covered by the whole suite). New
01_engine-cmdline.test cases exercise the quote-strip rewrite as the sole URL (a
quoted URL is mirrored; dangling- and lone-quote arguments are refused cleanly,
never a crash).
htscoremain.c pointer-destination warnings: 10 -> 0.
Signed-off-by: Xavier Roche <roche@httrack.com>
Continues the htssafe.h pointer-destination migration in the CLI parser
(hts_main_internal). All sites write into a bare char*.
* The cmdl_add()/cmdl_ins() macros build argv entries into the x_argvblk block
(malloc'd as the command-line size + 32768). Thread the block's total size and
bound the copy with strlcpybuff(argv[i], token, bufsize - ptr); record the size
in a new x_argvblk_size alongside x_argvblk.
* The in-place argv rewrites each write no more than the slot already holds, so
they are bounded by strlen(dest)+1 (provably sufficient): the "(none)" ->
"\"\"" replacement, the two quote-strip copies (tempo is argv[na] minus its
surrounding quotes), and the "--catchurl" -> "-#P" rewrite. The "--clean"/
"--tide" empty rewrite becomes a direct argv[i][1]='\0'.
* The URL accumulator append uses strlcatbuff against the tracked url_sz.
These are macros/locals inside hts_main_internal, so they are not -#7
unit-testable; cmdl_add runs on every invocation (covered by the whole suite),
and a new 01_engine-cmdline.test case exercises the quote-strip rewrite (a quoted
URL is mirrored; a dangling quote is refused cleanly, never a crash).
htscoremain.c pointer-destination warnings: 10 -> 0.
Signed-off-by: Xavier Roche <roche@httrack.com>
Continues the htssafe.h pointer-destination migration: the strcpybuff/strcatbuff
macros silently fall back to a raw strcpy/strcat when the destination is a bare
char* rather than a sized array.
All four functions are internal (hidden, not HTSEXT_API), so they take explicit
destination sizes:
* lienrelatif() builds a relative link into a char* caller buffer; threads a
size_t and bounds the "../"/path appends with strlcatbuff (the local _curr
copy uses sizeof(_curr)).
* long_to_83() / longfile_to_83() build an 8-3 / ISO9660 name into a caller
buffer; thread a size_t and use strl(n)catbuff.
* ident_url_relatif()'s in-place IDNA host rewrite bounds the copy by the
remaining capacity of adrfil->adr (a pointer into that array).
Callers in htscore.c, htswizard.c, htsparse.c and htsname.c pass sizeof(dest)
(all the destinations are HTS_URLMAXSIZE*2 arrays).
Add -#7 basic_selftests for longfile_to_83 (8-3 and ISO9660), long_to_83
(per-segment path conversion) and lienrelatif (same-dir basename, parent "../").
htstools.c pointer-destination warnings: 10 -> 0.
Signed-off-by: Xavier Roche <roche@httrack.com>
Continues the htssafe.h pointer-destination migration: the strcpybuff/strcatbuff
macros silently fall back to a raw strcpy/strcat when the destination is a bare
char* rather than a sized array.
In htsname.c:
* standard_name() builds the md5-based name into a caller buffer it received as
char* (size lost), via a chain of strncatbuff/strcatbuff. It is internal
(hidden, not HTSEXT_API), so it now takes an explicit destination size and
builds through an htsbuff bounded builder; its one caller (the
ADD_STANDARD_NAME macro) passes sizeof(buff).
* url_savename()'s delayed-extension append into lastDot (a pointer into the
afs->save[HTS_URLMAXSIZE*2] array) is bounded with strlcatbuff against the
remaining capacity.
Add a -#7 basic_selftests case for standard_name covering the no-query (no md5),
query (4-char md5) and short-name (clamped extension) paths.
htsname.c pointer-destination warnings: 12 -> 0.
Signed-off-by: Xavier Roche <roche@httrack.com>
get_httptype() took the caller buffer as a bare char* and raw-strcpy'd the MIME
string into it, so crawling a URL ending in .docx/.pptx/.xlsx (whose table MIME
types reach 73 chars) overflowed the 64-byte htsblk.contenttype that the htsback
and htslib callers pass, corrupting the adjacent struct fields. Remotely
triggerable.
* Widen htsblk contenttype/charset/contentencoding to HTS_MIMETYPE_SIZE (128, a
new named constant holding the longest registered MIME type). This changes the
installed htsblk layout, so bump the library soname (VERSION_INFO 2:49:0 ->
3:0:0).
* Add bounded get_httptype_sized(), guess_httptype_sized() and
adr_normalized_sized() that take the destination size and use
strlcpybuff/snprintf. The old get_httptype(), guess_httptype() and
adr_normalized() stay as wrappers, now marked HTS_DEPRECATED (portable:
GCC/Clang attribute, MSVC __declspec, nothing elsewhere). Internal callers
pass the real buffer size; the deprecated wrappers bound to the implicit
contract their old callers relied on (HTS_MIMETYPE_SIZE for the mime buffer,
HTS_URLMAXSIZE*2 for the URL buffer) rather than staying unbounded, so they
abort on overflow instead of silently corrupting memory.
* get_httptype_sized(), guess_httptype_sized() and give_mimext() now report
whether a type/extension was written; callers check the result and bail
rather than use a possibly-empty buffer (e.g. the is_hypertext_mime helpers).
A user "--assume cgi=" rule (empty value) matches but writes nothing, so
get_httptype_sized() returns the buffer's emptiness, matching the old callers'
strnotempty(s) test rather than reporting a bogus recognized type.
* -#7 basic_selftests: a .pptx MIME (73 chars) is stored whole into a real
htsblk.contenttype (a [64] field makes the bounded copy abort); give_mimext
and get_httptype_sized return values; the octet-stream fallback; the empty
--assume rule; plus fil_normalized "//"-in-query preservation and cut_path
trailing-slash / single-char branches.
Signed-off-by: Xavier Roche <roche@httrack.com>
LLM-assisted PRs are arriving; give agents one compact, tool-neutral file
covering the repo's toolchain rules and invariants so contributions arrive
review-ready instead of needing the conventions reconstructed each time.
AGENTS.md is the operational checklist (build/test, autotools regen, touched-
lines-only formatting, byte-safe Latin-1 edits, overflow-safe bounds,
adversarial self-review, commit/PR discipline). CLAUDE.md imports it via
@AGENTS.md so Claude Code auto-loads the same source. CONTRIBUTING.md keeps the
policy and gains a Co-Authored-By attribution rule plus a PR-conciseness line.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Continues the htssafe.h pointer-destination migration (X1), where the
strcpybuff/strcatbuff macros silently fall back to a raw strcpy/strcat
when the destination is a bare char* rather than a sized array.
In htslib.c:
* fil_normalized() rebuilds the sorted query through an htsbuff bounded
builder over the malloc'd copyBuff, then copies it back with strlcpybuff
(capacity is the known qLen + 1).
* treathead() bounds the Location: copy with strlcpybuff against the
location_buffer[HTS_URLMAXSIZE*2] contract.
* give_mimext(), convtolower() and cut_path() are internal (hidden, not
HTSEXT_API), so they take an explicit destination size and the callers
pass it: give_mimext in htsname.c/htscoremain.c/htslib.c, convtolower in
htshash.c. cut_path has no callers.
Add strlncatbuff(dst, src, size, n) to htssafe.h: a bounded n-limited
append with explicit capacity, the missing parallel to strlcatbuff.
Cover fil_normalized query-sort, give_mimext, convtolower and cut_path with
the -#7 basic_selftests.
get_httptype() and adr_normalized() are left for a follow-up: both are
exported (HTSEXT_API), and get_httptype() exposes a real latent overflow
(a .docx/.pptx/.xlsx URL writes a 65-73 char mime type into 64-byte
contenttype callers) whose fix is a public-ABI decision.
htslib.c pointer-destination warnings: 14 -> 4.
Signed-off-by: Xavier Roche <roche@httrack.com>
Picks up coucal PR #6: the MurmurHash3 tail mixing shifted a byte
promoted to int left by 24, overflowing signed int once the byte had
its high bit set (UBSan). A sanitized live crawl hashing arbitrary URL
keys aborted on it.
Verified: the ASan+UBSan www.edf.fr crawl that previously aborted at
murmurhash3.h:123 now completes clean (100 pages, no findings).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Convert htscore.c's 18 pointer-destination strcpybuff/strcatbuff sites (which
silently degrade to unchecked strcpy/strcat per the htssafe.h diagnostic) to
bounded forms:
- httpmirror(): one htsbuff over the malloc'd primary buffer drives the whole
link accumulation, replacing the manual "primary_ptr += strlen" cursor in the
filelist loop; the +/- filter slots build through htsbuff over their known
HTS_URLMAXSIZE*2 capacity.
- host_ban(): the "-host/*" filter slot builds through htsbuff.
- htsAddLink(): str->localLink builds through htsbuff / strlcpybuff bounded by
str->localLinkSize.
- next_token(): the in-place unquote/unescape copied the (always shorter) result
back through an 8KB temp buffer, which both relied on an unchecked pointer copy
and aborted on tokens over 8KB. Replace with memmove left-shift compaction: no
capacity guess, no size cap.
Add a next_token() regression test to basic_selftests (httrack -#7) covering
plain tokens, quote stripping, and \" / \\ unescaping; teeth verified.
htscore.c pointer-destination sites 18 -> 0.
Signed-off-by: Xavier Roche <roche@httrack.com>
These fread buffers were over-allocated as size+4, a superstitious margin
that never bought anything: every site writes a single trailing NUL at
[size], so size+1 is exactly right. Trim them all to size+1.
The proxytrack disk-fallback read in PT_ReadCache__New_u never wrote that
NUL at all, unlike its sibling read paths in the same file; add the missing
r->adr[r->size] = '\0' so the spare byte is actually used and the buffer is
a valid C string.
Signed-off-by: Xavier Roche <roche@httrack.com>
Fill malloc'd and freed memory with 0xCA in the sanitize job so a buffer
fread into without NUL termination, then used as a C string, runs off into
the redzone instead of stopping at an accidental zero byte. ASan caps its
malloc fill at the first 4096 bytes by default, which lets large cache
buffers escape; max_malloc_fill_size lifts the cap. No rebuild, no source
change -- purely the test environment.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The disk-fallback read (cache_readex with X-In-Cache: 0, body on disk) had no
runtime coverage: the crawl tests never re-read such a body into memory, which
is why the missing terminator there went unnoticed until the audit. Extend the
-#A cache self-test:
- check_entry now asserts every read-back body is NUL-terminated at [size],
covering the in-zip read paths.
- A new pass stores a non-hypertext record (X-In-Cache: 0), creates the body at
the exact fconv()-resolved path the reader uses, reads it back through the
disk-fallback branch, and asserts it round-trips and is terminated.
Verified by reverting the fix: with the terminator removed the new pass fails
("body not NUL-terminated"); with it in place the pass is clean. Runs under the
ASan/UBSan CI job, so it now guards the disk-fallback path that had none.
Signed-off-by: Xavier Roche <roche@httrack.com>
Follow-up audit after the cache strstr() overflow in #356: same pattern of
reading a file or record into a malloc'd buffer and then treating it as a C
string without a terminator.
- cache_readex disk-fallback paths (htscache.c, "previous_save"/"return_save")
read a record body into malloc(size+4) but, unlike their zip and .dat
siblings, never set the trailing NUL. The body is later strlen'd
(htscache.c:923, htscore.c:1046), so an un-terminated one over-reads.
Terminate it like the siblings do, but only for r.size >= 0: these two paths
guard the read with `r.size > 0 &&`, so a crafted cache with a negative
X-Size would otherwise fall through to write *(r.adr + r.size) one byte
before the allocation (heap underflow). The sibling paths read
unconditionally and fail the read for a negative size, so they never hit it.
- cache_readdata (HTS_FAST_CACHE) reads the record into malloc(len+4) whose
comment already reserves the "Plus byte 0" but never set it. Set it (the
enclosing `len > 0` keeps the write in bounds).
- index_finish (htsindex.c) ran strchr() over a malloc(size+4) buffer read raw
from the temp index file; a final line without a newline would over-read.
NUL-terminate before scanning.
All four are exercised under the ASan/UBSan CI job. proxytrack's store.c has the
same structural pattern but never strlen()s the body (it is served as binary),
so it is left as is.
Signed-off-by: Xavier Roche <roche@httrack.com>
httpmirror() read hts-cache/new.lst into a malloc(sz) buffer and then ran
strstr() over it to decide which old files to purge. fread() does not
NUL-terminate, so strstr() scanned past the end of the allocation; with the
wrong heap layout it ran into the redzone. ASan caught it as a
heap-buffer-overflow on the cache-read (update) crawl. Whether it tripped
depended on the byte just past the buffer, which is why it surfaced only
intermittently on cold CI runners and never reproduced locally.
Allocate sz + 1 and NUL-terminate after the read, matching the existing
filelist_buff pattern in the same file. Both strstr() calls in the block are
covered.
Found by the new ASan/UBSan CI job.
Signed-off-by: Xavier Roche <roche@httrack.com>
sanitize: build and run the suite under AddressSanitizer + UndefinedBehavior
Sanitizer, driving the parsers that handle untrusted crawled input. This
surfaced the use-after-free, the numeric-entity overflow, and the coucal
alignment fix in this branch; leak detection is off so the job reports
memory-safety errors rather than exit-time leaks.
no-ssl: build and test with --disable-https (and no libssl installed) so the
#if HTS_USEOPENSSL branches, never compiled by the libssl-equipped matrix, do
not rot.
distcheck: roll the release tarball and build/test it out-of-tree, guarding
against a source missing from *_SOURCES or EXTRA_DIST.
Signed-off-by: Xavier Roche <roche@httrack.com>
automake does not expand wildcards in EXTRA_DIST, so "coucal/*" and the
"*.dsp/*.dsw/*.vcproj" globs were left as literal targets that broke
"make dist" (and distcheck) out-of-tree with "No rule to make target
'coucal/*'". List the files explicitly; coucal's .c/.h ship via *_SOURCES
already, so only its aux files (LICENSE, Makefile, README.md, sample.c,
tests.c) plus the Windows project files needed listing. Regenerated
src/Makefile.in.
Signed-off-by: Xavier Roche <roche@httrack.com>
Picks up the coucal fix that reads each hash block with memcpy instead of
dereferencing an unaligned uint32_t*, clearing a UBSan alignment finding that
fired on nearly every hashtable insert during a crawl.
Signed-off-by: Xavier Roche <roche@httrack.com>
A numeric entity such as � was accumulated digit by digit into an
int with no bound, overflowing once past INT_MAX (undefined behavior). Guard
before each multiply: a value beyond the Unicode maximum (0x10FFFF) is invalid
anyway, so stop and keep the entity literal instead of overflowing. The input
comes straight from crawled pages.
Found by the new ASan/UBSan CI job.
Signed-off-by: Xavier Roche <roche@httrack.com>
The post-process step captured a pointer into output_buffer's own storage,
reset the array size to zero, then re-appended that pointer. The append's
realloc (TypedArrayEnsureRoom reallocs unconditionally) could move the block,
leaving the copy reading freed memory. The default callback returns "modified"
without touching the data, so this hit on every crawl; ASan flagged the
use-after-free. glibc usually returns the same pointer on a same-size realloc,
which is why a plain build never crashed.
Only copy when the callback handed back a different buffer. When it edited
output_buffer in place, just adopt the new length.
Found by the new ASan/UBSan CI job.
Signed-off-by: Xavier Roche <roche@httrack.com>
Keeps the checkout action on a supported major; v4 runs on the
end-of-life Node 20 runtime, v6 moves to Node 24.
Signed-off-by: Xavier Roche <roche@httrack.com>
The deb job set DEB_BUILD_OPTIONS=nocheck to skip a redundant second test run.
With mkdeb.sh no longer running its own pre-build check, debuild's is the only
test pass, so nocheck would suppress it entirely and CI would never exercise the
packaged build's tests. Drop nocheck; keep noautodbgsym and parallel.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
mkdeb.sh built and tested the sources twice: once in its own export-tree
pre-build (make check, offline), then again under debuild, whose dh_auto_test
runs the suite with the online tests enabled (debian/rules configures with
--enable-online-unit-tests=auto). The first run was a slower, offline-only
subset of the second.
Drop mkdeb's own make check. The export-tree build stays, since regen-man needs
the compiled binaries, but the suite now runs once, under debuild, as the
superset. This is the same redundancy CI #352 removed via DEB_BUILD_OPTIONS=nocheck;
fixing it in mkdeb.sh applies it to release builds too instead of per-environment.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Running the suite on macOS surfaced two GNU/Linux assumptions. The test
harness there resolves $(BASH) to /bin/sh (POSIX mode), and macOS ships
BSD userland, so:
- 01_engine-cache used "du -sb"; the -b (apparent bytes) flag is GNU-only
and BSD/macOS du rejects it, leaving an empty size and an "integer
expression expected" error. Switch to portable "du -sk" (1024-byte
units); block-allocated size is an upper bound, fine for a ceiling.
- 02_manpage-regen used diff with process substitution, which a POSIX
/bin/sh does not parse. Stage the stripped inputs in temp files instead.
Both now pass under dash as well as bash, on Linux and macOS.
Signed-off-by: Xavier Roche <roche@httrack.com>
-Wl,--push-state,--no-as-needed,-lc,--pop-state forces libc back into
DT_NEEDED for libraries that reach it only through libhttrack: the
libhtsjava JNI wrapper and the libtest callback examples. The flag is
GNU-ld-specific; Apple's ld rejects it ("ld: unknown options:
--push-state --no-as-needed --pop-state"), breaking the macOS build, and
doesn't need it (every dylib links libSystem anyway).
Probe it once with AX_CHECK_LINK_FLAG and emit it via LIBC_FORCE_LINK
only where the linker accepts it. On GNU/Linux the flag is still applied
and libc.so.6 stays in DT_NEEDED, so behavior is unchanged there.
Signed-off-by: Xavier Roche <roche@httrack.com>
Two cheap portability targets that need no VM or second CI provider:
- macOS (Darwin/clang) on a native macos-14 runner. The tree has no
__APPLE__ branches, so Darwin runs the generic-Unix path against a
second libc and kernel. brew's openssl@3 is keg-only, so configure is
pointed at it via CPPFLAGS/LDFLAGS.
- 32-bit i386 via multilib on the existing x86-64 runner. Exercises the
32-bit size_t/pointer ABI, where size and bounds math can truncate or
wrap in ways 64-bit never shows. --build (not --host) keeps configure
out of cross mode so the i386 binary still runs the test suite.
Signed-off-by: Xavier Roche <roche@httrack.com>
The deb job spent ~3m19s in the build step, half of it on work CI does not
need. The package build (via mkdeb.sh) ran the full test suite a second time
with online/network unit tests enabled (~54s), and compressed the large LTO
-dbgsym packages that CI throws away (~48s).
Set DEB_BUILD_OPTIONS=nocheck,noautodbgsym,parallel=N on the CI step only.
nocheck skips debuild's make check, which is redundant here: the build matrix
already runs the suite on every config and mkdeb.sh's own pre-build runs the
offline tests. noautodbgsym drops the -dbgsym packages. parallel uses every
runner core. mkdeb.sh is unchanged, so release builds still build with LTO,
full tests, and debug symbols; only the CI environment differs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The Debian AppStream generator flagged both webhttrack desktop entries as
no-metainfo: with no MetaInfo file, the catalog entry is synthesized from
the .desktop file and the package description, which is deprecated and risks
the app being dropped from the metadata catalog.
Add com.httrack.WebHTTrack.metainfo.xml (installed to share/metainfo) for the
main app, launching WebHTTrack.desktop. Mark the secondary "Browse Mirrored
Websites" launcher with X-AppStream-Ignore=true so it doesn't produce a
duplicate, metadata-less catalog entry.
Validated with appstreamcli validate and desktop-file-validate.
Signed-off-by: Xavier Roche <roche@httrack.com>
The webhttrack Depends listed iceape-browser, iceweasel, icecat, mozilla,
firefox and mozilla-firefox as browser alternatives. All six have been
removed from Debian (iceweasel was only ever a firefox-esr stub, firefox is
not in Debian main), so qa.debian.org/debcheck flagged them as half-broken
relationships. The OR-chain still resolved via the trailing www-browser
virtual package, so it was noise rather than a real installability failure.
Replace them with firefox-esr | chromium | www-browser: two real browsers
that exist in Debian today plus the virtual fallback. google-chrome is left
out deliberately since it is not in the Debian archive and would reintroduce
the same half-broken relationship.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Add a CI job that builds the Debian packages on every push/PR through the
same tools/mkdeb.sh maintainers release with, so packaging regressions
(control, rules, file manifests, lintian) surface in CI instead of at
release time. One amd64/gcc run is enough: packaging is arch- and
compiler-independent and the existing matrix already covers compile
portability. The job is unsigned and uploads nothing; its value is the
pass/fail and the lintian gate.
Make mkdeb.sh fail the build on any lintian error or warning, and refresh
the lintian overrides so the package is clean at that level:
- Drop dead overrides whose tags lintian no longer emits (breakout-link,
the libhttrack spelling-error-in-binary).
- Rewrite the pointed-hint overrides (extra-license-file,
package-contains-documentation-outside-usr-share-doc,
hardening-no-fortify-functions): their match context is now empty and the
path shows only as a display pointer, so a path context never matches.
Match with '*' as the working webhttrack-common override already does.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
The tracked generated files (configure, */Makefile.in, ltmain.sh, aux
scripts) had drifted stale: src/Makefile.in predated #343, which added
htscache_selftest.c to the library sources, so a plain
`bash configure && make` link-failed with "undefined reference to
cache_selftests". CI regenerates with `autoreconf -fi` on every run, so it
never saw the staleness.
Regenerate with `autoreconf -fi` so a from-checkout build needs no autotools
installed, as intended. This also bumps the generators (autoconf 2.71->2.72,
automake 1.17, libtool 2.4.7), hence the large but purely generated diff.
Also commit the automake test-driver aux script (its siblings compile,
depcomp, missing, install-sh are already committed); without it `make check`
on a fresh checkout could not find $(top_srcdir)/test-driver. A clean
checkout now builds and passes `make check` with no autotools installed.
Signed-off-by: Xavier Roche <roche@httrack.com>
Continue the htssafe.h pointer-destination migration in htsback.c.
back_infostr() wrote into a bare char* through the unchecked strcatbuff()
path. Thread the destination capacity through and use strlcatbuff(), and
fix a latent bug while here: the size/totalsize trailer was sprintf'd
straight into the destination, wiping the URL the function had just
assembled, instead of being built in the scratch buffer and appended.
The fixed-size sprintf() calls become snprintf().
Enlarge back_info()'s status buffer to HTS_URLMAXSIZE*4+1024 so it can
hold both url_adr and url_fil (each HTS_URLMAXSIZE*2) plus framing. The
old HTS_URLMAXSIZE*2+1024 buffer was too small for two full-length URL
fields, so the now-bounded appends would abort on a long URL.
In back_add()'s fast-header cache path, copy the cached location into its
backing array (location_buffer) rather than through the r.location alias,
so the bounded macro sees the real capacity.
Add a back_infostr()/back_info() self-test under -#7: it formats 2000
in-memory slots across every status-code arm with exact-match assertions
(no sockets needed), plus a near-maximal URL driven through back_info()
to guard the buffer sizing. It fails on the clobber bug and on an
undersized status buffer.
htsback.c is now free of pointer-destination buff() warnings.
Signed-off-by: Xavier Roche <roche@httrack.com>
The old license-header note ("We hereby ask people using this source NOT to
use it in purpose of grabbing emails addresses...") read awkwardly: "in
purpose of", "emails addresses", "on persons", and a one-item bullet list
under an "Important notes:" heading.
Replace it across all source headers, configure/configure.ac, and README with
a single clean sentence, and align the wording everywhere:
Ethical use: we kindly ask that you NOT use this software to harvest email
addresses or to collect any other private information about people. Doing so
would dishonor our work and waste the many hours we have spent on it.
Pure text change, no behavior impact.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Out-of-tree builds were broken in two ways, and "make check" could not run
when the source tree sits on a noexec filesystem. Fix both so a plain
`mkdir build && cd build && bash <srcdir>/configure && make && make check`
works without copying the tree.
libtest: -I../src is relative to the build dir, so out-of-tree it pointed at
build/src (generated files only) and missed the source header
httrack-library.h. Use -I$(top_srcdir)/src.
Wildcard DATA lists (libtest, lang, html, m4) used bare globs like "*.html".
Make expands a wildcard prerequisite against the build dir, so out-of-tree the
glob matches nothing and stays literal ("No rule to make target '*.html'").
Glob against $(srcdir) instead. Explicit filenames (e.g. ../history.txt) are
left as-is; they resolve through VPATH.
make check: automake's driver execve()s each tests/*.test, which fails with
"Permission denied" when the source tree is on a noexec mount. Run them through
bash via TEST_LOG_COMPILER = $(BASH) (detected by configure); this also drops
any reliance on the scripts' executable bit and works on a normal tree too.
Verified end to end from a noexec source tree: out-of-tree make builds the full
tree including libtest, and "make check" runs (14 pass, online crawl tests skip
offline, 0 fail).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Collapse the stale "1998-2017" copyright ranges (and a handful of other
ranges) in HTTrack-authored source to a single earliest year taken from
git history: 1998 for files tracing back to the 2012 release-history
import, and the real first-commit year for later additions (2013 for
htsencoding, 2014 for htsarrays/htssafe/htsconcat, 2026 for the cache
self-test). Each header also gains an SPDX-License-Identifier:
GPL-3.0-or-later line.
The runtime "about" banners (httrack, proxytrack) and the man pages keep
a range, but now end in the current year computed at build time: via
__DATE__ for the C banners and makeman.sh for the generated httrack.1,
so they no longer freeze at a stale year.
Third-party notices (Even Rouault, Mathias Svensson, Info-ZIP, Eric
Young) and the BSD-licensed coucal submodule are left untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Wire a new `httrack -#A <dir>` debug option that exercises the ZIP cache
end to end through the public API (cache_init / cache_add / cache_readex),
in a dedicated source file (htscache_selftest.c).
It stores, then reads back asserting every header field and the body
round-trip exactly:
- hand-crafted edge cases: a normal HTML page, an empty redirect with a
near-limit location, a non-HTML body kept in cache via all-in-cache, and
a binary body with embedded NUL and high bytes (compared with memcmp);
- a few thousand small entries, to stress the index/lookup at scale;
- a few large compressible and incompressible bodies, to exercise zlib
deflate/inflate and large-buffer handling.
It then updates one entry and confirms the new value is read back. The
driver returns the number of mismatches so failures are observable. The
whole cache weighs ~1-2 MB and the run takes a fraction of a second.
The location case is sized to the cache's real per-header-line round-trip
limit: cached headers are parsed through a HTS_URLMAXSIZE-sized line
buffer, so a value longer than that is truncated on read regardless of
the larger r.location buffer; 1000 bytes stays safely under it.
A dedicated test (tests/01_engine-cache.test) drives the option, asserts
the success line, that a ZIP cache was written, and that its footprint
stays under a sane ceiling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
cache_rstr() read an attacker-controlled length (clamped only to 32768) from a
CACHE-1.x .dat and fread() it straight into fixed htsblk fields (r.msg[80],
r.contenttype[64], ...) with no destination bound -- a heap/stack overflow from
a crafted/old cache (the audit's S1). cache_brstr() (the in-memory variant) had
the same shape and, worse, no length cap at all.
Thread a destination size into both:
- cache_rstr stores at most s_size-1 bytes and fseek()s past the remainder so
the next field stays aligned (the field may be longer than the destination in
a tampered cache).
- cache_brstr caps the length and bounds the copy.
Update every caller (htscache.c and htscoremain.c) to pass sizeof(field) /
HTS_URLMAXSIZE*2. cache_rstr_addr already malloc()s to the read size, so it is
left as is. Remove the dead cache_quickbrstr (no callers).
A dedicated cache self-test (create/read/update) follows separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Every crawl test runs httrack exactly once (crawl-test.sh), so the cache read /
update path (cache_readex) -- recently touched by the buffer-bounding work -- had
zero regression coverage: the cache was written but never read back.
Add tests/02_update-cache.test, a self-contained file:// two-pass test (no
network, always runs): mirror a local site, re-mirror it unchanged (the cache-
read pass must complete with no errors -- guards a crash/abort in cache_readex),
then change a source file and re-mirror (the update must pick up the new content
-- guards the update decision that reads the cached metadata).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
ZIP_READFIELD_STRING (the cached ZIP-header field reader) copied
attacker-influenced cache-file values into fixed htsblk fields with an unchecked
strcpybuff -- benign for the char[] fields, but r.location is a char* (degrades
to raw strcpy). Thread the destination size into the macro: sizeof(field) for
the array fields, HTS_URLMAXSIZE*2 for r.location (it points into a buffer of
that size, in both the caller-supplied and the location_default case).
Also bound cache_readex's return_save copy (its one non-NULL caller passes a
HTS_URLMAXSIZE*2 buffer), the exact-sized malloc copy in cache_rstr's default
path (strlen(defaultdata)+1), and replace the two strcpybuff(r.location, "")
clears with a direct r.location[0] = '\0'.
htscache.c pointer-destination warnings 6 -> 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
cookie_get(), bauth_prefix(), cookie_insert() and cookie_delete() all wrote into
caller-provided char* buffers via unchecked strcpybuff/strcatbuff/strncatbuff
(the pointer-destination case). Bound them:
- cookie_get: write the extracted field with htsbuff over the buffer's 8192-byte
contract (all callers use char[8192]).
- bauth_prefix: copy host+path with strlcpybuff/strlcatbuff bounded to the
caller's HTS_URLMAXSIZE*2 buffer.
- cookie_insert/cookie_delete: thread the destination capacity (the cookie
store's max_len minus the cursor offset) and use strlcpybuff/strlcatbuff;
update cookie_add/cookie_del to pass it.
Add cookie_get field-extraction asserts to basic_selftests (run via -#7) rather
than a new -# digit. Translated the touched French comments.
htsbauth.c pointer-destination warnings 9 -> 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
hts_acceptlink_()'s auto-generated allow/deny rules built _FILTERS[0] -- a
filter slot of HTS_URLMAXSIZE*2 bytes -- via unchecked strcpybuff/strcatbuff/
strncatbuff on the char* slot, and HT_INSERT_FILTERS0 shifted slots with an
unchecked strcpybuff. Convert each rule builder to an htsbuff over the slot
(new local HTS_FILTER_SLOT_SIZE, matching the stride allocated by
filters_init()), and bound the slot-shift copy with strlcpybuff.
Behavior preserved: old vs new produce byte-identical mirrors across four crawl
configurations on a local multi-directory site (the auto-rules fire for primary
links on normal crawls). Touched French comments translated.
htswizard.c pointer-destination warnings 30 -> 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
optalias_check() wrote into caller-provided char* buffers with unchecked ops:
the param0 case did strcpybuff/strcatbuff of command+param into return_argv[0],
which can exceed the buffer, and the syntax-error paths sprintf()'d an option
name into return_error -- which is only 256 bytes in the config-file caller, so
a long option overflows it. Both are the overflow the audit flagged.
Thread return_argv_size and return_error_size through the (internal,
non-exported) signature; copy with strlcpybuff/strlcatbuff and format with
snprintf, so an over-long value aborts/truncates instead of overrunning. Update
both callers to pass their real sizes.
Leaves the shared cmdl_ins macro (the cmdl_* family wants its block size
threaded too -- a separate cleanup).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
httrack had no community-health files. Add a short CONTRIBUTING (PR/style
basics, security-sensitivity, an outcome-only AI-assistance policy), the
Contributor Covenant 2.1 as CODE_OF_CONDUCT, and a SECURITY policy with a
verified-reproduction bar for AI-assisted reports.
Require a Signed-off-by (DCO) on every commit and enforce it in CI via a new
pull_request-only job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
infostatuscode() was a ~60-case switch, each arm strcpybuff()-ing a literal into
the caller's char* msg: 42 unchecked pointer-destination copies of static data.
Keep the same O(1) switch dispatch but have it return the phrase instead of
copying -- new public infostatuscode_const(int) -> const char* (or NULL) -- and
do the copy in a thin wrapper.
infostatuscode() preserves exact behavior: a known code overwrites msg; an
unknown code keeps any caller-provided message, else writes "Unknown error".
The single remaining copy uses strlcpybuff with the documented 64-byte minimum
(longest phrase is 31; all callers pass >= 80).
Drops 42 pointer-destination warnings (htslib.c 56 -> 14; tree 179 -> 137).
No dispatch regression: it stays a switch (jump table), no allocation, no
per-call scan.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The savename_type == -1 userdef renderer walked afs->save with a raw char*
cursor, doing "b += strlen(b)" after each write, and strcpybuff(b, ...) on that
char* was unchecked (the pointer-destination case). That manual pointer math is
where the function's off-by-one / strlen-based hazards lived.
Convert the cursor to an htsbuff over afs->save (capacity sizeof = the full
HTS_URLMAXSIZE*2 buffer): every append is now bounds-checked and the pointer
math is gone. The loop's truncation guard becomes "sb.len < HTS_URLMAXSIZE",
preserving the existing cap-at-1024 behavior; the 2x buffer means a write only
aborts where it would previously have overrun. Add htsbuff_catc for the
single-character appends ('%', '.', literal copy).
Removes 35 pointer-destination warnings (htsname.c 51 -> 9; the renderer is now
warning-free). Behavior verified identical: the pre-change and new binaries
produce byte-identical output across 14 -N templates (%n %N %t %p %h %H %M %q %r
%% %[param], the short %s variants, and literals) crawling a local site.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Enable with: git config core.hooksPath .githooks
The hook runs git-clang-format (clang-format 19, repo .clang-format) on the
staged C lines only and re-stages the result, so commits stay
clang-format-clean and the CI format check passes without a round-trip. It never
reformats the whole tree, only the lines a commit changes.
Safe by construction: if clang-format 19 is absent it skips (CI still enforces);
and if a file has both staged and unstaged changes it does not auto-mutate
(which would commit the unstaged part), it reports and asks the author to
stage/stash. HTTRACK_NO_AUTOFORMAT=1 skips it for one commit. README covers the
noexec-working-tree case (point core.hooksPath at an exec-fs copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The engine predates clang-format (it was shaped by an old Visual Studio
formatter) and does not round-trip through it: a whole-tree reformat is ~25k
lines of churn, so we never do one. Instead we format only the lines a change
touches, via git-clang-format, and enforce that in CI diff-scoped.
.clang-format is reverse-engineered from src/*.c (2-space, no tabs, 80 cols,
char *x pointers, attached braces, un-indented case labels, space after C-style
casts). That is mostly LLVM defaults; the deliberate deviations are
SpaceAfterCStyleCast (the dominant "(int) x" form) and SortIncludes: false
(C include order can be significant, so never reorder).
The CI "format" job pins clang-format-19 from apt.llvm.org's noble channel
(ubuntu-24.04's native is 18) to match local dev, and fails only if a PR's
changed C lines are not clang-format-clean. Existing untouched code is left
alone.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Many pointer-destination buff() sites are cursors walking a buffer of known
capacity, with a manual "p += strlen(p)" after each write (the url_savename
renderer does this ~40 times). That hand-rolled pointer math is where several
of the off-by-one hazards live.
htsbuff captures the pattern: a non-owning builder (buf/cap/len) built from an
in-scope array (htsbuff_array, capacity via sizeof) or a pointer of known size
(htsbuff_ptr). htsbuff_cat/catn/cpy bound every write against the real capacity
and abort on overflow, same contract as the *_safe_ helpers, so the pointer
math goes away.
Extend the -#8 self-test and tests/01_engine-strsafe.test with builder
correctness (append, truncating append, reset, length) and an overflow-abort
case. No call sites are converted yet; that follows per file.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First consumer of the new buff() pointer-destination diagnostic. catch_url()
appended response headers into the caller's 'data' buffer with strcatbuff on
a char* destination, which is unchecked: a long header stream could overrun
the 32Kb buffer.
Make the capacity contract explicit (CATCH_URL_DATA_SIZE in htscatchurl.h,
used by the caller too) and append with strlcatbuff, which enforces the bound
and aborts rather than overflowing. htscatchurl.c now compiles warning-free
under the diagnostic.
The remaining raw sprintf/sscanf into the same buffer are separate items for
a later pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
strcpybuff/strcatbuff/strncatbuff only bounds-check when the destination
is a sized char[] array. For a bare char* the capacity is unknown, so the
macro silently falls back to plain strcpy/strcat/strncat while still
looking like a checked call.
On GCC/Clang, route the pointer case through __builtin_choose_expr() to a
stub carrying the 'warning' function attribute, so a compile-time warning
fires only at pointer-destination sites and points at the explicit-size
replacement (strlcpybuff/strlcatbuff). Array sites keep using the bounded
_safe_ helpers and stay quiet. The change is diagnostic only: no runtime
or ABI change, and other compilers keep the previous behavior.
Add a runtime self-test for the bounded ops behind a new -#8 debug mode,
plus tests/01_engine-strsafe.test covering both correct copies and the
abort-on-overflow guarantee.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The -F user-agent value was rejected past 126 bytes and the -%X header line
past 256. Both are stored in dynamically grown String buffers, so the caps were
arbitrary. Drop them; every argument is still bounded by the general
per-argument check in htscoremain.c (HTS_CDLMAXSIZE), which lifts the usable
limit to just under 1 KB.
optalias_check copied a long-form option value (--user-agent, --headers, ...)
into a fixed 1000-byte scratch buffer, smaller than that general cap, so a value
of 1000..1023 bytes aborted the process through the guarded-copy overflow check.
Size command and param to HTS_CDLMAXSIZE so the long form matches the cap; an
over-cap value is now refused with the normal "argument too long" message
instead of crashing. Grow the request-head buffer to 16384 for the larger
aggregate header set.
closes#152
background-image is already captured and rewritten through the style/CSS
url() path, in both an external <style> block and an inline style attribute,
with the URL unquoted, double-quoted or single-quoted. Extend the offline
parser test to cover all of these so the behavior stays locked.
closes#237
A srcset value is a comma-separated list of "URL descriptor" entries
(480w, 2x). HTTrack only had "data-srcset" in the link-detection table and
left the plain "srcset" attribute untouched, so responsive images were never
mirrored. The parser now captures and rewrites each candidate URL in turn,
preserving the descriptors and the commas between entries verbatim, and bounds
every new buffer scan against the page end.
Candidate splitting follows the WHATWG srcset algorithm: the URL is a run of
non-whitespace characters, so a comma inside a URL (a data: URI, a CDN
transform path like w_300,c_fill) stays part of the URL and is not mis-split;
only a trailing comma or a comma after the descriptor separates candidates.
Adds tests/01_engine-parse.test, an offline file:// parser test that asserts
each candidate is queued and rewritten (including the comma-in-URL cases), and
also locks the existing xlink:href (#298) and inline background-image (#237)
handling.
closes#235closes#236
The rendered HTML manual had no regeneration path. Add regen-man-html,
which runs groff's html device over httrack.1, alongside the existing
regen-man target.
someweb.com is a real registrable domain; example.com is reserved for
documentation (RFC 2606). Replace it across the HTML guides, the CLI
--help text (htshelp.c) and code comments, then regenerate man/httrack.1
and the rendered html/httrack.man.html. Other placeholder domains are
left alone: they appear inside filter/wildcard examples where the host
interacts with the pattern.
Add -#0 self-test cases for backslash escapes inside a '*[...]' class.
They pin two quirks of the current decoder: '\X' matches both X and the
backslash itself, and a literal ']' cannot be a class member because the
parser stops at the first ']' (escaped or not). The latter is why the
filter guide's '*[\[\]]' = "the [ or ] character" claim is wrong (#148):
it parses as the class {[,\} plus a trailing literal ']'. These tests
lock the behavior down so a later matcher fix is a deliberate change.
refs #148
Escape the literal <URLs>, <FILTERs>, <param>, <filter>, <file> and
related placeholders in fcguide.html so they render instead of being
swallowed as unknown HTML tags; several were also missing their closing
'>'. Use --recurse-submodules in the README clone command. Relabel
lang/Ukrainian.txt as windows-1251, which is what its bytes actually
are (ISO-8859-5 decodes them to garbage).
closes#132, closes#103, closes#167
Add filter (-#0) and MIME (-#2) tests, and broaden the charset, entity,
IDNA, and path-simplify cases that previously had one or two assertions
each.
Cover the punycode, charset, and entity parsers (areas with a CVE
history) with malformed-input probes that check the hardened build exits
cleanly rather than overflowing. The IDNA and path-simplify edge cases
are pinned to RFC 3492 and RFC 3986 semantics.
The entity case documents the known U+00A0 -> space behavior in
htsencoding.c instead of asserting the spec byte, so a future fix is not
blocked by a stale test.
Two threads locking the same mutex for the first time could both run the
unsynchronized lazy init, corrupting the underlying pthread mutex and aborting
or deadlocking. Build the object and publish it with a single atomic
compare-and-swap; threads that lose the race free the object they built. This
needs no statically-initializable guard, so it stays valid on Windows 2000.
A long log path made the lock-file path overflow the fixed 256-byte n_lock
buffer, tripping the guarded copy and aborting with signal 6. Size n_lock to
the concat-buffer capacity so it holds any path fconcat can produce.
(cherry picked from commit 15144ffd24667712cca2ac0fee96bd355239eff6)
The default footer embeds the page URL inside an HTML comment. A URL
containing "-->" closed the comment and let an attacker inject script into
the mirrored page. Percent-encode < and > before the URL reaches the footer.
(cherry picked from commit 606883229244dc233d16915678e63cfa62000ff0)
The bundled html/ and templates/ pages are the genuine upstream
documentation from the httrack.com website. lintian's long-line
heuristic flags them as missing source; they are the actual source.
- webhttrack: depend firmly on sensible-utils (it calls sensible-browser),
drop the missing-depends-on-sensible-utils override.
- copyright: point to /usr/share/common-licenses/GPL-3, not the GPL symlink.
- watch: use https and version=4.
- control: add Rules-Requires-Root: no and Vcs-Browser.
- strip trailing whitespace in control, rules and changelog.
libhtsjava and the libtest callback examples reach libc only through
libhttrack, so the linker drops the direct libc edge from DT_NEEDED.
lintian flags this as library-not-linked-against-libc. Force libc to be
recorded as a dependency and drop the now-redundant override.
cookie_cmp_wildcard_domain used an unsigned loop counter, so i >= 0 was always
true (infinite loop and out-of-bounds reads) and an empty domain underflowed
l - 1. Use a signed counter. Found and fixed by greenrd in #172. closes#171
They were empty automake stubs (GNU strictness requires the files to exist).
Pointing them at history.txt satisfies automake, drops the confusing empty
files, and ships a real changelog in the dist tarball without duplicating
content in git.
Build and test (autoreconf, configure, make, make check) on x86-64 and arm64
with gcc and clang. A lint job runs shellcheck and shfmt -i 4 on the maintained
scripts.
OpenSSL 3.0+ is Apache-2.0 (GPL-compatible) and LibreSSL is BSD, so the GPL
linking exception is no longer needed; httrack is now plain GPL-3.0-or-later.
license.txt now carries the verbatim GPLv3 (matching COPYING); the ethical-use
request moves to README. debian/copyright updated to match.
Replace the debian/compat file and the unversioned debhelper build-dep with
debhelper-compat (= 13), and drop the now-redundant dh-autoreconf and obsolete
autotools-dev build deps. Compat level is unchanged (13). Clears the
no-versioned-debhelper-prerequisite and useless-autoreconf-build-depends lintian
tags. Folded into the not-yet-uploaded 3.49.7-2 stanza.
No packaging changes required. The 4.7.0 normative items do not apply to
httrack: it ships no maintainer scripts (so the systemd config
diversion/alternatives rule is moot), no services or init scripts (so the
systemd-unit requirement is moot), and it is in main (so the contrib/non-free
no-network rules target rule is moot).
dcmd expands the .changes to its full file set (orig, dsc, debs, dbgsym
ddebs, buildinfo), replacing the hand-rolled copy loop that silently
dropped the dbgsym packages. need() now takes several tools at once;
drop the unused dpkg-parsechangelog check and require dcmd.
roche@proliant.localnet was a local hostname that leaked into a released entry;
lintian flags it as bogus-mail-host. Use xavier@debian.org like the other
entries.
Replaces an external workstation script. mkdeb.sh exports committed HEAD plus
the coucal submodule to a scratch dir, refreshes the build system and man page
(reusing make -C man regen-man), builds a clean upstream tarball, overlays
debian/, and runs debuild (build + lintian + signing). It takes the GPG key and
options as arguments and writes nothing in the working tree. 'make deb
DEB_FLAGS=...' is a thin wrapper. Honors SOURCE_DATE_EPOCH.
The external makeman.sh turned the first token of every indented --help line
into an option, so prose like the -%! warning rendered as bogus -IMPORTANT and
-USE options (Debian #1061053). man/makeman.sh classifies lines by indentation,
reads README from the source tree, and honors SOURCE_DATE_EPOCH.
'make -C man regen-man' refreshes the page; tests/02_manpage-regen.test fails
if the committed page drifts from --help.
Use TLS_client_method() and OpenSSL_version() on OpenSSL 1.1.0+ / LibreSSL
2.7.0+; the deprecated SSLv23/SSLeay init may be removed in OpenSSL 4.0.
Legacy path kept for older OpenSSL.
The IMPORTANT NOTE / USE IT lines used .IP \-... tags, so groff showed
them as -IMPORTANT and -USE options. Render them as continuation text of
the -%! description instead.
SIZEOF_LONG was the only config.h macro differing across architectures
(8 vs 4), which broke libhttrack-dev Multi-Arch: same co-installation.
md5.h was its only non-Windows user and now uses uint32_t from <stdint.h>.
Regenerated configure and config.h.in.
This checks the lengths of the file name, extra field, and comment
that would be put in the zip headers, and rejects them if they are
too long. They are each limited to 65535 bytes in length by the zip
format. This also avoids possible buffer overflows if the provided
fields are too long.
(cherry picked from commit 73331a6a0481067628f065ffe87bb1d8f787d10c)
Future compilers will not support implicit function declarations by
default, so add the additional #include directives for the appropriate
function prototypes.
Increasingly HTML5 sites use a number of mechanisms for responsive
image loading. Many of these mechanisms revolve around the use of
`data-src` and `data-srcset` attributes within an img tag to provide a
list of valid images to display, along with their relevant sizes (e.g.
https://github.com/aFarkas/lazysizes, https://github.com/malchata/yall.js
).
This change adds the two previously mentioned tags to the list in
`hts_detect[]`, resulting in httrack detecting them correctly.
Initially an attempt was made at using `hts_detectbeg[]` with the
potential of resolving #203. Unfortunately it became apparent that the
implementation of `hts_detectbeg[]` only supports a suffix of integers
(see `src/htstools.h#rech_tageqbegdigits` for more information).
The NDK headers nowaday has timezone in time.h, so trying to redefine it
causes the build to fail with:
proxy/store.c:34:18: error: static declaration of 'timezone' follows non-static declaration
static long int timezone = 0;
^
include/time.h:42:17: note: previous declaration is here
extern long int timezone;
* closes:#53
Also fixed HTML-escaping issues inside webhttrack
Rationale: The webhttrack script made the wrong assumption that once the "browse" command returned, it meant the user killed the navigation window, and it had to kill the server itself. However, modern browsers tend to "attach" to an existing session (creating a new tab, for example, within an existing window), causing the browsing command to return immediately, thus causing the server to be killed immediately by the webhttrack script. I have rewritten the logic behind, and now the server is able to kill himself if the parent script dies, AND if the browsing client did not make any activity for two minutes. The "activity" can be any browser/refreshed page, or the internal "ping" iframe (which pings the server every 30 seconds). With this model, we *should* be compatible with old browsers, and modern ones.
Rationale: strncat(..., ..., (size_t) -1) does not behave gently on Linux, and is not equivalent to strcat(..., ...) when using optimizations (could it be a corner-case bug ?)
Removed unused CXX
added the following default compiler flags:
-Wdeclaration-after-statement
-Wsequence-point
-Wparentheses
-Winit-self
-Wuninitialized
-Wformat
-fstrict-aliasing -Wstrict-aliasing=2
added the following default linker flags:
-Wl,--discard-all
-Wl,--no-undefined
Depending on autoconf-archive because using AX_CHECK_COMPILE_FLAG and AX_CHECK_LINK_FLAG
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at <roche@httrack.com>. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of actions.
**Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at [https://www.contributor-covenant.org/translations][translations].
*HTTrack* is an _offline browser_ utility, allowing you to download a World Wide website from the Internet to a local directory, building recursively all directories, getting html, images, and other files from the server to your computer.
*HTTrack* arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.
HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.
*WinHTTrack* is the Windows 2000/XP/Vista/Seven release of HTTrack, and *WebHTTrack* the Linux/Unix/BSD release.
## Website
*Main Website:*
http://www.httrack.com/
## Compile trunk release
A git checkout ships only the autotools sources, so `./bootstrap` (which runs
`autoreconf`) regenerates `configure` first; this needs autoconf, automake and
libtool. Released tarballs already include `configure`, so building from a
for dir in "$(DESTDIR)$(HelpHtmldir)""$(DESTDIR)$(HelpHtmlTxtdir)""$(DESTDIR)$(HelpHtmldivdir)""$(DESTDIR)$(HelpHtmlimagesdir)""$(DESTDIR)$(HelpHtmlimgdir)""$(DESTDIR)$(HelpHtmlrootdir)""$(DESTDIR)$(VFolderEntrydir)""$(DESTDIR)$(WebHtmldir)""$(DESTDIR)$(WebHtmlimagesdir)""$(DESTDIR)$(WebHtmlsfxdir)""$(DESTDIR)$(WebIcon16x16dir)""$(DESTDIR)$(WebIcon32x32dir)""$(DESTDIR)$(WebIcon48x48dir)""$(DESTDIR)$(WebPixmapdir)";do\
It will not follow links to other websites, because this behaviour might cause to capture the Web entirely!<br>
It will not follow links located in higher directories, too (for example, <tt>www.someweb.com/gallery/flowers/</tt> itself) because this
It will not follow links located in higher directories, too (for example, <tt>www.example.com/gallery/flowers/</tt> itself) because this
might cause to capture too much data.<br>
<br>
This is the <b><u>default behaviour</b></u> of HTTrack, BUT, of course, if you want, you can tell HTTrack to capture other directorie(s), website(s)!..
<br>
In our example, we might want also to capture all links in <tt>www.someweb.com/gallery/trees/</tt>, and in <tt>www.someweb.com/photos/</tt><br>
In our example, we might want also to capture all links in <tt>www.example.com/gallery/trees/</tt>, and in <tt>www.example.com/photos/</tt><br>
<br>
This can easily done by using filters: go to the Option panel, select the 'Scan rules' tab, and enter this line:
(you can leave a blank space between each rules, instead of entering a carriage return)<br>
<tt>+www.someweb.com/gallery/trees/*<br>
+www.someweb.com/photos/*</tt><br>
<tt>+www.example.com/gallery/trees/*<br>
+www.example.com/photos/*</tt><br>
<br>
This means "accept all links begining with <tt>www.someweb.com/gallery/trees/</tt> and <tt>www.someweb.com/photos/</tt>"
This means "accept all links begining with <tt>www.example.com/gallery/trees/</tt> and <tt>www.example.com/photos/</tt>"
- the <tt>+</tt> means "accept" and the final <tt>*</tt> means "any character will match after the previous ones".
Remember the <tt>*.doc</tt> or <tt>*.zip</tt> encountered when you want to select all files from a certain type on your computer:
it is almost the same here, except the begining "+"<br>
<br>
Now, we might want to exclude all links in <tt>www.someweb.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
Now, we might want to exclude all links in <tt>www.example.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
we accepted too many files. Here again, you can add a filter rule to refuse these links. Modify the previous filters to:<br>
You have noticed the <tt>-</tt> in the begining of the third rule: this means "refuse links matching the rule"
; and the rule is "any files begining with <tt>www.someweb.com/gallery/trees/hugetrees/</tt><br>
; and the rule is "any files begining with <tt>www.example.com/gallery/trees/hugetrees/</tt><br>
Voila! With these three rules, you have precisely defined what you wanted to capture.<br>
<br>
A more complex example?<br>
<br>
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.someweb.com<br>
<tt>+www.someweb.com/*blue*.jpg</tt><br>
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.example.com<br>
<tt>+www.example.com/*blue*.jpg</tt><br>
<br>
More detailed information can be found <ahref="filters.html">here</a>!<br>
<br>
@@ -411,7 +411,7 @@ A: <em>Yes. It is called WebHTTrack. See the download section at <a href="http:/
<aNAME="Q0">Q: <strong>Some sites are captured very well, other aren't. Why?</strong><br>
A: <em>
There are several reasons (and solutions) for a mirror to fail. Reading the log files (ans this FAQ!) is generally a VERY good idea to figure out what occured.
There are several reasons (and solutions) for a mirror to fail. Reading the log files (ans this FAQ!) is generally a VERY good idea to figure out what occurred.
<ul>
<li>Links within the site refers to external links, or links located in another (or upper) directories, not captured by default - the use of filters is generally THE solution, as this is one of the powerful option in HTTrack. <u>See the above questions/answers</u>.</li>
@@ -440,7 +440,7 @@ This will cause a performance loss, but will increase the compatibility with som
<aNAME="QT1">Q: <strong>Only the first page is caught. What's wrong?</a></strong></br>
A: <em>First, check the <tt>hts-log.txt</tt> file (and/or <tt>hts-err.txt</tt> error log file) - this can give you precious information.<br>
The problem can be a website that redirects you to another site (for example, <tt>www.someweb.com</tt> to <tt>public.someweb.com</tt>) :
The problem can be a website that redirects you to another site (for example, <tt>www.example.com</tt> to <tt>public.example.com</tt>) :
in this case, use filters to accept this site<br>
This can be, also, a problem in the HTTrack options (link depth too low, for example)</em>
@@ -485,10 +485,10 @@ You may also want to capture files that are forbidden by default by the <a href=
In these cases, HTTrack does not capture these links automatically, you have to tell it to do so.
<br><br>
<ul><li>Either use the <ahref="filters.html">filters</a>.<br>
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get .jpg images located
in <tt>http://www.someweb.com/bar/</tt> (for example, http://www.someweb.com/bar/blue.jpg)<br>
Then, add the filter rule <tt>+www.someweb.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
You can, also, accept all files from the /bar folder with <tt>+www.someweb.com/bar/*</tt>, or only html files with <tt>+www.someweb.com/bar/*.html</tt> and so on..<br><br>
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get .jpg images located
in <tt>http://www.example.com/bar/</tt> (for example, http://www.example.com/bar/blue.jpg)<br>
Then, add the filter rule <tt>+www.example.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
You can, also, accept all files from the /bar folder with <tt>+www.example.com/bar/*</tt>, or only html files with <tt>+www.example.com/bar/*.html</tt> and so on..<br><br>
</li><li>
If the problems are related to robots.txt rules, that do not let you access some folders (check in the logs if you are not sure),
you may want to disable the default robots.txt rules in the options. (but only disable this option with great care,
@@ -509,8 +509,8 @@ and rescan the website as described before. HTTrack will be obliged to recatch t
<aNAME="Q1bb">Q: <strong>FTP links are not caught! What's happening?</strong><br>
A: <em>FTP files might be seen as external links, especially if they are located in outside domain. You have either to accept all external links (See the links options, -n option) or
only specific files (see <ahref="filters.html">filters</a> section). <br>
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get ftp://ftp.someweb.com files<br>
Then, add the filter rule <tt>+ftp.someweb.com/*</tt> to accept all files from this (ftp) location<br>
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get ftp://ftp.example.com files<br>
Then, add the filter rule <tt>+ftp.example.com/*</tt> to accept all files from this (ftp) location<br>
</em>
<br>
@@ -551,10 +551,10 @@ Note: In some rare cases, duplicate data files can be found when the website red
<aNAME="Q1b2">Q: <strong>I'm downloading too many files! What can I do?</strong><br>
A: <em>This is often the case when you use too large a filter, for example <tt>+*.html</tt>, which asks the
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.someweb.com/specificfolder/*.html</tt><br>
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.someweb.com/big/,
use <tt>-www.someweb.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
mirroring http://www.someweb.com/big/index.html, is to catch everything in http://www.someweb.com/big/. Filters are your friends,
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.example.com/specificfolder/*.html</tt><br>
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.example.com/big/,
use <tt>-www.example.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
mirroring http://www.example.com/big/index.html, is to catch everything in http://www.example.com/big/. Filters are your friends,
@@ -207,7 +207,7 @@ Below the list of callbacks, and associated external wrappers.
<tr><tdbackground="img/fade.gif"><i>receivehead</i></td><tdbackground="img/fade.gif">Called when HTTP headers are recevived from the remote server. The <tt>buff</tt> buffer contains text headers, <tt>adr</tt> and <tt>fil</tt> the URL, and <tt>referer_adr</tt> and <tt>referer_fil</tt> the referer URL. The <tt>incoming</tt> structure contains all information related to the current slot.<br>return value: 1 if the mirror can continue, 0 if the mirror must be aborted</td><tdbackground="img/fade.gif"><tt>int mycallback(t_hts_callbackarg *carg, httrackp* opt, char* buff, const char* adr, const char* fil, const char* referer_adr, const char* referer_fil, htsblk* incoming);</tt></td></tr>
<tr><tdbackground="img/fade.gif"><i>detect</i></td><tdbackground="img/fade.gif">Called when an unknown document is to be parsed. The <tt>str</tt> structure contains all information related to the document.<br>return value: 1 if the type is known and can be parsed, 0 if the document type is unknown</td><tdbackground="img/fade.gif"><tt>int mycallback(t_hts_callbackarg *carg, httrackp* opt, htsmoduleStruct* str);</tt></td></tr>
<tr><tdbackground="img/fade.gif"><i>parse</i></td><tdbackground="img/fade.gif">The <tt>str</tt> structure contains all information related to the document.<br>return value: 1 if the document was successfully parsed, 0 if an error occured</td><tdbackground="img/fade.gif"><tt>int mycallback(t_hts_callbackarg *carg, httrackp* opt, htsmoduleStruct* str);</tt></td></tr>
<tr><tdbackground="img/fade.gif"><i>parse</i></td><tdbackground="img/fade.gif">The <tt>str</tt> structure contains all information related to the document.<br>return value: 1 if the document was successfully parsed, 0 if an error occurred</td><tdbackground="img/fade.gif"><tt>int mycallback(t_hts_callbackarg *carg, httrackp* opt, htsmoduleStruct* str);</tt></td></tr>
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.