Compare commits

...

8 Commits

Author SHA1 Message Date
Xavier Roche
02180549f6 Don't read an uninitialized buffer on an empty Content-Type
treathead() parses the Content-Type value with sscanf("%s") into a local
`tempo` buffer, then calls strlen(tempo) and stores the result. A response
whose Content-Type header has an empty or whitespace-only value yields no
token: sscanf leaves `tempo` uninitialized, so strlen reads uninitialized
stack and can over-read past the buffer. A hostile server triggers this with
a bare `Content-Type:` line.

Guard on sscanf's return: adopt the value, and mark the type as server-given,
only when a token was actually read. An empty value now falls back to the
default type with contenttype_given left false, i.e. it is treated like a
missing header and the URL extension is kept -- which is also the correct
naming behavior.

Found while reviewing #409, which added contenttype_given right beside this
parse; the bug itself predates it. tests/17_local-empty-ct.test exercises the
empty-Content-Type path, and the ASan/UBSan CI job is what catches the
uninitialized read.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 20:48:55 +02:00
Xavier Roche
1611dbcabf Trust a declared Content-Type over a binary URL extension (#409)
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.

The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.

Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:18:16 +02:00
Xavier Roche
099501ee50 Make lintian actually gate the Debian package build (#410)
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.

Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:13:12 +02:00
Xavier Roche
1b9eefa3b4 Merge pull request #408 from xroche/fix/267-delayed-ext-mangle
Stop mangling saved-file extensions under the default delayed type check
2026-06-20 18:28:27 +02:00
Xavier Roche
9c8d3a41eb tests: tighten the type-matrix guards
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:25:45 +02:00
Xavier Roche
ae77cd9d6d Honor --assume under the default delayed type check (-%N2)
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.

tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:12:01 +02:00
Xavier Roche
51b8dcd81c Keep a known URL extension against a bogus html/empty Content-Type
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.

wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.

tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:07:08 +02:00
Xavier Roche
bcce664143 Merge pull request #364 from xroche/feature/local-test-server
tests: offline local test server prototype (cookies + HTTPS)
2026-06-20 16:41:26 +02:00
18 changed files with 221 additions and 26 deletions

View File

@@ -227,7 +227,8 @@ jobs:
# Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above
# already covers compile portability. lintian runs with --fail-on=error.
# already covers compile portability. mkdeb.sh runs lintian as an explicit gate
# (debuild does not propagate lintian's exit) with --fail-on=error,warning.
deb:
name: deb package (lintian)
runs-on: ubuntu-24.04

View File

@@ -29,9 +29,9 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to
# 128), an incompatible ABI break, so bump current and reset revision/age.
VERSION_INFO="3:0:0"
# 4:0:0: htsblk gained the contenttype_given field, an incompatible ABI break,
# so bump current and reset revision/age.
VERSION_INFO="4:0:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

10
debian/changelog vendored
View File

@@ -1,3 +1,13 @@
httrack (3.49.8-3) unstable; urgency=medium
* Rename libhttrack3 to libhttrack4 to follow the SONAME bump to
libhttrack.so.4: htsblk gained a contenttype_given field, an
incompatible ABI change (VERSION_INFO 3 -> 4). The .files wildcard
now tracks .so.4* so the runtime libraries land in the right
package. New binary package, via NEW.
-- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 19:46:16 +0200
httrack (3.49.8-2) unstable; urgency=medium
* Rename libhttrack2 to libhttrack3 to follow the SONAME, which the 3.49.8

6
debian/control vendored
View File

@@ -58,13 +58,13 @@ Description: webhttrack common files
This package is the common files of webhttrack, website copier and
mirroring utility
Package: libhttrack3
Package: libhttrack4
Architecture: any
Multi-Arch: same
Section: libs
Depends: ${misc:Depends}, ${shlibs:Depends}
Replaces: libhttrack2, httrack (<< 3.49.8-2~)
Breaks: libhttrack2, httrack (<< 3.49.8-2~)
Replaces: libhttrack3, httrack (<< 3.49.8-3~)
Breaks: libhttrack3, httrack (<< 3.49.8-3~)
Description: Httrack website copier library
This package is the library part of httrack, website copier and mirroring
utility

View File

@@ -4,3 +4,6 @@
# so the path lives in the display pointer, not the override -- match with '*'.
httrack-doc: extra-license-file *
httrack-doc: package-contains-documentation-outside-usr-share-doc *
# search.sh is a sample CGI shipped alongside the HTML manual, not meant to be
# run from the package tree; it stays non-executable by design.
httrack-doc: script-not-executable *

View File

@@ -1,3 +0,0 @@
usr/lib/*/libhttrack.so.3*
usr/lib/*/libhtsjava.so.3*
usr/share/httrack/templates

3
debian/libhttrack4.files vendored Normal file
View File

@@ -0,0 +1,3 @@
usr/lib/*/libhttrack.so.4*
usr/lib/*/libhtsjava.so.4*
usr/share/httrack/templates

View File

@@ -1,3 +1,3 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack3: no-symbols-control-file usr/lib/*
libhttrack4: no-symbols-control-file usr/lib/*

2
debian/rules vendored
View File

@@ -135,7 +135,7 @@ binary-arch: build install
dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info
dh_installdeb -a
# we depend on the current version (ABI may change)
dh_shlibdeps -a -ldebian/libhttrack3/usr/lib/$(DEB_HOST_MULTIARCH)
dh_shlibdeps -a -ldebian/libhttrack4/usr/lib/$(DEB_HOST_MULTIARCH)
dh_gencontrol -a
dh_md5sums -a
dh_builddeb -a

View File

@@ -1396,6 +1396,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
void treatfirstline(htsblk * retour, const char *rcvd) {
const char *a = rcvd;
retour->contenttype_given = HTS_FALSE; /* set when a Content-Type is seen */
// exemple:
// HTTP/1.0 200 OK
if (*a) {
@@ -1589,11 +1591,17 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
}
}
}
sscanf(rcvd + p, "%s", tempo);
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur
// An empty/whitespace Content-Type value yields no token; keep the
// default type and the "not given" flag instead of reading uninit tempo.
if (sscanf(rcvd + p, "%s", tempo) == 1) {
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype,
"application/octet-stream-unknown"); // erreur
retour->contenttype_given =
HTS_TRUE; /* server declared a usable type */
}
}
} else if ((p = strfield(rcvd, "Content-Range:")) != 0) {
// Content-Range: bytes 0-70870/70871

View File

@@ -138,6 +138,34 @@ static void cleanEndingSpaceOrDot(char *s) {
}
}
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server sent NO Content-Type (the #267 mangle guard): a typeless .png stays
.png, but a .pdf explicitly served as text/html is named .html. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file, int contenttype_given) {
char urlmime[256];
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server sent NO Content-Type: a missing type is defaulted to text/html
upstream and must not clobber e.g. a .png. An explicitly declared type is
trusted, so a binary-looking URL that really serves HTML (login/error
interstitial, soft-404) is named .html instead of kept as .pdf/.jpg. */
if (!is_hypertext_mime(opt, urlmime, file) && !contenttype_given)
return 0;
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int url_savename(lien_adrfilsave *const afs,
@@ -325,7 +353,10 @@ int url_savename(lien_adrfilsave *const afs,
}
/* replace shtml to html.. */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
/* HARD delays every type, except one the user pinned with --assume: honor it
immediately (ishtml() consults the user type), no delayed name (#56) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!is_userknowntype(opt, fil))
is_html = -1; /* ALWAYS delay type */
else
is_html = ishtml(opt, fil);
@@ -380,7 +411,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori?
} else if (wire_patches_ext(opt, r.contenttype, fil,
r.contenttype_given)) {
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
@@ -425,7 +457,9 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil,
headers->r.contenttype_given)) {
char s[16];
if (give_mimext(
s, sizeof(s),
@@ -653,7 +687,9 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil,
back[b].r.contenttype_given)) {
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension

View File

@@ -651,6 +651,8 @@ struct htsblk {
int debugid; /**< connection debug id */
/* */
htsrequest req; /**< parameters used for the request */
/* a Content-Type header was received (else contenttype holds a default) */
hts_boolean contenttype_given;
/*char digest[32+2]; // md5 digest generated by the engine ("" if none) */
};

25
tests/15_local-types.test Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.png' \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \
--found 'types/style.css' \
--found 'types/data.json' \
--found 'types/control.html' --not-found 'types/control.php' \
--found 'types/gend61c.png' --not-found 'types/gend61c.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# --assume under the default delayed type check (-%N2), issue #56. A user type
# pinned with --assume must be honored immediately, not lost to the delayed
# name: photo.png served as image/png but assumed text/html is saved as .html.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/photo.html' --not-found 'types/photo.png' \
httrack 'BASEURL/types/photo.png' --assume png=text/html

View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
# An empty "Content-Type:" header value must be treated as "no usable type"
# (keep the URL extension), not parsed from an uninitialized buffer. The crawl
# also runs under ASan/UBSan in CI, which catches the uninitialized read this
# guards against.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/emptyct.png' --not-found 'types/emptyct.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -51,6 +51,9 @@ TESTS = \
12_crawl_https.test \
13_crawl_proxy_https.test \
13_local-cookies.test \
14_local-https.test
14_local-https.test \
15_local-types.test \
16_local-assume.test \
17_local-empty-ct.test
CLEANFILES = check-network_sh.cache

View File

@@ -118,11 +118,83 @@ class Handler(SimpleHTTPRequestHandler):
if self.command != "HEAD":
self.wfile.write(body)
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
TYPE_MATRIX = {
"/types/control.php": (b"<html><body>control</body></html>", "text/html"),
"/types/photo.png": (FAKE_PNG, "image/png"),
"/types/doc.pdf": (FAKE_PDF, "application/pdf"),
"/types/notype.png": (FAKE_PNG, None),
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
"/types/style.css": (b"body { color: red; }\n", "text/css"),
"/types/data.json": (b'{"k": "v"}\n', "application/json"),
"/types/gen.php": (FAKE_PNG, "image/png"),
}
def route_types_index(self):
body = (
'\t<a href="control.php">control</a>\n'
'\t<img src="photo.png" />\n'
'\t<a href="doc.pdf">doc</a>\n'
'\t<img src="notype.png" />\n'
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
'\t<link rel="stylesheet" href="style.css" />\n'
'\t<a href="data.json">json</a>\n'
'\t<img src="gen.php?id=5" />\n'
)
self.send_html(body)
def route_types(self):
path = urlsplit(self.path).path
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
"/types/photo.png": route_types,
"/types/doc.pdf": route_types,
"/types/notype.png": route_types,
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
}
# --- dispatch ----------------------------------------------------------

View File

@@ -206,9 +206,10 @@ main() {
cp -a "$export_dir/debian" "httrack-$ver/debian"
)
# Build (debuild also runs lintian and signs). --fail-on aborts on a lintian
# error or warning, so neither a release nor CI produces an unclean package.
local -a debuild_opts=(--lintian-opts -I -i "--fail-on=error,warning")
# Build and sign. debuild runs lintian too but does NOT propagate its exit
# status, so a broken package would pass unnoticed; disable it here and run
# lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S)
if [[ $unsigned -eq 1 ]]; then
@@ -219,7 +220,8 @@ main() {
info "building packages with debuild"
(
cd "$scratch/httrack-$ver"
debuild "${build_opts[@]}" "${debuild_opts[@]}"
# debuild options (--no-lintian) must precede the dpkg-buildpackage ones
debuild "${debuild_opts[@]}" "${build_opts[@]}"
)
# Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo).
@@ -229,6 +231,16 @@ main() {
changes=("$scratch"/*.changes)
shopt -u nullglob
[[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file"
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/"
# Clean-room build gate: rebuild the source package in a minimal chroot that