Compare commits

..

1 Commits

Author SHA1 Message Date
Xavier Roche
fc50b4a0e6 Use an unknown/unknown sentinel for an absent Content-Type
#409 distinguished "the server declared text/html" from "no Content-Type,
defaulted to text/html" with a new htsblk.contenttype_given flag, so a
binary-looking URL that really serves HTML is saved .html while a typeless
response keeps its URL extension. That worked on a fresh crawl but had two
costs: the flag was never persisted, so on --update the cache read it as
unset and the names reverted (report.html became report.pdf again, and the
two passes disagreed); and it was an installed-struct ABI break (soname 4,
libhttrack4).

Replace the flag with a sentinel: when no Content-Type is received, store
"unknown/unknown" as the type instead of text/html. The sentinel is treated
as html for every type test (added to is_html_mime_type), so parsing,
storage and filtering of a typeless response are unchanged; only the naming
code (wire_patches_ext) reads it as "no declared type" and keeps the URL
extension. Because the type string rides the cache, an update reads the same
sentinel and names consistently -- the revert is fixed at the source.

The sentinel never reaches a consumer as a real type: a single helper,
hts_effective_mime(), maps it back to text/html wherever a stored type is
derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the
ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed
response header and the index record). The .arc export was caught by an
adversarial spill audit; without the map a typeless page archived via
proxytrack would carry Content-Type: unknown/unknown.

Since the sentinel makes contenttype_given unnecessary, #409's ABI break is
undone: the field is removed, soname returns to 3, and the Debian package
reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW
carries libhttrack3), so this re-aligns master with the archive rather than
flip-flopping anything downstream.

Tests: 18_local-update re-mirrors and asserts the names survive the update
pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct
stays green. Full make check: 27 pass, 0 fail.

One accepted behavior change: a mime filter matching exactly text/html no
longer matches a typeless response (its type is the sentinel, html-ish but
not literally text/html); the response is still parsed and crawled as html.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-21 10:38:50 +02:00
7 changed files with 20 additions and 58 deletions

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_INIT([httrack], [3.49.8], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,9 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
VERSION_INFO="3:1:0"
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to
# 128), an incompatible ABI break, so bump current and reset revision/age.
VERSION_INFO="3:0:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

11
debian/changelog vendored
View File

@@ -1,14 +1,3 @@
httrack (3.49.9-1) unstable; urgency=medium
* New upstream release: Content-Type and file-type detection fixes (trust a
declared Content-Type over a binary URL extension, honor --assume under the
delayed type check, keep a known extension against a bogus or empty
Content-Type, and avoid an uninitialised read on an empty Content-Type), and
restored C++ source-compatibility of the installed headers so reverse
dependencies (httraqt) build again.
-- Xavier Roche <xavier@debian.org> Sun, 21 Jun 2026 17:59:38 +0200
httrack (3.49.8-2) unstable; urgency=medium
* Rename libhttrack2 to libhttrack3 to follow the SONAME, which the 3.49.8

View File

@@ -4,12 +4,6 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack
3.49-9
+ Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
+ Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
+ Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)
+ Changed: multiple internal build, packaging and test-harness improvements
3.49-8
+ New: tunnel HTTPS downloads through the configured HTTP proxy via CONNECT (#85)
+ New: parse every candidate URL in <img> and <source> srcset lists (#326)

View File

@@ -3703,9 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0)
to->maxsoc = from->maxsoc;
/* hts_tristate fields use HTS_DEFAULT (-1) for "unspecified": copy_htsopt
skips them so the target keeps its value. */
if (from->nearlink > -1)
/* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard
is always false; cast to int to keep the -1 "unset" sentinel test. */
if ((int) from->nearlink > -1)
to->nearlink = from->nearlink;
if (from->timeout > -1)
@@ -3732,10 +3732,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol;
if (from->errpage > -1)
if ((int) from->errpage > -1)
to->errpage = from->errpage;
if (from->parseall > -1)
if ((int) from->parseall > -1)
to->parseall = from->parseall;
// test all: bit 8 de travel

View File

@@ -3166,16 +3166,6 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (to->parseall != HTS_FALSE)
err = 1;
/* HTS_DEFAULT (-1) is "unspecified": copy_htsopt must skip it,
leaving the target intact. Only a signed (int-backed) field
can hold -1, so this also guards the type against regressing
to an unsigned hts_boolean. */
from->parseall = HTS_DEFAULT;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->parseall != HTS_TRUE)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-9"
#define HTTRACK_VERSIONID "3.49.9"
#define HTTRACK_VERSION "3.49-8"
#define HTTRACK_VERSIONID "3.49.8"
#define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0"
@@ -247,23 +247,13 @@ Please visit our Website: http://www.httrack.com
#define HTS_NOPARAM "(none)"
#define HTS_NOPARAM2 "\"(none)\""
/* Boolean flag for option fields and API yes/no returns. Int-backed, not an
enum: an enum makes C++ reject `field = 1` / `f(0)` on the exported fields
and params. Int-sized, so the httrackp layout and the ABI are unchanged. */
/* Boolean flag for option fields and API yes/no returns. An enum (not C bool)
so it stays int-sized: option fields keep the httrackp layout/ABI, and a
return type stays compatible with the int it replaces. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean
typedef int hts_boolean;
#define HTS_FALSE 0
#define HTS_TRUE 1
#endif
#ifndef HTS_DEF_DEFSTRUCT_hts_tristate
#define HTS_DEF_DEFSTRUCT_hts_tristate
/* Tri-state hts_boolean: HTS_DEFAULT (-1) = "unspecified" (copy_htsopt leaves
the target untouched); HTS_FALSE/HTS_TRUE = off/on. */
typedef int hts_tristate;
#define HTS_DEFAULT (-1)
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
#endif
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */

View File

@@ -428,11 +428,11 @@ struct httrackp {
LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */
hts_tristate
hts_boolean
nearlink; /**< also fetch images/data adjacent to a page but off-site */
hts_boolean makeindex; /**< build a top-level index.html */
hts_boolean kindex; /**< build a keyword index */
hts_tristate delete_old; /**< delete locally obsolete files after update */
hts_boolean delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */
@@ -465,13 +465,13 @@ struct httrackp {
hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
hts_tristate errpage; /**< generate an error page on 404 and similar */
hts_boolean errpage; /**< generate an error page on 404 and similar */
hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/
hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
hts_robots robots; /**< robots.txt handling level */
hts_tristate external; /**< render external links as error pages */
hts_boolean external; /**< render external links as error pages */
hts_boolean passprivacy; /**< strip passwords from external links */
hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */
@@ -485,7 +485,7 @@ struct httrackp {
hts_boolean sizehack; /**< treat same-size response as "updated" */
hts_boolean urlhack; // force "url normalization" to avoid loops
hts_boolean tolerant; /**< accept an incorrect Content-Length */
hts_tristate
hts_boolean
parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */