Compare commits

...

10 Commits

Author SHA1 Message Date
Xavier Roche
3e600cff67 htsglobal: int-back hts_boolean (and add hts_tristate) for C++ source-compat
hts_boolean was a real enum {HTS_FALSE=0, HTS_TRUE=1}. C++ forbids assigning a
plain int to an enum, so every C++ consumer of the public API broke on
`opt->field = 1` or `f(opt, 0)`. The exported surface that trips this is wide:
33 hts_boolean fields in struct httrackp, plus two exported parameters
(unescape_http_unharm's no_high, get_httptype_sized's flag). httraqt, the one
reverse-dependency, fails to build against the installed headers, which blocks
the libhttrack.so.2 -> .so.3 binNMU.

Make hts_boolean an int-backed typedef (HTS_FALSE/HTS_TRUE become macros),
fixing the whole exported boolean surface at once: an int is assignable from an
int in C and C++. It is safe here -- hts_boolean has a single definition,
nothing uses the `enum hts_boolean` form, and HTS_FALSE/HTS_TRUE are only ever
0/1 values. ABI is unchanged: an int-sized enum and an int have identical size
and alignment, so sizeof(struct httrackp) stays 141648 and the soname is
untouched.

Five of those fields are not booleans but tri-state {-1 unspecified, 0 off,
1 on}: the -1 is copy_htsopt()'s "leave the target's value" sentinel for merging
a partial options struct onto a fully-initialised one. The old unsigned enum
could not hold -1, so the copy_htsopt `> -1` guard was always true and the field
was always overwritten. Give them a named hts_tristate (int-backed, HTS_DEFAULT
= -1) so the sentinel works and reads honestly, and drop the now-pointless
`(int)` casts. Extend the -#9 copy_htsopt self-test to assert an HTS_DEFAULT
source is skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-21 17:46:04 +02:00
Xavier Roche
addbd3136b Use an unknown/unknown sentinel for an absent Content-Type (#412)
#409 distinguished "the server declared text/html" from "no Content-Type,
defaulted to text/html" with a new htsblk.contenttype_given flag, so a
binary-looking URL that really serves HTML is saved .html while a typeless
response keeps its URL extension. That worked on a fresh crawl but had two
costs: the flag was never persisted, so on --update the cache read it as
unset and the names reverted (report.html became report.pdf again, and the
two passes disagreed); and it was an installed-struct ABI break (soname 4,
libhttrack4).

Replace the flag with a sentinel: when no Content-Type is received, store
"unknown/unknown" as the type instead of text/html. The sentinel is treated
as html for every type test (added to is_html_mime_type), so parsing,
storage and filtering of a typeless response are unchanged; only the naming
code (wire_patches_ext) reads it as "no declared type" and keeps the URL
extension. Because the type string rides the cache, an update reads the same
sentinel and names consistently -- the revert is fixed at the source.

The sentinel never reaches a consumer as a real type: a single helper,
hts_effective_mime(), maps it back to text/html wherever a stored type is
derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the
ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed
response header and the index record). The .arc export was caught by an
adversarial spill audit; without the map a typeless page archived via
proxytrack would carry Content-Type: unknown/unknown.

Since the sentinel makes contenttype_given unnecessary, #409's ABI break is
undone: the field is removed, soname returns to 3, and the Debian package
reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW
carries libhttrack3), so this re-aligns master with the archive rather than
flip-flopping anything downstream.

Tests: 18_local-update re-mirrors and asserts the names survive the update
pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct
stays green. Full make check: 27 pass, 0 fail.

One accepted behavior change: a mime filter matching exactly text/html no
longer matches a typeless response (its type is the sentinel, html-ish but
not literally text/html); the response is still parsed and crawled as html.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:44:12 +02:00
Xavier Roche
a64c4cd160 Don't read an uninitialized buffer on an empty Content-Type (#411)
treathead() parses the Content-Type value with sscanf("%s") into a local
`tempo` buffer, then calls strlen(tempo) and stores the result. A response
whose Content-Type header has an empty or whitespace-only value yields no
token: sscanf leaves `tempo` uninitialized, so strlen reads uninitialized
stack and can over-read past the buffer. A hostile server triggers this with
a bare `Content-Type:` line.

Guard on sscanf's return: adopt the value, and mark the type as server-given,
only when a token was actually read. An empty value now falls back to the
default type with contenttype_given left false, i.e. it is treated like a
missing header and the URL extension is kept -- which is also the correct
naming behavior.

Found while reviewing #409, which added contenttype_given right beside this
parse; the bug itself predates it. tests/17_local-empty-ct.test exercises the
empty-Content-Type path, and the ASan/UBSan CI job is what catches the
uninitialized read.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:20:08 +02:00
Xavier Roche
1611dbcabf Trust a declared Content-Type over a binary URL extension (#409)
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.

The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.

Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:18:16 +02:00
Xavier Roche
099501ee50 Make lintian actually gate the Debian package build (#410)
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.

Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:13:12 +02:00
Xavier Roche
1b9eefa3b4 Merge pull request #408 from xroche/fix/267-delayed-ext-mangle
Stop mangling saved-file extensions under the default delayed type check
2026-06-20 18:28:27 +02:00
Xavier Roche
9c8d3a41eb tests: tighten the type-matrix guards
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:25:45 +02:00
Xavier Roche
ae77cd9d6d Honor --assume under the default delayed type check (-%N2)
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.

tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:12:01 +02:00
Xavier Roche
51b8dcd81c Keep a known URL extension against a bogus html/empty Content-Type
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.

wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.

tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:07:08 +02:00
Xavier Roche
bcce664143 Merge pull request #364 from xroche/feature/local-test-server
tests: offline local test server prototype (cookies + HTTPS)
2026-06-20 16:41:26 +02:00
19 changed files with 315 additions and 57 deletions

View File

@@ -227,7 +227,8 @@ jobs:
# Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above
# already covers compile portability. lintian runs with --fail-on=error.
# already covers compile portability. mkdeb.sh runs lintian as an explicit gate
# (debuild does not propagate lintian's exit) with --fail-on=error,warning.
deb:
name: deb package (lintian)
runs-on: ubuntu-24.04

View File

@@ -4,3 +4,6 @@
# so the path lives in the display pointer, not the override -- match with '*'.
httrack-doc: extra-license-file *
httrack-doc: package-contains-documentation-outside-usr-share-doc *
# search.sh is a sample CGI shipped alongside the HTML manual, not meant to be
# run from the package tree; it stays non-executable by design.
httrack-doc: script-not-executable *

View File

@@ -3703,9 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0)
to->maxsoc = from->maxsoc;
/* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard
is always false; cast to int to keep the -1 "unset" sentinel test. */
if ((int) from->nearlink > -1)
/* hts_tristate fields use HTS_DEFAULT (-1) for "unspecified": copy_htsopt
skips them so the target keeps its value. */
if (from->nearlink > -1)
to->nearlink = from->nearlink;
if (from->timeout > -1)
@@ -3732,10 +3732,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol;
if ((int) from->errpage > -1)
if (from->errpage > -1)
to->errpage = from->errpage;
if ((int) from->parseall > -1)
if (from->parseall > -1)
to->parseall = from->parseall;
// test all: bit 8 de travel

View File

@@ -2579,7 +2579,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(r.size >= 0) ? r.size : (-r.size));
if (r.contenttype >= 0) {
fprintf(stdout, "Content-Type: %s\r\n",
r.contenttype);
hts_effective_mime(r.contenttype));
}
if (r.cdispo[0]) {
fprintf(stdout, "Content-Disposition: %s\r\n",
@@ -3166,6 +3166,16 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (to->parseall != HTS_FALSE)
err = 1;
/* HTS_DEFAULT (-1) is "unspecified": copy_htsopt must skip it,
leaving the target intact. Only a signed (int-backed) field
can hold -1, so this also guards the type against regressing
to an unsigned hts_boolean. */
from->parseall = HTS_DEFAULT;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->parseall != HTS_TRUE)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");

View File

@@ -247,13 +247,23 @@ Please visit our Website: http://www.httrack.com
#define HTS_NOPARAM "(none)"
#define HTS_NOPARAM2 "\"(none)\""
/* Boolean flag for option fields and API yes/no returns. An enum (not C bool)
so it stays int-sized: option fields keep the httrackp layout/ABI, and a
return type stays compatible with the int it replaces. */
/* Boolean flag for option fields and API yes/no returns. Int-backed, not an
enum: an enum makes C++ reject `field = 1` / `f(0)` on the exported fields
and params. Int-sized, so the httrackp layout and the ABI are unchanged. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
typedef int hts_boolean;
#define HTS_FALSE 0
#define HTS_TRUE 1
#endif
#ifndef HTS_DEF_DEFSTRUCT_hts_tristate
#define HTS_DEF_DEFSTRUCT_hts_tristate
/* Tri-state hts_boolean: HTS_DEFAULT (-1) = "unspecified" (copy_htsopt leaves
the target untouched); HTS_FALSE/HTS_TRUE = off/on. */
typedef int hts_tristate;
#define HTS_DEFAULT (-1)
#endif
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */

View File

@@ -1423,7 +1423,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
else
infostatuscode(retour->msg, retour->statuscode);
// type MIME par défaut2
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else { // pas de code!
retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown response structure");
@@ -1438,7 +1438,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK;
retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else if (strnotempty(a)) {
retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown (not HTTP/xx) response structure");
@@ -1447,7 +1447,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK;
retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
}
}
} else { // vide!
@@ -1458,7 +1458,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
/* This is dirty .. */
retour->statuscode = HTTP_OK;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
}
}
@@ -1589,11 +1589,15 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
}
}
}
sscanf(rcvd + p, "%s", tempo);
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur
// An empty/whitespace Content-Type value yields no token: keep the
// sentinel default rather than reading an uninitialized tempo.
if (sscanf(rcvd + p, "%s", tempo) == 1) {
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype,
"application/octet-stream-unknown"); // erreur
}
}
} else if ((p = strfield(rcvd, "Content-Range:")) != 0) {
// Content-Range: bytes 0-70870/70871
@@ -4310,6 +4314,7 @@ int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {

View File

@@ -481,10 +481,22 @@ HTS_STATIC int strcmpnocase(const char *a, const char *b) {
// is this MIME an hypertext MIME (text/html), html/js-style or other script/text type?
#define HTS_HYPERTEXT_DEFAULT_MIME "text/html"
/* Sentinel stored when the server declared no Content-Type. It is html-ish
for every type test (so a typeless response still parses/stores as today),
but the naming code (wire_patches_ext) treats it as "no declared type" and
keeps the URL extension. It rides the cache, so updates name consistently. */
#define HTS_UNKNOWN_MIME "unknown/unknown"
/* Map the no-declared-type sentinel back to a real type for any header or
record we EMIT or PERSIST, so "unknown/unknown" never reaches a consumer
(a served Content-Type, a ProxyTrack .arc record, ...). */
#define hts_effective_mime(m) \
(strfield2((m), HTS_UNKNOWN_MIME) ? HTS_HYPERTEXT_DEFAULT_MIME : (m))
#define is_html_mime_type(a) \
( (strfield2((a),"text/html")!=0)\
|| (strfield2((a),"application/xhtml+xml")!=0) \
#define is_html_mime_type(a) \
((strfield2((a), "text/html") != 0) || \
(strfield2((a), "application/xhtml+xml") != 0) || \
(strfield2((a), HTS_UNKNOWN_MIME) != \
0) /* no declared type: treat as html */ \
)
#define is_hypertext_mime__(a) \
( \

View File

@@ -138,6 +138,35 @@ static void cleanEndingSpaceOrDot(char *s) {
}
}
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server declared NO type (the HTS_UNKNOWN_MIME sentinel; the #267 mangle
guard): a typeless .png stays .png, but a .pdf explicitly served as text/html
is named .html. The sentinel rides the cache, so updates stay consistent. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server declared no type (the sentinel); an explicitly declared type,
even text/html, is trusted, so a binary-looking URL that really serves
HTML (login/error interstitial, soft-404) is named .html. */
if (!is_hypertext_mime(opt, urlmime, file) &&
strfield2(wiremime, HTS_UNKNOWN_MIME))
return 0;
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int url_savename(lien_adrfilsave *const afs,
@@ -325,7 +354,10 @@ int url_savename(lien_adrfilsave *const afs,
}
/* replace shtml to html.. */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
/* HARD delays every type, except one the user pinned with --assume: honor it
immediately (ishtml() consults the user type), no delayed name (#56) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!is_userknowntype(opt, fil))
is_html = -1; /* ALWAYS delay type */
else
is_html = ishtml(opt, fil);
@@ -380,7 +412,7 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori?
} else if (wire_patches_ext(opt, r.contenttype, fil)) {
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
@@ -425,7 +457,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil)) {
char s[16];
if (give_mimext(
s, sizeof(s),
@@ -641,7 +674,8 @@ int url_savename(lien_adrfilsave *const afs,
if (!has_been_moved) {
if (back[b].r.statuscode != -10) { // erreur
if (strnotempty(back[b].r.contenttype) == 0)
strcpybuff(back[b].r.contenttype, "text/html"); // message d'erreur en html
strcpybuff(back[b].r.contenttype,
HTS_UNKNOWN_MIME); // no declared type
// Finalement on, renvoie un erreur, pour ne toucher à rien dans le code
// libérer emplacement backing
}
@@ -653,7 +687,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil)) {
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension

View File

@@ -428,11 +428,11 @@ struct httrackp {
LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */
hts_boolean
hts_tristate
nearlink; /**< also fetch images/data adjacent to a page but off-site */
hts_boolean makeindex; /**< build a top-level index.html */
hts_boolean kindex; /**< build a keyword index */
hts_boolean delete_old; /**< delete locally obsolete files after update */
hts_tristate delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */
@@ -465,13 +465,13 @@ struct httrackp {
hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
hts_boolean errpage; /**< generate an error page on 404 and similar */
hts_tristate errpage; /**< generate an error page on 404 and similar */
hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/
hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
hts_robots robots; /**< robots.txt handling level */
hts_boolean external; /**< render external links as error pages */
hts_tristate external; /**< render external links as error pages */
hts_boolean passprivacy; /**< strip passwords from external links */
hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */
@@ -485,7 +485,7 @@ struct httrackp {
hts_boolean sizehack; /**< treat same-size response as "updated" */
hts_boolean urlhack; // force "url normalization" to avoid loops
hts_boolean tolerant; /**< accept an incorrect Content-Length */
hts_boolean
hts_tristate
parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */

View File

@@ -1176,11 +1176,15 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
if (element != NULL) {
msgCode = element->statuscode;
StringRoom(headers, 8192);
sprintf(StringBuffRW(headers), "HTTP/1.1 %d %s\r\n"
sprintf(StringBuffRW(headers),
"HTTP/1.1 %d %s\r\n"
#ifndef NO_WEBDAV
"%s"
#endif
"Content-Type: %s%s%s%s\r\n" "%s%s%s" "%s%s%s" "%s%s%s",
"Content-Type: %s%s%s%s\r\n"
"%s%s%s"
"%s%s%s"
"%s%s%s",
/* */
msgCode, element->msg,
#ifndef NO_WEBDAV
@@ -1188,16 +1192,18 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
StringBuff(davHeaders),
#endif
/* Content-type: foo; [ charset=bar ] */
element->contenttype,
hts_effective_mime(element->contenttype),
((element->charset[0]) ? "; charset=\"" : ""),
element->charset, ((element->charset[0]) ? "\"" : ""),
/* location */
((element->location != NULL
&& element->location[0]) ? "Location: " : ""),
((element->location != NULL
&& element->location[0]) ? element->location : ""),
((element->location != NULL
&& element->location[0]) ? "\r\n" : ""),
((element->location != NULL && element->location[0])
? "Location: "
: ""),
((element->location != NULL && element->location[0])
? element->location
: ""),
((element->location != NULL && element->location[0]) ? "\r\n"
: ""),
/* last-modified */
((element->lastmodified[0]) ? "Last-Modified: " : ""),
((element->lastmodified[0]) ? element->lastmodified : ""),
@@ -1205,8 +1211,7 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
/* etag */
((element->etag[0]) ? "ETag: " : ""),
((element->etag[0]) ? element->etag : ""),
((element->etag[0]) ? "\r\n" : "")
);
((element->etag[0]) ? "\r\n" : ""));
StringLength(headers) = (int) strlen(StringBuff(headers));
} else {
/* No query string, no ending / : check the the <url>/ page */

View File

@@ -52,6 +52,7 @@ Please visit our Website: http://www.httrack.com
#include "htscore.h"
#include "htsback.h"
#include "htslib.h" /* hts_effective_mime */
#include "store.h"
#include "proxystrings.h"
@@ -2289,10 +2290,17 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
int size_headers;
sprintf(st->headers,
"HTTP/1.0 %d %s" "\r\n" "X-Server: ProxyTrack " PROXYTRACK_VERSION
"\r\n" "Content-type: %s%s%s%s" "\r\n" "Last-modified: %s" "\r\n"
"Content-length: %d" "\r\n", element->statuscode, element->msg,
/**/ element->contenttype,
"HTTP/1.0 %d %s"
"\r\n"
"X-Server: ProxyTrack " PROXYTRACK_VERSION "\r\n"
"Content-type: %s%s%s%s"
"\r\n"
"Last-modified: %s"
"\r\n"
"Content-length: %d"
"\r\n",
element->statuscode, element->msg,
/**/ hts_effective_mime(element->contenttype),
(element->charset[0] ? "; charset=\"" : ""),
(element->charset[0] ? element->charset : ""),
(element->charset[0] ? "\"" : ""), /**/ element->lastmodified,
@@ -2328,10 +2336,10 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
/* args */
(link_has_authority(url) ? "" : "http://"), url, "0.0.0.0",
tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour,
tm->tm_min, tm->tm_sec, element->contenttype, element->statuscode,
st->md5, (element->location ? element->location : "-"),
(long int) ftell(fp), st->filename,
(long int) (size_headers + element->size));
tm->tm_min, tm->tm_sec, hts_effective_mime(element->contenttype),
element->statuscode, st->md5,
(element->location ? element->location : "-"), (long int) ftell(fp),
st->filename, (long int) (size_headers + element->size));
/* network_doc */
if (fwrite(st->headers, 1, size_headers, fp) != size_headers
|| (element->size > 0

25
tests/15_local-types.test Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \
--found 'types/style.css' \
--found 'types/data.json' \
--found 'types/control.html' --not-found 'types/control.php' \
--found 'types/gend61c.png' --not-found 'types/gend61c.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# --assume under the default delayed type check (-%N2), issue #56. A user type
# pinned with --assume must be honored immediately, not lost to the delayed
# name: photo.png served as image/png but assumed text/html is saved as .html.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/photo.html' --not-found 'types/photo.png' \
httrack 'BASEURL/types/photo.png' --assume png=text/html

View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
# An empty "Content-Type:" header value must be treated as "no usable type"
# (keep the URL extension), not parsed from an uninitialized buffer. The crawl
# also runs under ASan/UBSan in CI, which catches the uninitialized read this
# guards against.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/emptyct.png' --not-found 'types/emptyct.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# A second (update) pass must keep the names the first crawl chose. The stored
# Content-Type rides the cache, so the update reads back the same value -- the
# unknown/unknown sentinel for a typeless response, the declared type otherwise
# -- and names consistently: a declared-text/html .pdf stays .html and a
# typeless .png stays .png across the update rather than reverting.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/lie.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -51,6 +51,10 @@ TESTS = \
12_crawl_https.test \
13_crawl_proxy_https.test \
13_local-cookies.test \
14_local-https.test
14_local-https.test \
15_local-types.test \
16_local-assume.test \
17_local-empty-ct.test \
18_local-update.test
CLEANFILES = check-network_sh.cache

View File

@@ -26,6 +26,7 @@ key="${testdir}/server.key"
tls=
verbose=
rerun=
tmpdir=
serverpid=
crawlpid=
@@ -89,6 +90,7 @@ nargs=$#
while test "$pos" -lt "$nargs"; do
case "${args[$pos]}" in
--debug) verbose=1 ;;
--rerun) rerun=1 ;; # run httrack a second time (update pass) before auditing
--no-purge)
nopurge=1
audit+=("--no-purge")
@@ -180,6 +182,22 @@ test "$crawlres" -eq 0 || ! result "httrack exited $crawlres" || {
result "OK"
grep -iE "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt" >&2
# --- optional second pass: re-mirror into the same dir (cache/update path) ----
if test -n "$rerun"; then
info "re-running httrack (update pass)"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" \
"${moreargs[@]}" "${hts[@]}" >"${log}.2" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
test "$crawlres" -eq 0 || ! result "update pass exited $crawlres" || {
cat "${log}.2" >&2
exit 1
}
result "OK (update)"
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
hostroot=
for cand in "${out}/127.0.0.1_${port}" "${out}/127.0.0.1"; do

View File

@@ -118,11 +118,83 @@ class Handler(SimpleHTTPRequestHandler):
if self.command != "HEAD":
self.wfile.write(body)
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
TYPE_MATRIX = {
"/types/control.php": (b"<html><body>control</body></html>", "text/html"),
"/types/photo.png": (FAKE_PNG, "image/png"),
"/types/doc.pdf": (FAKE_PDF, "application/pdf"),
"/types/notype.png": (FAKE_PNG, None),
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
"/types/style.css": (b"body { color: red; }\n", "text/css"),
"/types/data.json": (b'{"k": "v"}\n', "application/json"),
"/types/gen.php": (FAKE_PNG, "image/png"),
}
def route_types_index(self):
body = (
'\t<a href="control.php">control</a>\n'
'\t<img src="photo.png" />\n'
'\t<a href="doc.pdf">doc</a>\n'
'\t<img src="notype.png" />\n'
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
'\t<link rel="stylesheet" href="style.css" />\n'
'\t<a href="data.json">json</a>\n'
'\t<img src="gen.php?id=5" />\n'
)
self.send_html(body)
def route_types(self):
path = urlsplit(self.path).path
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
"/types/photo.png": route_types,
"/types/doc.pdf": route_types,
"/types/notype.png": route_types,
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
}
# --- dispatch ----------------------------------------------------------

View File

@@ -206,9 +206,10 @@ main() {
cp -a "$export_dir/debian" "httrack-$ver/debian"
)
# Build (debuild also runs lintian and signs). --fail-on aborts on a lintian
# error or warning, so neither a release nor CI produces an unclean package.
local -a debuild_opts=(--lintian-opts -I -i "--fail-on=error,warning")
# Build and sign. debuild runs lintian too but does NOT propagate its exit
# status, so a broken package would pass unnoticed; disable it here and run
# lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S)
if [[ $unsigned -eq 1 ]]; then
@@ -219,7 +220,8 @@ main() {
info "building packages with debuild"
(
cd "$scratch/httrack-$ver"
debuild "${build_opts[@]}" "${debuild_opts[@]}"
# debuild options (--no-lintian) must precede the dpkg-buildpackage ones
debuild "${debuild_opts[@]}" "${build_opts[@]}"
)
# Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo).
@@ -229,6 +231,16 @@ main() {
changes=("$scratch"/*.changes)
shopt -u nullglob
[[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file"
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/"
# Clean-room build gate: rebuild the source package in a minimal chroot that