Compare commits

..

17 Commits

Author SHA1 Message Date
Xavier Roche
addbd3136b Use an unknown/unknown sentinel for an absent Content-Type (#412)
#409 distinguished "the server declared text/html" from "no Content-Type,
defaulted to text/html" with a new htsblk.contenttype_given flag, so a
binary-looking URL that really serves HTML is saved .html while a typeless
response keeps its URL extension. That worked on a fresh crawl but had two
costs: the flag was never persisted, so on --update the cache read it as
unset and the names reverted (report.html became report.pdf again, and the
two passes disagreed); and it was an installed-struct ABI break (soname 4,
libhttrack4).

Replace the flag with a sentinel: when no Content-Type is received, store
"unknown/unknown" as the type instead of text/html. The sentinel is treated
as html for every type test (added to is_html_mime_type), so parsing,
storage and filtering of a typeless response are unchanged; only the naming
code (wire_patches_ext) reads it as "no declared type" and keeps the URL
extension. Because the type string rides the cache, an update reads the same
sentinel and names consistently -- the revert is fixed at the source.

The sentinel never reaches a consumer as a real type: a single helper,
hts_effective_mime(), maps it back to text/html wherever a stored type is
derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the
ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed
response header and the index record). The .arc export was caught by an
adversarial spill audit; without the map a typeless page archived via
proxytrack would carry Content-Type: unknown/unknown.

Since the sentinel makes contenttype_given unnecessary, #409's ABI break is
undone: the field is removed, soname returns to 3, and the Debian package
reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW
carries libhttrack3), so this re-aligns master with the archive rather than
flip-flopping anything downstream.

Tests: 18_local-update re-mirrors and asserts the names survive the update
pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct
stays green. Full make check: 27 pass, 0 fail.

One accepted behavior change: a mime filter matching exactly text/html no
longer matches a typeless response (its type is the sentinel, html-ish but
not literally text/html); the response is still parsed and crawled as html.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:44:12 +02:00
Xavier Roche
a64c4cd160 Don't read an uninitialized buffer on an empty Content-Type (#411)
treathead() parses the Content-Type value with sscanf("%s") into a local
`tempo` buffer, then calls strlen(tempo) and stores the result. A response
whose Content-Type header has an empty or whitespace-only value yields no
token: sscanf leaves `tempo` uninitialized, so strlen reads uninitialized
stack and can over-read past the buffer. A hostile server triggers this with
a bare `Content-Type:` line.

Guard on sscanf's return: adopt the value, and mark the type as server-given,
only when a token was actually read. An empty value now falls back to the
default type with contenttype_given left false, i.e. it is treated like a
missing header and the URL extension is kept -- which is also the correct
naming behavior.

Found while reviewing #409, which added contenttype_given right beside this
parse; the bug itself predates it. tests/17_local-empty-ct.test exercises the
empty-Content-Type path, and the ASan/UBSan CI job is what catches the
uninitialized read.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:20:08 +02:00
Xavier Roche
1611dbcabf Trust a declared Content-Type over a binary URL extension (#409)
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.

The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.

Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:18:16 +02:00
Xavier Roche
099501ee50 Make lintian actually gate the Debian package build (#410)
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.

Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:13:12 +02:00
Xavier Roche
1b9eefa3b4 Merge pull request #408 from xroche/fix/267-delayed-ext-mangle
Stop mangling saved-file extensions under the default delayed type check
2026-06-20 18:28:27 +02:00
Xavier Roche
9c8d3a41eb tests: tighten the type-matrix guards
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:25:45 +02:00
Xavier Roche
ae77cd9d6d Honor --assume under the default delayed type check (-%N2)
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.

tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:12:01 +02:00
Xavier Roche
51b8dcd81c Keep a known URL extension against a bogus html/empty Content-Type
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.

wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.

tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:07:08 +02:00
Xavier Roche
bcce664143 Merge pull request #364 from xroche/feature/local-test-server
tests: offline local test server prototype (cookies + HTTPS)
2026-06-20 16:41:26 +02:00
Xavier Roche
7a24add87c tests: add offline local test server prototype (cookies + HTTPS)
Replace the network dependency for crawl tests with a self-contained Python
stdlib server (http.server + ssl) that httrack crawls over loopback. The server
binds an ephemeral port and prints it on stdout; local-crawl.sh discovers the
port, substitutes the BASEURL token into the httrack arguments, runs the crawl,
and audits the mirror under the discovered host-root directory.

This prototype migrates two cases off ut.httrack.com:

- 13_local-cookies.test drives the cookie chain (entrance/second/third)
  reimplemented as Python handlers from the old ut/cookies/*.php fixtures. A
  missing or wrong cookie answers 500, so a clean 3-files/0-errors run proves
  the cookie jar is replayed across links.
- 14_local-https.test crawls over HTTPS using a shipped long-dated self-signed
  cert. httrack does not verify certs, so the cert is accepted as-is and the
  real TLS path runs offline.

The group skips (exit 77) when python3 is missing, mirroring check-network.sh.
Fixtures and the cert are listed explicitly in EXTRA_DIST (automake does not
expand globs); make distcheck passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 16:35:13 +02:00
Xavier Roche
2308e7bafd Merge pull request #407 from xroche/fix/mkdeb-orig-artifact-rev2
mkdeb: cut a Debian revision >= 2 without bypassing the tool
2026-06-20 15:46:57 +02:00
Xavier Roche
ef5691fc47 mkdeb: reuse a frozen orig tarball for a Debian revision >= 2
mkdeb.sh regenerated the upstream orig from a fresh `git archive HEAD | make
dist` on every run. That is right for a -1 release, but a Debian revision >= 2
reuses the orig frozen in the archive at -1: the .dsc pins it by checksum, and
a regenerated orig (different mtimes, and content drift whenever the release
tooling shipped in EXTRA_DIST changes) gets rejected by dak. The -2 upload had
to bypass mkdeb.sh and stitch the package by hand.

Derive the upstream version and Debian revision from debian/changelog and let
the revision pick the orig: revision 1 builds a fresh tarball as before;
revision >= 2 reuses the one passed with --orig FILE, untouched. The --orig
requirement is enforced only for a signed (upload-bound) build: an unsigned
build is a throwaway (CI, local lintian) that can never reach the archive, so
it still regenerates the orig as before rather than demanding a frozen one.

Two guards close the gap the old code left implicit: the regenerate path
asserts the built tarball matches the changelog version (catching a
configure.ac/changelog skew), and the overlay step confirms the orig unpacks
to httrack-<ver>/ before dropping debian/ on top.

Validated end to end by reusing the official 3.49.8 orig to build 3.49.8-2:
the resulting .dsc pins the frozen orig's checksum byte for byte.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:44:12 +02:00
Xavier Roche
0a6eb73903 mkdeb: emit the orig website artifact on a Debian revision >= 2
The release-artifacts step signs and checksums httrack_<ver>.orig.tar.gz in
$outdir, but $outdir is populated by `dcmd cp` from the .changes, which lists
only the files in the upload. dpkg-genchanges omits the orig from a revision
>= 2 .changes (it is already in the archive), so the orig never reached
$outdir and `gpg --detach-sign` failed with "No such file or directory",
aborting a -2 (or later) release after the source package was already built.

Copy the orig from the build tree into $outdir before signing so the website
artifacts are produced regardless of the Debian revision. The upload is
unaffected: dput uploads the .changes-referenced files, not the extra orig.

CI didn't catch this because the deb job builds unsigned and the artifact
block is gated on a signing key.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:12:03 +02:00
Xavier Roche
fdb243e5a2 Merge pull request #406 from xroche/debian/libhttrack3-rename
debian: rename libhttrack2 to libhttrack3 to follow the SONAME
2026-06-20 15:04:12 +02:00
Xavier Roche
f8546e146d debian: drop the dead libhttrack-swf1.files and fix the overrides comment
Two packaging nits surfaced while reviewing the libhttrack3 rename, both
debian/-only:

- debian/libhttrack-swf1.files listed libhtsswf.so.1* but there is no
  libhttrack-swf1 package in debian/control and the swf module is no longer
  built (lib_LTLIBRARIES is just libhttrack/libhtsjava). dh_movefiles only
  consults built packages, so the list was dead. Remove it.

- libhttrack3.lintian-overrides claimed the ABI is tracked via "a strict
  =version dependency", but dh_makeshlibs --version-info emits the
  conservative (>= upstream-version) form, which is the correct choice for a
  soname-versioned library; a = ${binary:Version} shlibs dependency draws
  lintian's distant-prerequisite-in-shlibs. Correct the comment to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:59:00 +02:00
Xavier Roche
b7f602f2eb debian: rename libhttrack2 to libhttrack3 to follow the SONAME
The 3.49.8 ABI bump moved the soname to libhttrack.so.3, but the packaging
still globbed .so.2 in debian/libhttrack2.files, so the runtime libraries
matched nothing there and fell through into the catch-all httrack package;
libhttrack2 shipped no library (lintian package-name-doesnt-match-sonames).

Rename the binary package to libhttrack3, take over the misplaced libraries
from httrack and the old libhttrack2 via Breaks/Replaces, and switch the
.files globs to a .so.3* wildcard so a future soname bump no longer silently
misplaces the libraries. Ships as 3.49.8-2; new binary name goes through NEW.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:46:14 +02:00
Xavier Roche
550100b56a Merge pull request #405 from xroche/feature/mkdeb-sbuild
mkdeb: optional --sbuild clean-room build gate
2026-06-20 14:43:43 +02:00
31 changed files with 903 additions and 86 deletions

5
.flake8 Normal file
View File

@@ -0,0 +1,5 @@
[flake8]
# Match black's formatting so the two tools don't fight.
max-line-length = 88
# E203/W503 conflict with black's slice and line-break style.
extend-ignore = E203, W503

View File

@@ -227,7 +227,8 @@ jobs:
# Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above
# already covers compile portability. lintian runs with --fail-on=error.
# already covers compile portability. mkdeb.sh runs lintian as an explicit gate
# (debuild does not propagate lintian's exit) with --fail-on=error,warning.
deb:
name: deb package (lintian)
runs-on: ubuntu-24.04

13
debian/changelog vendored
View File

@@ -1,3 +1,16 @@
httrack (3.49.8-2) unstable; urgency=medium
* Rename libhttrack2 to libhttrack3 to follow the SONAME, which the 3.49.8
ABI bump moved to libhttrack.so.3 (package-name-doesnt-match-sonames). In
3.49.8-1 the libhttrack2.files glob still matched .so.2, so the runtime
libraries fell through into the httrack package and libhttrack2 shipped no
library. The new .files uses a .so.3* wildcard so a future SONAME bump no
longer silently misplaces the libraries. New binary package, via NEW.
* Drop the stale debian/libhttrack-swf1.files: the swf module is no longer
built and no libhttrack-swf1 package exists.
-- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 14:42:13 +0200
httrack (3.49.8-1) unstable; urgency=medium
* New upstream release: HTTPS-proxy CONNECT tunnelling and wider srcset

6
debian/control vendored
View File

@@ -58,13 +58,13 @@ Description: webhttrack common files
This package is the common files of webhttrack, website copier and
mirroring utility
Package: libhttrack2
Package: libhttrack3
Architecture: any
Multi-Arch: same
Section: libs
Replaces: libhttrack1
Conflicts: libhttrack1
Depends: ${misc:Depends}, ${shlibs:Depends}
Replaces: libhttrack2, httrack (<< 3.49.8-2~)
Breaks: libhttrack2, httrack (<< 3.49.8-2~)
Description: Httrack website copier library
This package is the library part of httrack, website copier and mirroring
utility

View File

@@ -4,3 +4,6 @@
# so the path lives in the display pointer, not the override -- match with '*'.
httrack-doc: extra-license-file *
httrack-doc: package-contains-documentation-outside-usr-share-doc *
# search.sh is a sample CGI shipped alongside the HTML manual, not meant to be
# run from the package tree; it stays non-executable by design.
httrack-doc: script-not-executable *

View File

@@ -1,2 +0,0 @@
usr/lib/*/libhtsswf.so.1.0.0
usr/lib/*/libhtsswf.so.1

View File

@@ -1,5 +0,0 @@
usr/lib/*/libhttrack.so.2.0.49
usr/lib/*/libhttrack.so.2
usr/lib/*/libhtsjava.so.2.0.49
usr/lib/*/libhtsjava.so.2
usr/share/httrack/templates

View File

@@ -1,3 +0,0 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME and a strict =version dependency, see debian/rules).
libhttrack2: no-symbols-control-file usr/lib/*

3
debian/libhttrack3.files vendored Normal file
View File

@@ -0,0 +1,3 @@
usr/lib/*/libhttrack.so.3*
usr/lib/*/libhtsjava.so.3*
usr/share/httrack/templates

3
debian/libhttrack3.lintian-overrides vendored Normal file
View File

@@ -0,0 +1,3 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack3: no-symbols-control-file usr/lib/*

2
debian/rules vendored
View File

@@ -135,7 +135,7 @@ binary-arch: build install
dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info
dh_installdeb -a
# we depend on the current version (ABI may change)
dh_shlibdeps -a -ldebian/libhttrack2/usr/lib/$(DEB_HOST_MULTIARCH)
dh_shlibdeps -a -ldebian/libhttrack3/usr/lib/$(DEB_HOST_MULTIARCH)
dh_gencontrol -a
dh_md5sums -a
dh_builddeb -a

View File

@@ -2579,7 +2579,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(r.size >= 0) ? r.size : (-r.size));
if (r.contenttype >= 0) {
fprintf(stdout, "Content-Type: %s\r\n",
r.contenttype);
hts_effective_mime(r.contenttype));
}
if (r.cdispo[0]) {
fprintf(stdout, "Content-Disposition: %s\r\n",

View File

@@ -1423,7 +1423,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
else
infostatuscode(retour->msg, retour->statuscode);
// type MIME par défaut2
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else { // pas de code!
retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown response structure");
@@ -1438,7 +1438,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK;
retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else if (strnotempty(a)) {
retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown (not HTTP/xx) response structure");
@@ -1447,7 +1447,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK;
retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
}
}
} else { // vide!
@@ -1458,7 +1458,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
/* This is dirty .. */
retour->statuscode = HTTP_OK;
strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME);
strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
}
}
@@ -1589,11 +1589,15 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
}
}
}
sscanf(rcvd + p, "%s", tempo);
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur
// An empty/whitespace Content-Type value yields no token: keep the
// sentinel default rather than reading an uninitialized tempo.
if (sscanf(rcvd + p, "%s", tempo) == 1) {
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo);
else
strcpybuff(retour->contenttype,
"application/octet-stream-unknown"); // erreur
}
}
} else if ((p = strfield(rcvd, "Content-Range:")) != 0) {
// Content-Range: bytes 0-70870/70871
@@ -4310,6 +4314,7 @@ int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {

View File

@@ -481,10 +481,22 @@ HTS_STATIC int strcmpnocase(const char *a, const char *b) {
// is this MIME an hypertext MIME (text/html), html/js-style or other script/text type?
#define HTS_HYPERTEXT_DEFAULT_MIME "text/html"
/* Sentinel stored when the server declared no Content-Type. It is html-ish
for every type test (so a typeless response still parses/stores as today),
but the naming code (wire_patches_ext) treats it as "no declared type" and
keeps the URL extension. It rides the cache, so updates name consistently. */
#define HTS_UNKNOWN_MIME "unknown/unknown"
/* Map the no-declared-type sentinel back to a real type for any header or
record we EMIT or PERSIST, so "unknown/unknown" never reaches a consumer
(a served Content-Type, a ProxyTrack .arc record, ...). */
#define hts_effective_mime(m) \
(strfield2((m), HTS_UNKNOWN_MIME) ? HTS_HYPERTEXT_DEFAULT_MIME : (m))
#define is_html_mime_type(a) \
( (strfield2((a),"text/html")!=0)\
|| (strfield2((a),"application/xhtml+xml")!=0) \
#define is_html_mime_type(a) \
((strfield2((a), "text/html") != 0) || \
(strfield2((a), "application/xhtml+xml") != 0) || \
(strfield2((a), HTS_UNKNOWN_MIME) != \
0) /* no declared type: treat as html */ \
)
#define is_hypertext_mime__(a) \
( \

View File

@@ -138,6 +138,35 @@ static void cleanEndingSpaceOrDot(char *s) {
}
}
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server declared NO type (the HTS_UNKNOWN_MIME sentinel; the #267 mangle
guard): a typeless .png stays .png, but a .pdf explicitly served as text/html
is named .html. The sentinel rides the cache, so updates stay consistent. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server declared no type (the sentinel); an explicitly declared type,
even text/html, is trusted, so a binary-looking URL that really serves
HTML (login/error interstitial, soft-404) is named .html. */
if (!is_hypertext_mime(opt, urlmime, file) &&
strfield2(wiremime, HTS_UNKNOWN_MIME))
return 0;
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int url_savename(lien_adrfilsave *const afs,
@@ -325,7 +354,10 @@ int url_savename(lien_adrfilsave *const afs,
}
/* replace shtml to html.. */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
/* HARD delays every type, except one the user pinned with --assume: honor it
immediately (ishtml() consults the user type), no delayed name (#56) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!is_userknowntype(opt, fil))
is_html = -1; /* ALWAYS delay type */
else
is_html = ishtml(opt, fil);
@@ -380,7 +412,7 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori?
} else if (wire_patches_ext(opt, r.contenttype, fil)) {
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
@@ -425,7 +457,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil)) {
char s[16];
if (give_mimext(
s, sizeof(s),
@@ -641,7 +674,8 @@ int url_savename(lien_adrfilsave *const afs,
if (!has_been_moved) {
if (back[b].r.statuscode != -10) { // erreur
if (strnotempty(back[b].r.contenttype) == 0)
strcpybuff(back[b].r.contenttype, "text/html"); // message d'erreur en html
strcpybuff(back[b].r.contenttype,
HTS_UNKNOWN_MIME); // no declared type
// Finalement on, renvoie un erreur, pour ne toucher à rien dans le code
// libérer emplacement backing
}
@@ -653,7 +687,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type)
} else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil)) {
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension

View File

@@ -1176,11 +1176,15 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
if (element != NULL) {
msgCode = element->statuscode;
StringRoom(headers, 8192);
sprintf(StringBuffRW(headers), "HTTP/1.1 %d %s\r\n"
sprintf(StringBuffRW(headers),
"HTTP/1.1 %d %s\r\n"
#ifndef NO_WEBDAV
"%s"
#endif
"Content-Type: %s%s%s%s\r\n" "%s%s%s" "%s%s%s" "%s%s%s",
"Content-Type: %s%s%s%s\r\n"
"%s%s%s"
"%s%s%s"
"%s%s%s",
/* */
msgCode, element->msg,
#ifndef NO_WEBDAV
@@ -1188,16 +1192,18 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
StringBuff(davHeaders),
#endif
/* Content-type: foo; [ charset=bar ] */
element->contenttype,
hts_effective_mime(element->contenttype),
((element->charset[0]) ? "; charset=\"" : ""),
element->charset, ((element->charset[0]) ? "\"" : ""),
/* location */
((element->location != NULL
&& element->location[0]) ? "Location: " : ""),
((element->location != NULL
&& element->location[0]) ? element->location : ""),
((element->location != NULL
&& element->location[0]) ? "\r\n" : ""),
((element->location != NULL && element->location[0])
? "Location: "
: ""),
((element->location != NULL && element->location[0])
? element->location
: ""),
((element->location != NULL && element->location[0]) ? "\r\n"
: ""),
/* last-modified */
((element->lastmodified[0]) ? "Last-Modified: " : ""),
((element->lastmodified[0]) ? element->lastmodified : ""),
@@ -1205,8 +1211,7 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
/* etag */
((element->etag[0]) ? "ETag: " : ""),
((element->etag[0]) ? element->etag : ""),
((element->etag[0]) ? "\r\n" : "")
);
((element->etag[0]) ? "\r\n" : ""));
StringLength(headers) = (int) strlen(StringBuff(headers));
} else {
/* No query string, no ending / : check the the <url>/ page */

View File

@@ -52,6 +52,7 @@ Please visit our Website: http://www.httrack.com
#include "htscore.h"
#include "htsback.h"
#include "htslib.h" /* hts_effective_mime */
#include "store.h"
#include "proxystrings.h"
@@ -2289,10 +2290,17 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
int size_headers;
sprintf(st->headers,
"HTTP/1.0 %d %s" "\r\n" "X-Server: ProxyTrack " PROXYTRACK_VERSION
"\r\n" "Content-type: %s%s%s%s" "\r\n" "Last-modified: %s" "\r\n"
"Content-length: %d" "\r\n", element->statuscode, element->msg,
/**/ element->contenttype,
"HTTP/1.0 %d %s"
"\r\n"
"X-Server: ProxyTrack " PROXYTRACK_VERSION "\r\n"
"Content-type: %s%s%s%s"
"\r\n"
"Last-modified: %s"
"\r\n"
"Content-length: %d"
"\r\n",
element->statuscode, element->msg,
/**/ hts_effective_mime(element->contenttype),
(element->charset[0] ? "; charset=\"" : ""),
(element->charset[0] ? element->charset : ""),
(element->charset[0] ? "\"" : ""), /**/ element->lastmodified,
@@ -2328,10 +2336,10 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
/* args */
(link_has_authority(url) ? "" : "http://"), url, "0.0.0.0",
tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour,
tm->tm_min, tm->tm_sec, element->contenttype, element->statuscode,
st->md5, (element->location ? element->location : "-"),
(long int) ftell(fp), st->filename,
(long int) (size_headers + element->size));
tm->tm_min, tm->tm_sec, hts_effective_mime(element->contenttype),
element->statuscode, st->md5,
(element->location ? element->location : "-"), (long int) ftell(fp),
st->filename, (long int) (size_headers + element->size));
/* network_doc */
if (fwrite(st->headers, 1, size_headers, fp) != size_headers
|| (element->size > 0

15
tests/13_local-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Cookie chain against the local test server (replaces the old online
# ut/cookies/*.php fixtures). entrance.php sets cat/cake; second.php checks
# them and sets badger; third.php checks all three. A missing or wrong cookie
# returns 500, which would surface as an httrack error and a missing file, so a
# clean 3-files/0-errors run proves the cookie jar is replayed across links.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
--found 'cookies/entrance.html' \
--found 'cookies/second.html' \
--found 'cookies/third.html' \
httrack 'BASEURL/cookies/entrance.php'

18
tests/14_local-https.test Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
#
# HTTPS crawl against the local test server, using the shipped self-signed
# cert. httrack does not verify certs (htslib.c: SSL_CTX_new with no
# SSL_CTX_set_verify), so the self-signed cert is accepted as-is and this
# exercises the real TLS path offline. basic.html links to link.html with four
# distinct query strings, each saved under a hashed name -> 5 files.
: "${top_srcdir:=..}"
if test "$HTTPS_SUPPORT" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
bash "$top_srcdir/tests/local-crawl.sh" --tls --errors 0 --files 5 \
--found 'simple/basic.html' \
httrack 'BASEURL/simple/basic.html'

25
tests/15_local-types.test Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \
--found 'types/style.css' \
--found 'types/data.json' \
--found 'types/control.html' --not-found 'types/control.php' \
--found 'types/gend61c.png' --not-found 'types/gend61c.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# --assume under the default delayed type check (-%N2), issue #56. A user type
# pinned with --assume must be honored immediately, not lost to the delayed
# name: photo.png served as image/png but assumed text/html is saved as .html.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/photo.html' --not-found 'types/photo.png' \
httrack 'BASEURL/types/photo.png' --assume png=text/html

View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
# An empty "Content-Type:" header value must be treated as "no usable type"
# (keep the URL extension), not parsed from an uninitialized buffer. The crawl
# also runs under ASan/UBSan in CI, which catches the uninitialized read this
# guards against.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/emptyct.png' --not-found 'types/emptyct.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# A second (update) pass must keep the names the first crawl chose. The stored
# Content-Type rides the cache, so the update reads back the same value -- the
# unknown/unknown sentinel for a typeless response, the declared type otherwise
# -- and names consistently: a declared-text/html .pdf stays .html and a
# typeless .png stays .png across the update rather than reverting.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/lie.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -3,6 +3,8 @@
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -47,6 +49,12 @@ TESTS = \
11_crawl-longurl.test \
11_crawl-parsing.test \
12_crawl_https.test \
13_crawl_proxy_https.test
13_crawl_proxy_https.test \
13_local-cookies.test \
14_local-https.test \
15_local-types.test \
16_local-assume.test \
17_local-empty-ct.test \
18_local-update.test
CLEANFILES = check-network_sh.cache

253
tests/local-crawl.sh Executable file
View File

@@ -0,0 +1,253 @@
#!/bin/bash
#
# Launcher for httrack crawl tests against the local Python test server.
#
# Starts tests/local-server.py on an ephemeral port, discovers the port from
# the server's stdout, then runs httrack against http(s)://127.0.0.1:$PORT and
# audits the mirror. The server is always killed and the tmpdir removed on exit.
#
# The token BASEURL in any httrack argument is replaced with the discovered
# http(s)://127.0.0.1:$PORT base. --found/--directory paths are relative to the
# discovered host root (127.0.0.1_<port>/), since the random port leaks into
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# --errors N --files N --found PATH ... --directory PATH ... \
# httrack BASEURL/some/path [httrack-args...]
set -u
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
root="${LOCAL_SERVER_ROOT:-${testdir}/server-root}"
cert="${testdir}/server.crt"
key="${testdir}/server.key"
tls=
verbose=
rerun=
tmpdir=
serverpid=
crawlpid=
function warning {
echo "** $*" >&2
return 0
}
function die {
warning "$*"
exit 1
}
function debug {
test -n "$verbose" && echo "$*" >&2
return 0
}
function info { printf "[%s] ..\t" "$*" >&2; }
function result { echo "$*" >&2; }
function cleanup {
if test -n "$crawlpid"; then
kill -9 "$crawlpid" 2>/dev/null
crawlpid=
fi
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null
# Reap it so the port is released before we rm the tmpdir/log.
wait "$serverpid" 2>/dev/null
serverpid=
fi
if test -n "$tmpdir" && test -d "$tmpdir"; then
test -n "$nopurge" || rm -rf "$tmpdir"
fi
}
function assert_equals {
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
fi
result "OK ($2)"
}
nopurge=
trap cleanup EXIT HUP INT QUIT PIPE TERM
# python3 is required; mirror check-network.sh's skip-with-77 convention.
command -v python3 >/dev/null || ! echo "python3 not found; skipping local crawl tests" || exit 77
tmptopdir=${TMPDIR:-/tmp}
test -d "$tmptopdir" || mkdir -p "$tmptopdir" || die "no temporary directory; set TMPDIR"
tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create tmpdir"
# --- parse leading control flags --------------------------------------------
declare -a audit=()
scheme=http
pos=0
args=("$@")
nargs=$#
while test "$pos" -lt "$nargs"; do
case "${args[$pos]}" in
--debug) verbose=1 ;;
--rerun) rerun=1 ;; # run httrack a second time (update pass) before auditing
--no-purge)
nopurge=1
audit+=("--no-purge")
;;
--tls)
tls=1
scheme=https
;;
--root)
pos=$((pos + 1))
root="${args[$pos]}"
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
httrack)
pos=$((pos + 1))
break
;;
*) die "unrecognized option ${args[$pos]}" ;;
esac
pos=$((pos + 1))
done
# --- start the server --------------------------------------------------------
test -r "$server" || die "cannot read $server"
serverlog="${tmpdir}/server.log"
serverargs=(--root "$root")
if test -n "$tls"; then
serverargs+=(--tls --cert "$cert" --key "$key")
fi
debug "starting python3 $server ${serverargs[*]}"
python3 "$server" "${serverargs[@]}" >"$serverlog" 2>&1 &
serverpid=$!
# Wait for the "PORT <n>" line (server prints it once bound).
port=
for _ in $(seq 1 50); do
if test -s "$serverlog"; then
line=$(head -n1 "$serverlog")
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
fi
kill -0 "$serverpid" 2>/dev/null || die "server exited early: $(cat "$serverlog")"
sleep 0.1
done
test -n "$port" || die "could not discover server port: $(cat "$serverlog")"
debug "server listening on ${scheme}://127.0.0.1:${port}"
baseurl="${scheme}://127.0.0.1:${port}"
# --- substitute BASEURL in the remaining (httrack) args ----------------------
declare -a hts=()
while test "$pos" -lt "$nargs"; do
hts+=("${args[$pos]//BASEURL/$baseurl}")
pos=$((pos + 1))
done
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || die "could not run httrack"
out="${tmpdir}/crawl"
mkdir "$out" || die "could not create $out"
# Localhost is fast; disable the rate/bandwidth safety limits but keep a
# max-time backstop so a hang cannot wedge the suite.
declare -a moreargs=(--quiet --max-time=120 --timeout=30 --disable-security-limits --robots=0)
log="${tmpdir}/log"
info "running httrack ${hts[*]}"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" "${moreargs[@]}" "${hts[@]}" >"$log" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
# httrack exits 0 even on hard connect/DNS errors, so this is a backstop only;
# the real guard is the audit below (--errors 0 plus the host-root existence check).
test "$crawlres" -eq 0 || ! result "httrack exited $crawlres" || {
cat "$log" >&2
exit 1
}
result "OK"
grep -iE "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt" >&2
# --- optional second pass: re-mirror into the same dir (cache/update path) ----
if test -n "$rerun"; then
info "re-running httrack (update pass)"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" \
"${moreargs[@]}" "${hts[@]}" >"${log}.2" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
test "$crawlres" -eq 0 || ! result "update pass exited $crawlres" || {
cat "${log}.2" >&2
exit 1
}
result "OK (update)"
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
hostroot=
for cand in "${out}/127.0.0.1_${port}" "${out}/127.0.0.1"; do
if test -d "$cand"; then
hostroot="$cand"
break
fi
done
test -n "$hostroot" || die "could not find host root under $out"
debug "host root: $hostroot"
# --- audit -------------------------------------------------------------------
i=0
while test "$i" -lt "${#audit[@]}"; do
case "${audit[$i]}" in
--errors)
i=$((i + 1))
assert_equals "checking errors" "${audit[$i]}" \
"$(grep -iEc "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt")"
;;
--files)
i=$((i + 1))
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${out}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "${audit[$i]}" "$nFiles"
;;
--found)
i=$((i + 1))
info "checking for ${audit[$i]}"
if test -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
--not-found)
i=$((i + 1))
info "checking absence of ${audit[$i]}"
if test ! -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "present"
exit 1
fi
;;
--directory)
i=$((i + 1))
info "checking for dir ${audit[$i]}"
if test -d "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
esac
i=$((i + 1))
done

254
tests/local-server.py Executable file
View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""Self-contained local web server for httrack's crawl tests.
Serves static fixtures from a docroot plus a handful of dynamic endpoints
(cookies, ...) so httrack can be exercised over loopback, deterministically and
offline, instead of crawling the live ut.httrack.com.
Binds to an ephemeral port (port 0) and prints the chosen port to stdout as
"PORT <n>\n" so a launcher can discover it. Pass --tls to wrap the socket with
the shipped self-signed test cert; httrack does not verify certs, so no CA
trust plumbing is needed.
stdlib only (http.server + ssl) -- no new build or runtime dependency.
"""
import argparse
import os
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import quote, unquote, urlsplit
# Cookie chain replicated from the old ut/cookies/*.php fixtures.
COOKIE_PATH = "/cookies/"
COOKIES = {
"cat": "dog",
"cake": "is a lie!",
"badger": "mushroom, with 'ants'",
}
PAGE = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
\t<title>Sample test</title>
</head>
<body>
{body}
</body>
</html>
"""
class Handler(SimpleHTTPRequestHandler):
# Quieter logging; the launcher captures httrack's own log anyway.
def log_message(self, fmt, *args):
if os.environ.get("LOCAL_SERVER_VERBOSE"):
super().log_message(fmt, *args)
# --- helpers -----------------------------------------------------------
def request_cookies(self):
"""Parse the Cookie header into {name: decoded-value}.
Mirrors PHP's $_COOKIE: values are url-decoded, matching the encoding
applied when the cookie was set (see set_cookie)."""
jar = {}
raw = self.headers.get("Cookie", "")
for pair in raw.split(";"):
pair = pair.strip()
if "=" in pair:
name, value = pair.split("=", 1)
jar[name.strip()] = unquote(value.strip())
return jar
def set_cookie(self, name, value):
"""Queue a Set-Cookie header, url-encoding the value like PHP's
setcookie() so spaces/quotes/commas stay a single token that httrack
can store and replay verbatim."""
self._set_cookies.append(f"{name}={quote(value)}; Path={COOKIE_PATH}")
def send_html(self, body, status=200, extra_status=None):
encoded = PAGE.format(body=body).encode("utf-8")
self.send_response(status, extra_status)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(encoded)))
for cookie in self._set_cookies:
self.send_header("Set-Cookie", cookie)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(encoded)
def fail_cookie(self, what):
# The old PHPs answered 500 with the reason in the status line.
self.send_html("", status=500, extra_status=f"The {what} is missing or invalid")
# --- dynamic routes ----------------------------------------------------
def route_entrance(self):
self.set_cookie("cat", COOKIES["cat"])
self.set_cookie("cake", COOKIES["cake"])
self.send_html('\tThis is a <a href="second.php">link</a>')
def route_second(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
self.set_cookie("badger", COOKIES["badger"])
self.send_html('\tThis is a <a href="third.php">link</a>')
def route_third(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
if jar.get("badger") != COOKIES["badger"]:
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
TYPE_MATRIX = {
"/types/control.php": (b"<html><body>control</body></html>", "text/html"),
"/types/photo.png": (FAKE_PNG, "image/png"),
"/types/doc.pdf": (FAKE_PDF, "application/pdf"),
"/types/notype.png": (FAKE_PNG, None),
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
"/types/style.css": (b"body { color: red; }\n", "text/css"),
"/types/data.json": (b'{"k": "v"}\n', "application/json"),
"/types/gen.php": (FAKE_PNG, "image/png"),
}
def route_types_index(self):
body = (
'\t<a href="control.php">control</a>\n'
'\t<img src="photo.png" />\n'
'\t<a href="doc.pdf">doc</a>\n'
'\t<img src="notype.png" />\n'
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
'\t<link rel="stylesheet" href="style.css" />\n'
'\t<a href="data.json">json</a>\n'
'\t<img src="gen.php?id=5" />\n'
)
self.send_html(body)
def route_types(self):
path = urlsplit(self.path).path
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
"/types/photo.png": route_types,
"/types/doc.pdf": route_types,
"/types/notype.png": route_types,
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
}
# --- dispatch ----------------------------------------------------------
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
handler = self.ROUTES.get(path)
if handler is not None:
handler(self)
return True
return False
def do_GET(self):
if not self.dispatch():
super().do_GET()
def do_HEAD(self):
if not self.dispatch():
super().do_HEAD()
def main():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--root", required=True, help="docroot for static files")
parser.add_argument("--bind", default="127.0.0.1", help="bind address")
parser.add_argument("--tls", action="store_true", help="serve HTTPS")
parser.add_argument("--cert", help="TLS certificate (PEM)")
parser.add_argument("--key", help="TLS private key (PEM)")
args = parser.parse_args()
root = os.path.abspath(args.root)
def factory(*a, **kw):
return Handler(*a, directory=root, **kw)
httpd = ThreadingHTTPServer((args.bind, 0), factory)
if args.tls:
import ssl
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile=args.cert, keyfile=args.key)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
# The launcher reads this line to discover the ephemeral port.
print(f"PORT {port}", flush=True)
try:
httpd.serve_forever()
except KeyboardInterrupt:
pass
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,18 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Sample test</title>
</head>
<body>
This is a <a href="link.html?v=1">link</a>
This is a <a href='link.html?v=2'>link</a>
This is a <a href="./link.html?v=3">link</a>
This is a <a href=link.html?v=4>link</a>
</body>

View File

@@ -0,0 +1,3 @@
This is a link.
Go back to <a href="basic.html">home</a>.

21
tests/server.crt Normal file
View File

@@ -0,0 +1,21 @@
-----BEGIN CERTIFICATE-----
MIIDbzCCAlegAwIBAgIUdWkDDomnY3WW95UqJ+UOASuR/i0wDQYJKoZIhvcNAQEL
BQAwODESMBAGA1UEAwwJMTI3LjAuMC4xMSIwIAYDVQQKDBlIVFRyYWNrIGxvY2Fs
IHRlc3Qgc2VydmVyMCAXDTI2MDYxNTE0NDQxMFoYDzIwNTYwNjA3MTQ0NDEwWjA4
MRIwEAYDVQQDDAkxMjcuMC4wLjExIjAgBgNVBAoMGUhUVHJhY2sgbG9jYWwgdGVz
dCBzZXJ2ZXIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDx78mogNhT
noWwRa51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf
3rwj79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJC
SGxIin9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4
ZbPgwep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaE
nQrAlTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJx
rjVEPfA8QSbtAgMBAAGjbzBtMB0GA1UdDgQWBBTHE0KKW8REV4HxajzVsIBxz3iL
9zAfBgNVHSMEGDAWgBTHE0KKW8REV4HxajzVsIBxz3iL9zAPBgNVHRMBAf8EBTAD
AQH/MBoGA1UdEQQTMBGHBH8AAAGCCWxvY2FsaG9zdDANBgkqhkiG9w0BAQsFAAOC
AQEAYlTEftrwGJBXuPmtxhmtw2HO/VTC4TGnq67hH5H+ptwgZJuuxCQ5KW6flTyp
FTyMhha33WD4EBL3wqqJsWr9Y4BXqi4G0lRqXBcC1oIUa2VYIDMER7kaY1qTSqE8
ARpwdB2BhvngAzDLc+4Jt4jQMRGr8fHAwxpDBoIZ1knbyzYNP73Bajse6/8YtxUu
nB2BsldjZnLvyHvRxUpWp92OyQih4jYSrlN6olDFlKDg7++kMhkHtJQW9a1t54VN
0ZXrB1ZRuHUUvGBq26x71riTWor7HNOSQaGeCMQjZNQkh5tfshNygUGSZVXTEwhG
xSrOL7NqBt2+EkVwf7LjGzjmBw==
-----END CERTIFICATE-----

28
tests/server.key Normal file
View File

@@ -0,0 +1,28 @@
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDx78mogNhTnoWw
Ra51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf3rwj
79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJCSGxI
in9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4ZbPg
wep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaEnQrA
lTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJxrjVE
PfA8QSbtAgMBAAECggEACgNK4klq1T3IpKdNoBY5yoE7CbUQZBNkBpSPRxHgBezj
SVFfgrZGnOySrIJSt4JHtuynG2Hl+0ku74HRep/ck+eOsh5W3mZvGvMLnGxhwR3u
Or99osTIgU0VQTkpC0SLQ16FCnih0uJycNIikdLR7uuya1tt1OyIBzK7XlNGIywT
p85zJc7/6TfTC9eM7lqh7JGR7KplBxSvgZL1pUr7y4rNpKms6uzOvPND79CcKnbU
BBA9Tu4qdOkoOljsZKkvh3pihxyG9X6d8QTZ/uX3pkvliwSFBc+Sz9EootA3/4r5
gVWpQ2t/AY7fY4hqzLIX/HivVaPj3cWk1G+SHm0XNQKBgQD5I9rijqFvV/p6FmUl
FbnjJFFHHgZLivlGxAC5vOyJNQQaqdeDzg7yMotNmQTggVGjT6sjdosQb3n+ctuk
EhQnZSU5VkNKv1+PTR35WrRkaECCaqz3Pv79pV9GVcX3it7UuYjNiOeSPqINWe+X
49JwnJFz+qQ1BchAwOis4zkENwKBgQD4mShDaYLOO97VpgZj4cGxHHWyEK9CRQvp
I7HxRmfaWS3JHwb88lOmALEU6pAj5cYJPAznv8BnUWcVHalZbkQ1JWYtUJRqj6OI
Ym7rw/nm4Ay5ijbdEism173dSk3IjOe+PdAlxzsOuVzYdBTqElmeQWtBzhY9aHvX
r+A02C2j+wKBgHHDo6Gsi57yR5gUPd9vSlCkNtEIrss0DJv5yHMIB+KnaNZcE+NF
5qFF30Jxyz5RDtxJ9tXcvaeln8lG3XDQKI/MqfDCqTuqo5ImHrfMaW8oA70JxS2p
gHqGVzkg1aMxsIrmpcdk6olnPExocvWivGdbtzeEjhMALu8Sp6y6nUCFAoGBAK5h
KLgYw/OMVaQCIMthaa+l6f0s7PMMYe1453H6VBD6qz4/8HPwO7LfG1gzrUYxADgs
ElVh0UHn/On383nS+i9Ze5Hfyyvwc+LQQURKJPrJQMPJavCptPE7NmiKnYNHK6vr
yh0l4oxShAklbCJBGvICq4zuVfVfXDeQnDIVTfaPAoGBAMCrZqYdOUhUu+aUqxZq
qO/TTQxrxftU63jGUg+o042TdgI4KWLn07wvHJ8/E2OqF35eXenvcuKbNLI1l72J
4cp+3cUv8iAXThTRYEztr5CS/wta4o4CNN8zfjn5dV9AI4Hmt4V7EaGWpBcViGbj
n0Mhag+dO8DHuenqi1yfMrAt
-----END PRIVATE KEY-----

View File

@@ -20,6 +20,9 @@
# Options:
# -k, --key KEYID GPG key for signing (default: $DEBSIGN_KEYID)
# -o, --outdir DIR output directory (default: <repo>/dist)
# --orig FILE reuse this upstream orig tarball instead of
# regenerating it (required for a Debian revision
# >= 2, whose orig is frozen in the archive)
# -s, --source-only build only the source package
# -u, --unsigned do not sign anything (implies no release sigs)
# --no-release-artifacts skip the orig tarball .asc/.md5/.sha1
@@ -34,6 +37,10 @@
# chroot for the changelog's distribution; create one once with the companion
# tools/mk-sbuild-chroot.sh (rootless unshare backend).
#
# The Debian revision in debian/changelog decides the orig: revision 1 builds a
# fresh upstream tarball; revision >= 2 must reuse the orig frozen at revision 1
# (the .dsc references it by checksum), so pass it with --orig.
#
# SOURCE_DATE_EPOCH is honored for reproducible output.
set -euo pipefail
@@ -66,6 +73,7 @@ need() {
main() {
local key=${DEBSIGN_KEYID:-}
local outdir=""
local orig_in=""
local source_only=0
local unsigned=0
local release_artifacts=1
@@ -83,6 +91,11 @@ main() {
outdir=$2
shift 2
;;
--orig)
[[ $# -ge 2 ]] || die "missing argument for $1"
orig_in=$2
shift 2
;;
-s | --source-only)
source_only=1
shift
@@ -109,8 +122,8 @@ main() {
esac
done
need git autoreconf debuild dcmd
[[ $sbuild -eq 1 ]] && need sbuild dpkg-parsechangelog
need git autoreconf debuild dcmd dpkg-parsechangelog
[[ $sbuild -eq 1 ]] && need sbuild
if [[ $unsigned -eq 0 ]]; then
need gpg
[[ -n $key ]] || die "no signing key (pass --key or set DEBSIGN_KEYID, or use --unsigned)"
@@ -122,6 +135,11 @@ main() {
mkdir -p "$outdir"
outdir=$(cd "$outdir" && pwd)
if [[ -n $orig_in ]]; then
[[ -r $orig_in ]] || die "--orig file not readable: $orig_in"
orig_in=$(cd "$(dirname "$orig_in")" && pwd)/$(basename "$orig_in")
fi
scratch=$(mktemp -d "${TMPDIR:-/tmp}/httrack-mkdeb.XXXXXX")
trap 'rm -rf -- "$scratch"' EXIT
@@ -133,45 +151,65 @@ main() {
git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD |
tar -x -C "$export_dir"
# Refresh build system and man page, then build the tarball. We build here
# only because regen-man needs the compiled binaries; the test suite is not
# run in this pass. debuild (below) runs the full suite once, with the online
# tests enabled, so a check here would just be a slower, offline-only repeat.
info "regenerating build system and man page"
(
cd "$export_dir"
autoreconf -fi
./configure --quiet
make -s -j"$(nproc)"
make -s -C man regen-man
# Build the tarball from a clean tree so no object files leak into it.
make -s clean
make -s dist
)
# Upstream version and Debian revision drive the orig: revision 1 builds a
# fresh tarball, revision >= 2 reuses the one frozen at -1 (the .dsc pins it
# by checksum, so a regenerated orig with new mtimes would be rejected).
local fullver ver rev
fullver=$(cd "$export_dir" && dpkg-parsechangelog -S Version)
ver=${fullver%-*}
rev=${fullver##*-}
local orig=httrack_${ver}.orig.tar.gz
info "version $ver (Debian revision $rev)"
local tarball ver
local -a tarballs
shopt -s nullglob
tarballs=("$export_dir"/httrack-*.tar.gz)
shopt -u nullglob
[[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball"
tarball=${tarballs[0]##*/}
ver=${tarball#httrack-}
ver=${ver%.tar.gz}
info "version $ver"
# A signed build is upload-bound, so a revision >= 2 must reuse the frozen
# orig (--orig); an unsigned build is a throwaway (CI, local) and may
# regenerate it, since it can never reach the archive.
if [[ -z $orig_in && $rev != 1 && $unsigned -eq 0 ]]; then
die "Debian revision $rev needs --orig FILE (the orig is frozen from revision 1)"
fi
if [[ -n $orig_in ]]; then
info "reusing upstream tarball $orig_in"
cp -- "$orig_in" "$scratch/$orig"
else
# Refresh build system and man page, then build the tarball. We build
# here only because regen-man needs the compiled binaries; the test
# suite is not run in this pass. debuild (below) runs the full suite
# once, online tests enabled, so a check here would just repeat it.
info "regenerating build system and man page"
(
cd "$export_dir"
autoreconf -fi
./configure --quiet
make -s -j"$(nproc)"
make -s -C man regen-man
# Build the tarball from a clean tree so no object files leak in.
make -s clean
make -s dist
)
local -a tarballs
shopt -s nullglob
tarballs=("$export_dir"/httrack-*.tar.gz)
shopt -u nullglob
[[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball"
local tarball=${tarballs[0]##*/}
[[ $tarball == "httrack-$ver.tar.gz" ]] ||
die "changelog version $ver disagrees with built tarball $tarball (configure.ac mismatch?)"
cp -- "$export_dir/$tarball" "$scratch/$orig"
fi
# 3.0 (quilt): orig tarball is upstream-only; debian/ is overlaid on top.
local orig=httrack_${ver}.orig.tar.gz
cp -- "$export_dir/$tarball" "$scratch/$orig"
(
cd "$scratch"
tar -xf "$orig"
[[ -d httrack-$ver ]] || die "orig tarball does not unpack to httrack-$ver/"
cp -a "$export_dir/debian" "httrack-$ver/debian"
)
# Build (debuild also runs lintian and signs). --fail-on aborts on a lintian
# error or warning, so neither a release nor CI produces an unclean package.
local -a debuild_opts=(--lintian-opts -I -i "--fail-on=error,warning")
# Build and sign. debuild runs lintian too but does NOT propagate its exit
# status, so a broken package would pass unnoticed; disable it here and run
# lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S)
if [[ $unsigned -eq 1 ]]; then
@@ -182,7 +220,8 @@ main() {
info "building packages with debuild"
(
cd "$scratch/httrack-$ver"
debuild "${build_opts[@]}" "${debuild_opts[@]}"
# debuild options (--no-lintian) must precede the dpkg-buildpackage ones
debuild "${debuild_opts[@]}" "${build_opts[@]}"
)
# Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo).
@@ -192,6 +231,16 @@ main() {
changes=("$scratch"/*.changes)
shopt -u nullglob
[[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file"
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/"
# Clean-room build gate: rebuild the source package in a minimal chroot that
@@ -219,8 +268,12 @@ main() {
fi
# Release artifacts for the upstream tarball (detached sig + checksums).
# A Debian revision >= 2 .changes omits the orig (it is already in the
# archive), so dcmd above won't have copied it; place it from the build tree
# so the website artifacts are produced regardless of the revision.
if [[ $release_artifacts -eq 1 && $unsigned -eq 0 ]]; then
info "signing upstream tarball"
cp -- "$scratch/$orig" "$outdir/$orig"
(
cd "$outdir"
gpg --armor --detach-sign --yes -u "$key" -- "$orig"