Compare commits

...

50 Commits

Author SHA1 Message Date
Xavier Roche
71bece09fd Add a mockable resolver backend and a DNS resolver/cache self-test
Route the getaddrinfo resolver through a swappable backend pair
(getaddrinfo/freeaddrinfo) that defaults to the libc resolver, so the
self-test can script DNS answers in-process with no network. The pair is
needed because a fake allocates its own addrinfo chain and must free it
with a matching deallocator.

Drive it from a new 'httrack -#D' debug subcommand backed by
htsdns_selftest.c: a scripted getaddrinfo checks address family,
single-address selection, the -@i4/-@i6 family filter, negative caching,
and that a cached host is resolved only once. tests/01_engine-dns.test
runs it.

No behavior change: the default backend is the libc resolver, one
indirect call on the cold resolve path. The seam stays internal (no
HTSEXT_API), so the exported ABI is unchanged. This is groundwork for the
multi-address record and connect fallback that fix the dead-IPv6 connect
stall; the dual-stack assertion pins today's single-address behavior and
will widen with that change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-22 18:58:19 +02:00
Xavier Roche
54f5717057 Silence coucal hashtable stats on the default log handler (#416)
coucal_delete() logs a per-table stats summary at info level. For tables
without their own handler (webhttrack's NewLangStr/NewLangStrKeys), these
went through default_coucal_loghandler, which printed every level, so
plain `webhttrack` startup dumped two "hashtable ... summary:" lines to
the console.

Drop info-and-below messages there unless debugging is on (hts_dgb_init,
i.e. HTS_LOG / hts_debug); warnings and critical errors still always
print.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:03:59 +02:00
Xavier Roche
40fc9de360 debian: rewrite copyright in DEP-5 format, credit all authors (#415)
The free-form debian/copyright credited only Xavier Roche, but the source
tree bundles code from other authors under additional licenses that went
unlisted: Yann Philippot (src/htsjava.*, GPL-3+), the minizip code by
Gilles Vollant, Mathias Svensson, Even Rouault and Info-ZIP (Zlib),
Colin Plumb's md5.c (public domain), and the coucal library (BSD-3-clause)
with Austin Appleby's murmurhash3.h (public domain).

Convert to machine-readable DEP-5 and add a Files stanza per component, as
requested in the Debian NEW review of 3.49.9-1.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:19 +02:00
Xavier Roche
4614eefefe Release 3.49.9 (#414)
Content-Type and file-type detection fixes, plus the C++ source-compat
restoration (int-backed hts_boolean/hts_tristate), since 3.49.8, with a batch of
internal build, packaging and test-harness improvements folded into one line.

The exported interface is unchanged versus 3.49.8 (same 164 symbols, no
struct-layout change), so VERSION_INFO is a revision bump (3:0:0 -> 3:1:0) and
the soname stays libhttrack.so.3.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 18:12:07 +02:00
Xavier Roche
b0e8262db0 htsglobal: int-back hts_boolean (and add hts_tristate) for C++ source-compat (#413)
hts_boolean was a real enum {HTS_FALSE=0, HTS_TRUE=1}. C++ forbids assigning a
plain int to an enum, so every C++ consumer of the public API broke on
`opt->field = 1` or `f(opt, 0)`. The exported surface that trips this is wide:
33 hts_boolean fields in struct httrackp, plus two exported parameters
(unescape_http_unharm's no_high, get_httptype_sized's flag). httraqt, the one
reverse-dependency, fails to build against the installed headers, which blocks
the libhttrack.so.2 -> .so.3 binNMU.

Make hts_boolean an int-backed typedef (HTS_FALSE/HTS_TRUE become macros),
fixing the whole exported boolean surface at once: an int is assignable from an
int in C and C++. It is safe here -- hts_boolean has a single definition,
nothing uses the `enum hts_boolean` form, and HTS_FALSE/HTS_TRUE are only ever
0/1 values. ABI is unchanged: an int-sized enum and an int have identical size
and alignment, so sizeof(struct httrackp) stays 141648 and the soname is
untouched.

Five of those fields are not booleans but tri-state {-1 unspecified, 0 off,
1 on}: the -1 is copy_htsopt()'s "leave the target's value" sentinel for merging
a partial options struct onto a fully-initialised one. The old unsigned enum
could not hold -1, so the copy_htsopt `> -1` guard was always true and the field
was always overwritten. Give them a named hts_tristate (int-backed, HTS_DEFAULT
= -1) so the sentinel works and reads honestly, and drop the now-pointless
`(int)` casts. Extend the -#9 copy_htsopt self-test to assert an HTS_DEFAULT
source is skipped.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:51:09 +02:00
Xavier Roche
addbd3136b Use an unknown/unknown sentinel for an absent Content-Type (#412)
#409 distinguished "the server declared text/html" from "no Content-Type,
defaulted to text/html" with a new htsblk.contenttype_given flag, so a
binary-looking URL that really serves HTML is saved .html while a typeless
response keeps its URL extension. That worked on a fresh crawl but had two
costs: the flag was never persisted, so on --update the cache read it as
unset and the names reverted (report.html became report.pdf again, and the
two passes disagreed); and it was an installed-struct ABI break (soname 4,
libhttrack4).

Replace the flag with a sentinel: when no Content-Type is received, store
"unknown/unknown" as the type instead of text/html. The sentinel is treated
as html for every type test (added to is_html_mime_type), so parsing,
storage and filtering of a typeless response are unchanged; only the naming
code (wire_patches_ext) reads it as "no declared type" and keeps the URL
extension. Because the type string rides the cache, an update reads the same
sentinel and names consistently -- the revert is fixed at the source.

The sentinel never reaches a consumer as a real type: a single helper,
hts_effective_mime(), maps it back to text/html wherever a stored type is
derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the
ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed
response header and the index record). The .arc export was caught by an
adversarial spill audit; without the map a typeless page archived via
proxytrack would carry Content-Type: unknown/unknown.

Since the sentinel makes contenttype_given unnecessary, #409's ABI break is
undone: the field is removed, soname returns to 3, and the Debian package
reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW
carries libhttrack3), so this re-aligns master with the archive rather than
flip-flopping anything downstream.

Tests: 18_local-update re-mirrors and asserts the names survive the update
pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct
stays green. Full make check: 27 pass, 0 fail.

One accepted behavior change: a mime filter matching exactly text/html no
longer matches a typeless response (its type is the sentinel, html-ish but
not literally text/html); the response is still parsed and crawled as html.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:44:12 +02:00
Xavier Roche
a64c4cd160 Don't read an uninitialized buffer on an empty Content-Type (#411)
treathead() parses the Content-Type value with sscanf("%s") into a local
`tempo` buffer, then calls strlen(tempo) and stores the result. A response
whose Content-Type header has an empty or whitespace-only value yields no
token: sscanf leaves `tempo` uninitialized, so strlen reads uninitialized
stack and can over-read past the buffer. A hostile server triggers this with
a bare `Content-Type:` line.

Guard on sscanf's return: adopt the value, and mark the type as server-given,
only when a token was actually read. An empty value now falls back to the
default type with contenttype_given left false, i.e. it is treated like a
missing header and the URL extension is kept -- which is also the correct
naming behavior.

Found while reviewing #409, which added contenttype_given right beside this
parse; the bug itself predates it. tests/17_local-empty-ct.test exercises the
empty-Content-Type path, and the ASan/UBSan CI job is what catches the
uninitialized read.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:20:08 +02:00
Xavier Roche
1611dbcabf Trust a declared Content-Type over a binary URL extension (#409)
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.

The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.

Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:18:16 +02:00
Xavier Roche
099501ee50 Make lintian actually gate the Debian package build (#410)
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.

Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:13:12 +02:00
Xavier Roche
1b9eefa3b4 Merge pull request #408 from xroche/fix/267-delayed-ext-mangle
Stop mangling saved-file extensions under the default delayed type check
2026-06-20 18:28:27 +02:00
Xavier Roche
9c8d3a41eb tests: tighten the type-matrix guards
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:25:45 +02:00
Xavier Roche
ae77cd9d6d Honor --assume under the default delayed type check (-%N2)
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.

tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:12:01 +02:00
Xavier Roche
51b8dcd81c Keep a known URL extension against a bogus html/empty Content-Type
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.

wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.

tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:07:08 +02:00
Xavier Roche
bcce664143 Merge pull request #364 from xroche/feature/local-test-server
tests: offline local test server prototype (cookies + HTTPS)
2026-06-20 16:41:26 +02:00
Xavier Roche
7a24add87c tests: add offline local test server prototype (cookies + HTTPS)
Replace the network dependency for crawl tests with a self-contained Python
stdlib server (http.server + ssl) that httrack crawls over loopback. The server
binds an ephemeral port and prints it on stdout; local-crawl.sh discovers the
port, substitutes the BASEURL token into the httrack arguments, runs the crawl,
and audits the mirror under the discovered host-root directory.

This prototype migrates two cases off ut.httrack.com:

- 13_local-cookies.test drives the cookie chain (entrance/second/third)
  reimplemented as Python handlers from the old ut/cookies/*.php fixtures. A
  missing or wrong cookie answers 500, so a clean 3-files/0-errors run proves
  the cookie jar is replayed across links.
- 14_local-https.test crawls over HTTPS using a shipped long-dated self-signed
  cert. httrack does not verify certs, so the cert is accepted as-is and the
  real TLS path runs offline.

The group skips (exit 77) when python3 is missing, mirroring check-network.sh.
Fixtures and the cert are listed explicitly in EXTRA_DIST (automake does not
expand globs); make distcheck passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 16:35:13 +02:00
Xavier Roche
2308e7bafd Merge pull request #407 from xroche/fix/mkdeb-orig-artifact-rev2
mkdeb: cut a Debian revision >= 2 without bypassing the tool
2026-06-20 15:46:57 +02:00
Xavier Roche
ef5691fc47 mkdeb: reuse a frozen orig tarball for a Debian revision >= 2
mkdeb.sh regenerated the upstream orig from a fresh `git archive HEAD | make
dist` on every run. That is right for a -1 release, but a Debian revision >= 2
reuses the orig frozen in the archive at -1: the .dsc pins it by checksum, and
a regenerated orig (different mtimes, and content drift whenever the release
tooling shipped in EXTRA_DIST changes) gets rejected by dak. The -2 upload had
to bypass mkdeb.sh and stitch the package by hand.

Derive the upstream version and Debian revision from debian/changelog and let
the revision pick the orig: revision 1 builds a fresh tarball as before;
revision >= 2 reuses the one passed with --orig FILE, untouched. The --orig
requirement is enforced only for a signed (upload-bound) build: an unsigned
build is a throwaway (CI, local lintian) that can never reach the archive, so
it still regenerates the orig as before rather than demanding a frozen one.

Two guards close the gap the old code left implicit: the regenerate path
asserts the built tarball matches the changelog version (catching a
configure.ac/changelog skew), and the overlay step confirms the orig unpacks
to httrack-<ver>/ before dropping debian/ on top.

Validated end to end by reusing the official 3.49.8 orig to build 3.49.8-2:
the resulting .dsc pins the frozen orig's checksum byte for byte.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:44:12 +02:00
Xavier Roche
0a6eb73903 mkdeb: emit the orig website artifact on a Debian revision >= 2
The release-artifacts step signs and checksums httrack_<ver>.orig.tar.gz in
$outdir, but $outdir is populated by `dcmd cp` from the .changes, which lists
only the files in the upload. dpkg-genchanges omits the orig from a revision
>= 2 .changes (it is already in the archive), so the orig never reached
$outdir and `gpg --detach-sign` failed with "No such file or directory",
aborting a -2 (or later) release after the source package was already built.

Copy the orig from the build tree into $outdir before signing so the website
artifacts are produced regardless of the Debian revision. The upload is
unaffected: dput uploads the .changes-referenced files, not the extra orig.

CI didn't catch this because the deb job builds unsigned and the artifact
block is gated on a signing key.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:12:03 +02:00
Xavier Roche
fdb243e5a2 Merge pull request #406 from xroche/debian/libhttrack3-rename
debian: rename libhttrack2 to libhttrack3 to follow the SONAME
2026-06-20 15:04:12 +02:00
Xavier Roche
f8546e146d debian: drop the dead libhttrack-swf1.files and fix the overrides comment
Two packaging nits surfaced while reviewing the libhttrack3 rename, both
debian/-only:

- debian/libhttrack-swf1.files listed libhtsswf.so.1* but there is no
  libhttrack-swf1 package in debian/control and the swf module is no longer
  built (lib_LTLIBRARIES is just libhttrack/libhtsjava). dh_movefiles only
  consults built packages, so the list was dead. Remove it.

- libhttrack3.lintian-overrides claimed the ABI is tracked via "a strict
  =version dependency", but dh_makeshlibs --version-info emits the
  conservative (>= upstream-version) form, which is the correct choice for a
  soname-versioned library; a = ${binary:Version} shlibs dependency draws
  lintian's distant-prerequisite-in-shlibs. Correct the comment to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:59:00 +02:00
Xavier Roche
b7f602f2eb debian: rename libhttrack2 to libhttrack3 to follow the SONAME
The 3.49.8 ABI bump moved the soname to libhttrack.so.3, but the packaging
still globbed .so.2 in debian/libhttrack2.files, so the runtime libraries
matched nothing there and fell through into the catch-all httrack package;
libhttrack2 shipped no library (lintian package-name-doesnt-match-sonames).

Rename the binary package to libhttrack3, take over the misplaced libraries
from httrack and the old libhttrack2 via Breaks/Replaces, and switch the
.files globs to a .so.3* wildcard so a future soname bump no longer silently
misplaces the libraries. Ships as 3.49.8-2; new binary name goes through NEW.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:46:14 +02:00
Xavier Roche
550100b56a Merge pull request #405 from xroche/feature/mkdeb-sbuild
mkdeb: optional --sbuild clean-room build gate
2026-06-20 14:43:43 +02:00
Xavier Roche
33ddb27243 mk-sbuild-chroot: suggest a concrete usermod for the subuid range
Compute a start past every range already in /etc/subuid+subgid and print the
canonical sudo usermod --add-subuids/--add-subgids command, instead of a raw
file append the user has to adjust by hand.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:13:06 +02:00
Xavier Roche
4606dfbf66 mk-sbuild-chroot: require a subuid/subgid range up front
The unshare backend maps a whole UID range, not just the caller's, because the
base install creates system users. Without an /etc/subuid+subgid entry the
install crashes (dpkg SIGSEGV) instead of failing cleanly. Check for the range
before bootstrapping and point at the one-line fix; skip the check for root,
which uses mode=root.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:07:10 +02:00
Xavier Roche
a6f1b9a3dd mk-sbuild-chroot: only treat an active $chroot_mode line as configured
The idempotency guard matched chroot_mode.*unshare anywhere in ~/.sbuildrc,
including a commented-out line, so --write-sbuildrc would silently skip the
append and leave the unshare backend unconfigured. Anchor the match to an
active assignment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:02:42 +02:00
Xavier Roche
fb35d6a0f1 tools: add mk-sbuild-chroot.sh to set up the --sbuild gate
The --sbuild gate needs an sbuild chroot, which was only documented as loose
commands. This adds a companion script that bootstraps one with the rootless
unshare backend (mmdebstrap into ~/.cache/sbuild/<dist>-<arch>.tar.zst, where
sbuild finds it by name), idempotent unless --force, optionally writing the
unshare mode into ~/.sbuildrc. mkdeb.sh's --sbuild help now points at it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:43:34 +02:00
Xavier Roche
8a270fec03 mkdeb: add an optional --sbuild clean-room build gate
With source-only uploads the archive's buildds are the first place the package
is built in a clean environment, so an undeclared Build-Depends or any FTBFS
only shows up after the upload. --sbuild rebuilds the freshly produced .dsc in a
minimal chroot holding only the declared Build-Depends, reproducing the buildd
environment; a failure aborts the release before the upload. It runs after the
source package is built and before the upstream-tarball release artifacts are
signed. Logs and the clean-built debs land in <outdir>/sbuild.

The distribution comes from the changelog (UNRELEASED falls back to unstable),
and the flag fails fast if sbuild isn't installed. Off by default; needs an
sbuild chroot for the target suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:37:20 +02:00
Xavier Roche
0cbd5279f2 Merge pull request #404 from xroche/release/3.49.8
Curate the 3.49-8 release notes
2026-06-20 13:06:13 +02:00
Xavier Roche
05306ee4fd Curate the 3.49-8 release notes
Round out the 3.49-8 entry in history.txt and the debian changelog with the
user-facing work landed since 3.49-7: the HTTPS-proxy CONNECT tunnel, wider
srcset parsing, the crawler and parser fixes (CSS @import, xmlns, relative
paths, RFC 6265 cookies, doit.log reload), the parser and engine buffer-copy
security hardening, and brief summary lines for the API, build, CI and test
work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:02:51 +02:00
Xavier Roche
1d0fc0a566 Merge pull request #403 from xroche/chore/clang-format-separate-defs
Separate definition blocks in the public headers
2026-06-20 12:56:23 +02:00
Xavier Roche
a4452592b4 Separate definition blocks and canonicalize the public headers
Set SeparateDefinitionBlocks: Always in .clang-format so clang-format keeps
a blank line between adjacent definitions, then reformat the installed
(DevIncludes) headers in full. Several of them packed struct/typedef/macro
definitions with no separation and carried non-canonical spacing (char*,
__attribute__ ((x)), padded inner parens), which made them hard to read;
this brings them to the repo's clang-format-19 canonical form and inserts
the separating blank lines.

Headers only, no semantic change: out-of-tree build is clean and make check
passes (21 pass, 7 network skip, 0 fail). htsconfig.h is UTF-8 and its
French comments survive byte-for-byte (clang-format only reflowed them to 80
columns). The new option also governs future touched-line formatting of the
engine sources.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:52:19 +02:00
Xavier Roche
62c2364b59 Merge pull request #402 from xroche/chore/lint-all-shell-scripts
Lint every shell script with shfmt and shellcheck
2026-06-20 12:42:19 +02:00
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
Xavier Roche
02c7f4ebf6 htsparse: don't treat XHR.open's method argument as a URL (#218)
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.

Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.

Closes #218

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 20:27:04 +02:00
Xavier Roche
9070b44a70 Merge pull request #398 from xroche/fix/html-underflow-396
htsparse: fix buffer underflow reading *(html-1) at offset 0 (#396)
2026-06-19 19:55:40 +02:00
Xavier Roche
799c045061 htsparse: don't read *(html-1) before the parse buffer (#396)
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.

Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.

Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:44:25 +02:00
Xavier Roche
fb1ee3bf2e Merge pull request #397 from xroche/fix/css-import-94
CSS @import: capture URLs that carry a media/supports/layer condition (#94)
2026-06-19 19:30:21 +02:00
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
Xavier Roche
0145ec37a3 htsparse: don't crawl xmlns namespace declarations (#191)
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.

01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:24:55 +02:00
Xavier Roche
a80fab38ba Merge pull request #394 from xroche/fix/proxy-https-connect-85
Tunnel https through the proxy via CONNECT (#85)
2026-06-19 18:03:31 +02:00
77 changed files with 3320 additions and 1269 deletions

View File

@@ -16,6 +16,7 @@ BasedOnStyle: LLVM
SpaceAfterCStyleCast: true # "(int) x", overwhelmingly dominant (542 vs 7) SpaceAfterCStyleCast: true # "(int) x", overwhelmingly dominant (542 vs 7)
SortIncludes: false # C include order can be significant; never reorder SortIncludes: false # C include order can be significant; never reorder
IncludeBlocks: Preserve # do not merge/reflow include groups IncludeBlocks: Preserve # do not merge/reflow include groups
SeparateDefinitionBlocks: Always # blank line between definitions (readability)
# Stated explicitly for robustness against base-style drift (these match LLVM): # Stated explicitly for robustness against base-style drift (these match LLVM):
IndentWidth: 2 IndentWidth: 2

5
.flake8 Normal file
View File

@@ -0,0 +1,5 @@
[flake8]
# Match black's formatting so the two tools don't fight.
max-line-length = 88
# E203/W503 conflict with black's slice and line-break style.
extend-ignore = E203, W503

View File

@@ -227,7 +227,8 @@ jobs:
# Validate the Debian packaging via the same script maintainers release with. # Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt # One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above # source build) is arch- and compiler-independent, and the build matrix above
# already covers compile portability. lintian runs with --fail-on=error. # already covers compile portability. mkdeb.sh runs lintian as an explicit gate
# (debuild does not propagate lintian's exit) with --fail-on=error,warning.
deb: deb:
name: deb package (lintian) name: deb package (lintian)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
@@ -320,6 +321,21 @@ jobs:
lint: lint:
name: lint (shellcheck, shfmt) name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps: steps:
- uses: actions/checkout@v6 - uses: actions/checkout@v6
@@ -332,12 +348,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck - name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh run: shellcheck $SHELL_SCRIPTS
- name: shfmt - name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format # Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip, # (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71]) AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.8], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/]) AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([ AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,9 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4]) AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h) AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects]) AM_INIT_AUTOMAKE([subdir-objects])
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to # 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
# 128), an incompatible ABI break, so bump current and reset revision/age. # symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
VERSION_INFO="3:0:0" # mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
VERSION_INFO="3:1:0"
AM_MAINTAINER_MODE AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS AC_USE_SYSTEM_EXTENSIONS

31
debian/changelog vendored
View File

@@ -1,6 +1,33 @@
httrack (3.49.9-1) unstable; urgency=medium
* New upstream release: Content-Type and file-type detection fixes (trust a
declared Content-Type over a binary URL extension, honor --assume under the
delayed type check, keep a known extension against a bogus or empty
Content-Type, and avoid an uninitialised read on an empty Content-Type), and
restored C++ source-compatibility of the installed headers so reverse
dependencies (httraqt) build again.
-- Xavier Roche <xavier@debian.org> Sun, 21 Jun 2026 17:59:38 +0200
httrack (3.49.8-2) unstable; urgency=medium
* Rename libhttrack2 to libhttrack3 to follow the SONAME, which the 3.49.8
ABI bump moved to libhttrack.so.3 (package-name-doesnt-match-sonames). In
3.49.8-1 the libhttrack2.files glob still matched .so.2, so the runtime
libraries fell through into the httrack package and libhttrack2 shipped no
library. The new .files uses a .so.3* wildcard so a future SONAME bump no
longer silently misplaces the libraries. New binary package, via NEW.
* Drop the stale debian/libhttrack-swf1.files: the swf module is no longer
built and no libhttrack-swf1 package exists.
-- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 14:42:13 +0200
httrack (3.49.8-1) unstable; urgency=medium httrack (3.49.8-1) unstable; urgency=medium
* New upstream release. * New upstream release: HTTPS-proxy CONNECT tunnelling and wider srcset
parsing, a batch of crawler and parser fixes (CSS @import, xmlns
namespaces, relative paths, RFC 6265 cookies), and security hardening of
the parser and of buffer copies throughout the engine.
* Drop the OpenSSL linking exception from the license: OpenSSL 3.0+ is * Drop the OpenSSL linking exception from the license: OpenSSL 3.0+ is
Apache-2.0 and GPL-compatible, so it is no longer needed. httrack is now Apache-2.0 and GPL-compatible, so it is no longer needed. httrack is now
plain GPL-3.0-or-later. Updated debian/copyright accordingly. plain GPL-3.0-or-later. Updated debian/copyright accordingly.
@@ -14,7 +41,7 @@ httrack (3.49.8-1) unstable; urgency=medium
the QA debcheck page. Depend on firefox-esr | chromium | www-browser the QA debcheck page. Depend on firefox-esr | chromium | www-browser
instead. instead.
-- Xavier Roche <xavier@debian.org> Sun, 07 Jun 2026 14:29:24 +0200 -- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 13:02:08 +0200
httrack (3.49.7-2) unstable; urgency=medium httrack (3.49.7-2) unstable; urgency=medium

6
debian/control vendored
View File

@@ -58,13 +58,13 @@ Description: webhttrack common files
This package is the common files of webhttrack, website copier and This package is the common files of webhttrack, website copier and
mirroring utility mirroring utility
Package: libhttrack2 Package: libhttrack3
Architecture: any Architecture: any
Multi-Arch: same Multi-Arch: same
Section: libs Section: libs
Replaces: libhttrack1
Conflicts: libhttrack1
Depends: ${misc:Depends}, ${shlibs:Depends} Depends: ${misc:Depends}, ${shlibs:Depends}
Replaces: libhttrack2, httrack (<< 3.49.8-2~)
Breaks: libhttrack2, httrack (<< 3.49.8-2~)
Description: Httrack website copier library Description: Httrack website copier library
This package is the library part of httrack, website copier and mirroring This package is the library part of httrack, website copier and mirroring
utility utility

106
debian/copyright vendored
View File

@@ -1,21 +1,109 @@
This package was debianized by Xavier Roche <roche@httrack.com> on Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Fri, 27 Sep 2002 16:42:26 +0200 Upstream-Name: httrack
Upstream-Contact: Xavier Roche <roche@httrack.com>
Source: https://www.httrack.com/
The current Debian maintainer is Xavier Roche <xavier@debian.org> Files: *
Copyright: 1998-2026 Xavier Roche and other contributors
License: GPL-3+
Comment:
The engine includes contributions from Yann Philippot (src/htsjava.c,
src/htsjava.h). htsbasenet.h links against the system OpenSSL library
(originally by Eric Young); no OpenSSL/SSLeay code is bundled here.
Upstream author: Xavier Roche <roche@httrack.com> Files: src/minizip/*
Copyright: 1998-2010 Gilles Vollant
2007-2008 Even Rouault
2009-2010 Mathias Svensson
1990-2000 Info-ZIP
License: Zlib
Comment:
The decryption code in src/minizip/crypt.h and src/minizip/unzip.c derives
from the Info-ZIP distribution, distributed under the same terms.
Copyright: 1998-2014 Xavier Roche and other contributors Files: src/md5.c
Copyright: 1993 Colin Plumb
License: public-domain-md5
This code implements the MD5 message-digest algorithm, due to Ron Rivest.
It was written by Colin Plumb in 1993, no copyright is claimed. This code
is in the public domain; do with it what you wish.
Files: src/coucal/*
Copyright: 2013-2014 Xavier Roche
License: BSD-3-clause
Files: src/coucal/murmurhash3.h*
Copyright: Austin Appleby
License: public-domain-murmurhash3
MurmurHash3 was written by Austin Appleby, and is placed in the public
domain. The author hereby disclaims copyright to this source code.
Files: html/server/div/com.httrack.WebHTTrack.metainfo.xml
Copyright: 1998-2026 Xavier Roche and other contributors
License: FSFAP
Copying and distribution of this file, with or without modification, are
permitted in any medium without royalty provided the copyright notice and
this notice are preserved. This file is offered as-is, without any warranty.
Files: debian/*
Copyright: 2002-2026 Xavier Roche <xavier@debian.org>
License: GPL-3+
License: GPL-3+
This program is free software: you can redistribute it and/or modify This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or the Free Software Foundation, either version 3 of the License, or
(at your option) any later version. (at your option) any later version.
.
On Debian systems, the complete text of the GNU General Public
License version 3 can be found in /usr/share/common-licenses/GPL-3 file.
This program is distributed in the hope that it will be useful, This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details. GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
.
On Debian systems, the complete text of the GNU General Public License
version 3 can be found in /usr/share/common-licenses/GPL-3.
License: Zlib
This software is provided 'as-is', without any express or implied warranty.
In no event will the authors be held liable for any damages arising from the
use of this software.
.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
.
1. The origin of this software must not be misrepresented; you must not claim
that you wrote the original software. If you use this software in a product,
an acknowledgment in the product documentation would be appreciated but is
not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
License: BSD-3-clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
.
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

View File

@@ -4,3 +4,6 @@
# so the path lives in the display pointer, not the override -- match with '*'. # so the path lives in the display pointer, not the override -- match with '*'.
httrack-doc: extra-license-file * httrack-doc: extra-license-file *
httrack-doc: package-contains-documentation-outside-usr-share-doc * httrack-doc: package-contains-documentation-outside-usr-share-doc *
# search.sh is a sample CGI shipped alongside the HTML manual, not meant to be
# run from the package tree; it stays non-executable by design.
httrack-doc: script-not-executable *

View File

@@ -1,2 +0,0 @@
usr/lib/*/libhtsswf.so.1.0.0
usr/lib/*/libhtsswf.so.1

View File

@@ -1,5 +0,0 @@
usr/lib/*/libhttrack.so.2.0.49
usr/lib/*/libhttrack.so.2
usr/lib/*/libhtsjava.so.2.0.49
usr/lib/*/libhtsjava.so.2
usr/share/httrack/templates

View File

@@ -1,3 +0,0 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME and a strict =version dependency, see debian/rules).
libhttrack2: no-symbols-control-file usr/lib/*

3
debian/libhttrack3.files vendored Normal file
View File

@@ -0,0 +1,3 @@
usr/lib/*/libhttrack.so.3*
usr/lib/*/libhtsjava.so.3*
usr/share/httrack/templates

3
debian/libhttrack3.lintian-overrides vendored Normal file
View File

@@ -0,0 +1,3 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack3: no-symbols-control-file usr/lib/*

2
debian/rules vendored
View File

@@ -135,7 +135,7 @@ binary-arch: build install
dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info
dh_installdeb -a dh_installdeb -a
# we depend on the current version (ABI may change) # we depend on the current version (ABI may change)
dh_shlibdeps -a -ldebian/libhttrack2/usr/lib/$(DEB_HOST_MULTIARCH) dh_shlibdeps -a -ldebian/libhttrack3/usr/lib/$(DEB_HOST_MULTIARCH)
dh_gencontrol -a dh_gencontrol -a
dh_md5sums -a dh_md5sums -a
dh_builddeb -a dh_builddeb -a

View File

@@ -4,13 +4,38 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack This file lists all changes and fixes that have been made for HTTrack
3.49-9
+ Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
+ Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
+ Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)
+ Changed: multiple internal build, packaging and test-harness improvements
3.49-8 3.49-8
+ New: tunnel HTTPS downloads through the configured HTTP proxy via CONNECT (#85)
+ New: parse every candidate URL in <img> and <source> srcset lists (#326)
+ Changed: dropped the obsolete OpenSSL linking exception (OpenSSL 3.0+ is Apache-2.0 and GPL-compatible); httrack is now plain GPLv3-or-later + Changed: dropped the obsolete OpenSSL linking exception (OpenSSL 3.0+ is Apache-2.0 and GPL-compatible); httrack is now plain GPLv3-or-later
+ Fixed: link libhtsjava and the libtest examples directly against libc + Fixed: several out-of-bounds reads in the HTML/CSS parser on hostile input (#94, #396)
+ Fixed: stored XSS via an unescaped URL in the generated page footer (#165)
+ Fixed: hardened buffer copies throughout the engine against overflow
+ Fixed: capture conditional CSS @import URLs (#94)
+ Fixed: don't crawl xmlns namespace declarations as links (#191)
+ Fixed: don't mistake the method argument of XMLHttpRequest.open for a URL (#218)
+ Fixed: percent-encode parentheses when rewriting CSS url() targets (#163)
+ Fixed: collapse ../ in file:// URLs and widen relative-link handling (#137, #162)
+ Fixed: drop the obsolete $Version/$Path attributes from the request Cookie header, per RFC 6265 (#151)
+ Fixed: keep empty quoted arguments when reloading doit.log for --update/--continue (#106)
+ Fixed: raise the User-Agent and custom-header length limits (#152)
+ Fixed: abort on a long log path (lock-file buffer too small) (#183)
+ Fixed: race in lazy mutex initialization (#297)
+ Fixed: sub-second mtime precision when comparing local files on POSIX (#383)
+ Fixed: modernize OpenSSL TLS initialization for the 3.x to 4.x transition (#308)
+ Fixed: in-place changes made by the postprocess callback were not applied (Roman Sęk) + Fixed: in-place changes made by the postprocess callback were not applied (Roman Sęk)
+ Fixed: "preffered" typo in the help text and man page (yosinn1-blip) + Fixed: "preffered" typo in the help text and man page (yosinn1-blip)
+ Fixed: corrections and updates of the Russian translation (German Aizek) + Fixed: corrections and updates of the Russian translation (German Aizek)
+ Fixed: corrections and updates of the Danish translation (scootergrisen) + Fixed: corrections and updates of the Danish translation (scootergrisen)
+ Fixed: link libhtsjava and the libtest examples directly against libc
+ New: documented the public library API headers and typed the option fields as named enums
+ Fixed: numerous build, packaging, CI and test-coverage improvements (out-of-tree builds, sanitizer/distcheck CI, shell and Python linting, AppStream metainfo)
3.49-7 3.49-7
+ Fixed: keep generated config.h architecture-independent (Debian #1133728) + Fixed: keep generated config.h architecture-independent (Debian #1133728)

View File

@@ -1,4 +1,3 @@
#!/bin/sh #!/bin/sh
# Simple indexing test using HTTrack # Simple indexing test using HTTrack
@@ -18,22 +17,22 @@ if ! test -f "index.txt"; then
fi fi
# Convert crlf to lf # Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .." echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt tr -d '\r' <index.txt.old >index.txt
fi fi
keyword=- keyword=-
while test -n "$keyword"; do while test -n "$keyword"; do
printf "Enter a keyword: " printf "Enter a keyword: "
read keyword read -r keyword
if test -n "$keyword"; then if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`" FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches # Multiple matches
printf "Found multiple keywords: " printf "Found multiple keywords: "
echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' ' echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
@@ -41,12 +40,12 @@ while test -n "$keyword"; do
echo "Use keyword$ to find only one" echo "Use keyword$ to find only one"
else else
# One match # One match
N=`echo "$FOUNDK"|cut -f1 -d':'` N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=`tail +$N index.txt|grep -nE "\("|head -n 1` PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM" | grep "ignored" >/dev/null; then if ! echo "$PM" | grep "ignored" >/dev/null; then
M=`echo $PM|cut -f1 -d':'` M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:" echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' ' tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else else
echo "keyword ignored (too many hits)" echo "keyword ignored (too many hits)"
fi fi
@@ -57,4 +56,3 @@ while test -n "$keyword"; do
fi fi
done done

View File

@@ -56,7 +56,7 @@ whttrackrundir = $(bindir)
whttrackrun_SCRIPTS = webhttrack whttrackrun_SCRIPTS = webhttrack
libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htscache_selftest.c \ htscache_selftest.c htsdns_selftest.c \
htscatchurl.c htsfilters.c htsftp.c htshash.c coucal/coucal.c \ htscatchurl.c htsfilters.c htsftp.c htshash.c coucal/coucal.c \
htshelp.c htslib.c htscoremain.c \ htshelp.c htslib.c htscoremain.c \
htsname.c htsrobots.c htstools.c htswizard.c \ htsname.c htsrobots.c htstools.c htswizard.c \
@@ -66,7 +66,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
md5.c \ md5.c \
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \ minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \ hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htscatchurl.h \ htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htscatchurl.h \
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \ htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \ htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
htshelp.h htsindex.h htslib.h htsmd5.h \ htshelp.h htsindex.h htslib.h htsmd5.h \

View File

@@ -48,9 +48,8 @@ Please visit our Website: http://www.httrack.com
/* Abort (with the failed byte count) when a growth allocation fails. The /* Abort (with the failed byte count) when a growth allocation fails. The
array macros never return an out-of-memory error; they assert and abort. */ array macros never return an out-of-memory error; they assert and abort. */
static void hts_record_assert_memory_failed(const size_t size) { static void hts_record_assert_memory_failed(const size_t size) {
fprintf(stderr, "memory allocation failed (%lu bytes)", \ fprintf(stderr, "memory allocation failed (%lu bytes)", (long int) size);
(long int) size); \ assertf(!"memory allocation failed");
assertf(! "memory allocation failed"); \
} }
/** Dynamic array of T elements. **/ /** Dynamic array of T elements. **/
@@ -109,20 +108,22 @@ static void hts_record_assert_memory_failed(const size_t size) {
* After a call to this macro, TypedArrayRoom(A) is guaranteed to be at * After a call to this macro, TypedArrayRoom(A) is guaranteed to be at
* least equal to 'ROOM'. * least equal to 'ROOM'.
**/ **/
#define TypedArrayEnsureRoom(A, ROOM) do { \ #define TypedArrayEnsureRoom(A, ROOM) \
do { \
const size_t room_ = (ROOM); \ const size_t room_ = (ROOM); \
while (TypedArrayRoom(A) < room_) { \ while (TypedArrayRoom(A) < room_) { \
TypedArrayCapa(A) = TypedArrayCapa(A) < 16 ? 16 : TypedArrayCapa(A) * 2; \ TypedArrayCapa(A) = TypedArrayCapa(A) < 16 ? 16 : TypedArrayCapa(A) * 2; \
} \ } \
TypedArrayPtr(A) = realloc(TypedArrayPtr(A), \ TypedArrayPtr(A) = \
TypedArrayCapa(A)*TypedArrayWidth(A)); \ realloc(TypedArrayPtr(A), TypedArrayCapa(A) * TypedArrayWidth(A)); \
if (TypedArrayPtr(A) == NULL) { \ if (TypedArrayPtr(A) == NULL) { \
hts_record_assert_memory_failed(TypedArrayCapa(A) * TypedArrayWidth(A)); \ hts_record_assert_memory_failed(TypedArrayCapa(A) * TypedArrayWidth(A)); \
} \ } \
} while (0) } while (0)
/** Add an element. Macro, first element evaluated multiple times. **/ /** Add an element. Macro, first element evaluated multiple times. **/
#define TypedArrayAdd(A, E) do { \ #define TypedArrayAdd(A, E) \
do { \
TypedArrayEnsureRoom(A, 1); \ TypedArrayEnsureRoom(A, 1); \
assertf(TypedArraySize(A) < TypedArrayCapa(A)); \ assertf(TypedArraySize(A) < TypedArrayCapa(A)); \
TypedArrayTail(A) = (E); \ TypedArrayTail(A) = (E); \
@@ -133,7 +134,8 @@ static void hts_record_assert_memory_failed(const size_t size) {
* Add 'COUNT' elements from 'PTR'. * Add 'COUNT' elements from 'PTR'.
* Macro, first element evaluated multiple times. * Macro, first element evaluated multiple times.
**/ **/
#define TypedArrayAppend(A, PTR, COUNT) do { \ #define TypedArrayAppend(A, PTR, COUNT) \
do { \
const size_t count_ = (COUNT); \ const size_t count_ = (COUNT); \
/* This 1-case is to benefit from type safety. */ \ /* This 1-case is to benefit from type safety. */ \
if (count_ == 1) { \ if (count_ == 1) { \
@@ -148,7 +150,8 @@ static void hts_record_assert_memory_failed(const size_t size) {
} while (0) } while (0)
/** Clear an array, freeing memory and clearing size and capacity. **/ /** Clear an array, freeing memory and clearing size and capacity. **/
#define TypedArrayFree(A) do { \ #define TypedArrayFree(A) \
do { \
if (TypedArrayPtr(A) != NULL) { \ if (TypedArrayPtr(A) != NULL) { \
TypedArrayCapa(A) = TypedArraySize(A) = 0; \ TypedArrayCapa(A) = TypedArraySize(A) = 0; \
free(TypedArrayPtr(A)); \ free(TypedArrayPtr(A)); \

View File

@@ -49,9 +49,10 @@ Please visit our Website: http://www.httrack.com
#define WIN32_LEAN_AND_MEAN #define WIN32_LEAN_AND_MEAN
// KB955045 (http://support.microsoft.com/kb/955045) // KB955045 (http://support.microsoft.com/kb/955045)
// To execute an application using this function on earlier versions of Windows // To execute an application using this function on earlier versions of Windows
// (Windows 2000, Windows NT, and Windows Me/98/95), then it is mandatary to #include Ws2tcpip.h // (Windows 2000, Windows NT, and Windows Me/98/95), then it is mandatary to
// and also Wspiapi.h. When the Wspiapi.h header file is included, the 'getaddrinfo' function is // #include Ws2tcpip.h and also Wspiapi.h. When the Wspiapi.h header file is
// #defined to the 'WspiapiGetAddrInfo' inline function in Wspiapi.h. // included, the 'getaddrinfo' function is #defined to the 'WspiapiGetAddrInfo'
// inline function in Wspiapi.h.
#include <ws2tcpip.h> #include <ws2tcpip.h>
#include <Wspiapi.h> #include <Wspiapi.h>
// #include <winsock2.h> // #include <winsock2.h>

View File

@@ -13,14 +13,14 @@ rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi fi
# Produce code # Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n" printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT; do for i in *.TXT; do
echo "processing $i" >&2 echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \ grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
( (
unset arr unset arr
while read LINE ; do while read -r LINE; do
from=$[$(echo $LINE | cut -f1 -d' ')] from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then if ! test -n "$from"; then
echo "error with $i" >&2 echo "error with $i" >&2
exit 1 exit 1
@@ -28,22 +28,23 @@ for i in *.TXT ; do
echo "out-of-range ($LINE) with $i" >&2 echo "out-of-range ($LINE) with $i" >&2
exit 1 exit 1
fi fi
to=$(echo $LINE | cut -f2 -d' ') to=$(echo "$LINE" | cut -f2 -d' ')
arr[$from]=$to arr[from]=$to
done done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n " name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
i=0 printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
while test "$i" -lt 256; do idx=0
if test "$i" -gt 0; then while test "$idx" -lt 256; do
if test "$idx" -gt 0; then
printf ", " printf ", "
if test $[${i}%8] -eq 0; then if test $((idx % 8)) -eq 0; then
printf "\n " printf "\n "
fi fi
fi fi
value=${arr[$i]:-0} value=${arr[$idx]:-0}
printf "0x%04x" $value printf "0x%04x" "$value"
i=$[${i}+1] idx=$((idx + 1))
done done
printf " };\n\n" printf " };\n\n"
) )
@@ -53,7 +54,8 @@ done
# Indexes # Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n" printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT; do for i in *.TXT; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n" name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done done
printf " { NULL, NULL }\n};\n" printf " { NULL, NULL }\n};\n"

View File

@@ -71,7 +71,8 @@ struct t_cookie {
int cookie_add(t_cookie *cookie, const char *cook_name, const char *cook_value, int cookie_add(t_cookie *cookie, const char *cook_name, const char *cook_value,
const char *domain, const char *path); const char *domain, const char *path);
int cookie_del(t_cookie * cookie, const char *cook_name, const char *domain, const char *path); int cookie_del(t_cookie *cookie, const char *cook_name, const char *domain,
const char *path);
int cookie_load(t_cookie *cookie, const char *path, const char *name); int cookie_load(t_cookie *cookie, const char *path, const char *name);
@@ -83,7 +84,8 @@ void cookie_delete(char *s, size_t s_size, size_t pos);
const char *cookie_get(char *buffer, const char *cookie_base, int param); const char *cookie_get(char *buffer, const char *cookie_base, int param);
char *cookie_find(char *s, const char *cook_name, const char *domain, const char *path); char *cookie_find(char *s, const char *cook_name, const char *domain,
const char *path);
char *cookie_nextfield(char *a); char *cookie_nextfield(char *a);
@@ -92,7 +94,8 @@ char *cookie_nextfield(char *a);
/** Register credentials (auth = base-64 user:pass) for the prefix derived from /** Register credentials (auth = base-64 user:pass) for the prefix derived from
adr (host) and fil (path). No-op returning 0 if cookie is NULL, allocation adr (host) and fil (path). No-op returning 0 if cookie is NULL, allocation
fails, or a matching prefix is already stored; returns 1 on insertion. */ fails, or a matching prefix is already stored; returns 1 on insertion. */
int bauth_add(t_cookie * cookie, const char *adr, const char *fil, const char *auth); int bauth_add(t_cookie *cookie, const char *adr, const char *fil,
const char *auth);
/** Return the stored base-64 credentials whose prefix matches adr+fil, or NULL /** Return the stored base-64 credentials whose prefix matches adr+fil, or NULL
if none (or cookie is NULL). Returned pointer aliases the jar's bauth_chain; if none (or cookie is NULL). Returned pointer aliases the jar's bauth_chain;

View File

@@ -87,7 +87,8 @@ Please visit our Website: http://www.httrack.com
// fast cache (build hash table) // fast cache (build hash table)
#define HTS_FAST_CACHE 1 #define HTS_FAST_CACHE 1
// le > peut être considéré comme un tag de fermeture de commentaire (<!-- > est valide) // le > peut être considéré comme un tag de fermeture de commentaire (<!-- > est
// valide)
#define GT_ENDS_COMMENT 1 #define GT_ENDS_COMMENT 1
// always adds a '/' at the end if a '~' is encountered (/~smith -> /~smith/) // always adds a '/' at the end if a '~' is encountered (/~smith -> /~smith/)
@@ -97,7 +98,8 @@ Please visit our Website: http://www.httrack.com
#define HTS_STRIP_DOUBLE_SLASH 0 #define HTS_STRIP_DOUBLE_SLASH 0
// case-sensitive pour les dossiers et fichiers (0/1) // case-sensitive pour les dossiers et fichiers (0/1)
// [normalement 1, mais pose des problèmes (url malformée par exemple) et n'est pas très utile.. // [normalement 1, mais pose des problèmes (url malformée par exemple) et n'est
// pas très utile..
// ..et pas bcp respecté] // ..et pas bcp respecté]
// REMOVED // REMOVED
// #define HTS_CASSE 0 // #define HTS_CASSE 0

View File

@@ -3703,9 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0) if (from->maxsoc > 0)
to->maxsoc = from->maxsoc; to->maxsoc = from->maxsoc;
/* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard /* hts_tristate fields use HTS_DEFAULT (-1) for "unspecified": copy_htsopt
is always false; cast to int to keep the -1 "unset" sentinel test. */ skips them so the target keeps its value. */
if ((int) from->nearlink > -1) if (from->nearlink > -1)
to->nearlink = from->nearlink; to->nearlink = from->nearlink;
if (from->timeout > -1) if (from->timeout > -1)
@@ -3732,10 +3732,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1) if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol; to->hostcontrol = from->hostcontrol;
if ((int) from->errpage > -1) if (from->errpage > -1)
to->errpage = from->errpage; to->errpage = from->errpage;
if ((int) from->parseall > -1) if (from->parseall > -1)
to->parseall = from->parseall; to->parseall = from->parseall;
// test all: bit 8 de travel // test all: bit 8 de travel

View File

@@ -47,6 +47,7 @@ Please visit our Website: http://www.httrack.com
#include "htscharset.h" #include "htscharset.h"
#include "htsencoding.h" #include "htsencoding.h"
#include "htscache_selftest.h" #include "htscache_selftest.h"
#include "htsdns_selftest.h"
#include "htsmd5.h" #include "htsmd5.h"
#include <ctype.h> #include <ctype.h>
@@ -2460,6 +2461,13 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 1; return 1;
} }
break; break;
case 'D': { // DNS resolver/cache self-test (mock getaddrinfo)
const int err = dns_selftests(opt);
printf("dns-selftest: %s\n", err ? "FAIL" : "OK");
htsmain_free();
return err;
} break;
case 'C': // list cache files : httrack -#C '*spid*.gif' will attempt to find the matching file case 'C': // list cache files : httrack -#C '*spid*.gif' will attempt to find the matching file
{ {
int hasFilter = 0; int hasFilter = 0;
@@ -2579,7 +2587,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(r.size >= 0) ? r.size : (-r.size)); (r.size >= 0) ? r.size : (-r.size));
if (r.contenttype >= 0) { if (r.contenttype >= 0) {
fprintf(stdout, "Content-Type: %s\r\n", fprintf(stdout, "Content-Type: %s\r\n",
r.contenttype); hts_effective_mime(r.contenttype));
} }
if (r.cdispo[0]) { if (r.cdispo[0]) {
fprintf(stdout, "Content-Disposition: %s\r\n", fprintf(stdout, "Content-Disposition: %s\r\n",
@@ -2787,6 +2795,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; return 0;
} }
break; break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs case '2': // mimedefs
if (na + 1 >= argc) { if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL"); HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");
@@ -3125,6 +3174,16 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
if (to->parseall != HTS_FALSE) if (to->parseall != HTS_FALSE)
err = 1; err = 1;
/* HTS_DEFAULT (-1) is "unspecified": copy_htsopt must skip it,
leaving the target intact. Only a signed (int-backed) field
can hold -1, so this also guards the type against regressing
to an unsigned hts_boolean. */
from->parseall = HTS_DEFAULT;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->parseall != HTS_TRUE)
err = 1;
hts_free_opt(from); hts_free_opt(from);
hts_free_opt(to); hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK"); printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");

View File

@@ -109,8 +109,8 @@ typedef int (*t_hts_htmlcheck_chopt) (t_hts_callbackarg * carg, httrackp * opt);
/* Rewrite hook over an in-memory page: the html and len arguments point at the /* Rewrite hook over an in-memory page: the html and len arguments point at the
buffer and its length (the callback may reallocate and resize it), buffer and its length (the callback may reallocate and resize it),
url_adresse and url_fichier name it. */ url_adresse and url_fichier name it. */
typedef int (*t_hts_htmlcheck_process) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_process)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, char **html, int *len, char **html, int *len,
const char *url_adresse, const char *url_adresse,
const char *url_fichier); const char *url_fichier);
@@ -147,9 +147,8 @@ typedef const char *(*t_hts_htmlcheck_query3) (t_hts_callbackarg * carg,
queue size and running totals, stat_time the elapsed time. */ queue size and running totals, stat_time the elapsed time. */
typedef int (*t_hts_htmlcheck_loop)(t_hts_callbackarg *carg, httrackp *opt, typedef int (*t_hts_htmlcheck_loop)(t_hts_callbackarg *carg, httrackp *opt,
lien_back *back, int back_max, lien_back *back, int back_max,
int back_index, int lien_tot, int back_index, int lien_tot, int lien_ntot,
int lien_ntot, int stat_time, int stat_time, hts_stat_struct *stats);
hts_stat_struct * stats);
/* Veto a link (adr host, fil path) after its transfer; status is the result. /* Veto a link (adr host, fil path) after its transfer; status is the result.
Return 0 to drop the link. */ Return 0 to drop the link. */
@@ -168,8 +167,8 @@ typedef void (*t_hts_htmlcheck_pause) (t_hts_callbackarg * carg, httrackp * opt,
const char *lockfile); const char *lockfile);
/* Fired after a file is written to disk; 'file' is the local path. */ /* Fired after a file is written to disk; 'file' is the local path. */
typedef void (*t_hts_htmlcheck_filesave) (t_hts_callbackarg * carg, typedef void (*t_hts_htmlcheck_filesave)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, const char *file); const char *file);
/* Richer file-saved notification: source host/filename, local path, and flags /* Richer file-saved notification: source host/filename, local path, and flags
telling whether the file is new, modified, or left unchanged. */ telling whether the file is new, modified, or left unchanged. */
@@ -189,13 +188,12 @@ typedef int (*t_hts_htmlcheck_linkdetected2) (t_hts_callbackarg * carg,
const char *tag_start); const char *tag_start);
/* Fired on each transfer-status change of slot 'back'. */ /* Fired on each transfer-status change of slot 'back'. */
typedef int (*t_hts_htmlcheck_xfrstatus) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_xfrstatus)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, lien_back * back); lien_back *back);
/* Choose the local save path for a URL; write it into 'save'. adr/fil name the /* Choose the local save path for a URL; write it into 'save'. adr/fil name the
target, referer_adr/referer_fil the page that linked it. */ target, referer_adr/referer_fil the page that linked it. */
typedef int (*t_hts_htmlcheck_savename) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_savename)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt,
const char *adr_complete, const char *adr_complete,
const char *fil_complete, const char *fil_complete,
const char *referer_adr, const char *referer_adr,
@@ -206,9 +204,9 @@ typedef t_hts_htmlcheck_savename t_hts_htmlcheck_extsavename;
/* Inspect or edit the outgoing request headers in 'buff' before they are sent. /* Inspect or edit the outgoing request headers in 'buff' before they are sent.
*/ */
typedef int (*t_hts_htmlcheck_sendhead) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_sendhead)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, char *buff, char *buff, const char *adr,
const char *adr, const char *fil, const char *fil,
const char *referer_adr, const char *referer_adr,
const char *referer_fil, const char *referer_fil,
htsblk *outgoing); htsblk *outgoing);

254
src/htsdns_selftest.c Normal file
View File

@@ -0,0 +1,254 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 2026 Xavier Roche and other contributors
SPDX-License-Identifier: GPL-3.0-or-later
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Ethical use: we kindly ask that you NOT use this software to harvest email
addresses or to collect any other private information about people. Doing so
would dishonor our work and waste the many hours we have spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: htsdns_selftest.c subroutines: */
/* in-process self-test for the DNS resolver and cache */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
/* Routes the resolver through a scripted getaddrinfo (hts_resolver_backend)
instead of the network, so resolution and the DNS cache are testable for a
fixed set of scenarios (IPv4/IPv6/dual-stack, errors, family filter,
cache reuse) with no live DNS. */
#define HTS_INTERNAL_BYTECODE
#include "htsdns_selftest.h"
#include "htscore.h"
#include "htslib.h"
#include "htsnet.h"
#include <stdio.h>
#include <string.h>
#if HTS_INET6 != 0
/* IPV6_resolver: 0 = v4+v6, 1 = v4 only, 2 = v6 only (htscoremain -@i). */
extern int IPV6_resolver;
/* One scripted host: either a getaddrinfo error, or an ordered address list. */
typedef struct mock_addr {
int family; /* AF_INET / AF_INET6 */
unsigned char addr[16]; /* 4 (v4) or 16 (v6) meaningful bytes */
} mock_addr;
typedef struct mock_host {
const char *name;
int gai_err; /* non-zero: getaddrinfo returns this */
int naddr;
mock_addr addr[3];
int calls; /* times the backend resolved this host */
} mock_host;
static mock_host mock_hosts[] = {
{"v4only.test", 0, 1, {{AF_INET, {1, 2, 3, 4}}}, 0},
{"v6only.test", 0, 1, {{AF_INET6, {0x20, 0x01, 0x0d, 0xb8, [15] = 1}}}, 0},
/* dual stack, IPv6 first (RFC 6724 order) then IPv4 */
{"dual.test",
0,
2,
{{AF_INET6, {0x20, 0x01, 0x0d, 0xb8, [15] = 2}}, {AF_INET, {5, 6, 7, 8}}},
0},
/* dual stack, IPv4 first: distinguishes "keep the first address" from
"prefer a family", so the selection contract is actually pinned. */
{"dual4.test",
0,
2,
{{AF_INET, {9, 10, 11, 12}},
{AF_INET6, {0x20, 0x01, 0x0d, 0xb8, [15] = 3}}},
0},
{"nodns.test", EAI_NONAME, 0, {{0}}, 0},
};
static mock_host *mock_find(const char *name) {
for (size_t i = 0; i < sizeof(mock_hosts) / sizeof(mock_hosts[0]); i++) {
if (strcmp(mock_hosts[i].name, name) == 0)
return &mock_hosts[i];
}
return NULL;
}
static void mock_reset_calls(void) {
for (size_t i = 0; i < sizeof(mock_hosts) / sizeof(mock_hosts[0]); i++)
mock_hosts[i].calls = 0;
}
/* Build one addrinfo node owning its sockaddr (freed by mock_freeaddrinfo). */
static struct addrinfo *mock_mkai(const mock_addr *a) {
struct addrinfo *ai = calloct(1, sizeof(*ai));
ai->ai_family = a->family;
if (a->family == AF_INET) {
struct sockaddr_in *sin = calloct(1, sizeof(*sin));
sin->sin_family = AF_INET;
memcpy(&sin->sin_addr, a->addr, 4);
ai->ai_addr = (struct sockaddr *) sin;
ai->ai_addrlen = sizeof(*sin);
} else {
struct sockaddr_in6 *sin6 = calloct(1, sizeof(*sin6));
sin6->sin6_family = AF_INET6;
memcpy(&sin6->sin6_addr, a->addr, 16);
ai->ai_addr = (struct sockaddr *) sin6;
ai->ai_addrlen = sizeof(*sin6);
}
return ai;
}
static int mock_getaddrinfo(const char *node, const char *service,
const struct addrinfo *hints,
struct addrinfo **res) {
mock_host *const h = mock_find(node);
const int want = (hints != NULL) ? hints->ai_family : PF_UNSPEC;
struct addrinfo *head = NULL, *tail = NULL;
(void) service;
*res = NULL;
if (h == NULL)
return EAI_NONAME;
h->calls++; /* a real backend hit; a cached host skips this */
if (h->gai_err != 0)
return h->gai_err;
for (int i = 0; i < h->naddr; i++) {
if (want != PF_UNSPEC && want != h->addr[i].family)
continue; /* honor the requested family (v4/v6 only) */
struct addrinfo *const ai = mock_mkai(&h->addr[i]);
if (head == NULL)
head = ai;
else
tail->ai_next = ai;
tail = ai;
}
if (head == NULL)
return EAI_NONAME; /* filtered to empty, as the libc resolver does */
*res = head;
return 0;
}
static void mock_freeaddrinfo(struct addrinfo *res) {
while (res != NULL) {
struct addrinfo *const next = res->ai_next;
freet(res->ai_addr);
freet(res);
res = next;
}
}
static const hts_resolver_backend mock_backend = {mock_getaddrinfo,
mock_freeaddrinfo};
static int failures = 0;
#define CHECK(cond) \
do { \
if (!(cond)) { \
failures++; \
fprintf(stderr, "dns-selftest: FAIL at %s:%d: %s\n", __FILE__, __LINE__, \
#cond); \
} \
} while (0)
/* Resolve via the uncached entry point; return the address family, or
AF_UNSPEC if the host did not resolve. */
static int resolve_family_nocache(const char *host) {
SOCaddr addr;
const char *err = NULL;
if (hts_dns_resolve_nocache2(host, &addr, &err) == NULL)
return AF_UNSPEC;
return SOCaddr_sinfamily(addr);
}
int dns_selftests(httrackp *opt) {
failures = 0;
hts_dns_set_resolver_backend(&mock_backend);
/* IPv4-only / IPv6-only hosts map to the right family. */
IPV6_resolver = 0;
CHECK(resolve_family_nocache("v4only.test") == AF_INET);
CHECK(resolve_family_nocache("v6only.test") == AF_INET6);
/* Dual-stack: the current resolver keeps only the *first* address. Both
orderings pin that (not a family preference); PR2 (multi-address) widens
it. */
CHECK(resolve_family_nocache("dual.test") == AF_INET6); /* v6 listed first */
CHECK(resolve_family_nocache("dual4.test") == AF_INET); /* v4 listed first */
/* Unknown host does not resolve. */
CHECK(resolve_family_nocache("nodns.test") == AF_UNSPEC);
/* Family filter (-@i4 / -@i6) selects v4 / v6 out of the dual-stack host. */
IPV6_resolver = 1;
CHECK(resolve_family_nocache("dual.test") == AF_INET);
IPV6_resolver = 2;
CHECK(resolve_family_nocache("dual.test") == AF_INET6);
IPV6_resolver = 0;
/* Cached driver resolves a host once and reuses the *same* address. */
mock_reset_calls();
{
SOCaddr a1, a2;
char ip1[64], ip2[64];
const char *err = NULL;
CHECK(hts_dns_resolve2(opt, "v4only.test", &a1, &err) != NULL);
CHECK(hts_dns_resolve2(opt, "v4only.test", &a2, &err) != NULL);
CHECK(mock_find("v4only.test")->calls == 1);
/* the cache returns the right address, not merely a hit for the key */
SOCaddr_inetntoa(ip1, sizeof(ip1), a1);
SOCaddr_inetntoa(ip2, sizeof(ip2), a2);
CHECK(strcmp(ip1, "1.2.3.4") == 0);
CHECK(strcmp(ip1, ip2) == 0);
}
/* A negative result is cached too: a second lookup does not re-resolve. */
{
SOCaddr a1, a2;
const char *err = NULL;
CHECK(hts_dns_resolve2(opt, "nodns.test", &a1, &err) == NULL);
CHECK(hts_dns_resolve2(opt, "nodns.test", &a2, &err) == NULL);
CHECK(mock_find("nodns.test")->calls == 1); /* resolved once, then cached */
}
hts_dns_set_resolver_backend(NULL);
return failures;
}
#else
int dns_selftests(httrackp *opt) {
(void) opt;
return 0; /* resolver seam only exists in the IPv6 build */
}
#endif

51
src/htsdns_selftest.h Normal file
View File

@@ -0,0 +1,51 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 2026 Xavier Roche and other contributors
SPDX-License-Identifier: GPL-3.0-or-later
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Ethical use: we kindly ask that you NOT use this software to harvest email
addresses or to collect any other private information about people. Doing so
would dishonor our work and waste the many hours we have spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: htsdns_selftest.h */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#ifndef HTSDNS_SELFTEST_DEFH
#define HTSDNS_SELFTEST_DEFH
#ifdef HTS_INTERNAL_BYTECODE
#ifndef HTS_DEF_FWSTRUCT_httrackp
#define HTS_DEF_FWSTRUCT_httrackp
typedef struct httrackp httrackp;
#endif
/* Drive the DNS resolver and cache through a scripted (mock) getaddrinfo,
asserting address family, single-address selection, negative caching, the
IPv4/IPv6 family filter, and that a cached host is resolved only once.
Returns the number of failed checks (0 == success). */
int dns_selftests(httrackp *opt);
#endif
#endif

View File

@@ -33,14 +33,14 @@ EOF
else else
GET "${url}" GET "${url}"
fi fi
) \ ) |
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \ grep -E '^<!ENTITY [a-zA-Z0-9_]' |
| sed \ sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \ -e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \ -e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\ -e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
| ( \ (
read A read -r A
while test -n "$A"; do while test -n "$A"; do
ent="${A%% *}" ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ') code=$(echo "$A" | cut -f2 -d' ')
@@ -49,11 +49,11 @@ EOF
i=0 i=0
a=1664525 a=1664525
c=1013904223 c=1013904223
m="$[1 << 32]" m="$((1 << 32))"
while test "$i" -lt ${#ent}; do while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')" d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]" hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$[${i}+1] i=$((i + 1))
done done
echo -e " /* $A */" echo -e " /* $A */"
echo -e " case ${hash}u:" echo -e " case ${hash}u:"
@@ -63,7 +63,7 @@ EOF
echo -e " break;" echo -e " break;"
# next # next
read A read -r A
done done
) )
cat <<EOF cat <<EOF

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers, the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */ LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-8" #define HTTRACK_VERSION "3.49-9"
#define HTTRACK_VERSIONID "3.49.8" #define HTTRACK_VERSIONID "3.49.9"
#define HTTRACK_AFF_VERSION "3.x" #define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0" #define HTTRACK_LIB_VERSION "2.0"
@@ -226,9 +226,14 @@ Please visit our Website: http://www.httrack.com
/* Copyright (C) 1998 Xavier Roche and other contributors */ /* Copyright (C) 1998 Xavier Roche and other contributors */
#define HTTRACK_AFF_AUTHORS "[XR&CO'2014]" #define HTTRACK_AFF_AUTHORS "[XR&CO'2014]"
#define HTS_DEFAULT_FOOTER "<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION " " HTTRACK_AFF_AUTHORS ", %s -->" #define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
#define HTTRACK_WEB "http://www.httrack.com" #define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE "http://www.httrack.com/update.php3?Product=HTTrack&Version=" HTTRACK_VERSIONID "&VersionStr=" HTTRACK_VERSION "&Platform=%d&Language=%s" #define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \
"update.php3?Product=HTTrack&Version=" HTTRACK_VERSIONID \
"&VersionStr=" HTTRACK_VERSION "&Platform=%d&Language=%s"
#define H_CRLF "\x0d\x0a" #define H_CRLF "\x0d\x0a"
#define CRLF "\x0d\x0a" #define CRLF "\x0d\x0a"
@@ -242,12 +247,23 @@ Please visit our Website: http://www.httrack.com
#define HTS_NOPARAM "(none)" #define HTS_NOPARAM "(none)"
#define HTS_NOPARAM2 "\"(none)\"" #define HTS_NOPARAM2 "\"(none)\""
/* Boolean flag for option fields and API yes/no returns. An enum (not C bool) /* Boolean flag for option fields and API yes/no returns. Int-backed, not an
so it stays int-sized: option fields keep the httrackp layout/ABI, and a enum: an enum makes C++ reject `field = 1` / `f(0)` on the exported fields
return type stays compatible with the int it replaces. */ and params. Int-sized, so the httrackp layout and the ABI are unchanged. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean #ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean #define HTS_DEF_DEFSTRUCT_hts_boolean
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
typedef int hts_boolean;
#define HTS_FALSE 0
#define HTS_TRUE 1
#endif
#ifndef HTS_DEF_DEFSTRUCT_hts_tristate
#define HTS_DEF_DEFSTRUCT_hts_tristate
/* Tri-state hts_boolean: HTS_DEFAULT (-1) = "unspecified" (copy_htsopt leaves
the target untouched); HTS_FALSE/HTS_TRUE = off/on. */
typedef int hts_tristate;
#define HTS_DEFAULT (-1)
#endif #endif
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */ /* Larger/smaller of two values. Macros: arguments are evaluated twice. */
@@ -278,8 +294,8 @@ typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
#endif #endif
#else #else
/* See <http://gcc.gnu.org/wiki/Visibility> */ /* See <http://gcc.gnu.org/wiki/Visibility> */
#if ( ( defined(__GNUC__) && ( __GNUC__ >= 4 ) ) \ #if ((defined(__GNUC__) && (__GNUC__ >= 4)) || \
|| ( defined(HAVE_VISIBILITY) && HAVE_VISIBILITY ) ) (defined(HAVE_VISIBILITY) && HAVE_VISIBILITY))
#define HTSEXT_API __attribute__((visibility("default"))) #define HTSEXT_API __attribute__((visibility("default")))
#else #else
@@ -335,8 +351,8 @@ typedef __int64 LLint;
typedef __int64 TStamp; typedef __int64 TStamp;
#define LLintP "%I64d" #define LLintP "%I64d"
#elif (defined(_LP64) || defined(__x86_64__) \ #elif (defined(_LP64) || defined(__x86_64__) || defined(__powerpc64__) || \
|| defined(__powerpc64__) || defined(__64BIT__)) defined(__64BIT__))
typedef long int LLint; typedef long int LLint;
@@ -405,7 +421,8 @@ typedef int T_SOC;
#if HTS_ACCESS #if HTS_ACCESS
#define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH) #define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
#define HTS_ACCESS_FOLDER (S_IRUSR|S_IWUSR|S_IXUSR|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH) #define HTS_ACCESS_FOLDER \
(S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH)
#else #else
#define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR) #define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR)
@@ -427,7 +444,11 @@ typedef int T_SOC;
#endif #endif
/* fflush sur stdout */ /* fflush sur stdout */
#define io_flush { fflush(stdout); fflush(stdin); } #define io_flush \
{ \
fflush(stdout); \
fflush(stdin); \
}
/* HTSLib */ /* HTSLib */
@@ -524,7 +545,13 @@ static const t_htsboundary htsboundary = 0xDEADBEEF;
#if _HTS_WIDE #if _HTS_WIDE
extern FILE *DEBUG_fp; extern FILE *DEBUG_fp;
#define DEBUG_W(A) { if (DEBUG_fp==NULL) DEBUG_fp=fopen("bug.out","wb"); fprintf(DEBUG_fp,":>"A); fflush(DEBUG_fp); } #define DEBUG_W(A) \
{ \
if (DEBUG_fp == NULL) \
DEBUG_fp = fopen("bug.out", "wb"); \
fprintf(DEBUG_fp, ":>" A); \
fflush(DEBUG_fp); \
}
#undef _ #undef _
#define _ , #define _ ,
#endif #endif

View File

@@ -1423,7 +1423,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
else else
infostatuscode(retour->msg, retour->statuscode); infostatuscode(retour->msg, retour->statuscode);
// type MIME par défaut2 // type MIME par défaut2
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME); strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else { // pas de code! } else { // pas de code!
retour->statuscode = STATUSCODE_INVALID; retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown response structure"); strcpybuff(retour->msg, "Unknown response structure");
@@ -1438,7 +1438,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK; retour->statuscode = HTTP_OK;
retour->keep_alive = 0; retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server"); strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME); strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} else if (strnotempty(a)) { } else if (strnotempty(a)) {
retour->statuscode = STATUSCODE_INVALID; retour->statuscode = STATUSCODE_INVALID;
strcpybuff(retour->msg, "Unknown (not HTTP/xx) response structure"); strcpybuff(retour->msg, "Unknown (not HTTP/xx) response structure");
@@ -1447,7 +1447,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
retour->statuscode = HTTP_OK; retour->statuscode = HTTP_OK;
retour->keep_alive = 0; retour->keep_alive = 0;
strcpybuff(retour->msg, "Unknown, assuming junky server"); strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME); strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} }
} }
} else { // vide! } else { // vide!
@@ -1458,7 +1458,7 @@ void treatfirstline(htsblk * retour, const char *rcvd) {
/* This is dirty .. */ /* This is dirty .. */
retour->statuscode = HTTP_OK; retour->statuscode = HTTP_OK;
strcpybuff(retour->msg, "Unknown, assuming junky server"); strcpybuff(retour->msg, "Unknown, assuming junky server");
strcpybuff(retour->contenttype, HTS_HYPERTEXT_DEFAULT_MIME); strcpybuff(retour->contenttype, HTS_UNKNOWN_MIME);
} }
} }
@@ -1589,11 +1589,15 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
} }
} }
} }
sscanf(rcvd + p, "%s", tempo); // An empty/whitespace Content-Type value yields no token: keep the
// sentinel default rather than reading an uninitialized tempo.
if (sscanf(rcvd + p, "%s", tempo) == 1) {
if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!! if (strlen(tempo) < sizeof(retour->contenttype) - 2) // pas trop long!!
strcpybuff(retour->contenttype, tempo); strcpybuff(retour->contenttype, tempo);
else else
strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur strcpybuff(retour->contenttype,
"application/octet-stream-unknown"); // erreur
}
} }
} else if ((p = strfield(rcvd, "Content-Range:")) != 0) { } else if ((p = strfield(rcvd, "Content-Range:")) != 0) {
// Content-Range: bytes 0-70870/70871 // Content-Range: bytes 0-70870/70871
@@ -2605,6 +2609,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++) for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\') if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/'; adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
} }
// no hostname // no hostname
@@ -4308,6 +4314,7 @@ int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0; int ok = 0;
int j = 0; int j = 0;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0'; s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) { while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) { if (strfield2(hts_mime[j][0], st)) {
@@ -4808,6 +4815,19 @@ static SOCaddr* hts_ghbn(const t_dnscache *cache, const char *const iadr, SOCadd
return NULL; return NULL;
} }
#if HTS_INET6 != 0
/* Active resolver backend; defaults to the libc resolver. The self-test
reroutes it to script DNS answers in-process (see
hts_dns_set_resolver_backend). */
static const hts_resolver_backend hts_resolver_libc = {getaddrinfo,
freeaddrinfo};
static const hts_resolver_backend *hts_resolver = &hts_resolver_libc;
void hts_dns_set_resolver_backend(const hts_resolver_backend *backend) {
hts_resolver = (backend != NULL) ? backend : &hts_resolver_libc;
}
#endif
static SOCaddr *hts_dns_resolve_nocache2_(const char *const hostname, static SOCaddr *hts_dns_resolve_nocache2_(const char *const hostname,
SOCaddr *const addr, SOCaddr *const addr,
const char **error) { const char **error) {
@@ -4838,7 +4858,7 @@ static SOCaddr* hts_dns_resolve_nocache2_(const char *const hostname,
hints.ai_family = PF_UNSPEC; hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM; hints.ai_socktype = SOCK_STREAM;
hints.ai_protocol = IPPROTO_TCP; hints.ai_protocol = IPPROTO_TCP;
if ( ( gerr = getaddrinfo(hostname, NULL, &hints, &res) ) == 0) { if ((gerr = hts_resolver->getaddrinfo(hostname, NULL, &hints, &res)) == 0) {
if (res != NULL) { if (res != NULL) {
if (res->ai_addr != NULL && res->ai_addrlen != 0) { if (res->ai_addr != NULL && res->ai_addrlen != 0) {
SOCaddr_copyaddr2(*addr, res->ai_addr, res->ai_addrlen); SOCaddr_copyaddr2(*addr, res->ai_addr, res->ai_addrlen);
@@ -4850,7 +4870,7 @@ static SOCaddr* hts_dns_resolve_nocache2_(const char *const hostname,
} }
} }
if (res) { if (res) {
freeaddrinfo(res); hts_resolver->freeaddrinfo(res);
} }
#endif #endif
} }
@@ -5298,6 +5318,11 @@ static int get_loglevel_from_coucal(coucal_loglevel level) {
static void default_coucal_loghandler(void *arg, coucal_loglevel level, static void default_coucal_loghandler(void *arg, coucal_loglevel level,
const char* format, va_list args) { const char* format, va_list args) {
/* informational chatter (hashtable stats on delete, etc.) only when
debugging; keep warnings and critical errors always visible. */
if (level > coucal_log_warning && hts_dgb_init <= 0) {
return;
}
if (level <= coucal_log_warning) { if (level <= coucal_log_warning) {
fprintf(stderr, "** warning: "); fprintf(stderr, "** warning: ");
} }

View File

@@ -481,10 +481,22 @@ HTS_STATIC int strcmpnocase(const char *a, const char *b) {
// is this MIME an hypertext MIME (text/html), html/js-style or other script/text type? // is this MIME an hypertext MIME (text/html), html/js-style or other script/text type?
#define HTS_HYPERTEXT_DEFAULT_MIME "text/html" #define HTS_HYPERTEXT_DEFAULT_MIME "text/html"
/* Sentinel stored when the server declared no Content-Type. It is html-ish
for every type test (so a typeless response still parses/stores as today),
but the naming code (wire_patches_ext) treats it as "no declared type" and
keeps the URL extension. It rides the cache, so updates name consistently. */
#define HTS_UNKNOWN_MIME "unknown/unknown"
/* Map the no-declared-type sentinel back to a real type for any header or
record we EMIT or PERSIST, so "unknown/unknown" never reaches a consumer
(a served Content-Type, a ProxyTrack .arc record, ...). */
#define hts_effective_mime(m) \
(strfield2((m), HTS_UNKNOWN_MIME) ? HTS_HYPERTEXT_DEFAULT_MIME : (m))
#define is_html_mime_type(a) \ #define is_html_mime_type(a) \
( (strfield2((a),"text/html")!=0)\ ((strfield2((a), "text/html") != 0) || \
|| (strfield2((a),"application/xhtml+xml")!=0) \ (strfield2((a), "application/xhtml+xml") != 0) || \
(strfield2((a), HTS_UNKNOWN_MIME) != \
0) /* no declared type: treat as html */ \
) )
#define is_hypertext_mime__(a) \ #define is_hypertext_mime__(a) \
( \ ( \

View File

@@ -92,8 +92,8 @@ struct htsmoduleStruct {
/* Callbacks */ /* Callbacks */
t_htsAddLink addLink; /* call this function when links are t_htsAddLink addLink; /* call this function when links are
being detected. it if not your responsability to decide being detected. it if not your responsability to
if the engine will keep them, or not. */ decide if the engine will keep them, or not. */
/* Optional */ /* Optional */
char *localLink; /* if non null, the engine will write there the local char *localLink; /* if non null, the engine will write there the local
@@ -117,7 +117,6 @@ struct htsmoduleStruct {
int *ptr_; int *ptr_;
const char *page_charset_; const char *page_charset_;
/* Internal use - please don't touch */ /* Internal use - please don't touch */
}; };
#ifdef __cplusplus #ifdef __cplusplus

View File

@@ -138,6 +138,35 @@ static void cleanEndingSpaceOrDot(char *s) {
} }
} }
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server declared NO type (the HTS_UNKNOWN_MIME sentinel; the #267 mangle
guard): a typeless .png stays .png, but a .pdf explicitly served as text/html
is named .html. The sentinel rides the cache, so updates stay consistent. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server declared no type (the sentinel); an explicitly declared type,
even text/html, is trusted, so a binary-looking URL that really serves
HTML (login/error interstitial, soft-404) is named .html. */
if (!is_hypertext_mime(opt, urlmime, file) &&
strfield2(wiremime, HTS_UNKNOWN_MIME))
return 0;
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr // forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html) // système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int url_savename(lien_adrfilsave *const afs, int url_savename(lien_adrfilsave *const afs,
@@ -325,7 +354,10 @@ int url_savename(lien_adrfilsave *const afs,
} }
/* replace shtml to html.. */ /* replace shtml to html.. */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD) /* HARD delays every type, except one the user pinned with --assume: honor it
immediately (ishtml() consults the user type), no delayed name (#56) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!is_userknowntype(opt, fil))
is_html = -1; /* ALWAYS delay type */ is_html = -1; /* ALWAYS delay type */
else else
is_html = ishtml(opt, fil); is_html = ishtml(opt, fil);
@@ -380,7 +412,7 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(r.cdispo)) { /* filename given */ if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo); strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori? } else if (wire_patches_ext(opt, r.contenttype, fil)) {
if (give_mimext(s, sizeof(s), if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension r.contenttype)) { // recognized extension
ext_chg = 1; ext_chg = 1;
@@ -425,7 +457,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(headers->r.cdispo)) { /* filename given */ if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo); strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type) } else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil)) {
char s[16]; char s[16];
if (give_mimext( if (give_mimext(
s, sizeof(s), s, sizeof(s),
@@ -641,7 +674,8 @@ int url_savename(lien_adrfilsave *const afs,
if (!has_been_moved) { if (!has_been_moved) {
if (back[b].r.statuscode != -10) { // erreur if (back[b].r.statuscode != -10) { // erreur
if (strnotempty(back[b].r.contenttype) == 0) if (strnotempty(back[b].r.contenttype) == 0)
strcpybuff(back[b].r.contenttype, "text/html"); // message d'erreur en html strcpybuff(back[b].r.contenttype,
HTS_UNKNOWN_MIME); // no declared type
// Finalement on, renvoie un erreur, pour ne toucher à rien dans le code // Finalement on, renvoie un erreur, pour ne toucher à rien dans le code
// libérer emplacement backing // libérer emplacement backing
} }
@@ -653,7 +687,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(back[b].r.cdispo)) { /* filename given */ if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo); strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type) } else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil)) {
if (give_mimext( if (give_mimext(
s, sizeof(s), s, sizeof(s),
back[b].r.contenttype)) { // recognized extension back[b].r.contenttype)) { // recognized extension

View File

@@ -112,8 +112,8 @@ struct SOCaddr {
/** Pointer to the port field (network byte order) for the active family. /** Pointer to the port field (network byte order) for the active family.
Asserts on NULL or an unset/unknown family. */ Asserts on NULL or an unset/unknown family. */
static HTS_INLINE HTS_UNUSED in_port_t* SOCaddr_sinport_(SOCaddr *const addr, static HTS_INLINE HTS_UNUSED in_port_t *
const char *file, const int line) { SOCaddr_sinport_(SOCaddr *const addr, const char *file, const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
switch (addr->m_addr.sa.sa_family) { switch (addr->m_addr.sa.sa_family) {
case AF_INET: case AF_INET:
@@ -134,7 +134,8 @@ static HTS_INLINE HTS_UNUSED in_port_t* SOCaddr_sinport_(SOCaddr *const addr,
/** Length of the active sockaddr (sockaddr_in or sockaddr_in6), or 0 if the /** Length of the active sockaddr (sockaddr_in or sockaddr_in6), or 0 if the
family is unset/unknown. The 0 case doubles as the "not valid" test. */ family is unset/unknown. The 0 case doubles as the "not valid" test. */
static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr *const addr, static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr *const addr,
const char *file, const int line) { const char *file,
const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
switch (addr->m_addr.sa.sa_family) { switch (addr->m_addr.sa.sa_family) {
case AF_INET: case AF_INET:
@@ -152,8 +153,8 @@ static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr*const addr,
} }
/** Reset to the unset state (family AF_UNSPEC), making the address invalid. */ /** Reset to the unset state (family AF_UNSPEC), making the address invalid. */
static HTS_INLINE HTS_UNUSED void SOCaddr_clear_(SOCaddr*const addr, static HTS_INLINE HTS_UNUSED void
const char *file, const int line) { SOCaddr_clear_(SOCaddr *const addr, const char *file, const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
addr->m_addr.sa.sa_family = AF_UNSPEC; addr->m_addr.sa.sa_family = AF_UNSPEC;
} }
@@ -191,14 +192,16 @@ static HTS_INLINE HTS_UNUSED void SOCaddr_clear_(SOCaddr*const addr,
/** Set the port (host-order argument, stored network-order) on the active /** Set the port (host-order argument, stored network-order) on the active
* family. */ * family. */
#define SOCaddr_initport(server, port) do { \ #define SOCaddr_initport(server, port) \
do { \
SOCaddr_sinport(server) = htons((in_port_t) (port)); \ SOCaddr_sinport(server) = htons((in_port_t) (port)); \
} while (0) } while (0)
/** Initialize as an all-zero IPv4 wildcard (INADDR_ANY) address; returns its /** Initialize as an all-zero IPv4 wildcard (INADDR_ANY) address; returns its
sockaddr length. */ sockaddr length. */
static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr *const addr, static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr *const addr,
const char *file, const int line) { const char *file,
const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
memset(&addr->m_addr.in, 0, sizeof(addr->m_addr.in)); memset(&addr->m_addr.in, 0, sizeof(addr->m_addr.in));
addr->m_addr.in.sin_family = AF_INET; addr->m_addr.in.sin_family = AF_INET;
@@ -206,7 +209,8 @@ static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr*const addr,
} }
/** Initialize server as an IPv4 wildcard (INADDR_ANY) address. */ /** Initialize server as an IPv4 wildcard (INADDR_ANY) address. */
#define SOCaddr_initany(server) do { \ #define SOCaddr_initany(server) \
do { \
SOCaddr_initany_(&(server), __FILE__, __LINE__); \ SOCaddr_initany_(&(server), __FILE__, __LINE__); \
} while (0) } while (0)
@@ -215,8 +219,10 @@ static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr*const addr,
with port zeroed. Any other size leaves an AF_INET shell. Returns the with port zeroed. Any other size leaves an AF_INET shell. Returns the
resulting sockaddr length. */ resulting sockaddr length. */
static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr *const server, static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr *const server,
const void *data, const size_t data_size, const void *data,
const char *file, const int line) { const size_t data_size,
const char *file,
const int line) {
assertf_(server != NULL, file, line); assertf_(server != NULL, file, line);
assertf_(data != NULL, file, line); assertf_(data != NULL, file, line);
@@ -248,32 +254,35 @@ static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr*const server,
/** Copy hpaddr (length hpsize) into server, writing the result length into the /** Copy hpaddr (length hpsize) into server, writing the result length into the
lvalue server_len (int). See SOCaddr_copyaddr_ for accepted forms. */ lvalue server_len (int). See SOCaddr_copyaddr_ for accepted forms. */
#define SOCaddr_copyaddr(server, server_len, hpaddr, hpsize) do { \ #define SOCaddr_copyaddr(server, server_len, hpaddr, hpsize) \
server_len = (int) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \ do { \
server_len = (int) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, \
__LINE__); \
} while (0) } while (0)
/** Like SOCaddr_copyaddr but discards the result length. */ /** Like SOCaddr_copyaddr but discards the result length. */
#define SOCaddr_copyaddr2(server, hpaddr, hpsize) do { \ #define SOCaddr_copyaddr2(server, hpaddr, hpsize) \
do { \
(void) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \ (void) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \
} while (0) } while (0)
/** Copy one SOCaddr (src) into another (dest), preserving family and port. */ /** Copy one SOCaddr (src) into another (dest), preserving family and port. */
#define SOCaddr_copy_SOCaddr(dest, src) do { \ #define SOCaddr_copy_SOCaddr(dest, src) \
SOCaddr_copyaddr_(&(dest), &(src).m_addr.sa, SOCaddr_size(src), __FILE__, __LINE__); \ do { \
SOCaddr_copyaddr_(&(dest), &(src).m_addr.sa, SOCaddr_size(src), __FILE__, \
__LINE__); \
} while (0) } while (0)
/** Write the numeric (dotted/colon) host of ss into namebuf (capacity /** Write the numeric (dotted/colon) host of ss into namebuf (capacity
namebuflen), scope id stripped. On failure namebuf becomes "". */ namebuflen), scope id stripped. On failure namebuf becomes "". */
static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen, static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen,
SOCaddr *const ss, SOCaddr *const ss, const char *file,
const char *file, const int line) { const int line) {
assertf_(namebuf != NULL, file, line); assertf_(namebuf != NULL, file, line);
assertf_(ss != NULL, file, line); assertf_(ss != NULL, file, line);
if (getnameinfo(&ss->m_addr.sa, sizeof(ss->m_addr), if (getnameinfo(&ss->m_addr.sa, sizeof(ss->m_addr), namebuf, namebuflen, NULL,
namebuf, namebuflen, 0, NI_NUMERICHOST) == 0) {
NULL, 0,
NI_NUMERICHOST) == 0) {
/* remove scope id(s) */ /* remove scope id(s) */
char *const pos = strchr(namebuf, '%'); char *const pos = strchr(namebuf, '%');
if (pos != NULL) { if (pos != NULL) {
@@ -289,11 +298,28 @@ static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen,
SOCaddr_inetntoa_(namebuf, namebuflen, &(ss), __FILE__, __LINE__) SOCaddr_inetntoa_(namebuf, namebuflen, &(ss), __FILE__, __LINE__)
/** Single-char family tag: '1' for IPv4, '2' otherwise (used in the cache). */ /** Single-char family tag: '1' for IPv4, '2' otherwise (used in the cache). */
#define SOCaddr_getproto(ss) ( SOCaddr_size(ss) == sizeof(struct sockaddr_in) ? '1' : '2') #define SOCaddr_getproto(ss) \
(SOCaddr_size(ss) == sizeof(struct sockaddr_in) ? '1' : '2')
/** Length type for socket APIs (getsockname, accept, ...). */ /** Length type for socket APIs (getsockname, accept, ...). */
typedef socklen_t SOClen; typedef socklen_t SOClen;
#if HTS_INET6 != 0
/** Resolver backend: getaddrinfo/freeaddrinfo as a swappable pair, so the
self-test can script DNS answers (families, multiplicity, errors)
in-process. The free function must match its getaddrinfo (a fake allocates
its own chain), hence the pair. */
typedef struct hts_resolver_backend {
int (*getaddrinfo)(const char *node, const char *service,
const struct addrinfo *hints, struct addrinfo **res);
void (*freeaddrinfo)(struct addrinfo *res);
} hts_resolver_backend;
/** Install a resolver backend for the process; NULL restores the libc default.
Test-only seam, not thread-safe; callers must serialize against resolves. */
void hts_dns_set_resolver_backend(const hts_resolver_backend *backend);
#endif
#ifdef __cplusplus #ifdef __cplusplus
} }
#endif #endif

View File

@@ -72,6 +72,7 @@ typedef struct String String;
#endif #endif
#ifndef HTS_DEF_STRUCT_String #ifndef HTS_DEF_STRUCT_String
#define HTS_DEF_STRUCT_String #define HTS_DEF_STRUCT_String
struct String { struct String {
char *buffer_; char *buffer_;
size_t length_; size_t length_;
@@ -179,6 +180,7 @@ typedef struct lien_url lien_url;
#ifndef HTS_DEF_DEFSTRUCT_hts_log_type #ifndef HTS_DEF_DEFSTRUCT_hts_log_type
#define HTS_DEF_DEFSTRUCT_hts_log_type #define HTS_DEF_DEFSTRUCT_hts_log_type
typedef enum hts_log_type { typedef enum hts_log_type {
LOG_PANIC, LOG_PANIC,
LOG_ERROR, LOG_ERROR,
@@ -288,6 +290,7 @@ typedef enum htsparsejava_flags {
/* Link-rewriting style for saved pages (opt->urlmode). */ /* Link-rewriting style for saved pages (opt->urlmode). */
#ifndef HTS_DEF_DEFSTRUCT_hts_urlmode #ifndef HTS_DEF_DEFSTRUCT_hts_urlmode
#define HTS_DEF_DEFSTRUCT_hts_urlmode #define HTS_DEF_DEFSTRUCT_hts_urlmode
typedef enum hts_urlmode { typedef enum hts_urlmode {
HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */ HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */
HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */ HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */
@@ -301,6 +304,7 @@ typedef enum hts_urlmode {
/* Cache policy for updates and retries (opt->cache). */ /* Cache policy for updates and retries (opt->cache). */
#ifndef HTS_DEF_DEFSTRUCT_hts_cachemode #ifndef HTS_DEF_DEFSTRUCT_hts_cachemode
#define HTS_DEF_DEFSTRUCT_hts_cachemode #define HTS_DEF_DEFSTRUCT_hts_cachemode
typedef enum hts_cachemode { typedef enum hts_cachemode {
HTS_CACHE_NONE = 0, /**< no cache */ HTS_CACHE_NONE = 0, /**< no cache */
HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */ HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */
@@ -311,6 +315,7 @@ typedef enum hts_cachemode {
/* Interactive wizard level (opt->wizard). */ /* Interactive wizard level (opt->wizard). */
#ifndef HTS_DEF_DEFSTRUCT_hts_wizard #ifndef HTS_DEF_DEFSTRUCT_hts_wizard
#define HTS_DEF_DEFSTRUCT_hts_wizard #define HTS_DEF_DEFSTRUCT_hts_wizard
typedef enum hts_wizard { typedef enum hts_wizard {
HTS_WIZARD_NONE = 0, /**< no wizard */ HTS_WIZARD_NONE = 0, /**< no wizard */
HTS_WIZARD_ASK = 1, /**< wizard asks questions */ HTS_WIZARD_ASK = 1, /**< wizard asks questions */
@@ -321,6 +326,7 @@ typedef enum hts_wizard {
/* robots.txt / meta-robots obedience level (opt->robots). */ /* robots.txt / meta-robots obedience level (opt->robots). */
#ifndef HTS_DEF_DEFSTRUCT_hts_robots #ifndef HTS_DEF_DEFSTRUCT_hts_robots
#define HTS_DEF_DEFSTRUCT_hts_robots #define HTS_DEF_DEFSTRUCT_hts_robots
typedef enum hts_robots { typedef enum hts_robots {
HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */ HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */
HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */ HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */
@@ -422,11 +428,11 @@ struct httrackp {
LLint maxfile_html; /**< max bytes per HTML file */ LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */ int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */ LLint fragment; /**< split site after this many bytes */
hts_boolean hts_tristate
nearlink; /**< also fetch images/data adjacent to a page but off-site */ nearlink; /**< also fetch images/data adjacent to a page but off-site */
hts_boolean makeindex; /**< build a top-level index.html */ hts_boolean makeindex; /**< build a top-level index.html */
hts_boolean kindex; /**< build a keyword index */ hts_boolean kindex; /**< build a keyword index */
hts_boolean delete_old; /**< delete locally obsolete files after update */ hts_tristate delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */ int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */ int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */ int maxtime; /**< max total mirror duration in seconds */
@@ -459,13 +465,13 @@ struct httrackp {
hts_boolean maketrack; /**< maintain an operations-statistics log */ hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */ int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */ int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
hts_boolean errpage; /**< generate an error page on 404 and similar */ hts_tristate errpage; /**< generate an error page on 404 and similar */
hts_boolean hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/ */
hts_boolean all_in_cache; /**< keep all retrieved data in the cache */ hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
hts_robots robots; /**< robots.txt handling level */ hts_robots robots; /**< robots.txt handling level */
hts_boolean external; /**< render external links as error pages */ hts_tristate external; /**< render external links as error pages */
hts_boolean passprivacy; /**< strip passwords from external links */ hts_boolean passprivacy; /**< strip passwords from external links */
hts_boolean includequery; /**< include the query string in saved names */ hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */ hts_boolean mirror_first_page; /**< only mirror the links of the first page */
@@ -479,7 +485,7 @@ struct httrackp {
hts_boolean sizehack; /**< treat same-size response as "updated" */ hts_boolean sizehack; /**< treat same-size response as "updated" */
hts_boolean urlhack; // force "url normalization" to avoid loops hts_boolean urlhack; // force "url normalization" to avoid loops
hts_boolean tolerant; /**< accept an incorrect Content-Length */ hts_boolean tolerant; /**< accept an incorrect Content-Length */
hts_boolean hts_tristate
parseall; /**< parse aggressively, including unknown tags with links */ parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */ hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */ hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */

View File

@@ -296,6 +296,48 @@ static const char *html_inline_safe(const char *src, char *dst, size_t size) {
return dst; return dst;
} }
/* Byte before html, or a space sentinel at the buffer start where html[-1]
would underflow; space reads as the word boundary the guards want there. */
static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
static const char *const methods[] = {"GET", "POST", "PUT",
"DELETE", "HEAD", "OPTIONS",
"PATCH", "TRACE", NULL};
int i;
for (i = 0; methods[i] != NULL; i++) {
if (strlen(methods[i]) == len && strfield(s, methods[i]) == (int) len)
return 1;
}
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */ /* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) { int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE]; char catbuff[CATBUFF_SIZE];
@@ -556,7 +598,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->getmode & HTS_GETMODE_HTML) { if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title"); p = strfield(html, "title");
if (p) { if (p) {
if (*(html - 1) == '/') if (html_prevc(html, r->adr) == '/')
p = 0; // /title p = 0; // /title
} else { } else {
if (strfield(html, "/html")) if (strfield(html, "/html"))
@@ -1341,6 +1383,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0; int can_avoid_quotes = 0;
char quotes_replacement = '\0'; char quotes_replacement = '\0';
int ensure_not_mime = 0; int ensure_not_mime = 0;
// .open(method,url): reject an HTTP-method first arg (#218)
int ensure_not_method = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag) if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'" expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1357,9 +1404,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (!nc) if (!nc)
nc = strfield(html, ":location"); // javascript:location="doc" nc = strfield(html, ":location"); // javascript:location="doc"
if (!nc) { // location="doc" if (!nc) { // location="doc"
if ((nc = strfield(html, "location")) if ((nc = strfield(html, "location")) &&
&& !isspace(*(html - 1)) !isspace(html_prevc(html, r->adr)))
)
nc = 0; nc = 0;
} }
if (!nc) if (!nc)
@@ -1369,6 +1415,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = "),"; // fin: virgule ou parenthèse expected_end = "),"; // fin: virgule ou parenthèse
ensure_not_mime = 1; //* ensure the url is not a mime type */ ensure_not_mime = 1; //* ensure the url is not a mime type */
ensure_not_method = 1; // xhr.open: don't grab method
} }
if (!nc) if (!nc)
if ((nc = strfield(html, ".replace"))) { // window.replace("url") if ((nc = strfield(html, ".replace"))) { // window.replace("url")
@@ -1380,7 +1427,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
} }
if (!nc && (nc = strfield(html, "url")) && (!isalnum(*(html - 1))) && *(html - 1) != '_') { // url(url) if (!nc && (nc = strfield(html, "url")) &&
(!isalnum(html_prevc(html, r->adr))) &&
html_prevc(html, r->adr) != '_') { // url(url)
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
can_avoid_quotes = 1; can_avoid_quotes = 1;
@@ -1390,6 +1439,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url" if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) { if (is_space(*(html + nc))) {
expected = 0; // no char expected expected = 0; // no char expected
is_import = 1;
} else } else
nc = 0; nc = 0;
} }
@@ -1407,6 +1457,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) { if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c; const char *b, *c;
int ndelim = 1; int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\'')) if ((*a == 34) || (*a == '\''))
a++; a++;
@@ -1421,12 +1472,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++; b++;
} }
c = b--; c = b--;
// no closing delimiter here (truncated input):
// Don't scan past the buffer NUL or capture it.
if (*c != '\0') {
c += ndelim; c += ndelim;
while (*c == ' ') while (*c == ' ')
c++; c++;
if ((strchr(expected_end, *c)) || (*c == '\n') valid_url =
|| (*c == '\r')) { (strchr(expected_end, *c)) || (*c == '\n') ||
c -= (ndelim + 1); (*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) { if ((int) (c - a + 1)) {
if (ensure_not_mime) { if (ensure_not_mime) {
int i = 0; int i = 0;
@@ -1442,6 +1501,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
i++; i++;
} }
} }
// XHR.open's "GET" etc. is a method, not a URL
if (a != NULL && ensure_not_method &&
is_http_method(a, (size_t) (c - a + 1))) {
a = NULL;
}
// Check for bogus links (Vasiliy) // Check for bogus links (Vasiliy)
if (a != NULL) { if (a != NULL) {
const size_t size = c - a + 1; const size_t size = c - a + 1;
@@ -1485,7 +1549,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
} }
} }
} }
} }
} }
@@ -1692,6 +1755,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
hts_nodetect[i - hts_nodetect[i -
1]); 1]);
} }
// xmlns / xmlns:prefix declare
// XML namespaces, not resources
// (#191)
else {
const int xl = strfield(
intag_startattr, "xmlns");
const char xc =
intag_startattr[xl];
if (xl &&
(xc == ':' || xc == '=' ||
is_space(xc))) {
url_ok = 0;
hts_log_print(
opt, LOG_DEBUG,
"dirty parsing: xmlns "
"namespace avoided");
}
}
} }
} }
@@ -2967,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */ /* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo)); inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf) //if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx // escape_uri(tempo); // escape with %xx
//else { //else {

View File

@@ -58,7 +58,8 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
#endif #endif
#endif #endif
#define HTSSAFE_ABORT_FUNCTION(A,B,C) do { \ #define HTSSAFE_ABORT_FUNCTION(A, B, C) \
do { \
htsErrorCallback callback = hts_get_error_callback(); \ htsErrorCallback callback = hts_get_error_callback(); \
if (callback != NULL) { \ if (callback != NULL) { \
callback(A, B, C); \ callback(A, B, C); \
@@ -75,7 +76,8 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
/** /**
* Fatal assertion check. * Fatal assertion check.
*/ */
#define assertf__(exp, sexp, file, line) (void) ( (exp) || (abortf_(sexp, file, line), 0) ) #define assertf__(exp, sexp, file, line) \
(void) ((exp) || (abortf_(sexp, file, line), 0))
/** /**
* Fatal assertion check. * Fatal assertion check.
@@ -106,7 +108,8 @@ static HTS_UNUSED void abortf_(const char *exp, const char *file, int line) {
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
/* Note: char[] and const char[] are compatible */ /* Note: char[] and const char[] are compatible */
#define HTS_IS_CHAR_BUFFER(VAR) ( __builtin_types_compatible_p ( typeof (VAR), char[] ) ) #define HTS_IS_CHAR_BUFFER(VAR) \
(__builtin_types_compatible_p(typeof(VAR), char[]))
#else #else
/* Note: a bit lame as char[8] won't be seen. */ /* Note: a bit lame as char[8] won't be seen. */
#define HTS_IS_CHAR_BUFFER(VAR) (sizeof(VAR) != sizeof(char *)) #define HTS_IS_CHAR_BUFFER(VAR) (sizeof(VAR) != sizeof(char *))
@@ -201,10 +204,13 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strncatbuff(A, B, N) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strncatbuff(A, B, N) \
__builtin_choose_expr( \
HTS_IS_CHAR_BUFFER(A), \
strncat_safe_(A, sizeof(A), B, \ strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \ "overflow while appending '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strncatbuff_ptr_((A), (B), (N))) strncatbuff_ptr_((A), (B), (N)))
#else #else
#define strncatbuff(A, B, N) \ #define strncatbuff(A, B, N) \
@@ -212,7 +218,8 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
? strncat(A, B, N) \ ? strncat(A, B, N) \
: strncat_safe_(A, sizeof(A), B, \ : strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) ) "overflow while appending '" #B "' to '" #A "'", \
__FILE__, __LINE__))
#endif #endif
/** /**
@@ -222,18 +229,24 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strcatbuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strcatbuff(A, B) \
__builtin_choose_expr( \
HTS_IS_CHAR_BUFFER(A), \
strncat_safe_(A, sizeof(A), B, \ strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \ (size_t) -1, \
"overflow while appending '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strcatbuff_ptr_((A), (B))) strcatbuff_ptr_((A), (B)))
#else #else
#define strcatbuff(A, B) \ #define strcatbuff(A, B) \
(HTS_IS_NOT_CHAR_BUFFER(A) \ (HTS_IS_NOT_CHAR_BUFFER(A) \
? strcat(A, B) \ ? strcat(A, B) \
: strncat_safe_(A, sizeof(A), B, \ : strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) ) (size_t) -1, \
"overflow while appending '" #B "' to '" #A "'", \
__FILE__, __LINE__))
#endif #endif
/** /**
@@ -243,10 +256,13 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strcpybuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strcpybuff(A, B) \
__builtin_choose_expr( \
HTS_IS_CHAR_BUFFER(A), \
strcpy_safe_(A, sizeof(A), B, \ strcpy_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__), \ "overflow while copying '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strcpybuff_ptr_((A), (B))) strcpybuff_ptr_((A), (B)))
#else #else
#define strcpybuff(A, B) \ #define strcpybuff(A, B) \
@@ -254,7 +270,8 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
? strcpy(A, B) \ ? strcpy(A, B) \
: strcpy_safe_(A, sizeof(A), B, \ : strcpy_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__) ) "overflow while copying '" #B "' to '" #A "'", __FILE__, \
__LINE__))
#endif #endif
/* /*
@@ -269,9 +286,9 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
* Append characters of "B" to "A", "A" having a maximum capacity of "S". * Append characters of "B" to "A", "A" having a maximum capacity of "S".
*/ */
#define strlcatbuff(A, B, S) \ #define strlcatbuff(A, B, S) \
strncat_safe_(A, S, B, \ strncat_safe_(A, S, B, HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ (size_t) -1, "overflow while appending '" #B "' to '" #A "'", \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) __FILE__, __LINE__)
/** /**
* Append at most "N" characters of "B" to "A", "A" having a maximum capacity * Append at most "N" characters of "B" to "A", "A" having a maximum capacity
@@ -286,16 +303,17 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
* Copy characters of "B" to "A", "A" having a maximum capacity of "S". * Copy characters of "B" to "A", "A" having a maximum capacity of "S".
*/ */
#define strlcpybuff(A, B, S) \ #define strlcpybuff(A, B, S) \
strcpy_safe_(A, S, B, \ strcpy_safe_(A, S, B, HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ "overflow while copying '" #B "' to '" #A "'", __FILE__, \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__) __LINE__)
/** strnlen replacement (autotools). **/ /** strnlen replacement (autotools). **/
#if (!defined(_WIN32) && !defined(HAVE_STRNLEN)) #if (!defined(_WIN32) && !defined(HAVE_STRNLEN))
static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) { static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) {
size_t i; size_t i;
for(i = 0 ; i < maxlen && s[i] != '\0' ; i++) ; for (i = 0; i < maxlen && s[i] != '\0'; i++)
;
return i; return i;
} }
#endif #endif
@@ -304,12 +322,13 @@ static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) {
Aborts if source is NULL or has no NUL within that capacity. The sentinel Aborts if source is NULL or has no NUL within that capacity. The sentinel
sizeof_source == (size_t)-1 means "capacity unknown", and falls back to the sizeof_source == (size_t)-1 means "capacity unknown", and falls back to the
unbounded strlen (used when the source is a pointer rather than an array). */ unbounded strlen (used when the source is a pointer rather than an array). */
static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source, const size_t sizeof_source, static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source,
const size_t sizeof_source,
const char *file, int line) { const char *file, int line) {
size_t size; size_t size;
assertf_(source != NULL, file, line); assertf_(source != NULL, file, line);
size = sizeof_source != (size_t) -1 size = sizeof_source != (size_t) -1 ? strnlen(source, sizeof_source)
? strnlen(source, sizeof_source) : strlen(source); : strlen(source);
assertf_(size < sizeof_source, file, line); assertf_(size < sizeof_source, file, line);
return size; return size;
} }
@@ -319,10 +338,10 @@ static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source, const size_
source's capacity or (size_t)-1 if unknown. Aborts if the result (existing source's capacity or (size_t)-1 if unknown. Aborts if the result (existing
dest length + appended bytes + NUL) would not fit sizeof_dest: this NEVER dest length + appended bytes + NUL) would not fit sizeof_dest: this NEVER
truncates. Always NUL-terminates on success. */ truncates. Always NUL-terminates on success. */
static HTS_INLINE HTS_UNUSED char* strncat_safe_(char *const dest, const size_t sizeof_dest, static HTS_INLINE HTS_UNUSED char *
strncat_safe_(char *const dest, const size_t sizeof_dest,
const char *const source, const size_t sizeof_source, const char *const source, const size_t sizeof_source,
const size_t n, const size_t n, const char *exp, const char *file, int line) {
const char *exp, const char *file, int line) {
const size_t source_len = strlen_safe_(source, sizeof_source, file, line); const size_t source_len = strlen_safe_(source, sizeof_source, file, line);
const size_t dest_len = strlen_safe_(dest, sizeof_dest, file, line); const size_t dest_len = strlen_safe_(dest, sizeof_dest, file, line);
/* note: "size_t is an unsigned integral type" ((size_t) -1 is positive) */ /* note: "size_t is an unsigned integral type" ((size_t) -1 is positive) */
@@ -337,12 +356,14 @@ static HTS_INLINE HTS_UNUSED char* strncat_safe_(char *const dest, const size_t
/* Core bounded copy: empties dest then appends all of source via /* Core bounded copy: empties dest then appends all of source via
strncat_safe_. sizeof_dest is dest's total capacity (NUL included). Aborts strncat_safe_. sizeof_dest is dest's total capacity (NUL included). Aborts
(no truncation) if source plus its NUL would not fit. */ (no truncation) if source plus its NUL would not fit. */
static HTS_INLINE HTS_UNUSED char* strcpy_safe_(char *const dest, const size_t sizeof_dest, static HTS_INLINE HTS_UNUSED char *
strcpy_safe_(char *const dest, const size_t sizeof_dest,
const char *const source, const size_t sizeof_source, const char *const source, const size_t sizeof_source,
const char *exp, const char *file, int line) { const char *exp, const char *file, int line) {
assertf_(sizeof_dest != 0, file, line); assertf_(sizeof_dest != 0, file, line);
dest[0] = '\0'; dest[0] = '\0';
return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1, exp, file, line); return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1,
exp, file, line);
} }
/** /**
@@ -385,22 +406,28 @@ static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) {
/* 0 for an array, a -1 array-size compile error for a pointer. */ /* 0 for an array, a -1 array-size compile error for a pointer. */
#define htsbuff_must_be_array_(A) \ #define htsbuff_must_be_array_(A) \
(sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), typeof(&(A)[0]))]) - 1) (sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), \
typeof(&(A)[0]))]) - \
1)
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR)) #define htsbuff_array(ARR) \
htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR))
#else #else
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR)) #define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR))
#endif #endif
/** Builder over pointer P of known capacity N (N includes the NUL). */ /** Builder over pointer P of known capacity N (N includes the NUL). */
#define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N)) #define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N))
/** Append at most n characters of s (stopping at its NUL). Aborts on overflow. */ /** Append at most n characters of s (stopping at its NUL). Aborts on overflow.
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s, size_t n) { */
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s,
size_t n) {
const size_t add = strnlen(s, n); const size_t add = strnlen(s, n);
/* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The /* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The
maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so
'add < cap - len' cannot wrap the way 'len + add < cap' could. */ 'add < cap - len' cannot wrap the way 'len + add < cap' could. */
assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__, __LINE__); assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__,
__LINE__);
memcpy(b->buf + b->len, s, add); memcpy(b->buf + b->len, s, add);
b->len += add; b->len += add;
b->buf[b->len] = '\0'; b->buf[b->len] = '\0';
@@ -437,7 +464,13 @@ static HTS_INLINE HTS_UNUSED const char *htsbuff_str(const htsbuff *b) {
#define calloct(A, B) calloc((A), (B)) #define calloct(A, B) calloc((A), (B))
#define freet(A) do { if ((A) != NULL) { free(A); (A) = NULL; } } while(0) #define freet(A) \
do { \
if ((A) != NULL) { \
free(A); \
(A) = NULL; \
} \
} while (0)
#define strdupt(A) strdup(A) #define strdupt(A) strdup(A)

View File

@@ -60,6 +60,7 @@ typedef struct String String;
#endif #endif
#ifndef HTS_DEF_STRUCT_String #ifndef HTS_DEF_STRUCT_String
#define HTS_DEF_STRUCT_String #define HTS_DEF_STRUCT_String
/** /**
* Growable owned string. * Growable owned string.
* *
@@ -131,14 +132,16 @@ struct String {
/** Drop the last byte and re-terminate. Undefined if the String is empty /** Drop the last byte and re-terminate. Undefined if the String is empty
(no length check; would underflow). **/ (no length check; would underflow). **/
#define StringPopRight(BLK) do { \ #define StringPopRight(BLK) \
do { \
StringBuffRW(BLK)[--StringLength(BLK)] = '\0'; \ StringBuffRW(BLK)[--StringLength(BLK)] = '\0'; \
} while (0) } while (0)
/** Grow so capacity_ >= CAPACITY (total bytes, including the NUL). May realloc /** Grow so capacity_ >= CAPACITY (total bytes, including the NUL). May realloc
(invalidating prior buffer pointers); aborts via STRING_ASSERT on OOM. (invalidating prior buffer pointers); aborts via STRING_ASSERT on OOM.
Never shrinks. **/ Never shrinks. **/
#define StringRoomTotal(BLK, CAPACITY) do { \ #define StringRoomTotal(BLK, CAPACITY) \
do { \
const size_t capacity_ = (size_t) (CAPACITY); \ const size_t capacity_ = (size_t) (CAPACITY); \
while ((BLK).capacity_ < capacity_) { \ while ((BLK).capacity_ < capacity_) { \
if ((BLK).capacity_ < 16) { \ if ((BLK).capacity_ < 16) { \
@@ -153,11 +156,13 @@ struct String {
/** Reserve room for SIZE more bytes beyond the current length (plus the NUL). /** Reserve room for SIZE more bytes beyond the current length (plus the NUL).
May realloc, invalidating prior buffer pointers. **/ May realloc, invalidating prior buffer pointers. **/
#define StringRoom(BLK, SIZE) StringRoomTotal(BLK, StringLength(BLK) + (SIZE) + 1) #define StringRoom(BLK, SIZE) \
StringRoomTotal(BLK, StringLength(BLK) + (SIZE) + 1)
/** Reserve room for SIZE more bytes and return the (post-realloc) RW buffer, /** Reserve room for SIZE more bytes and return the (post-realloc) RW buffer,
for appending in place. Does not update length_; the caller must. **/ for appending in place. Does not update length_; the caller must. **/
#define StringBuffN(BLK, SIZE) StringBuffN_(&(BLK), SIZE) #define StringBuffN(BLK, SIZE) StringBuffN_(&(BLK), SIZE)
HTS_STATIC char *StringBuffN_(String *blk, int size) { HTS_STATIC char *StringBuffN_(String *blk, int size) {
StringRoom(*blk, size); StringRoom(*blk, size);
return StringBuffRW(*blk); return StringBuffRW(*blk);
@@ -166,7 +171,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Zero the fields (NULL buffer, no allocation). Use on an uninitialized /** Zero the fields (NULL buffer, no allocation). Use on an uninitialized
String only; does NOT free an existing buffer (use StringFree to reset String only; does NOT free an existing buffer (use StringFree to reset
an owned one), so calling it on a live String leaks. **/ an owned one), so calling it on a live String leaks. **/
#define StringInit(BLK) do { \ #define StringInit(BLK) \
do { \
(BLK).buffer_ = NULL; \ (BLK).buffer_ = NULL; \
(BLK).capacity_ = 0; \ (BLK).capacity_ = 0; \
(BLK).length_ = 0; \ (BLK).length_ = 0; \
@@ -174,7 +180,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Truncate to length 0, keeping the allocation. Forces a non-NULL buffer /** Truncate to length 0, keeping the allocation. Forces a non-NULL buffer
(allocates if empty) and writes the leading NUL, so StringBuff is "". **/ (allocates if empty) and writes the leading NUL, so StringBuff is "". **/
#define StringClear(BLK) do { \ #define StringClear(BLK) \
do { \
(BLK).length_ = 0; \ (BLK).length_ = 0; \
StringRoom(BLK, 0); \ StringRoom(BLK, 0); \
(BLK).buffer_[0] = '\0'; \ (BLK).buffer_[0] = '\0'; \
@@ -182,7 +189,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Set length_ to SIZE, or to strlen(buffer_) if SIZE is negative. Caller /** Set length_ to SIZE, or to strlen(buffer_) if SIZE is negative. Caller
asserts SIZE fits the existing content; does not (re)allocate. **/ asserts SIZE fits the existing content; does not (re)allocate. **/
#define StringSetLength(BLK, SIZE) do { \ #define StringSetLength(BLK, SIZE) \
do { \
if (SIZE >= 0) { \ if (SIZE >= 0) { \
(BLK).length_ = SIZE; \ (BLK).length_ = SIZE; \
} else { \ } else { \
@@ -192,7 +200,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Release the owned buffer and reset to the empty state (NULL buffer). /** Release the owned buffer and reset to the empty state (NULL buffer).
Idempotent; safe on an already-empty String. **/ Idempotent; safe on an already-empty String. **/
#define StringFree(BLK) do { \ #define StringFree(BLK) \
do { \
if ((BLK).buffer_ != NULL) { \ if ((BLK).buffer_ != NULL) { \
STRING_FREE((BLK).buffer_); \ STRING_FREE((BLK).buffer_); \
(BLK).buffer_ = NULL; \ (BLK).buffer_ = NULL; \
@@ -207,7 +216,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
freed or used by the caller afterwards. length_/capacity_ are set to freed or used by the caller afterwards. length_/capacity_ are set to
strlen(STR) (capacity_ here excludes the NUL, so the next append reallocs). strlen(STR) (capacity_ here excludes the NUL, so the next append reallocs).
**/ **/
#define StringSetBuffer(BLK, STR) do { \ #define StringSetBuffer(BLK, STR) \
do { \
size_t len__ = strlen(STR); \ size_t len__ = strlen(STR); \
StringFree(BLK); \ StringFree(BLK); \
(BLK).buffer_ = (STR); \ (BLK).buffer_ = (STR); \
@@ -218,7 +228,8 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Append SIZE raw bytes from STR (NULs allowed as data). Grows as needed and /** Append SIZE raw bytes from STR (NULs allowed as data). Grows as needed and
re-terminates with a NUL after the appended bytes. STR must not alias re-terminates with a NUL after the appended bytes. STR must not alias
BLK's buffer (a realloc would invalidate it). **/ BLK's buffer (a realloc would invalidate it). **/
#define StringMemcat(BLK, STR, SIZE) do { \ #define StringMemcat(BLK, STR, SIZE) \
do { \
const char *str_mc_ = (STR); \ const char *str_mc_ = (STR); \
const size_t size_mc_ = (size_t) (SIZE); \ const size_t size_mc_ = (size_t) (SIZE); \
StringRoom(BLK, size_mc_); \ StringRoom(BLK, size_mc_); \
@@ -231,13 +242,15 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Replace content with SIZE raw bytes from STR (NULs allowed as data). /** Replace content with SIZE raw bytes from STR (NULs allowed as data).
Same non-aliasing requirement as StringMemcat. **/ Same non-aliasing requirement as StringMemcat. **/
#define StringMemcpy(BLK, STR, SIZE) do { \ #define StringMemcpy(BLK, STR, SIZE) \
do { \
(BLK).length_ = 0; \ (BLK).length_ = 0; \
StringMemcat(BLK, STR, SIZE); \ StringMemcat(BLK, STR, SIZE); \
} while (0) } while (0)
/** Append one byte and re-terminate. Grows as needed. **/ /** Append one byte and re-terminate. Grows as needed. **/
#define StringAddchar(BLK, c) do { \ #define StringAddchar(BLK, c) \
do { \
String *const s__ = &(BLK); \ String *const s__ = &(BLK); \
char c__ = (c); \ char c__ = (c); \
StringRoom(*s__, 1); \ StringRoom(*s__, 1); \
@@ -281,7 +294,8 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Append the C string STR (up to its NUL). No-op if STR is NULL. STR must not /** Append the C string STR (up to its NUL). No-op if STR is NULL. STR must not
alias BLK's buffer. **/ alias BLK's buffer. **/
#define StringCat(BLK, STR) do { \ #define StringCat(BLK, STR) \
do { \
const char *const str__ = (STR); \ const char *const str__ = (STR); \
if (str__ != NULL) { \ if (str__ != NULL) { \
const size_t size__ = strlen(str__); \ const size_t size__ = strlen(str__); \
@@ -291,7 +305,8 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Append at most SIZE leading bytes of the C string STR. No-op if STR is /** Append at most SIZE leading bytes of the C string STR. No-op if STR is
NULL. STR must not alias BLK's buffer. **/ NULL. STR must not alias BLK's buffer. **/
#define StringCatN(BLK, STR, SIZE) do { \ #define StringCatN(BLK, STR, SIZE) \
do { \
const char *str__ = (STR); \ const char *str__ = (STR); \
if (str__ != NULL) { \ if (str__ != NULL) { \
size_t size__ = strlen(str__); \ size_t size__ = strlen(str__); \
@@ -304,7 +319,8 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Replace content with at most SIZE leading bytes of the C string STR. /** Replace content with at most SIZE leading bytes of the C string STR.
If STR is NULL, clears to "". STR must not alias BLK's buffer. **/ If STR is NULL, clears to "". STR must not alias BLK's buffer. **/
#define StringCopyN(BLK, STR, SIZE) do { \ #define StringCopyN(BLK, STR, SIZE) \
do { \
const char *str__ = (STR); \ const char *str__ = (STR); \
const size_t usize__ = (SIZE); \ const size_t usize__ = (SIZE); \
(BLK).length_ = 0; \ (BLK).length_ = 0; \
@@ -326,7 +342,8 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Replace content with a copy of the C string STR. If STR is NULL, clears to /** Replace content with a copy of the C string STR. If STR is NULL, clears to
"". STR must not alias BLK's buffer (use StringCopyOverlapped if it might). "". STR must not alias BLK's buffer (use StringCopyOverlapped if it might).
**/ **/
#define StringCopy(BLK, STR) do { \ #define StringCopy(BLK, STR) \
do { \
const char *str__ = (STR); \ const char *str__ = (STR); \
if (str__ != NULL) { \ if (str__ != NULL) { \
size_t size__ = strlen(str__); \ size_t size__ = strlen(str__); \
@@ -338,7 +355,8 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Like StringCopy but safe when STR aliases BLK's own buffer: copies via a /** Like StringCopy but safe when STR aliases BLK's own buffer: copies via a
temporary, so a self-copy or overlap is well-defined. **/ temporary, so a self-copy or overlap is well-defined. **/
#define StringCopyOverlapped(BLK, STR) do { \ #define StringCopyOverlapped(BLK, STR) \
do { \
String s__ = STRING_EMPTY; \ String s__ = STRING_EMPTY; \
StringCopy(s__, STR); \ StringCopy(s__, STR); \
StringCopyS(BLK, s__); \ StringCopyS(BLK, s__); \

View File

@@ -73,6 +73,7 @@ typedef struct strc_int2bytes2 strc_int2bytes2;
#endif #endif
#ifndef HTS_DEF_DEFSTRUCT_hts_log_type #ifndef HTS_DEF_DEFSTRUCT_hts_log_type
#define HTS_DEF_DEFSTRUCT_hts_log_type #define HTS_DEF_DEFSTRUCT_hts_log_type
/** Log severity levels, most to least severe. A message is emitted only if its /** Log severity levels, most to least severe. A message is emitted only if its
level is <= opt->debug. LOG_ERRNO is a flag OR'd into the level to append level is <= opt->debug. LOG_ERRNO is a flag OR'd into the level to append
": <strerror(errno)>" to the message. */ ": <strerror(errno)>" to the message. */
@@ -111,8 +112,10 @@ requires: htsdefines.h */
* CALLBACKARG_USERDEF(). Allocates a t_hts_callbackarg with hts_malloc (not * CALLBACKARG_USERDEF(). Allocates a t_hts_callbackarg with hts_malloc (not
* checked for OOM); it is freed by hts_free_opt(). * checked for OOM); it is freed by hts_free_opt().
*/ */
#define CHAIN_FUNCTION(OPT, MEMBER, FUNCTION, ARGUMENT) do { \ #define CHAIN_FUNCTION(OPT, MEMBER, FUNCTION, ARGUMENT) \
t_hts_callbackarg *carg = (t_hts_callbackarg*) hts_malloc(sizeof(t_hts_callbackarg)); \ do { \
t_hts_callbackarg *carg = \
(t_hts_callbackarg *) hts_malloc(sizeof(t_hts_callbackarg)); \
carg->userdef = (ARGUMENT); \ carg->userdef = (ARGUMENT); \
carg->prev.fun = (void *) (OPT)->callbacks_fun->MEMBER.fun; \ carg->prev.fun = (void *) (OPT)->callbacks_fun->MEMBER.fun; \
carg->prev.carg = (OPT)->callbacks_fun->MEMBER.carg; \ carg->prev.carg = (OPT)->callbacks_fun->MEMBER.carg; \
@@ -120,8 +123,10 @@ requires: htsdefines.h */
(OPT)->callbacks_fun->MEMBER.carg = carg; \ (OPT)->callbacks_fun->MEMBER.carg = carg; \
} while (0) } while (0)
/* The following helpers are useful only if you know that an existing callback migh be existing before before the call to CHAIN_FUNCTION() /* The following helpers are useful only if you know that an existing callback
If your functions were added just after hts_create_opt(), no need to make the previous function check */ migh be existing before before the call to CHAIN_FUNCTION() If your functions
were added just after hts_create_opt(), no need to make the previous function
check */
/** Inside a chained callback, return the ARGUMENT pointer originally passed to /** Inside a chained callback, return the ARGUMENT pointer originally passed to
CHAIN_FUNCTION(), or NULL when CARG is NULL. */ CHAIN_FUNCTION(), or NULL when CARG is NULL. */
@@ -129,11 +134,13 @@ If your functions were added just after hts_create_opt(), no need to make the pr
/** Return the callback of type NAME that this one chained over, cast to its /** Return the callback of type NAME that this one chained over, cast to its
function-pointer type, or NULL. Call it to forward to the prior handler. */ function-pointer type, or NULL. Call it to forward to the prior handler. */
#define CALLBACKARG_PREV_FUN(CARG, NAME) ( (t_hts_htmlcheck_ ##NAME) ( ( (CARG) != NULL ) ? (CARG)->prev.fun : NULL ) ) #define CALLBACKARG_PREV_FUN(CARG, NAME) \
((t_hts_htmlcheck_##NAME)(((CARG) != NULL) ? (CARG)->prev.fun : NULL))
/** Return the carg of the callback this one chained over (pass it when /** Return the carg of the callback this one chained over (pass it when
forwarding to the CALLBACKARG_PREV_FUN result), or NULL. */ forwarding to the CALLBACKARG_PREV_FUN result), or NULL. */
#define CALLBACKARG_PREV_CARG(CARG) ( ( (CARG) != NULL ) ? (CARG)->prev.carg : NULL ) #define CALLBACKARG_PREV_CARG(CARG) \
(((CARG) != NULL) ? (CARG)->prev.carg : NULL)
/* Functions */ /* Functions */
@@ -212,8 +219,8 @@ HTSEXT_API hts_boolean hts_log(httrackp *opt, const char *prefix,
/** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO). /** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO).
Forwards to the registered log callback, and when the level is <= opt->debug Forwards to the registered log callback, and when the level is <= opt->debug
also to opt->log. @p format must be non-NULL. */ also to opt->log. @p format must be non-NULL. */
HTSEXT_API void hts_log_print(httrackp * opt, int type, const char *format, HTSEXT_API void hts_log_print(httrackp *opt, int type, const char *format, ...)
...) HTS_PRINTF_FUN(3, 4); HTS_PRINTF_FUN(3, 4);
/** va_list form of hts_log_print(). @p opt may be NULL (only the callback /** va_list form of hts_log_print(). @p opt may be NULL (only the callback
runs). Preserves errno. @p format must be non-NULL. */ runs). Preserves errno. @p format must be non-NULL. */
@@ -255,7 +262,8 @@ HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct);
or 0 if none or unknown. */ or 0 if none or unknown. */
HTSEXT_API uintptr_t htswrap_read(httrackp *opt, const char *name); HTSEXT_API uintptr_t htswrap_read(httrackp *opt, const char *name);
/* Internal library allocators, if a different libc is being used by the client */ /* Internal library allocators, if a different libc is being used by the client
*/
/** strdup() through the library allocator. Returns a heap copy freed with /** strdup() through the library allocator. Returns a heap copy freed with
hts_free(), or NULL on failure. */ hts_free(), or NULL on failure. */
HTSEXT_API char *hts_strdup(const char *string); HTSEXT_API char *hts_strdup(const char *string);
@@ -490,40 +498,50 @@ HTSEXT_API void unescape_amp(char *s);
/** Percent-escape only spaces (' ' becomes "%20"); copy everything else /** Percent-escape only spaces (' ' becomes "%20"); copy everything else
* verbatim. */ * verbatim. */
HTSEXT_API size_t escape_spc_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_spc_url(const char *const src, char *const dest,
const size_t size);
/** Aggressively percent-escape @p src for use as a single URL path segment /** Aggressively percent-escape @p src for use as a single URL path segment
(reserved, delimiter, unwise, special, avoid and mark characters). */ (reserved, delimiter, unwise, special, avoid and mark characters). */
HTSEXT_API size_t escape_in_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_in_url(const char *const src, char *const dest,
const size_t size);
/** Percent-escape @p src as a URI, escaping only what is necessary and keeping /** Percent-escape @p src as a URI, escaping only what is necessary and keeping
'/' and other reserved characters. */ '/' and other reserved characters. */
HTSEXT_API size_t escape_uri(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_uri(const char *const src, char *const dest,
const size_t size);
/** Like escape_uri() for a UTF-8 URI: also escapes reserved characters other /** Like escape_uri() for a UTF-8 URI: also escapes reserved characters other
than '/'. */ than '/'. */
HTSEXT_API size_t escape_uri_utf(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_uri_utf(const char *const src, char *const dest,
const size_t size);
/** Minimal "make safe" escape: percent-escapes only '"', ' ' and control /** Minimal "make safe" escape: percent-escapes only '"', ' ' and control
characters, leaving an already-formed URL otherwise intact. */ characters, leaving an already-formed URL otherwise intact. */
HTSEXT_API size_t escape_check_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_check_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_spc_url(): escapes @p src after the existing /** Append-variant of escape_spc_url(): escapes @p src after the existing
NUL-terminated content of @p dest. Returns the bytes appended (excluding the NUL-terminated content of @p dest. Returns the bytes appended (excluding the
NUL). */ NUL). */
HTSEXT_API size_t append_escape_spc_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_spc_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_in_url(). See append_escape_spc_url(). */ /** Append-variant of escape_in_url(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_in_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_in_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_uri(). See append_escape_spc_url(). */ /** Append-variant of escape_uri(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_uri(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_uri(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_uri_utf(). See append_escape_spc_url(). */ /** Append-variant of escape_uri_utf(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_uri_utf(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_uri_utf(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_check_url(). See append_escape_spc_url(). */ /** Append-variant of escape_check_url(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_check_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_check_url(const char *const src,
char *const dest, const size_t size);
/** In-place variant of escape_spc_url(): escapes the NUL-terminated string in /** In-place variant of escape_spc_url(): escapes the NUL-terminated string in
@p dest back into @p dest. */ @p dest back into @p dest. */
@@ -543,32 +561,39 @@ HTSEXT_API size_t inplace_escape_check_url(char *const dest, const size_t size);
/** Same escaping as escape_check_url() but returns @p dest instead of the byte /** Same escaping as escape_check_url() but returns @p dest instead of the byte
count. */ count. */
HTSEXT_API char *escape_check_url_addr(const char *const src, char *const dest, const size_t size); HTSEXT_API char *escape_check_url_addr(const char *const src, char *const dest,
const size_t size);
/** Build a MIME/MHTML content-id token in @p dest from @p adr and @p fil: /** Build a MIME/MHTML content-id token in @p dest from @p adr and @p fil:
escape_in_url() both, then replace every '%' with 'X' so the result is one escape_in_url() both, then replace every '%' with 'X' so the result is one
opaque token. */ opaque token. */
HTSEXT_API size_t make_content_id(const char *const adr, const char *const fil, char *const dest, const size_t size); HTSEXT_API size_t make_content_id(const char *const adr, const char *const fil,
char *const dest, const size_t size);
/** Low-level percent-escaper backing the escape_* family. @p mode selects the /** Low-level percent-escaper backing the escape_* family. @p mode selects the
character class to escape: 0 check_url, 1 in_url, 2 spc_url, 3 uri, character class to escape: 0 check_url, 1 in_url, 2 spc_url, 3 uri,
30 uri_utf. @p max_size is the dest capacity including the NUL. */ 30 uri_utf. @p max_size is the dest capacity including the NUL. */
HTSEXT_API size_t x_escape_http(const char *const s, char *const dest, const size_t max_size, const int mode); HTSEXT_API size_t x_escape_http(const char *const s, char *const dest,
const size_t max_size, const int mode);
/** Strip all control characters (byte value < 32) from @p s in place. */ /** Strip all control characters (byte value < 32) from @p s in place. */
HTSEXT_API void escape_remove_control(char *const s); HTSEXT_API void escape_remove_control(char *const s);
/** HTML-escape for text output: rewrite '&' to "&amp;" and pass every other /** HTML-escape for text output: rewrite '&' to "&amp;" and pass every other
byte through unchanged. */ byte through unchanged. */
HTSEXT_API size_t escape_for_html_print(const char *const s, char *const dest, const size_t size); HTSEXT_API size_t escape_for_html_print(const char *const s, char *const dest,
const size_t size);
/** Like escape_for_html_print() but also convert every high byte (>= 128) to a /** Like escape_for_html_print() but also convert every high byte (>= 128) to a
numeric entity "&#xNN;". */ numeric entity "&#xNN;". */
HTSEXT_API size_t escape_for_html_print_full(const char *const s, char *const dest, const size_t size); HTSEXT_API size_t escape_for_html_print_full(const char *const s,
char *const dest,
const size_t size);
/** Percent-decode @p s into @p catbuff (capacity @p size) and return @p /** Percent-decode @p s into @p catbuff (capacity @p size) and return @p
catbuff. Decodes every "%xx" hex escape. */ catbuff. Decodes every "%xx" hex escape. */
HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const char *const s); HTSEXT_API char *unescape_http(char *const catbuff, const size_t size,
const char *const s);
/** Percent-decode @p s into @p catbuff, but only the escapes that are safe to /** Percent-decode @p s into @p catbuff, but only the escapes that are safe to
decode while keeping a valid URI (reserved, delimiter, unwise, control and decode while keeping a valid URI (reserved, delimiter, unwise, control and
@@ -589,8 +614,7 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
HTS_MIMETYPE_SIZE capacity. */ HTS_MIMETYPE_SIZE capacity. */
HTS_DEPRECATED("use get_httptype_sized(opt, s, ssize, fil, flag)") HTS_DEPRECATED("use get_httptype_sized(opt, s, ssize, fil, flag)")
HTSEXT_API void get_httptype(httrackp * opt, char *s, const char *fil, HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil, int flag);
int flag);
/** Classify @p fil by its extension: 0 unknown, 1 known non-HTML, 2 known HTML. /** Classify @p fil by its extension: 0 unknown, 1 known non-HTML, 2 known HTML.
Consults the built-in table then user --assume rules. 0 for a NULL @p fil. Consults the built-in table then user --assume rules. 0 for a NULL @p fil.
@@ -633,11 +657,13 @@ HTSEXT_API void guess_httptype(httrackp * opt, char *s, const char *fil);
time), not a pointer. */ time), not a pointer. */
/** Concatenate @p a and @p b into @p catbuff (NULL or empty operands are /** Concatenate @p a and @p b into @p catbuff (NULL or empty operands are
* skipped). */ * skipped). */
HTSEXT_API char *concat(char *catbuff, size_t size, const char *a, const char *b); HTSEXT_API char *concat(char *catbuff, size_t size, const char *a,
const char *b);
/** Like concat(a, b) but convert '/' to the platform path separator (Windows). /** Like concat(a, b) but convert '/' to the platform path separator (Windows).
*/ */
HTSEXT_API char *fconcat(char *catbuff, size_t size, const char *a, const char *b); HTSEXT_API char *fconcat(char *catbuff, size_t size, const char *a,
const char *b);
/** Copy @p a into @p catbuff, converting '/' to the platform path separator /** Copy @p a into @p catbuff, converting '/' to the platform path separator
(Windows). */ (Windows). */
@@ -756,7 +782,8 @@ typedef struct utimbuf STRUCT_UTIMBUF;
/** Macro aimed to break at build-time if a size is not a sizeof() strictly /** Macro aimed to break at build-time if a size is not a sizeof() strictly
* greater than sizeof(char*). **/ * greater than sizeof(char*). **/
#undef COMPILE_TIME_CHECK_SIZE #undef COMPILE_TIME_CHECK_SIZE
#define COMPILE_TIME_CHECK_SIZE(A) (void) ((void (*)(char[A - sizeof(char*) - 1])) NULL) #define COMPILE_TIME_CHECK_SIZE(A) \
(void) ((void (*)(char[A - sizeof(char *) - 1])) NULL)
/** Macro aimed to break at compile-time if a size is not a sizeof() strictly /** Macro aimed to break at compile-time if a size is not a sizeof() strictly
* greater than sizeof(char*). **/ * greater than sizeof(char*). **/

View File

@@ -1176,11 +1176,15 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
if (element != NULL) { if (element != NULL) {
msgCode = element->statuscode; msgCode = element->statuscode;
StringRoom(headers, 8192); StringRoom(headers, 8192);
sprintf(StringBuffRW(headers), "HTTP/1.1 %d %s\r\n" sprintf(StringBuffRW(headers),
"HTTP/1.1 %d %s\r\n"
#ifndef NO_WEBDAV #ifndef NO_WEBDAV
"%s" "%s"
#endif #endif
"Content-Type: %s%s%s%s\r\n" "%s%s%s" "%s%s%s" "%s%s%s", "Content-Type: %s%s%s%s\r\n"
"%s%s%s"
"%s%s%s"
"%s%s%s",
/* */ /* */
msgCode, element->msg, msgCode, element->msg,
#ifndef NO_WEBDAV #ifndef NO_WEBDAV
@@ -1188,16 +1192,18 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
StringBuff(davHeaders), StringBuff(davHeaders),
#endif #endif
/* Content-type: foo; [ charset=bar ] */ /* Content-type: foo; [ charset=bar ] */
element->contenttype, hts_effective_mime(element->contenttype),
((element->charset[0]) ? "; charset=\"" : ""), ((element->charset[0]) ? "; charset=\"" : ""),
element->charset, ((element->charset[0]) ? "\"" : ""), element->charset, ((element->charset[0]) ? "\"" : ""),
/* location */ /* location */
((element->location != NULL ((element->location != NULL && element->location[0])
&& element->location[0]) ? "Location: " : ""), ? "Location: "
((element->location != NULL : ""),
&& element->location[0]) ? element->location : ""), ((element->location != NULL && element->location[0])
((element->location != NULL ? element->location
&& element->location[0]) ? "\r\n" : ""), : ""),
((element->location != NULL && element->location[0]) ? "\r\n"
: ""),
/* last-modified */ /* last-modified */
((element->lastmodified[0]) ? "Last-Modified: " : ""), ((element->lastmodified[0]) ? "Last-Modified: " : ""),
((element->lastmodified[0]) ? element->lastmodified : ""), ((element->lastmodified[0]) ? element->lastmodified : ""),
@@ -1205,8 +1211,7 @@ static void proxytrack_process_HTTP(PT_Indexes indexes, T_SOC soc_c) {
/* etag */ /* etag */
((element->etag[0]) ? "ETag: " : ""), ((element->etag[0]) ? "ETag: " : ""),
((element->etag[0]) ? element->etag : ""), ((element->etag[0]) ? element->etag : ""),
((element->etag[0]) ? "\r\n" : "") ((element->etag[0]) ? "\r\n" : ""));
);
StringLength(headers) = (int) strlen(StringBuff(headers)); StringLength(headers) = (int) strlen(StringBuff(headers));
} else { } else {
/* No query string, no ending / : check the the <url>/ page */ /* No query string, no ending / : check the the <url>/ page */

View File

@@ -52,6 +52,7 @@ Please visit our Website: http://www.httrack.com
#include "htscore.h" #include "htscore.h"
#include "htsback.h" #include "htsback.h"
#include "htslib.h" /* hts_effective_mime */
#include "store.h" #include "store.h"
#include "proxystrings.h" #include "proxystrings.h"
@@ -2289,10 +2290,17 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
int size_headers; int size_headers;
sprintf(st->headers, sprintf(st->headers,
"HTTP/1.0 %d %s" "\r\n" "X-Server: ProxyTrack " PROXYTRACK_VERSION "HTTP/1.0 %d %s"
"\r\n" "Content-type: %s%s%s%s" "\r\n" "Last-modified: %s" "\r\n" "\r\n"
"Content-length: %d" "\r\n", element->statuscode, element->msg, "X-Server: ProxyTrack " PROXYTRACK_VERSION "\r\n"
/**/ element->contenttype, "Content-type: %s%s%s%s"
"\r\n"
"Last-modified: %s"
"\r\n"
"Content-length: %d"
"\r\n",
element->statuscode, element->msg,
/**/ hts_effective_mime(element->contenttype),
(element->charset[0] ? "; charset=\"" : ""), (element->charset[0] ? "; charset=\"" : ""),
(element->charset[0] ? element->charset : ""), (element->charset[0] ? element->charset : ""),
(element->charset[0] ? "\"" : ""), /**/ element->lastmodified, (element->charset[0] ? "\"" : ""), /**/ element->lastmodified,
@@ -2328,10 +2336,10 @@ static int PT_SaveCache__Arc_Fun(void *arg, const char *url, PT_Element element)
/* args */ /* args */
(link_has_authority(url) ? "" : "http://"), url, "0.0.0.0", (link_has_authority(url) ? "" : "http://"), url, "0.0.0.0",
tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour, tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour,
tm->tm_min, tm->tm_sec, element->contenttype, element->statuscode, tm->tm_min, tm->tm_sec, hts_effective_mime(element->contenttype),
st->md5, (element->location ? element->location : "-"), element->statuscode, st->md5,
(long int) ftell(fp), st->filename, (element->location ? element->location : "-"), (long int) ftell(fp),
(long int) (size_headers + element->size)); st->filename, (long int) (size_headers + element->size));
/* network_doc */ /* network_doc */
if (fwrite(st->headers, 1, size_headers, fp) != size_headers if (fwrite(st->headers, 1, size_headers, fp) != size_headers
|| (element->size > 0 || (element->size > 0

View File

@@ -4,28 +4,33 @@
# Initializes the htsserver GUI frontend and launch the default browser # Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE= BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition" SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set # sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}" SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi fi
# Patch for Darwin/Mac by Ross Williams # Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find # Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for # the default browser. The -W flag causes it to wait for
# the browser to exit # the browser to exit
BROWSEREXE="/usr/bin/open -W" BROWSEREXE="/usr/bin/open -W"
fi fi
BINWD=`dirname "$0"` BINWD=$(dirname "$0")
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin" SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "` IFS=':' read -ra pathdirs <<<"$PATH"
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share" for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
### ###
# And now some famous cuisine # And now some famous cuisine
function log { function log {
echo "$0($$): $@" >&2 echo "$0($$): $*" >&2
return 0 return 0
} }
@@ -42,35 +47,35 @@ log "Browser (or helper) exited"
# First ensure that we can launch the server # First ensure that we can launch the server
BINPATH= BINPATH=
for i in ${SRCHPATH}; do for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i} ! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done done
for i in ${SRCHDISTPATH}; do for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack" ! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1 test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1 test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1 test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1 test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1 test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1 test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale # Locale
HTSLANG="${LC_MESSAGES}" HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}" ! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}" ! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`" HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'` LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1 ! test -n "${LANGN}" && LANGN=1
# Find the browser # Find the browser
# note: not all systems have sensible-browser or www-browser alternative # note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found # thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do for i in "${SRCHBROWSEREXE[@]}"; do
for j in ${SRCHPATH}; do for j in "${SRCHPATH[@]}"; do
if test -x ${j}/${i}; then if test -x "${j}/${i}"; then
BROWSEREXE=${j}/${i} BROWSEREXE="${j}/${i}"
fi fi
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
@@ -81,7 +86,7 @@ test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command # "browse" command
if test "$1" = "browse"; then if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='` INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html" INDEXF="${INDEXF}/index.html"
else else
@@ -96,39 +101,43 @@ exit $?
fi fi
# Create a temporary filename # Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1 TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server # Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}& (
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL # Find the generated SRVURL
SRVURL= SRVURL=
MAXCOUNT=60 MAXCOUNT=60
while ! test -n "$SRVURL"; do while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1] MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1 test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.." test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=` SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1 test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1 test -n "$SRVURL" || sleep 1
done done
# Cleanup function # Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup { function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.." test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself # Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=` test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 ${SRVPID} test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE} test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done" test -n "$1" && log "..Done"
return 0 return 0
} }
# Cleanup in case of emergency # Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser # Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}" launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks! # That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup cleanup
exit 0 exit 0

15
tests/01_engine-dns.test Normal file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
set -euo pipefail
# DNS resolver/cache self-test: a mock getaddrinfo (no network) checks address
# family, single-address selection, the -@i4/-@i6 family filter, and cache reuse.
# The trailing token is required, like the other -# selftests, so a bare command
# line isn't treated as "no arguments" and routed to the usage screen.
out=$(httrack -#D run)
test "$out" = "dns-selftest: OK" || {
echo "expected 'dns-selftest: OK', got: $out" >&2
exit 1
}

View File

@@ -154,4 +154,173 @@ grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
grep -q 'title="file://' "$saved2" || grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1 ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
# xmlns / xmlns:prefix decls must not be crawled (#191). Local file:// targets so a
# regression downloads them; each is the LAST attr (heuristic only scans a value before '>').
site3="$tmp/xmlns"
mkdir -p "$site3"
for f in ns og rdfs real; do gif "$site3/$f.gif"; done
cat >"$site3/index.html" <<EOF
<html xmlns="file://$site3/ns.gif"><body>
<svg xmlns:og="file://$site3/og.gif"></svg>
<div class="c" xmlns:rdfs="file://$site3/rdfs.gif"></div>
<a href="file://$site3/real.gif">real link</a>
</body></html>
EOF
out3="$tmp/xmlns-out"
crawl "$site3/index.html" "$out3"
# the real link is still captured
found "real.gif" "$out3"
# namespace-declaration targets must not be fetched (default + prefixed forms)
notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
# Offset-0 underflow (#396): a token at the buffer start makes the detector's
# word-boundary guard read *(html-1) one byte early (aborts under ASan). The
# url() target is still captured; here it just must not underflow.
site6="$tmp/parse-off0"
mkdir -p "$site6"
printf 'body{}\n' >"$site6/off0.css"
printf 'url(off0.css)\n' >"$site6/main.css"
out6="$tmp/parse-off0-out"
crawl "$site6/main.css" "$out6"
found "off0.css" "$out6"
# XMLHttpRequest.open(method, url) (#218): the first argument is an HTTP method,
# not a URL. Without the fix "GET" is captured as a link and fetched (the offline
# fixture saves a bare file named GET; a live server mangles it to GET.html).
# window.open(url) detection must be unaffected.
site7="$tmp/xhropen"
mkdir -p "$site7"
gif "$site7/winopen.gif"
cat >"$site7/index.html" <<EOF
<html><body><script>
var x = new XMLHttpRequest();
x.open("GET", "ajax_info.txt");
var y = new XMLHttpRequest();
y.open("Post", "submit.cgi");
window.open("file://$site7/winopen.gif");
</script></body></html>
EOF
out7="$tmp/xhropen-out"
crawl "$site7/index.html" "$out7"
# negative control: without the fix a file named exactly GET is downloaded
notfound "GET" "$out7"
# methods are matched case-insensitively (XHR spec normalizes them): a mixed-case
# method is rejected too, so a file named Post must not appear either
notfound "Post" "$out7"
# regression guard: window.open(url) is still detected, so its absolute URL is
# rewritten to a local link. The rewrite only happens if the parser saw it, so
# these two assertions fail if .open detection broke (not a trivial --near save).
saved7=$(savedhtml "$out7")
test -n "$saved7" || ! echo "FAIL: saved xhr page not found" || exit 1
grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! echo "FAIL #218: window.open(url) no longer detected/rewritten" || exit 1
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0 exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986 # empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b' simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it # the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
echo "over-capacity write was NOT caught" >&2
exit 1
;;
*"overflow while copying"*) ;; *"overflow while copying"*) ;;
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;; *)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer # Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert. # aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
echo "htsbuff over-capacity write was NOT caught" >&2
exit 1
;;
*"htsbuff append overflow"*) ;; *"htsbuff append overflow"*) ;;
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;; *)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac esac

15
tests/13_local-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Cookie chain against the local test server (replaces the old online
# ut/cookies/*.php fixtures). entrance.php sets cat/cake; second.php checks
# them and sets badger; third.php checks all three. A missing or wrong cookie
# returns 500, which would surface as an httrack error and a missing file, so a
# clean 3-files/0-errors run proves the cookie jar is replayed across links.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
--found 'cookies/entrance.html' \
--found 'cookies/second.html' \
--found 'cookies/third.html' \
httrack 'BASEURL/cookies/entrance.php'

18
tests/14_local-https.test Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
#
# HTTPS crawl against the local test server, using the shipped self-signed
# cert. httrack does not verify certs (htslib.c: SSL_CTX_new with no
# SSL_CTX_set_verify), so the self-signed cert is accepted as-is and this
# exercises the real TLS path offline. basic.html links to link.html with four
# distinct query strings, each saved under a hashed name -> 5 files.
: "${top_srcdir:=..}"
if test "$HTTPS_SUPPORT" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
bash "$top_srcdir/tests/local-crawl.sh" --tls --errors 0 --files 5 \
--found 'simple/basic.html' \
httrack 'BASEURL/simple/basic.html'

25
tests/15_local-types.test Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \
--found 'types/style.css' \
--found 'types/data.json' \
--found 'types/control.html' --not-found 'types/control.php' \
--found 'types/gend61c.png' --not-found 'types/gend61c.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# --assume under the default delayed type check (-%N2), issue #56. A user type
# pinned with --assume must be honored immediately, not lost to the delayed
# name: photo.png served as image/png but assumed text/html is saved as .html.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/photo.html' --not-found 'types/photo.png' \
httrack 'BASEURL/types/photo.png' --assume png=text/html

View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
# An empty "Content-Type:" header value must be treated as "no usable type"
# (keep the URL extension), not parsed from an uninitialized buffer. The crawl
# also runs under ASan/UBSan in CI, which catches the uninitialized read this
# guards against.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/emptyct.png' --not-found 'types/emptyct.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# A second (update) pass must keep the names the first crawl chose. The stored
# Content-Type rides the cache, so the update reads back the same value -- the
# unknown/unknown sentinel for a typeless response, the declared type otherwise
# -- and names consistently: a declared-text/html .pdf stays .html and a
# typeless .png stays .png across the update rather than reverting.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/lie.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -3,6 +3,8 @@
# silently drop it from the dist tarball and break "make distcheck". # silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \ proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
fixtures/cache-golden/hts-cache/new.zip fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT = TESTS_ENVIRONMENT =
@@ -27,6 +29,7 @@ TESTS = \
01_engine-cmdline.test \ 01_engine-cmdline.test \
01_engine-cookies.test \ 01_engine-cookies.test \
01_engine-copyopt.test \ 01_engine-copyopt.test \
01_engine-dns.test \
01_engine-doitlog.test \ 01_engine-doitlog.test \
01_engine-entities.test \ 01_engine-entities.test \
01_engine-filter.test \ 01_engine-filter.test \
@@ -35,6 +38,7 @@ TESTS = \
01_engine-mime.test \ 01_engine-mime.test \
01_engine-parse.test \ 01_engine-parse.test \
01_engine-rcfile.test \ 01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \ 01_engine-simplify.test \
01_engine-strsafe.test \ 01_engine-strsafe.test \
02_manpage-regen.test \ 02_manpage-regen.test \
@@ -46,6 +50,12 @@ TESTS = \
11_crawl-longurl.test \ 11_crawl-longurl.test \
11_crawl-parsing.test \ 11_crawl-parsing.test \
12_crawl_https.test \ 12_crawl_https.test \
13_crawl_proxy_https.test 13_crawl_proxy_https.test \
13_local-cookies.test \
14_local-https.test \
15_local-types.test \
16_local-assume.test \
17_local-empty-ct.test \
18_local-update.test
CLEANFILES = check-network_sh.cache CLEANFILES = check-network_sh.cache

View File

@@ -18,7 +18,7 @@ function debug {
} }
function info { function info {
printf "[$*] ..\t" >&2 printf '[%s] ..\t' "$*" >&2
} }
function result { function result {
@@ -66,31 +66,30 @@ function start-crawl {
--debug) --debug)
verbose=1 verbose=1
;; ;;
--no-purge|--summary|--print-files) --no-purge | --summary | --print-files) ;;
;;
--errors | --files | --found | --not-found | --directory) --errors | --files | --found | --not-found | --directory)
pos=$[${pos}+1] pos=$((pos + 1))
test "$#" -ge "$pos" || warning "missing argument" || return 1 test "$#" -ge "$pos" || warning "missing argument" || return 1
;; ;;
httrack) httrack)
pos=$[${pos}+1] pos=$((pos + 1))
break; break
;; ;;
*) *)
warning "unrecognized option ${!pos}" warning "unrecognized option ${!pos}"
return 1 return 1
;; ;;
esac esac
pos=$[${pos}+1] pos=$((pos + 1))
done done
debug "remaining args: ${@:${pos}}" debug "remaining args: ${*:pos}"
# ut/ won't exceed 2 minutes # ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5" moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ? # proxy environment ?
if test -n "$http_proxy"; then if test -n "${http_proxy:-}"; then
moreargs="$moreargs --proxy $http_proxy" moreargs+=(--proxy "$http_proxy")
fi fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1 test -n "$tmpdir" || ! warning "no tmpdir" || return 1
@@ -104,9 +103,9 @@ function start-crawl {
# start crawl # start crawl
log="${tmp}/log" log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}} debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${@:${pos}}" info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 & httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!" crawlpid="$!"
debug "started cralwer on pid $crawlpid" debug "started cralwer on pid $crawlpid"
wait "$crawlpid" wait "$crawlpid"
@@ -164,12 +163,12 @@ function start-crawl {
;; ;;
--files) --files)
shift shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \ nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g') sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles" assert_equals "checking files" "$1" "$nFiles"
;; ;;
httrack) httrack)
break; break
;; ;;
esac esac
shift shift
@@ -195,7 +194,7 @@ tmpdir=
crawlpid= crawlpid=
nopurge= nopurge=
verbose= verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory # working directory
tmpdir="${tmptopdir}/httrack_ut.$$" tmpdir="${tmptopdir}/httrack_ut.$$"

253
tests/local-crawl.sh Executable file
View File

@@ -0,0 +1,253 @@
#!/bin/bash
#
# Launcher for httrack crawl tests against the local Python test server.
#
# Starts tests/local-server.py on an ephemeral port, discovers the port from
# the server's stdout, then runs httrack against http(s)://127.0.0.1:$PORT and
# audits the mirror. The server is always killed and the tmpdir removed on exit.
#
# The token BASEURL in any httrack argument is replaced with the discovered
# http(s)://127.0.0.1:$PORT base. --found/--directory paths are relative to the
# discovered host root (127.0.0.1_<port>/), since the random port leaks into
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# --errors N --files N --found PATH ... --directory PATH ... \
# httrack BASEURL/some/path [httrack-args...]
set -u
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
root="${LOCAL_SERVER_ROOT:-${testdir}/server-root}"
cert="${testdir}/server.crt"
key="${testdir}/server.key"
tls=
verbose=
rerun=
tmpdir=
serverpid=
crawlpid=
function warning {
echo "** $*" >&2
return 0
}
function die {
warning "$*"
exit 1
}
function debug {
test -n "$verbose" && echo "$*" >&2
return 0
}
function info { printf "[%s] ..\t" "$*" >&2; }
function result { echo "$*" >&2; }
function cleanup {
if test -n "$crawlpid"; then
kill -9 "$crawlpid" 2>/dev/null
crawlpid=
fi
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null
# Reap it so the port is released before we rm the tmpdir/log.
wait "$serverpid" 2>/dev/null
serverpid=
fi
if test -n "$tmpdir" && test -d "$tmpdir"; then
test -n "$nopurge" || rm -rf "$tmpdir"
fi
}
function assert_equals {
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
fi
result "OK ($2)"
}
nopurge=
trap cleanup EXIT HUP INT QUIT PIPE TERM
# python3 is required; mirror check-network.sh's skip-with-77 convention.
command -v python3 >/dev/null || ! echo "python3 not found; skipping local crawl tests" || exit 77
tmptopdir=${TMPDIR:-/tmp}
test -d "$tmptopdir" || mkdir -p "$tmptopdir" || die "no temporary directory; set TMPDIR"
tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create tmpdir"
# --- parse leading control flags --------------------------------------------
declare -a audit=()
scheme=http
pos=0
args=("$@")
nargs=$#
while test "$pos" -lt "$nargs"; do
case "${args[$pos]}" in
--debug) verbose=1 ;;
--rerun) rerun=1 ;; # run httrack a second time (update pass) before auditing
--no-purge)
nopurge=1
audit+=("--no-purge")
;;
--tls)
tls=1
scheme=https
;;
--root)
pos=$((pos + 1))
root="${args[$pos]}"
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
httrack)
pos=$((pos + 1))
break
;;
*) die "unrecognized option ${args[$pos]}" ;;
esac
pos=$((pos + 1))
done
# --- start the server --------------------------------------------------------
test -r "$server" || die "cannot read $server"
serverlog="${tmpdir}/server.log"
serverargs=(--root "$root")
if test -n "$tls"; then
serverargs+=(--tls --cert "$cert" --key "$key")
fi
debug "starting python3 $server ${serverargs[*]}"
python3 "$server" "${serverargs[@]}" >"$serverlog" 2>&1 &
serverpid=$!
# Wait for the "PORT <n>" line (server prints it once bound).
port=
for _ in $(seq 1 50); do
if test -s "$serverlog"; then
line=$(head -n1 "$serverlog")
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
fi
kill -0 "$serverpid" 2>/dev/null || die "server exited early: $(cat "$serverlog")"
sleep 0.1
done
test -n "$port" || die "could not discover server port: $(cat "$serverlog")"
debug "server listening on ${scheme}://127.0.0.1:${port}"
baseurl="${scheme}://127.0.0.1:${port}"
# --- substitute BASEURL in the remaining (httrack) args ----------------------
declare -a hts=()
while test "$pos" -lt "$nargs"; do
hts+=("${args[$pos]//BASEURL/$baseurl}")
pos=$((pos + 1))
done
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || die "could not run httrack"
out="${tmpdir}/crawl"
mkdir "$out" || die "could not create $out"
# Localhost is fast; disable the rate/bandwidth safety limits but keep a
# max-time backstop so a hang cannot wedge the suite.
declare -a moreargs=(--quiet --max-time=120 --timeout=30 --disable-security-limits --robots=0)
log="${tmpdir}/log"
info "running httrack ${hts[*]}"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" "${moreargs[@]}" "${hts[@]}" >"$log" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
# httrack exits 0 even on hard connect/DNS errors, so this is a backstop only;
# the real guard is the audit below (--errors 0 plus the host-root existence check).
test "$crawlres" -eq 0 || ! result "httrack exited $crawlres" || {
cat "$log" >&2
exit 1
}
result "OK"
grep -iE "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt" >&2
# --- optional second pass: re-mirror into the same dir (cache/update path) ----
if test -n "$rerun"; then
info "re-running httrack (update pass)"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" \
"${moreargs[@]}" "${hts[@]}" >"${log}.2" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
test "$crawlres" -eq 0 || ! result "update pass exited $crawlres" || {
cat "${log}.2" >&2
exit 1
}
result "OK (update)"
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
hostroot=
for cand in "${out}/127.0.0.1_${port}" "${out}/127.0.0.1"; do
if test -d "$cand"; then
hostroot="$cand"
break
fi
done
test -n "$hostroot" || die "could not find host root under $out"
debug "host root: $hostroot"
# --- audit -------------------------------------------------------------------
i=0
while test "$i" -lt "${#audit[@]}"; do
case "${audit[$i]}" in
--errors)
i=$((i + 1))
assert_equals "checking errors" "${audit[$i]}" \
"$(grep -iEc "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt")"
;;
--files)
i=$((i + 1))
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${out}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "${audit[$i]}" "$nFiles"
;;
--found)
i=$((i + 1))
info "checking for ${audit[$i]}"
if test -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
--not-found)
i=$((i + 1))
info "checking absence of ${audit[$i]}"
if test ! -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "present"
exit 1
fi
;;
--directory)
i=$((i + 1))
info "checking for dir ${audit[$i]}"
if test -d "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
esac
i=$((i + 1))
done

254
tests/local-server.py Executable file
View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""Self-contained local web server for httrack's crawl tests.
Serves static fixtures from a docroot plus a handful of dynamic endpoints
(cookies, ...) so httrack can be exercised over loopback, deterministically and
offline, instead of crawling the live ut.httrack.com.
Binds to an ephemeral port (port 0) and prints the chosen port to stdout as
"PORT <n>\n" so a launcher can discover it. Pass --tls to wrap the socket with
the shipped self-signed test cert; httrack does not verify certs, so no CA
trust plumbing is needed.
stdlib only (http.server + ssl) -- no new build or runtime dependency.
"""
import argparse
import os
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import quote, unquote, urlsplit
# Cookie chain replicated from the old ut/cookies/*.php fixtures.
COOKIE_PATH = "/cookies/"
COOKIES = {
"cat": "dog",
"cake": "is a lie!",
"badger": "mushroom, with 'ants'",
}
PAGE = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
\t<title>Sample test</title>
</head>
<body>
{body}
</body>
</html>
"""
class Handler(SimpleHTTPRequestHandler):
# Quieter logging; the launcher captures httrack's own log anyway.
def log_message(self, fmt, *args):
if os.environ.get("LOCAL_SERVER_VERBOSE"):
super().log_message(fmt, *args)
# --- helpers -----------------------------------------------------------
def request_cookies(self):
"""Parse the Cookie header into {name: decoded-value}.
Mirrors PHP's $_COOKIE: values are url-decoded, matching the encoding
applied when the cookie was set (see set_cookie)."""
jar = {}
raw = self.headers.get("Cookie", "")
for pair in raw.split(";"):
pair = pair.strip()
if "=" in pair:
name, value = pair.split("=", 1)
jar[name.strip()] = unquote(value.strip())
return jar
def set_cookie(self, name, value):
"""Queue a Set-Cookie header, url-encoding the value like PHP's
setcookie() so spaces/quotes/commas stay a single token that httrack
can store and replay verbatim."""
self._set_cookies.append(f"{name}={quote(value)}; Path={COOKIE_PATH}")
def send_html(self, body, status=200, extra_status=None):
encoded = PAGE.format(body=body).encode("utf-8")
self.send_response(status, extra_status)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(encoded)))
for cookie in self._set_cookies:
self.send_header("Set-Cookie", cookie)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(encoded)
def fail_cookie(self, what):
# The old PHPs answered 500 with the reason in the status line.
self.send_html("", status=500, extra_status=f"The {what} is missing or invalid")
# --- dynamic routes ----------------------------------------------------
def route_entrance(self):
self.set_cookie("cat", COOKIES["cat"])
self.set_cookie("cake", COOKIES["cake"])
self.send_html('\tThis is a <a href="second.php">link</a>')
def route_second(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
self.set_cookie("badger", COOKIES["badger"])
self.send_html('\tThis is a <a href="third.php">link</a>')
def route_third(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
if jar.get("badger") != COOKIES["badger"]:
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
TYPE_MATRIX = {
"/types/control.php": (b"<html><body>control</body></html>", "text/html"),
"/types/photo.png": (FAKE_PNG, "image/png"),
"/types/doc.pdf": (FAKE_PDF, "application/pdf"),
"/types/notype.png": (FAKE_PNG, None),
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
"/types/style.css": (b"body { color: red; }\n", "text/css"),
"/types/data.json": (b'{"k": "v"}\n', "application/json"),
"/types/gen.php": (FAKE_PNG, "image/png"),
}
def route_types_index(self):
body = (
'\t<a href="control.php">control</a>\n'
'\t<img src="photo.png" />\n'
'\t<a href="doc.pdf">doc</a>\n'
'\t<img src="notype.png" />\n'
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
'\t<link rel="stylesheet" href="style.css" />\n'
'\t<a href="data.json">json</a>\n'
'\t<img src="gen.php?id=5" />\n'
)
self.send_html(body)
def route_types(self):
path = urlsplit(self.path).path
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
"/types/photo.png": route_types,
"/types/doc.pdf": route_types,
"/types/notype.png": route_types,
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
}
# --- dispatch ----------------------------------------------------------
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
handler = self.ROUTES.get(path)
if handler is not None:
handler(self)
return True
return False
def do_GET(self):
if not self.dispatch():
super().do_GET()
def do_HEAD(self):
if not self.dispatch():
super().do_HEAD()
def main():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--root", required=True, help="docroot for static files")
parser.add_argument("--bind", default="127.0.0.1", help="bind address")
parser.add_argument("--tls", action="store_true", help="serve HTTPS")
parser.add_argument("--cert", help="TLS certificate (PEM)")
parser.add_argument("--key", help="TLS private key (PEM)")
args = parser.parse_args()
root = os.path.abspath(args.root)
def factory(*a, **kw):
return Handler(*a, directory=root, **kw)
httpd = ThreadingHTTPServer((args.bind, 0), factory)
if args.tls:
import ssl
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile=args.cert, keyfile=args.key)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
# The launcher reads this line to discover the ephemeral port.
print(f"PORT {port}", flush=True)
try:
httpd.serve_forever()
except KeyboardInterrupt:
pass
if __name__ == "__main__":
main()

View File

@@ -3,11 +3,11 @@
error=0 error=0
for i in *.test; do for i in *.test; do
if bash $i ; then if bash "$i"; then
echo "$i: passed" >&2 echo "$i: passed" >&2
else else
echo "$i: ERROR" >&2 echo "$i: ERROR" >&2
error=$[${error}+1] error=$((error + 1))
fi fi
done done

View File

@@ -0,0 +1,18 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Sample test</title>
</head>
<body>
This is a <a href="link.html?v=1">link</a>
This is a <a href='link.html?v=2'>link</a>
This is a <a href="./link.html?v=3">link</a>
This is a <a href=link.html?v=4>link</a>
</body>

View File

@@ -0,0 +1,3 @@
This is a link.
Go back to <a href="basic.html">home</a>.

21
tests/server.crt Normal file
View File

@@ -0,0 +1,21 @@
-----BEGIN CERTIFICATE-----
MIIDbzCCAlegAwIBAgIUdWkDDomnY3WW95UqJ+UOASuR/i0wDQYJKoZIhvcNAQEL
BQAwODESMBAGA1UEAwwJMTI3LjAuMC4xMSIwIAYDVQQKDBlIVFRyYWNrIGxvY2Fs
IHRlc3Qgc2VydmVyMCAXDTI2MDYxNTE0NDQxMFoYDzIwNTYwNjA3MTQ0NDEwWjA4
MRIwEAYDVQQDDAkxMjcuMC4wLjExIjAgBgNVBAoMGUhUVHJhY2sgbG9jYWwgdGVz
dCBzZXJ2ZXIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDx78mogNhT
noWwRa51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf
3rwj79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJC
SGxIin9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4
ZbPgwep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaE
nQrAlTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJx
rjVEPfA8QSbtAgMBAAGjbzBtMB0GA1UdDgQWBBTHE0KKW8REV4HxajzVsIBxz3iL
9zAfBgNVHSMEGDAWgBTHE0KKW8REV4HxajzVsIBxz3iL9zAPBgNVHRMBAf8EBTAD
AQH/MBoGA1UdEQQTMBGHBH8AAAGCCWxvY2FsaG9zdDANBgkqhkiG9w0BAQsFAAOC
AQEAYlTEftrwGJBXuPmtxhmtw2HO/VTC4TGnq67hH5H+ptwgZJuuxCQ5KW6flTyp
FTyMhha33WD4EBL3wqqJsWr9Y4BXqi4G0lRqXBcC1oIUa2VYIDMER7kaY1qTSqE8
ARpwdB2BhvngAzDLc+4Jt4jQMRGr8fHAwxpDBoIZ1knbyzYNP73Bajse6/8YtxUu
nB2BsldjZnLvyHvRxUpWp92OyQih4jYSrlN6olDFlKDg7++kMhkHtJQW9a1t54VN
0ZXrB1ZRuHUUvGBq26x71riTWor7HNOSQaGeCMQjZNQkh5tfshNygUGSZVXTEwhG
xSrOL7NqBt2+EkVwf7LjGzjmBw==
-----END CERTIFICATE-----

28
tests/server.key Normal file
View File

@@ -0,0 +1,28 @@
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDx78mogNhTnoWw
Ra51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf3rwj
79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJCSGxI
in9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4ZbPg
wep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaEnQrA
lTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJxrjVE
PfA8QSbtAgMBAAECggEACgNK4klq1T3IpKdNoBY5yoE7CbUQZBNkBpSPRxHgBezj
SVFfgrZGnOySrIJSt4JHtuynG2Hl+0ku74HRep/ck+eOsh5W3mZvGvMLnGxhwR3u
Or99osTIgU0VQTkpC0SLQ16FCnih0uJycNIikdLR7uuya1tt1OyIBzK7XlNGIywT
p85zJc7/6TfTC9eM7lqh7JGR7KplBxSvgZL1pUr7y4rNpKms6uzOvPND79CcKnbU
BBA9Tu4qdOkoOljsZKkvh3pihxyG9X6d8QTZ/uX3pkvliwSFBc+Sz9EootA3/4r5
gVWpQ2t/AY7fY4hqzLIX/HivVaPj3cWk1G+SHm0XNQKBgQD5I9rijqFvV/p6FmUl
FbnjJFFHHgZLivlGxAC5vOyJNQQaqdeDzg7yMotNmQTggVGjT6sjdosQb3n+ctuk
EhQnZSU5VkNKv1+PTR35WrRkaECCaqz3Pv79pV9GVcX3it7UuYjNiOeSPqINWe+X
49JwnJFz+qQ1BchAwOis4zkENwKBgQD4mShDaYLOO97VpgZj4cGxHHWyEK9CRQvp
I7HxRmfaWS3JHwb88lOmALEU6pAj5cYJPAznv8BnUWcVHalZbkQ1JWYtUJRqj6OI
Ym7rw/nm4Ay5ijbdEism173dSk3IjOe+PdAlxzsOuVzYdBTqElmeQWtBzhY9aHvX
r+A02C2j+wKBgHHDo6Gsi57yR5gUPd9vSlCkNtEIrss0DJv5yHMIB+KnaNZcE+NF
5qFF30Jxyz5RDtxJ9tXcvaeln8lG3XDQKI/MqfDCqTuqo5ImHrfMaW8oA70JxS2p
gHqGVzkg1aMxsIrmpcdk6olnPExocvWivGdbtzeEjhMALu8Sp6y6nUCFAoGBAK5h
KLgYw/OMVaQCIMthaa+l6f0s7PMMYe1453H6VBD6qz4/8HPwO7LfG1gzrUYxADgs
ElVh0UHn/On383nS+i9Ze5Hfyyvwc+LQQURKJPrJQMPJavCptPE7NmiKnYNHK6vr
yh0l4oxShAklbCJBGvICq4zuVfVfXDeQnDIVTfaPAoGBAMCrZqYdOUhUu+aUqxZq
qO/TTQxrxftU63jGUg+o042TdgI4KWLn07wvHJ8/E2OqF35eXenvcuKbNLI1l72J
4cp+3cUv8iAXThTRYEztr5CS/wta4o4CNN8zfjn5dV9AI4Hmt4V7EaGWpBcViGbj
n0Mhag+dO8DHuenqi1yfMrAt
-----END PRIVATE KEY-----

152
tools/mk-sbuild-chroot.sh Executable file
View File

@@ -0,0 +1,152 @@
#!/usr/bin/env bash
#
# Bootstrap an sbuild chroot for the clean-room build gate (mkdeb.sh --sbuild).
#
# Uses the rootless unshare backend: no root, no schroot daemon. It builds a
# minimal buildd chroot tarball into ~/.cache/sbuild/<dist>-<arch>.tar.zst, where
# sbuild --dist=<dist> finds it automatically in unshare mode.
#
# Usage:
# tools/mk-sbuild-chroot.sh [options]
#
# Options:
# -d, --dist DIST suite to bootstrap (default: unstable)
# -a, --arch ARCH architecture (default: dpkg --print-architecture)
# -m, --mirror URL apt mirror (default: http://deb.debian.org/debian)
# --components LIST comma-separated components (default: main)
# -f, --force rebuild even if the tarball already exists
# --write-sbuildrc add "$chroot_mode = 'unshare';" to ~/.sbuildrc if absent
# -h, --help show this help
#
# One-time setup; refresh later with sbuild-update or by rerunning with --force.
# Requires mmdebstrap and the uidmap tools (newuidmap) for the unshare backend.
set -euo pipefail
readonly PROGNAME=${0##*/}
die() {
printf '%s: error: %s\n' "$PROGNAME" "$*" >&2
exit 1
}
info() {
printf '==> %s\n' "$*" >&2
}
usage() {
sed -n '2,/^set -euo/{/^set -euo/!p}' "$0" | sed 's/^# \{0,1\}//'
}
need() {
local tool
for tool in "$@"; do
command -v "$tool" >/dev/null 2>&1 || die "required tool not found: $tool"
done
}
main() {
local dist=unstable
local arch=""
local mirror=http://deb.debian.org/debian
local components=main
local force=0
local write_sbuildrc=0
while [[ $# -gt 0 ]]; do
case $1 in
-d | --dist)
[[ $# -ge 2 ]] || die "missing argument for $1"
dist=$2
shift 2
;;
-a | --arch)
[[ $# -ge 2 ]] || die "missing argument for $1"
arch=$2
shift 2
;;
-m | --mirror)
[[ $# -ge 2 ]] || die "missing argument for $1"
mirror=$2
shift 2
;;
--components)
[[ $# -ge 2 ]] || die "missing argument for $1"
components=$2
shift 2
;;
-f | --force)
force=1
shift
;;
--write-sbuildrc)
write_sbuildrc=1
shift
;;
-h | --help)
usage
exit 0
;;
*)
die "unknown option: $1 (try --help)"
;;
esac
done
need mmdebstrap dpkg
# Unshare needs the setuid uid/gid mappers; mmdebstrap fails cryptically without.
command -v newuidmap >/dev/null 2>&1 ||
die "newuidmap not found; install the uidmap package for the unshare backend"
# Unshare maps a whole UID range, not just the caller's: the base install
# creates system users, and without an /etc/subuid+subgid range the install
# crashes (dpkg SIGSEGV) instead of erroring cleanly. Root uses mode=root and
# needs no range.
if [[ $(id -u) -ne 0 ]]; then
local me
me=$(id -un)
if ! grep -qs "^$me:" /etc/subuid || ! grep -qs "^$me:" /etc/subgid; then
# Suggest a range starting past every allocation in either file.
local start
start=$(awk -F: '{e = $2 + $3; if (e > m) m = e} END {print (m ? m : 100000)}' \
/etc/subuid /etc/subgid 2>/dev/null)
die "no /etc/subuid+subgid range for $me; the unshare backend needs one:
sudo usermod --add-subuids $start-$((start + 65535)) --add-subgids $start-$((start + 65535)) $me"
fi
fi
: "${arch:=$(dpkg --print-architecture)}"
local cache=$HOME/.cache/sbuild
local tarball=$cache/${dist}-${arch}.tar.zst
if [[ -e $tarball && $force -eq 0 ]]; then
info "chroot already exists: $tarball (use --force to rebuild)"
else
info "bootstrapping $dist/$arch chroot into $tarball"
mkdir -p "$cache"
mmdebstrap --variant=buildd --arch="$arch" --components="$components" \
"$dist" "$tarball" "$mirror"
info "chroot ready: $tarball"
fi
local rc=$HOME/.sbuildrc
local mode_line="\$chroot_mode = 'unshare';"
# shellcheck disable=SC2016 # $chroot_mode is literal regex text, not a shell var.
if grep -qsE '^[[:space:]]*\$chroot_mode[[:space:]]*=.*unshare' "$rc"; then
: # already configured (active, non-commented line)
elif [[ $write_sbuildrc -eq 1 ]]; then
info "enabling the unshare backend in $rc"
printf '%s\n' "$mode_line" >>"$rc"
else
cat >&2 <<EOF
==> To use this chroot without passing --chroot-mode each time, add to $rc:
$mode_line
(or rerun with --write-sbuildrc). Then verify with:
sbuild --dist=$dist path/to/package.dsc
and build the release gate with:
tools/mkdeb.sh --source-only --sbuild
EOF
fi
}
main "$@"

View File

@@ -20,11 +20,27 @@
# Options: # Options:
# -k, --key KEYID GPG key for signing (default: $DEBSIGN_KEYID) # -k, --key KEYID GPG key for signing (default: $DEBSIGN_KEYID)
# -o, --outdir DIR output directory (default: <repo>/dist) # -o, --outdir DIR output directory (default: <repo>/dist)
# --orig FILE reuse this upstream orig tarball instead of
# regenerating it (required for a Debian revision
# >= 2, whose orig is frozen in the archive)
# -s, --source-only build only the source package # -s, --source-only build only the source package
# -u, --unsigned do not sign anything (implies no release sigs) # -u, --unsigned do not sign anything (implies no release sigs)
# --no-release-artifacts skip the orig tarball .asc/.md5/.sha1 # --no-release-artifacts skip the orig tarball .asc/.md5/.sha1
# --sbuild additionally build the .dsc in a clean sbuild
# chroot as a from-scratch verification gate
# -h, --help show this help # -h, --help show this help
# #
# --sbuild reproduces the buildd environment: it builds the source package in a
# minimal chroot holding only the declared Build-Depends, so an FTBFS or a
# missing dependency fails here instead of on the archive's buildds (which, with
# a source-only upload, are otherwise the first clean build). It needs an sbuild
# chroot for the changelog's distribution; create one once with the companion
# tools/mk-sbuild-chroot.sh (rootless unshare backend).
#
# The Debian revision in debian/changelog decides the orig: revision 1 builds a
# fresh upstream tarball; revision >= 2 must reuse the orig frozen at revision 1
# (the .dsc references it by checksum), so pass it with --orig.
#
# SOURCE_DATE_EPOCH is honored for reproducible output. # SOURCE_DATE_EPOCH is honored for reproducible output.
set -euo pipefail set -euo pipefail
@@ -57,9 +73,11 @@ need() {
main() { main() {
local key=${DEBSIGN_KEYID:-} local key=${DEBSIGN_KEYID:-}
local outdir="" local outdir=""
local orig_in=""
local source_only=0 local source_only=0
local unsigned=0 local unsigned=0
local release_artifacts=1 local release_artifacts=1
local sbuild=0
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
@@ -73,6 +91,11 @@ main() {
outdir=$2 outdir=$2
shift 2 shift 2
;; ;;
--orig)
[[ $# -ge 2 ]] || die "missing argument for $1"
orig_in=$2
shift 2
;;
-s | --source-only) -s | --source-only)
source_only=1 source_only=1
shift shift
@@ -85,6 +108,10 @@ main() {
release_artifacts=0 release_artifacts=0
shift shift
;; ;;
--sbuild)
sbuild=1
shift
;;
-h | --help) -h | --help)
usage usage
exit 0 exit 0
@@ -95,7 +122,8 @@ main() {
esac esac
done done
need git autoreconf debuild dcmd need git autoreconf debuild dcmd dpkg-parsechangelog
[[ $sbuild -eq 1 ]] && need sbuild
if [[ $unsigned -eq 0 ]]; then if [[ $unsigned -eq 0 ]]; then
need gpg need gpg
[[ -n $key ]] || die "no signing key (pass --key or set DEBSIGN_KEYID, or use --unsigned)" [[ -n $key ]] || die "no signing key (pass --key or set DEBSIGN_KEYID, or use --unsigned)"
@@ -107,6 +135,11 @@ main() {
mkdir -p "$outdir" mkdir -p "$outdir"
outdir=$(cd "$outdir" && pwd) outdir=$(cd "$outdir" && pwd)
if [[ -n $orig_in ]]; then
[[ -r $orig_in ]] || die "--orig file not readable: $orig_in"
orig_in=$(cd "$(dirname "$orig_in")" && pwd)/$(basename "$orig_in")
fi
scratch=$(mktemp -d "${TMPDIR:-/tmp}/httrack-mkdeb.XXXXXX") scratch=$(mktemp -d "${TMPDIR:-/tmp}/httrack-mkdeb.XXXXXX")
trap 'rm -rf -- "$scratch"' EXIT trap 'rm -rf -- "$scratch"' EXIT
@@ -118,10 +151,31 @@ main() {
git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD | git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD |
tar -x -C "$export_dir" tar -x -C "$export_dir"
# Refresh build system and man page, then build the tarball. We build here # Upstream version and Debian revision drive the orig: revision 1 builds a
# only because regen-man needs the compiled binaries; the test suite is not # fresh tarball, revision >= 2 reuses the one frozen at -1 (the .dsc pins it
# run in this pass. debuild (below) runs the full suite once, with the online # by checksum, so a regenerated orig with new mtimes would be rejected).
# tests enabled, so a check here would just be a slower, offline-only repeat. local fullver ver rev
fullver=$(cd "$export_dir" && dpkg-parsechangelog -S Version)
ver=${fullver%-*}
rev=${fullver##*-}
local orig=httrack_${ver}.orig.tar.gz
info "version $ver (Debian revision $rev)"
# A signed build is upload-bound, so a revision >= 2 must reuse the frozen
# orig (--orig); an unsigned build is a throwaway (CI, local) and may
# regenerate it, since it can never reach the archive.
if [[ -z $orig_in && $rev != 1 && $unsigned -eq 0 ]]; then
die "Debian revision $rev needs --orig FILE (the orig is frozen from revision 1)"
fi
if [[ -n $orig_in ]]; then
info "reusing upstream tarball $orig_in"
cp -- "$orig_in" "$scratch/$orig"
else
# Refresh build system and man page, then build the tarball. We build
# here only because regen-man needs the compiled binaries; the test
# suite is not run in this pass. debuild (below) runs the full suite
# once, online tests enabled, so a check here would just repeat it.
info "regenerating build system and man page" info "regenerating build system and man page"
( (
cd "$export_dir" cd "$export_dir"
@@ -129,34 +183,33 @@ main() {
./configure --quiet ./configure --quiet
make -s -j"$(nproc)" make -s -j"$(nproc)"
make -s -C man regen-man make -s -C man regen-man
# Build the tarball from a clean tree so no object files leak into it. # Build the tarball from a clean tree so no object files leak in.
make -s clean make -s clean
make -s dist make -s dist
) )
local tarball ver
local -a tarballs local -a tarballs
shopt -s nullglob shopt -s nullglob
tarballs=("$export_dir"/httrack-*.tar.gz) tarballs=("$export_dir"/httrack-*.tar.gz)
shopt -u nullglob shopt -u nullglob
[[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball" [[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball"
tarball=${tarballs[0]##*/} local tarball=${tarballs[0]##*/}
ver=${tarball#httrack-} [[ $tarball == "httrack-$ver.tar.gz" ]] ||
ver=${ver%.tar.gz} die "changelog version $ver disagrees with built tarball $tarball (configure.ac mismatch?)"
info "version $ver" cp -- "$export_dir/$tarball" "$scratch/$orig"
fi
# 3.0 (quilt): orig tarball is upstream-only; debian/ is overlaid on top. # 3.0 (quilt): orig tarball is upstream-only; debian/ is overlaid on top.
local orig=httrack_${ver}.orig.tar.gz
cp -- "$export_dir/$tarball" "$scratch/$orig"
( (
cd "$scratch" cd "$scratch"
tar -xf "$orig" tar -xf "$orig"
[[ -d httrack-$ver ]] || die "orig tarball does not unpack to httrack-$ver/"
cp -a "$export_dir/debian" "httrack-$ver/debian" cp -a "$export_dir/debian" "httrack-$ver/debian"
) )
# Build (debuild also runs lintian and signs). --fail-on aborts on a lintian # Build and sign. debuild runs lintian too but does NOT propagate its exit
# error or warning, so neither a release nor CI produces an unclean package. # status, so a broken package would pass unnoticed; disable it here and run
local -a debuild_opts=(--lintian-opts -I -i "--fail-on=error,warning") # lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=() local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S) [[ $source_only -eq 1 ]] && build_opts+=(-S)
if [[ $unsigned -eq 1 ]]; then if [[ $unsigned -eq 1 ]]; then
@@ -167,7 +220,8 @@ main() {
info "building packages with debuild" info "building packages with debuild"
( (
cd "$scratch/httrack-$ver" cd "$scratch/httrack-$ver"
debuild "${build_opts[@]}" "${debuild_opts[@]}" # debuild options (--no-lintian) must precede the dpkg-buildpackage ones
debuild "${debuild_opts[@]}" "${build_opts[@]}"
) )
# Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo). # Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo).
@@ -177,11 +231,49 @@ main() {
changes=("$scratch"/*.changes) changes=("$scratch"/*.changes)
shopt -u nullglob shopt -u nullglob
[[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file" [[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file"
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/" dcmd cp -- "${changes[@]}" "$outdir/"
# Clean-room build gate: rebuild the source package in a minimal chroot that
# holds only the declared Build-Depends, the same way the buildds will. An
# undeclared dependency or any FTBFS aborts the release here instead of
# surfacing after a source-only upload. Logs and clean-built debs land in
# $outdir/sbuild for inspection.
if [[ $sbuild -eq 1 ]]; then
local -a dscs
shopt -s nullglob
dscs=("$scratch"/*.dsc)
shopt -u nullglob
[[ ${#dscs[@]} -ge 1 ]] || die "no .dsc to sbuild"
local dist
dist=$(cd "$scratch/httrack-$ver" && dpkg-parsechangelog -S Distribution)
[[ $dist == UNRELEASED ]] && dist=unstable
info "clean-room build with sbuild (dist $dist)"
local sbdir=$outdir/sbuild
rm -rf -- "$sbdir"
mkdir -p "$sbdir"
(cd "$sbdir" && sbuild --dist="$dist" -- "${dscs[0]}")
info "sbuild clean-room build passed; logs in $sbdir"
fi
# Release artifacts for the upstream tarball (detached sig + checksums). # Release artifacts for the upstream tarball (detached sig + checksums).
# A Debian revision >= 2 .changes omits the orig (it is already in the
# archive), so dcmd above won't have copied it; place it from the build tree
# so the website artifacts are produced regardless of the revision.
if [[ $release_artifacts -eq 1 && $unsigned -eq 0 ]]; then if [[ $release_artifacts -eq 1 && $unsigned -eq 0 ]]; then
info "signing upstream tarball" info "signing upstream tarball"
cp -- "$scratch/$orig" "$outdir/$orig"
( (
cd "$outdir" cd "$outdir"
gpg --armor --detach-sign --yes -u "$key" -- "$orig" gpg --armor --detach-sign --yes -u "$key" -- "$orig"