Compare commits

...

13 Commits

Author SHA1 Message Date
Xavier Roche
f578bede0c ci: add a no-python3 job reproducing the Debian buildd chroot
GitHub runners ship python3, so every make-check job exercised the
python3-present path and the local-server tests never skipped. The Debian
buildds build in a minimal chroot with no python3, where those tests must
SKIP (exit 77) -- and 28_local-pause failed there instead, FTBFS on every
arch for 3.49.10-1, invisible to CI.

Add a job that builds, removes python3, and runs make check, so the skip
path is exercised on every PR. Verified to fail on the pre-fix tree and pass
after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-28 20:02:47 +02:00
Xavier Roche
45279d7357 tests: skip 28_local-pause when python3 is absent (Debian buildd)
The local-server tests all skip with exit 77 when python3 is missing, but
28_local-pause runs local-crawl.sh inside a command-substitution with output
redirected to /dev/null, swallowing that exit-77 skip signal. On a host with
no python3 (every Debian buildd) both crawls then run serverless, finish in
0s, and the test reports FAIL instead of SKIP, turning 3.49.10-1 red on all
architectures. Add the same python3 guard the other tests get via
local-crawl.sh, up front, so it skips cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-28 19:49:47 +02:00
Xavier Roche
cca83e5f4a Modernize HTML entity decoding to the WHATWG named character references (#444)
* Modernize HTML entity decoding to the WHATWG named character references

Regenerate htsentities.h from the WHATWG entities.json (2032 single-codepoint
names) instead of the 1998 HTML 4.0 set (252 names). The dispatch hash moves
from a 32-bit LCG to 64-bit FNV-1a; the generator now aborts on any (hash,len)
collision, so the hash-only switch stays correct without a runtime name compare.
Bump the consumer name-length cap from 10 to 31, the longest name
(CounterClockwiseContourIntegral), or long names would be rejected outright.
Multi-codepoint references (~93 obscure math entities) can't fit the
single-codepoint return and are skipped, left verbatim as before.

Also fix the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh.

Closes #443

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* entities: harden the generator collision guard and widen test coverage

Review follow-up. The switch keys on the hash alone, so check hash-alone
uniqueness among emitted names (a same-hash/different-len pair would otherwise
slip the old (hash,len) check and surface only as a cryptic duplicate-case
compile error). Also hash the ~93 skipped multi-codepoint names and abort if any
aliases an emitted hash, so "skipped means verbatim" is enforced rather than
assumed on future regens.

Add a runtime sweep of common HTML4 names (copy/reg/trade/mdash/ndash/alpha/beta)
to 01_engine-entities.test: a regression guard against accidental drops and a
generator-vs-consumer hash cross-check on names beyond the handful already
probed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:29:03 +02:00
Xavier Roche
97f398e508 Release 3.49.10 (#442)
Bump the package version to 3.49.10 and curate the release notes.

VERSION_INFO goes 3:1:0 -> 3:2:0: the cycle only appended tail fields to the
installed options struct (--cookies-file, --pause, --strip-query, the -%u
split), no existing symbol or offset changed, so the soname stays .so.3.

history.txt gets the 3.49-10 block; debian/changelog gets 3.49.10-1 with the
Debian-specific items (DEP-5 copyright, chromium-first browser dep, minizip
embedded-library override). Standards-Version 4.7.0 -> 4.7.4: the intervening
Policy changes (usr-merge, /usr/games, Priority recommendation, -dev linker
scripts, non-free-firmware) need no package change.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 14:26:50 +02:00
Xavier Roche
a62f93a107 Strip the #fragment from a redirect Location before fetching (#204) (#441)
* Strip the #fragment from a redirect Location before fetching (#204)

A 302/30x Location is dereferenced, not displayed, so its #fragment is a
client-side anchor that must be dropped before the target is requested.
httrack kept it: the redirect followers copied r.location verbatim, so the
re-request carried `GET /page.html#frag` (strict servers answer 400) and the
mirror was saved under a fragment-polluted name. HTML links were already
stripped at parse time; only the two Location followers were not.

Drop the fragment in a small helper called at both follow sites, covering
the live and cached-redirect paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* test(#204): strict-server guard so a leaked fragment is a wire-level failure

The first cut of 29_local-redirect-fragment only checked the saved filename.
Python's urlsplit() drops the fragment before routing, so a `#` leaked into
the GET line still routed to the target and the crawl passed: the assertion
was a proxy, not the wire behavior the fix targets. Make the server strict
(400 on any `#` in the request-target, like the real servers in #204), so a
leaked fragment now logs an error and the target is never saved. Neutering the
fix makes the test fail with the exact "400 Bad Request" from the issue.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 13:52:21 +02:00
Xavier Roche
799ec88dc7 filters: fix escaped brackets inside *[...] character classes (#440)
* filters: decode escaped chars correctly inside *[...] classes

The escape branch in strjoker probed joker[i+2] instead of the current
char, so a backslash escape only worked as the first class member:
'*[\[\]]' (documented as "the [ or ] character") matched only ']', and
'*[a,\[]' dropped the 'a'. The loop also treated any ']' as the class
terminator, so an escaped ']' could never be a member.

Decode the escape first in the loop body: a backslash takes the next char
as the literal member (only that char, not also the backslash the old code
added), and an escaped ']' is consumed before the terminator check. So
'*[\[\]]' now matches both brackets, and escape precedes the range/size
checks ('\-' '\,' '\<' become literal members). The self-test previously
pinned the buggy output as expected; it now asserts the documented
behavior and fails against the old matcher.

Closes #148

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* filters: fix a 1-byte over-read on a truncated range *[a-

The *[...] class parser's range arm does i += 3 unconditionally, so a
pattern ending in a dangling '-' (e.g. *[a-) read one byte past the NUL:
joker[i+2] is the NUL, i jumps to len+1, and the separator skip and loop
guard then read joker[len+1]. Guard the range arm on joker[i+2] != '\0'
so a truncated range falls through to the literal-member path instead of
overshooting.

The filter self-test now copies the pattern and string into exact-size
heap buffers so a sanitizer traps such over-reads; the pattern previously
came straight from argv (no redzone), which is why this stayed invisible.
A *[a- test case exercises it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:56:11 +02:00
Xavier Roche
71af4a24f0 lang: add a translation guide and fix the English colon spacing (#439)
Document the lang/*.txt format and contribution flow in lang/README.md
(linked from CONTRIBUTING.md); the project had no written instructions for
translators. Also drop the stray space before the colon in the English
"Filters (refuse/accept links):" label so it matches the other labels. Only
the English.txt value is changed, not the msgid key, so existing
translations still resolve.

Closes #74
Closes #75

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:45:07 +02:00
Xavier Roche
e17f4f12a0 Add --pause to space out file downloads by a random delay (#185) (#438)
A new --pause MIN[:MAX] (seconds, -%G) waits a random MIN..MAX between
files so a crawl looks less like a bot and is gentler on the server; a
single value is a fixed delay. Disabled by default.

It reuses the existing non-blocking launch gate
(back_pluggable_sockets_strict): rather than Sleep() -- which would freeze
the single select() pump and stall the other in-flight transfers -- the
gate just withholds new launches until the delay elapses, one file per
gap. The per-gap target is derived from the last-request timestamp so it
stays stable across the many gate evaluations within a gap yet rerolls on
each launch; sampling rand() per evaluation would instead bias the
realized delay toward MIN.

Two int fields appended at the httrackp tail (ABI-stable, no soname bump).
Covered by a pure-function self-test (range + spread, with teeth against
the min-bias bug) and a local-server crawl that asserts the pause slows a
multi-file mirror.

Closes #185

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:34:56 +02:00
Xavier Roche
5be8ba4bbd Add --cookies-file to preload a Netscape cookies.txt (#215) (#437)
Mirroring a site behind a login meant either re-implementing the auth
flow or dropping a file literally named cookies.txt into the output or
working directory, the only two places the engine looked. This adds a
CLI option to point at an arbitrary Netscape/Mozilla cookies.txt, so a
session exported from a browser (the "Get cookies.txt" extensions write
exactly this format) is replayed on the crawl and authenticated pages
come down.

The plumbing already existed: cookie_load parses the format into the
shared jar and the request path sends every matching cookie. The new
opt->cookies_file is loaded last, after the mirror/CWD defaults, so a
user-supplied value wins on a name/domain/path conflict. The field is
appended at the tail of httrackp, so the exported ABI is unchanged.

Cookies key on host[:port], so a bare-domain file matches a normal crawl
of a default-port site; only an explicit-port URL needs the port in the
cookie domain. Covered by 27_local-cookies-file.test: a gated page that
500s without a cookie no page ever sets, reachable only once the file
preloads it (with -o0 so the absence of a 500 error page is meaningful),
plus a no-cookie control. The local-crawl harness grows a --cookie helper
that writes a port-scoped jar. The copyopt self-test also gains a String
round-trip so the exported copy_htsopt path for the new field is covered.

Closes #215

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 22:57:05 +02:00
Xavier Roche
247a46068e debian: lead webhttrack browser dep with chromium, not firefox-esr (#436)
webhttrack's "firefox-esr | chromium | www-browser" made the httrack
source inherit firefox-esr's autoremoval clock: britney keys off the
first alternative of a disjunction, so every firefox-esr RC bug
(currently #1127569) dragged httrack toward removal even though
webhttrack stays installable via the other alternatives.

Lead with chromium instead. lintian requires a real package as the
first alternative (virtual-package-depends-without-real-package-depends),
so the virtual www-browser cannot go first; chromium is real, keeps the
dep lintian-clean, and makes britney track chromium rather than the
RC-bug-prone firefox-esr.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 20:44:30 +02:00
Xavier Roche
669947cd23 Split -%u URL Hacks into independent www/slash/query toggles (#271) (#435)
-%u (--urlhack) bundled three dedup normalizations under one switch:
www.host == host, redundant // collapse, and query-argument reordering.
A mirror that needed one but not another (e.g. keep www. distinct) had to
turn the whole umbrella off. Add three opt-out sub-options, defaulting to
the umbrella so existing -%u/-%u0 behavior is unchanged:

  --keep-www-prefix      keep www.foo.com distinct from foo.com   (-%j)
  --keep-double-slashes  keep redundant // in the path            (-%o)
  --keep-query-order     keep query-argument order significant    (-%y)

The split is resolved once in hash_init() into norm_host/norm_slash/
norm_query and threaded through the dedup hash (htshash.c), the savename
lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so
all three stay consistent. fil_normalized() gains an internal
fil_normalized_ex(do_slash, do_query) core; the public
fil_normalized()/fil_normalized_filtered() keep their signatures.

Normalization (slash/query) now follows urlhack and its sub-flags
uniformly, while --strip-query stays orthogonal. So with urlhack off,
strip-query strips keys without sorting the remainder; the url_savename
urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer
the hash uses, so a URL is always looked up under the key it was stored
with (a self-lookup mismatch this otherwise introduced).

http/https are always merged in the dedup key (the scheme is stripped
regardless of -%u), so that part of the request needs no toggle.

The opt-outs are spelled positively (--keep-*) because httrack's generic
--no<opt> prefix only appends the disabling "0" for parametered options,
not "single" booleans, so --nowww-dedup would silently no-op.

opt grows three hts_boolean fields appended at the struct tail (offsets
stable, no soname bump, matching the strip_query addition in #112).

Tested by a -#test=urlhack engine self-test (hash_url_equals over each
flag combination) plus a -%u0 + --strip-query crawl case exercising the
urlhack-off savename branch.

Closes #271

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 20:26:28 +02:00
Xavier Roche
40a66600ff Add --strip-query to drop query keys from dedup naming (#112) (#434)
Two URLs that differ only in tracking or session query parameters
(?utm_source=x versus ?utm_source=y) were saved as separate files, and a
single CGI could fan out into thousands of near-duplicate pages.
fil_normalized already sorted query args, so reordered parameters dedup,
but there was no way to drop a named key.

--strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the
listed keys when computing the dedup key and the saved name. The fetched
URL is untouched, so a required sid= is still sent on the wire; only the
local namespace collapses. Patterns match the normalized host/path with
the +/- filter glob (strjoker), last match wins as in the filter list,
and stripping is decoupled from urlhack (-%u) so it never silently
no-ops with -%u0.

It all funnels through one chokepoint, fil_normalized: an internal
fil_normalized_filtered() strips then delegates, and hts_query_strip_keys
resolves the per-URL key list. The strip pass walks every query field,
including empty and trailing ones, so its output is a fixpoint under the
read path's second normalization (otherwise dedup silently misses).
Exported ABI is unchanged; the strip_query field is appended at the tail
of httrackp. Covered by a -#test=stripquery self-test (degenerate queries
like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup
crawl test.

Closes #112

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 11:13:16 +02:00
Xavier Roche
768756e231 ci: add a MemorySanitizer job for the offline engine self-tests (#433)
MSan is the only sanitizer that catches a read of uninitialized memory --
the class of #143, where the size filter tested an uninitialized stack
LLint and forbade files at random. ASan and UBSan let that through.

MSan reports any byte produced by an uninstrumented library as
uninitialized, so the job stays inside our own code: clang, a static link
(the MSan runtime is not injected into shared objects), --disable-https to
drop openssl, and only the offline 01_engine-* self-tests minus the
zlib-backed cache trio. Those self-tests drive the hostile-input parsers
(charset, mime, html, entities, idna, filters) straight through MSan.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:40:22 +02:00
43 changed files with 13412 additions and 1631 deletions

View File

@@ -61,6 +61,50 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Reproduce the Debian buildds: they build in a minimal chroot with no
# python3, so the local-server tests must SKIP (exit 77), not fail. GitHub
# runners ship python3, so every other job hides this path; here we remove it
# before `make check`. This is the guard that would have caught the 3.49.10-1
# FTBFS (28_local-pause failed instead of skipping when python3 was absent).
buildd-no-python3:
name: build (no python3, Debian buildd)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: Configure
run: |
set -euo pipefail
autoreconf -fi
./configure
- name: Build
run: make -j"$(nproc)"
- name: Test without python3
run: |
set -euo pipefail
# Hide every python3* so `command -v python3` fails like it does in the
# buildd chroot; masking with /bin/false would still resolve.
sudo find /usr/bin /usr/local/bin -maxdepth 1 -name 'python3*' \
-exec mv {} {}.hidden \;
! command -v python3
make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Portability: build and test on macOS (Darwin/clang) on a native runner --
# no VM. The tree has no __APPLE__ branches, so Darwin exercises the
# generic-Unix path on a second libc and kernel. brew's openssl@3 is keg-only,
@@ -188,6 +232,51 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# MemorySanitizer catches reads of uninitialized memory (#143's stack-garbage
# size filter) that ASan/UBSan miss. It flags any byte an uninstrumented lib
# wrote, so the job stays in our own code: offline self-tests only, no openssl
# (--disable-https), no zlib cache tests, static (the runtime is not in .so's).
msan:
name: msan (MemorySanitizer, clang)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential clang autoconf automake libtool autoconf-archive \
zlib1g-dev
- name: Configure (MSan, static, no https)
run: |
set -euo pipefail
autoreconf -fi
./configure CC=clang \
CFLAGS="-fsanitize=memory -fsanitize-memory-track-origins=2 -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
LDFLAGS="-fsanitize=memory" \
--disable-https --disable-shared --enable-static
- name: Build
run: make -j"$(nproc)"
- name: Test (offline self-tests under MSan)
env:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
# the configuration users on minimal systems build, and one libssl is not even
# installed here so configure cannot silently re-enable it. The matrix above

View File

@@ -39,6 +39,10 @@ Welcome, and nothing to disclose. Two rules:
The sign-off covers AI-assisted code too.
## Translations
Interface strings live in [`lang/`](lang/). See [lang/README.md](lang/README.md) for the file format and how to add or update a language.
## Bugs
Open an issue with the version, OS, command used, and expected vs actual result.

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
VERSION_INFO="3:1:0"
# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
VERSION_INFO="3:2:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

13
debian/changelog vendored
View File

@@ -1,3 +1,16 @@
httrack (3.49.10-1) unstable; urgency=medium
* New upstream release: new download-pacing and URL-handling options plus a
batch of crawl and robustness fixes (full list in history.txt).
* Rewrite debian/copyright in machine-readable DEP-5 format, crediting the
bundled minizip, md5 and coucal sources (#415).
* Lead the webhttrack browser dependency with chromium so httrack is not
dragged into the firefox-esr autoremoval cascade (#436).
* Override the embedded-library lint for the bundled minizip (#419).
* Bump Standards-Version to 4.7.4 (no changes required).
-- Xavier Roche <xavier@debian.org> Sun, 28 Jun 2026 14:01:53 +0200
httrack (3.49.9-1) unstable; urgency=medium
* New upstream release: Content-Type and file-type detection fixes (trust a

4
debian/control vendored
View File

@@ -2,7 +2,7 @@ Source: httrack
Section: web
Priority: optional
Maintainer: Xavier Roche <roche@httrack.com>
Standards-Version: 4.7.0
Standards-Version: 4.7.4
Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
Rules-Requires-Root: no
Homepage: http://www.httrack.com
@@ -30,7 +30,7 @@ Description: Copy websites to your computer (Offline browser)
Package: webhttrack
Architecture: any
Multi-Arch: foreign
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, firefox-esr | chromium | www-browser
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, chromium | firefox-esr | www-browser
Replaces: webhttrack-common (<< 3.43.9-2)
Breaks: webhttrack-common (<< 3.43.9-2)
Suggests: httrack, httrack-doc

View File

@@ -4,7 +4,25 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack
3.49-9
3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)
+ New: --strip-query to drop selected query keys from the dedup naming (#112)
+ Changed: split the -%u URL hacks into independent --keep-www-prefix, --keep-double-slashes and --keep-query-order toggles (#271)
+ Fixed: follow a redirect Location after dropping its #fragment, instead of requesting the fragment and polluting the saved name (#204)
+ Fixed: escaped brackets inside a *[...] filter character class (#148)
+ Fixed: honor the server's Content-Range when resuming a partial download, instead of appending overlapping bytes (#198)
+ Fixed: abort the download as soon as the response type is excluded by -mime:, instead of fetching then discarding the body (#58)
+ Fixed: keep size-based filter rules neutral until the file size is known (#143)
+ Fixed: stop the mirror with a clean fatal error on a cache write failure, instead of crashing (#174, #219)
+ Fixed: stop the 412/416 partial re-get loop on --continue and --update (#206)
+ Fixed: keep an unrecognized URL tail instead of mangling it to .html (#115)
+ Fixed: honor --tolerant (-%B) on a broken Content-Length, and fix an out-of-bounds read it exposed (#32, #41)
+ Fixed: fall back to the next resolved address when a connection fails or stalls, instead of hanging on a dead IPv6 address
+ Fixed: report why a -%L URL list could not be loaded (#49)
+ Changed: multiple internal hardening, build and CI improvements
.49-9
+ Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
+ Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
+ Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)

View File

@@ -247,7 +247,7 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
<td>the \ character</td>
</tr>
<tr>
<td nowrap><tt>*[\[\]]</tt></td>
<td nowrap><tt>*[\[,\]]</tt></td>
<td>the [ or ] character</td>
</tr>
<tr>

View File

@@ -295,7 +295,7 @@ Max Depth
Maximum external depth:
Maximum external depth:
Filters (refuse/accept links) :
Filters (refuse/accept links) :
Filters (refuse/accept links):
Paths
Paths
Save prefs

37
lang/README.md Normal file
View File

@@ -0,0 +1,37 @@
# Translating HTTrack
Interface strings live here, one `.txt` file per language. `English.txt` is the reference: every other file maps each English string to its translation.
## File format
Plain text, entries in consecutive pairs of lines:
```
<English string>
<translation>
```
The first line of a pair is the lookup key and must stay identical to the one in `English.txt`; translate only the second line. Missing entries fall back to the English text at runtime, so a partial translation works.
Preserve any `\r\n`, `\t` and `printf` placeholders (`%s`, `%d`, ...) in the translation.
A few `LANGUAGE_*` entries at the top describe the file itself:
| Key | Meaning |
| --- | --- |
| `LANGUAGE_NAME` | Name shown in the language picker, in its own language (`Deutsch`, not `German`) |
| `LANGUAGE_ISO` | ISO 639 code, with region if needed (`de`, `pt_BR`) |
| `LANGUAGE_CHARSET` | Encoding the file is saved in (`ISO-8859-1`, `windows-1251`, `UTF-8`, ...) |
| `LANGUAGE_AUTHOR` | Your name and contact |
| `LANGUAGE_WINDOWSID` | Windows locale name used by WinHTTrack (`German (Standard)`) |
Save the file in exactly its declared `LANGUAGE_CHARSET`; an editor that rewrites it as UTF-8 will corrupt the non-ASCII bytes.
## Adding or updating a language
1. Copy `English.txt` to `<Language>.txt`, or edit the existing file.
2. Translate each second line; leave the English keys untouched.
3. Fill in the `LANGUAGE_*` header for a new file.
4. Open a pull request, or attach the file to a GitHub issue.
When new strings land in `English.txt` they show up untranslated (as English) until a translator fills them in.

View File

@@ -3,7 +3,7 @@
.\"
.\" This file is generated by man/makeman.sh; do not edit by hand.
.\" SPDX-License-Identifier: GPL-3.0-or-later
.TH httrack 1 "26 June 2026" "httrack website copier"
.TH httrack 1 "27 June 2026" "httrack website copier"
.SH NAME
httrack \- offline browser : copy websites to a local directory
.SH SYNOPSIS
@@ -24,6 +24,7 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-EN, \-\-max\-time[=N]\fR ]
[ \fB\-AN, \-\-max\-rate[=N]\fR ]
[ \fB\-%cN, \-\-connection\-per\-second[=N]\fR ]
[ \fB\-%G, \-\-pause\fR ]
[ \fB\-GN, \-\-max\-pause[=N]\fR ]
[ \fB\-cN, \-\-sockets[=N]\fR ]
[ \fB\-TN, \-\-timeout[=N]\fR ]
@@ -43,11 +44,13 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-x, \-\-replace\-external\fR ]
[ \fB\-%x, \-\-disable\-passwords\fR ]
[ \fB\-%q, \-\-include\-query\-string\fR ]
[ \fB\-%g, \-\-strip\-query\fR ]
[ \fB\-o, \-\-generate\-errors\fR ]
[ \fB\-X, \-\-purge\-old[=N]\fR ]
[ \fB\-%p, \-\-preserve\fR ]
[ \fB\-%T, \-\-utf8\-conversion\fR ]
[ \fB\-bN, \-\-cookies[=N]\fR ]
[ \fB\-%K, \-\-cookies\-file\fR ]
[ \fB\-u, \-\-check\-type[=N]\fR ]
[ \fB\-j, \-\-parse\-java[=N]\fR ]
[ \fB\-sN, \-\-robots[=N]\fR ]
@@ -153,6 +156,8 @@ maximum mirror time in seconds (60=1 minute, 3600=1 hour) (\-\-max\-time[=N])
maximum transfer rate in bytes/seconds (1000=1KB/s max) (\-\-max\-rate[=N])
.IP \-%cN
maximum number of connections/seconds (*%c10) (\-\-connection\-per\-second[=N])
.IP \-%G
random pause of MIN[:MAX] seconds between files (e.g. %G5:10) (\-\-pause <param>)
.IP \-GN
pause transfer if N bytes reached, and wait until lock file is deleted (\-\-max\-pause[=N])
.SS Flow control:
@@ -198,6 +203,8 @@ replace external html links by error pages (\-\-replace\-external)
do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
.IP \-%q
*include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
.IP \-%g
strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
.IP \-o
*generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
.IP \-X
@@ -209,6 +216,8 @@ links conversion to UTF\-8 (\-\-utf8\-conversion)
.SS Spider options:
.IP \-bN
accept cookies in cookies.txt (0=do not accept,* 1=accept) (\-\-cookies[=N])
.IP \-%K
load extra cookies from a Netscape cookies.txt (\-\-cookies\-file <param>)
.IP \-u
check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (\-\-check\-type[=N])
.IP \-j
@@ -225,6 +234,8 @@ tolerant requests (accept bogus responses on some servers, but not standard!) (\
update hacks: various hacks to limit re\-transfers when updating (identical size, bogus response..) (\-\-updatehack)
.IP \-%u
url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (\-\-urlhack)
.br
opt out of one url\-hack part: \-\-keep\-www\-prefix (www.foo.com<>foo.com), \-\-keep\-double\-slashes (//), \-\-keep\-query\-order (?b&a)
.IP \-%A
assume that a type (cgi,asp..) is always linked with a mime type (\-%A php3,cgi=text/html;dat,bin=application/x\-zip) (\-\-assume <param>)
.br

View File

@@ -60,6 +60,9 @@ Please visit our Website: http://www.httrack.com
param1 : this option must be alone, and needs one distinct parameter (-P <path>)
param0 : this option must be alone, but the parameter should be put together (+*.gif)
*/
/* clang-format off: hand-aligned table; clang-format reflows the whole
initializer (2->4 space) on any edit, churning every untouched row. */
/* clang-format off */
const char *hts_optalias[][4] = {
/* {"","","",""}, */
{"path", "-O", "param1", "output path"},
@@ -107,6 +110,12 @@ const char *hts_optalias[][4] = {
{"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
"single", ""},
{"include-query-string", "-%q", "single", ""},
{"strip-query", "-%g", "param1",
"strip [host/pattern=]key1,key2,... from URLs"},
{"cookies-file", "-%K", "param1",
"load extra cookies from a Netscape cookies.txt"},
{"pause", "-%G", "param1",
"random pause of MIN[:MAX] seconds between files"},
{"generate-errors", "-o", "single", ""},
{"do-not-generate-errors", "-o0", "single", ""},
{"purge-old", "-X", "param", ""},
@@ -123,6 +132,9 @@ const char *hts_optalias[][4] = {
{"tolerant", "-%B", "single", ""},
{"updatehack", "-%s", "single", ""}, {"sizehack", "-%s", "single", ""},
{"urlhack", "-%u", "single", ""},
{"keep-www-prefix", "-%j", "single", ""},
{"keep-double-slashes", "-%o", "single", ""},
{"keep-query-order", "-%y", "single", ""},
{"user-agent", "-F", "param1", "user-agent identity"},
{"referer", "-%R", "param1", "default referer URL"},
{"from", "-%E", "param1", "from email address"},
@@ -241,6 +253,7 @@ const char *hts_optalias[][4] = {
{"", "", "", ""}
};
/* clang-format on */
/*
Check for alias in command-line

View File

@@ -3,12 +3,12 @@
# Change this to download files
if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi

View File

@@ -35,6 +35,7 @@ Please visit our Website: http://www.httrack.com
#include <fcntl.h>
#include <ctype.h>
#include <stdint.h> /* uint64_t for the pause mixer (already a hard dep via md5.h) */
/* File defs */
#include "htscore.h"
@@ -523,9 +524,12 @@ int httpmirror(char *url1, httrackp * opt) {
opt->cookie = &cookie;
cookie.max_len = 30000; // max len
strcpybuff(cookie.data, "");
// Charger cookies.txt par défaut ou cookies.txt du miroir
// Load the mirror's cookies.txt, then the one in the current directory
cookie_load(opt->cookie, StringBuff(opt->path_log), "cookies.txt");
cookie_load(opt->cookie, "", "cookies.txt");
// A user-supplied cookie file is merged last so it wins on conflicts
if (strnotempty(StringBuff(opt->cookies_file)))
cookie_load(opt->cookie, "", StringBuff(opt->cookies_file));
} else
opt->cookie = NULL;
@@ -3311,6 +3315,21 @@ HTS_INLINE int back_fillmax(struct_back * sback, httrackp * opt,
return -1; /* plus de place */
}
/* Seed-derived: stable within a gap, rerolls per launch; a per-call rand()
would bias the delay toward min_ms (see header). Jitter, not crypto. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms) {
uint64_t z = (uint64_t) seed;
if (max_ms <= min_ms)
return min_ms;
/* SplitMix64 finalizer: scrambles the low-entropy ms timestamp. */
z += 0x9E3779B97F4A7C15ULL;
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL;
z ^= z >> 31;
return min_ms + (int) (z % (uint64_t) (max_ms - min_ms + 1));
}
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
int n = opt->maxsoc - back_nsoc(sback);
@@ -3331,6 +3350,18 @@ int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
}
}
// #185 randomized inter-file pause: non-blocking, one launch per gap
if (n > 0 && opt->pause_max_ms > 0 && HTS_STAT.last_connect > 0) {
TStamp opTime =
HTS_STAT.last_request ? HTS_STAT.last_request : HTS_STAT.last_connect;
TStamp lap = mtime_local() - opTime;
if (lap < hts_pause_target_ms(opTime, opt->pause_min_ms, opt->pause_max_ms))
n = 0;
else
n = 1;
}
return n;
}
@@ -3739,6 +3770,17 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (StringNotEmpty(from->user_agent))
StringCopyS(to->user_agent, from->user_agent);
if (StringNotEmpty(from->strip_query))
StringCopyS(to->strip_query, from->strip_query);
if (StringNotEmpty(from->cookies_file))
StringCopyS(to->cookies_file, from->cookies_file);
if (from->pause_max_ms > 0) {
to->pause_min_ms = from->pause_min_ms;
to->pause_max_ms = from->pause_max_ms;
}
if (from->retry > -1)
to->retry = from->retry;

View File

@@ -234,8 +234,12 @@ struct hash_struct {
coucal adrfil;
/* former address+path -> link index (renamed/moved entries) */
coucal former_adrfil;
/* scratch buffers reused across lookups (not reentrant) */
int normalized;
/* effective urlhack sub-flags: www.==host / // collapse / query-arg sort */
hts_boolean norm_host;
hts_boolean norm_slash;
hts_boolean norm_query;
/* query-strip keys (not owned); set from opt->strip_query at hash_init */
const char *strip_query;
char normfil[HTS_URLMAXSIZE * 2];
char normfil2[HTS_URLMAXSIZE * 2];
char catbuff[CATBUFF_SIZE];
@@ -364,6 +368,22 @@ int fspc(httrackp * opt, FILE * fp, const char *type);
char *next_token(char *p, int flag);
/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
"*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip);
/* As fil_normalized_filtered(), but DO_SLASH/DO_QUERY gate the // collapse and
the query-argument sort independently (the urlhack sub-flags). */
char *fil_normalized_filtered_ex(const char *source, char *dest,
const char *strip, int do_slash, int do_query);
/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
'\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize);
/* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
owns it and must release it with freet(). Return NULL on missing/unreadable
file (readfile_or substitutes defaultdata instead). The byte content is NOT
@@ -398,6 +418,10 @@ int back_pluggable_sockets(struct_back * sback, httrackp * opt);
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt);
/* Randomized inter-file pause target in [min_ms,max_ms] (#185), derived from a
timestamp seed so it is stable within one gap and rerolls per launch. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms);
/* Schedule more links from the heap into free slots. Returns the number queued,
or <=0 if none could be added (no free slot / paused / stopped). */
int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,

View File

@@ -1570,6 +1570,30 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
com++;
}
break; // url hack
case 'j':
opt->no_www_dedup =
HTS_TRUE; // --keep-www-prefix: keep www.X != X
if (*(com + 1) == '0') {
opt->no_www_dedup = HTS_FALSE;
com++;
}
break;
case 'o':
opt->no_slash_dedup =
HTS_TRUE; // --keep-double-slashes: keep //
if (*(com + 1) == '0') {
opt->no_slash_dedup = HTS_FALSE;
com++;
}
break;
case 'y':
opt->no_query_dedup =
HTS_TRUE; // --keep-query-order: keep ?b&a order
if (*(com + 1) == '0') {
opt->no_query_dedup = HTS_FALSE;
com++;
}
break;
case 'v':
opt->verbosedisplay = HTS_VERBOSE_FULL;
if (isdigit((unsigned char) *(com + 1))) {
@@ -1937,6 +1961,66 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
break;
case 'g': // strip-query: accumulate "[pattern=]keys" entries
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
HTS_PANIC_PRINTF("Option strip-query needs a blank space and "
"[host/pattern=]key1,key2,...");
printf("Example: --strip-query "
"\"www.example.com/*=utm_source,sid\"\n");
htsmain_free();
return -1;
} else {
na++;
if (StringNotEmpty(opt->strip_query))
StringCat(opt->strip_query, "\n");
StringCat(opt->strip_query, argv[na]);
}
break;
case 'K': // cookies-file: extra Netscape cookies.txt to preload
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
HTS_PANIC_PRINTF(
"Option cookies-file needs a blank space and "
"a cookies.txt path");
printf("Example: --cookies-file \"/home/me/cookies.txt\"\n");
htsmain_free();
return -1;
} else {
na++;
if (strlen(argv[na]) >= 1024) {
HTS_PANIC_PRINTF("Cookie file path too long");
htsmain_free();
return -1;
}
StringCopy(opt->cookies_file, argv[na]);
}
break;
case 'G': // pause: randomized inter-file delay MIN[:MAX] seconds
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
HTS_PANIC_PRINTF("Option pause needs a blank space and a "
"delay in seconds (MIN[:MAX])");
printf("Example: --pause 5:10\n");
htsmain_free();
return -1;
} else {
double pmin = 0, pmax = 0;
int nf;
na++;
nf = sscanf(argv[na], "%lf:%lf", &pmin, &pmax);
if (nf < 2)
pmax = pmin; /* a single value means a fixed delay */
/* positive-form bounds: NaN fails every comparison, so this
rejects it before the undefined (int)(NaN*1000) cast */
if (nf < 1 || !(pmin >= 0 && pmax >= pmin && pmax <= 86400)) {
HTS_PANIC_PRINTF("Invalid --pause range (expected "
"MIN[:MAX] seconds, 0<=MIN<=MAX<=86400)");
htsmain_free();
return -1;
}
opt->pause_min_ms = (int) (pmin * 1000.0);
opt->pause_max_ms = (int) (pmax * 1000.0);
}
break;
case 't': /* do not change type (ending) of filenames according to the MIME type */
opt->no_type_change = 1;
if (*(com+1)=='0') { opt->no_type_change = 0; com++; }

View File

@@ -30,12 +30,14 @@ Please visit our Website: http://www.httrack.com
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#include <stdint.h>
#include "htscharset.h"
#include "htsencoding.h"
#include "htssafe.h"
/* static int decode_entity(const unsigned int hash, const size_t len);
*/
/* static int decode_entity(const uint64_t hash, const size_t len);
*/
#include "htsentities.h"
/* hexadecimal conversion */
@@ -50,30 +52,31 @@ static int get_hex_value(char c) {
return -1;
}
/* Numerical Recipes,
see <http://en.wikipedia.org/wiki/Linear_congruential_generator> */
#define HASH_PRIME ( 1664525 )
#define HASH_CONST ( 1013904223 )
#define HASH_ADD(HASH, C) do { \
(HASH) *= HASH_PRIME; \
(HASH) += HASH_CONST; \
(HASH) += (C); \
} while(0)
/* 64-bit FNV-1a; must match htsentities.sh, which keys the entity table on it.
*/
#define HASH_INIT 0xcbf29ce484222325ULL
#define HASH_PRIME 0x100000001b3ULL
#define HASH_ADD(HASH, C) \
do { \
(HASH) ^= (unsigned char) (C); \
(HASH) *= HASH_PRIME; \
} while (0)
int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t max, const char *charset) {
size_t i, j, ampStart, ampStartDest;
int uc;
int hex;
unsigned int hash;
uint64_t hash;
assertf(max != 0);
for(i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0,
uc = -1, hex = 0, hash = 0 ; src[i] != '\0' ; i++) {
for (i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0, uc = -1, hex = 0,
hash = HASH_INIT;
src[i] != '\0'; i++) {
/* start of entity */
if (src[i] == '&') {
ampStart = i;
ampStartDest = j;
hash = 0;
hash = HASH_INIT;
uc = -1;
}
/* inside a potential entity */
@@ -174,14 +177,11 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
}
/* alphanumerical entity */
else {
/* alphanum and not too far ('&thetasym;' is the longest) */
if (i <= ampStart + 10 &&
(
(src[i] >= '0' && src[i] <= '9')
|| (src[i] >= 'A' && src[i] <= 'Z')
|| (src[i] >= 'a' && src[i] <= 'z')
)
) {
/* alphanum, capped at the longest name
* '&CounterClockwiseContourIntegral;' (31) */
if (i <= ampStart + 31 && ((src[i] >= '0' && src[i] <= '9') ||
(src[i] >= 'A' && src[i] <= 'Z') ||
(src[i] >= 'a' && src[i] <= 'z'))) {
/* compute hash */
HASH_ADD(hash, (unsigned char) src[i]);
} else {

File diff suppressed because it is too large Load Diff

View File

@@ -1,75 +1,92 @@
#!/bin/bash
#
# Regenerate htsentities.h from the WHATWG named character references.
src=html40.txt
url=http://www.w3.org/TR/1998/REC-html40-19980424/html40.txt
set -euo pipefail
src=entities.json
url=https://html.spec.whatwg.org/entities.json
dest=htsentities.h
(
cat <<EOF
/*
-- ${dest} --
FILE GENERATED BY $0, DO NOT MODIFY
# 64-bit FNV-1a of $1, printed as a C constant. Must match the hash in
# htsencoding.c. The offset basis is stored as its wrapped (signed) bit pattern;
# bash arithmetic is 64-bit two's complement, so the result is bit-exact.
fnv1a() {
local s=$1 i c h=$((0xcbf29ce484222325))
for ((i = 0; i < ${#s}; i++)); do
printf -v c '%d' "'${s:i:1}"
h=$(((h ^ (c & 0xff)) * 0x100000001b3))
done
printf '0x%016xULL' "$h"
}
We compute the LCG hash
(see <http://en.wikipedia.org/wiki/Linear_congruential_generator>)
for each entity. We should in theory check using strncmp() that we
actually have the correct entity, but this is actually statistically
not needed.
if [ ! -f "$src" ]; then
curl -fsS "$url" -o "$src"
fi
We may want to do better, but we expect the hash function to be uniform, and
let the compiler be smart enough to optimize the switch (for example by
checking in log2() intervals)
This code has been generated using the evil $0 script.
*/
# Keep ';'-terminated single-codepoint names; the ~93 multi-codepoint refs can't
# fit decode_entity's single-codepoint return and are skipped (left verbatim).
pairs=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length == 1)))
| sort_by(.key)
| .[] | "\(.key | ltrimstr("&") | rtrimstr(";"))\t\(.value.codepoints[0])"' "$src")
static int decode_entity(const unsigned int hash, const size_t len) {
# Skipped multi-codepoint names, kept to prove none aliases an emitted hash.
skipped=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length > 1)))
| .[] | .key | ltrimstr("&") | rtrimstr(";")' "$src")
cases=""
emit_hashes=""
while IFS=$'\t' read -r name cp; do
hash=$(fnv1a "$name")
cases+=" /* $name */"$'\n'
cases+=" case $hash:"$'\n'
cases+=" if (len == ${#name}) {"$'\n'
cases+=" return $cp;"$'\n'
cases+=" }"$'\n'
cases+=" break;"$'\n'
emit_hashes+="$hash"$'\n'
done <<<"$pairs"
skip_hashes=""
while IFS= read -r name; do
[ -n "$name" ] && skip_hashes+="$(fnv1a "$name")"$'\n'
done <<<"$skipped"
# The switch keys on the hash alone, so the dispatch is correct only while every
# emitted name hashes uniquely; prove it here, no runtime name compare needed.
dups=$(printf '%s' "$emit_hashes" | sort | uniq -d || true)
if [ -n "$dups" ]; then
echo "FATAL: two entity names share a hash (duplicate switch case); change the hash:" >&2
echo "$dups" >&2
exit 1
fi
# A skipped name colliding with an emitted hash would mis-decode instead of
# staying verbatim; forbid that too.
aliased=$(comm -12 <(printf '%s' "$emit_hashes" | sort -u) <(printf '%s' "$skip_hashes" | sort -u) || true)
if [ -n "$aliased" ]; then
echo "FATAL: a skipped multi-codepoint name aliases an emitted hash:" >&2
echo "$aliased" >&2
exit 1
fi
cat >"$dest" <<EOF
/* GENERATED by $0 from the WHATWG named character references
(${url}). DO NOT EDIT.
Dispatch keys on a 64-bit FNV-1a hash of the entity name; the generator
aborts on any hash collision, so no runtime name compare is needed. */
#include <stdint.h>
static int decode_entity(const uint64_t hash, const size_t len) {
switch(hash) {
EOF
(
if test -f ${src}; then
cat ${src}
else
GET "${url}"
fi
) |
grep -E '^<!ENTITY [a-zA-Z0-9_]' |
sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
(
read -r A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$((1 << 32))"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$((i + 1))
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
# next
read -r A
done
)
cat <<EOF
}
${cases} }
/* unknown */
return -1;
}
EOF
) >${dest}
echo "wrote $dest ($(grep -c '^ case ' "$dest") entities)" >&2

View File

@@ -193,7 +193,12 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
int len = (int) strlen(joker);
while((joker[i] != RIGHT) && (joker[i]) && (i < len)) {
if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
// '\' escapes the next char as a literal member, e.g. *[\[\]]
if (joker[i] == '\\' && joker[i + 1] != '\0') {
i++;
pass[(int) (unsigned char) joker[i]] = 1;
i++;
} else if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
int lsize = 0;
int lverdict;
@@ -221,7 +226,9 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
while(isdigit((unsigned char) joker[i]))
i++;
}
} else if (joker[i + 1] == '-') { // 2 car, ex: *[A-Z]
} else if (joker[i + 1] == '-' && joker[i + 2] != '\0') {
// range *[A-Z]; the '\0' guard rejects a truncated *[a- (else
// i+=3 overshoots the NUL)
if ((int) (unsigned char) joker[i + 2] >
(int) (unsigned char) joker[i]) {
int j;
@@ -233,10 +240,7 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
}
// else err=1;
i += 3;
} else { // 1 car, ex: *[ ]
if (joker[i + 2] == '\\' && joker[i + 3] != 0) { // escaped char, such as *[\[] or *[\]]
i++;
}
} else { // 1 car, ex: *[ ]
pass[(int) (unsigned char) joker[i]] = 1;
i++;
}

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-9"
#define HTTRACK_VERSIONID "3.49.9"
#define HTTRACK_VERSION "3.49-10"
#define HTTRACK_VERSIONID "3.49.10"
#define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0"

View File

@@ -106,10 +106,10 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
const lien_url*const lien = (const lien_url*) value;
const char *const adr = !former ? lien->adr : lien->former_adr;
const char *const fil = !former ? lien->fil : lien->former_fil;
const char *const adr_norm = adr != NULL ?
( hash->normalized ? jump_normalized_const(adr)
: jump_identification_const(adr) )
: NULL;
const char *const adr_norm =
adr != NULL ? (hash->norm_host ? jump_normalized_const(adr)
: jump_identification_const(adr))
: NULL;
// copy address
assertf(adr_norm != NULL);
@@ -117,10 +117,18 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
// copy link
assertf(fil != NULL);
if (hash->normalized) {
fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
{
/* resolve the per-URL strip keys; strip applies even when urlhack is off */
char BIGSTK keybuf[HTS_URLMAXSIZE];
const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
keybuf, sizeof(keybuf));
if (hash->norm_slash || hash->norm_query || keys != NULL) {
fil_normalized_filtered_ex(fil, &hash->normfil[strlen(hash->normfil)],
keys, hash->norm_slash, hash->norm_query);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
}
}
// hash
@@ -132,8 +140,7 @@ static int key_adrfil_equals_generic(void *arg,
coucal_key_const a_,
coucal_key_const b_,
const int former) {
hash_struct *const hash = (hash_struct*) arg;
const int normalized = hash->normalized;
hash_struct *const hash = (hash_struct *) arg;
const lien_url*const a = (const lien_url*) a_;
const lien_url*const b = (const lien_url*) b_;
const char *const a_adr = !former ? a->adr : a->former_adr;
@@ -150,10 +157,10 @@ static int key_adrfil_equals_generic(void *arg,
assertf(b_fil != NULL);
// skip scheme and authentication to the domain (possibly without www.)
ja = normalized
? jump_normalized_const(a_adr) : jump_identification_const(a_adr);
jb = normalized
? jump_normalized_const(b_adr) : jump_identification_const(b_adr);
ja = hash->norm_host ? jump_normalized_const(a_adr)
: jump_identification_const(a_adr);
jb = hash->norm_host ? jump_normalized_const(b_adr)
: jump_identification_const(b_adr);
assertf(ja != NULL);
assertf(jb != NULL);
if (strcasecmp(ja, jb) != 0) {
@@ -161,12 +168,23 @@ static int key_adrfil_equals_generic(void *arg,
}
// now compare pathes
if (normalized) {
fil_normalized(a_fil, hash->normfil);
fil_normalized(b_fil, hash->normfil2);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
{
char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
const char *const keysa =
hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
const char *const keysb =
hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
if (hash->norm_slash || hash->norm_query || keysa != NULL ||
keysb != NULL) {
fil_normalized_filtered_ex(a_fil, hash->normfil, keysa, hash->norm_slash,
hash->norm_query);
fil_normalized_filtered_ex(b_fil, hash->normfil2, keysb, hash->norm_slash,
hash->norm_query);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
}
}
}
@@ -222,11 +240,17 @@ static int key_former_adrfil_equals(void *arg,
return key_adrfil_equals_generic(arg, a, b, 1);
}
void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized) {
hash->sav = coucal_new(0);
hash->adrfil = coucal_new(0);
hash->former_adrfil = coucal_new(0);
hash->normalized = normalized;
/* urlhack is the umbrella; per-feature negatives opt out of each part */
hash->norm_host = normalized && !opt->no_www_dedup;
hash->norm_slash = normalized && !opt->no_slash_dedup;
hash->norm_query = normalized && !opt->no_query_dedup;
/* snapshot the query-strip list (not owned; valid for the hash lifetime) */
hash->strip_query =
StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;
hts_set_hash_handler(hash->sav, opt);
hts_set_hash_handler(hash->adrfil, opt);
@@ -282,6 +306,26 @@ void hash_free(hash_struct *hash) {
}
}
/* Test helper: do the two URLs dedupe to the same key under opt's urlhack
flags? Exercises the live hash compare (norm_host/slash/query resolution). */
hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
const char *adrb, const char *filb) {
hash_struct hash;
lien_url la, lb;
hts_boolean eq;
memset(&la, 0, sizeof(la));
memset(&lb, 0, sizeof(lb));
la.adr = key_duphandler(NULL, adra);
la.fil = key_duphandler(NULL, fila);
lb.adr = key_duphandler(NULL, adrb);
lb.fil = key_duphandler(NULL, filb);
hash_init(opt, &hash, opt->urlhack);
eq = key_adrfil_equals(&hash, &la, &lb);
hash_free(&hash);
return eq;
}
// retour: position ou -1 si non trouvé
int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
hash_struct_type type) {

View File

@@ -51,8 +51,12 @@ typedef enum hash_struct_type {
} hash_struct_type;
// tables de hachage
void hash_init(httrackp *opt, hash_struct *hash, int normalized);
void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized);
void hash_free(hash_struct *hash);
/* Test helper: HTS_TRUE if the two URLs dedupe together under opt's urlhack
flags. */
hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
const char *adrb, const char *filb);
int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
hash_struct_type type);
void hash_write(hash_struct * hash, size_t lpos);

View File

@@ -521,6 +521,7 @@ void help(const char *app, int more) {
infomsg(" EN maximum mirror time in seconds (60=1 minute, 3600=1 hour)");
infomsg(" AN maximum transfer rate in bytes/seconds (1000=1KB/s max)");
infomsg(" %cN maximum number of connections/seconds (*%c10)");
infomsg(" %G random pause of MIN[:MAX] seconds between files (e.g. %G5:10)");
infomsg
(" GN pause transfer if N bytes reached, and wait until lock file is deleted");
infomsg("");
@@ -563,6 +564,7 @@ void help(const char *app, int more) {
(" %x do not include any password for external password protected websites (%x0 include)");
infomsg
(" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
infomsg(" %g strip query keys for dedup ([host/pattern=]key1,key2,...)");
infomsg
(" o *generate output html file in case of error (404..) (o0 don't generate)");
infomsg(" X *purge old files after update (X0 keep delete)");
@@ -571,6 +573,7 @@ void help(const char *app, int more) {
infomsg("");
infomsg("Spider options:");
infomsg(" bN accept cookies in cookies.txt (0=do not accept,* 1=accept)");
infomsg(" %K load extra cookies from a Netscape cookies.txt");
infomsg
(" u check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always)");
infomsg
@@ -587,6 +590,9 @@ void help(const char *app, int more) {
(" %s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..)");
infomsg
(" %u url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..)");
infomsg(" opt out of one url-hack part: --keep-www-prefix "
"(www.foo.com<>foo.com), --keep-double-slashes (//), "
"--keep-query-order (?b&a)");
infomsg
(" %A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip)");
infomsg(" shortcut: '--assume standard' is equivalent to -%A "

View File

@@ -3610,7 +3610,10 @@ static int sortNormFnc(const void *a_, const void *b_) {
return strcmp(*a + 1, *b + 1);
}
HTSEXT_API char *fil_normalized(const char *source, char *dest) {
/* Path normalizer core: optionally collapse redundant '//' (DO_SLASH) and/or
sort query arguments (DO_QUERY) so equivalent URLs dedupe. */
static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
int do_query) {
char lastc = 0;
int gotquery = 0;
int ampargs = 0;
@@ -3620,8 +3623,8 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
for(i = j = 0; source[i] != '\0'; i++) {
if (!gotquery && source[i] == '?')
gotquery = ampargs = 1;
if ((!gotquery && lastc == '/' && source[i] == '/') // foo//bar -> foo/bar
) {
if (do_slash && !gotquery && lastc == '/' && source[i] == '/') {
// foo//bar -> foo/bar
} else {
if (gotquery && source[i] == '&') {
ampargs++;
@@ -3633,7 +3636,7 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
dest[j++] = '\0';
/* Sort arguments (&foo=1&bar=2 == &bar=2&foo=1) */
if (ampargs > 1) {
if (do_query && ampargs > 1) {
char **amps = malloct(ampargs * sizeof(char *));
char *copyBuff = NULL;
size_t qLen = 0;
@@ -3681,6 +3684,153 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
return dest;
}
HTSEXT_API char *fil_normalized(const char *source, char *dest) {
return fil_normalized_ex(source, dest, 1, 1);
}
/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
case-sensitive, space-trimmed tokens. */
static int hts_query_key_stripped(const char *arg, size_t keylen,
const char *strip) {
const char *p = strip;
while (*p != '\0') {
const char *start = p;
size_t toklen;
while (*p != '\0' && *p != ',')
p++;
toklen = (size_t) (p - start);
while (toklen > 0 && *start == ' ') {
start++;
toklen--;
}
while (toklen > 0 && start[toklen - 1] == ' ')
toklen--;
if (toklen == 1 && start[0] == '*')
return 1;
if (toklen == keylen && strncmp(start, arg, keylen) == 0)
return 1;
if (*p == ',')
p++;
}
return 0;
}
/* see htscore.h */
char *fil_normalized_filtered_ex(const char *source, char *dest,
const char *strip, int do_slash,
int do_query) {
const char *query;
char BIGSTK tmp[HTS_URLMAXSIZE * 2];
htsbuff cb;
int wrote = 0;
/* No strip list, or no query: plain normalization. */
if (strip == NULL || *strip == '\0' ||
(query = strchr(source, '?')) == NULL) {
return fil_normalized_ex(source, dest, do_slash, do_query);
}
/* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
(the read re-normalizes it; a dropped empty arg would miss dedup). */
cb = htsbuff_ptr(tmp, sizeof(tmp));
htsbuff_catn(&cb, source, (size_t) (query - source));
for (query++;;) {
const char *const arg = query;
const char *eq = NULL;
size_t keylen, arglen;
while (*query != '\0' && *query != '&') {
if (eq == NULL && *query == '=')
eq = query;
query++;
}
arglen = (size_t) (query - arg);
keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
if (!hts_query_key_stripped(arg, keylen, strip)) {
htsbuff_catc(&cb, wrote ? '&' : '?');
htsbuff_catn(&cb, arg, arglen);
wrote = 1;
}
if (*query == '\0')
break;
query++;
}
return fil_normalized_ex(tmp, dest, do_slash, do_query);
}
/* see htscore.h */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip) {
return fil_normalized_filtered_ex(source, dest, strip, 1, 1);
}
/* see htscore.h */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize) {
const char *p, *q;
const char *result = NULL;
char BIGSTK url[HTS_URLMAXSIZE * 2];
if (rules == NULL || *rules == '\0' || destsize == 0)
return NULL;
/* Match string = normalized host/path, query removed. jump_normalized_const
collapses www+scheme/auth so read and write (double-normalized) agree;
query excluded keeps the decision on host/path only. */
url[0] = '\0';
strcatbuff(url, jump_normalized_const(adr));
if (fil[0] != '/')
strcatbuff(url, "/");
q = strchr(fil, '?');
if (q != NULL)
strncatbuff(url, fil, (int) (q - fil));
else
strcatbuff(url, fil);
/* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
"pattern=keys"; no '=' is the bare form, pattern "*". */
for (p = rules; *p != '\0';) {
const char *const line = p;
const char *eol, *eq, *keys;
char BIGSTK pat[HTS_URLMAXSIZE * 2];
while (*p != '\0' && *p != '\n')
p++;
eol = p;
if (*p == '\n')
p++;
if (eol == line)
continue;
eq = memchr(line, '=', (size_t) (eol - line));
if (eq != NULL) {
size_t patlen = (size_t) (eq - line);
if (patlen >= sizeof(pat))
patlen = sizeof(pat) - 1;
memcpy(pat, line, patlen);
pat[patlen] = '\0';
keys = eq + 1;
} else {
pat[0] = '*';
pat[1] = '\0';
keys = line;
}
if (strjoker(url, pat, NULL, NULL) != NULL) {
size_t klen = (size_t) (eol - keys);
if (klen >= destsize)
klen = destsize - 1;
memcpy(dest, keys, klen);
dest[klen] = '\0';
result = dest;
}
}
return result;
}
#define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
size_t destsize) {
@@ -5890,7 +6040,14 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
opt->no_www_dedup = HTS_FALSE;
opt->no_slash_dedup = HTS_FALSE;
opt->no_query_dedup = HTS_FALSE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
StringCopy(opt->strip_query, "");
StringCopy(opt->cookies_file, "");
opt->pause_min_ms = 0;
opt->pause_max_ms = 0;
opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, "");
@@ -6035,6 +6192,8 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
StringFree(opt->urllist);
StringFree(opt->footer);
StringFree(opt->mod_blacklist);
StringFree(opt->strip_query);
StringFree(opt->cookies_file);
StringFree(opt->path_html);
StringFree(opt->path_html_utf8);

View File

@@ -198,6 +198,13 @@ int url_savename(lien_adrfilsave *const afs,
// copy of fil, used for lookups (see urlhack)
const char *normadr = adr;
const char *normfil = fil_complete;
/* query keys to strip for this URL (NULL = none); decoupled from urlhack */
char BIGSTK stripkeys[HTS_URLMAXSIZE];
const char *const strip =
StringNotEmpty(opt->strip_query)
? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
fil_complete, stripkeys, sizeof(stripkeys))
: NULL;
const char *const print_adr = jump_protocol_const(adr);
const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL; // Position nom et point
@@ -230,9 +237,13 @@ int url_savename(lien_adrfilsave *const afs,
// www-42.foo.com -> foo.com
// foo.com/bar//foobar -> foo.com/bar/foobar
if (opt->urlhack) {
// copy of adr (without protocol), used for lookups (see urlhack)
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil = fil_normalized(fil_complete, normfil_);
// dedup-lookup key; honor the per-feature negatives like htshash.c so
// distinct URLs keep distinct savenames (else keep normadr = adr)
if (!opt->no_www_dedup)
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil =
fil_normalized_filtered_ex(fil_complete, normfil_, strip,
!opt->no_slash_dedup, !opt->no_query_dedup);
} else {
if (link_has_authority(adr_complete)) { // https or other protocols : in "http/" subfolder
char *pos = strchr(adr_complete, ':');
@@ -245,6 +256,11 @@ int url_savename(lien_adrfilsave *const afs,
normadr = normadr_;
}
}
// strip still applies with urlhack off (host left untouched); no // or
// query-sort here, to match the hash key (norm_slash/norm_query are 0 when
// urlhack is off) so a URL is looked up under the key it was stored with
if (strip != NULL)
normfil = fil_normalized_filtered_ex(fil_complete, normfil_, strip, 0, 0);
}
// à afficher sans ftp://

View File

@@ -529,6 +529,16 @@ struct httrackp {
htslibhandles libHandles; /**< loaded external module handles */
//
htsoptstate state; /**< embedded live engine state */
String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
appended at the tail to keep field offsets stable */
hts_boolean
no_www_dedup; /**< with urlhack, keep www.host distinct from host */
hts_boolean no_slash_dedup; /**< with urlhack, keep redundant // in paths */
hts_boolean no_query_dedup; /**< with urlhack, keep query-argument order */
String cookies_file; /**< extra Netscape cookies.txt to preload
(--cookies-file) */
int pause_min_ms; /**< inter-file pause lower bound, ms (0=off, #185) */
int pause_max_ms; /**< inter-file pause upper bound, ms */
};
/* Running statistics for a mirror. */

View File

@@ -302,6 +302,14 @@ static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* Drop a redirect Location's #fragment: a UA anchor, never part of the fetched
* resource (#204). */
static void url_drop_fragment(char *const url) {
char *const frag = strchr(url, '#');
if (frag != NULL)
*frag = '\0';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
@@ -3596,22 +3604,35 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
//
strcpybuff(mov_url, r->location);
url_drop_fragment(mov_url);
// url qque -> adresse+fichier
if ((reponse =
ident_url_relatif(mov_url, urladr(), urlfil(), moved)) >= 0) {
int set_prio_to = 0; // pas de priotité fixéd par wizard
// check whether URLHack is harmless or not
if (opt->urlhack) {
// check whether URLHack is harmless or not (per the effective
// sub-flags)
if (opt->urlhack && (!opt->no_www_dedup || !opt->no_slash_dedup ||
!opt->no_query_dedup)) {
const int norm_host = !opt->no_www_dedup;
const int norm_slash = !opt->no_slash_dedup;
const int norm_query = !opt->no_query_dedup;
char BIGSTK n_adr[HTS_URLMAXSIZE * 2], n_fil[HTS_URLMAXSIZE * 2];
char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];
n_adr[0] = n_fil[0] = '\0';
(void) adr_normalized_sized(moved->adr, n_adr, sizeof(n_adr));
(void) fil_normalized(moved->fil, n_fil);
(void) adr_normalized_sized(urladr(), pn_adr, sizeof(pn_adr));
(void) fil_normalized(urlfil(), pn_fil);
strlcpybuff(n_adr,
norm_host ? jump_normalized_const(moved->adr)
: jump_identification_const(moved->adr),
sizeof(n_adr));
strlcpybuff(pn_adr,
norm_host ? jump_normalized_const(urladr())
: jump_identification_const(urladr()),
sizeof(pn_adr));
fil_normalized_filtered_ex(moved->fil, n_fil, NULL, norm_slash,
norm_query);
fil_normalized_filtered_ex(urlfil(), pn_fil, NULL, norm_slash,
norm_query);
if (strcasecmp(n_adr, pn_adr) == 0
&& strcasecmp(n_fil, pn_fil) == 0) {
hts_log_print(opt, LOG_WARNING,
@@ -4791,6 +4812,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
mov_url[0] = '\0';
strcpybuff(mov_url, back[b].r.location); // copier URL
url_drop_fragment(mov_url);
/* Remove (temporarily created) file if it was created */
UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), back[b].url_sav));

View File

@@ -512,15 +512,21 @@ static int string_safety_selftests(void) {
/* ------------------------------------------------------------ */
static int st_filter(httrackp *opt, int argc, char **argv) {
char *str, *pat;
int matched;
(void) opt;
if (argc < 2) {
fprintf(stderr, "filter: needs a filter pattern and a string\n");
return 1;
}
if (strjoker(argv[1], argv[0], NULL, NULL))
printf("%s does match %s\n", argv[1], argv[0]);
else
printf("%s does NOT match %s\n", argv[1], argv[0]);
/* exact-size heap copies so a sanitizer traps any over-read of the pattern */
str = strdupt(argv[1]);
pat = strdupt(argv[0]);
matched = strjoker(str, pat, NULL, NULL) != NULL;
printf("%s does %s %s\n", argv[1], matched ? "match" : "NOT match", argv[0]);
freet(str);
freet(pat);
return 0;
}
@@ -899,12 +905,71 @@ static int st_copyopt(httrackp *opt, int argc, char **argv) {
if (to->parseall != HTS_TRUE)
err = 1;
/* String field: a non-empty source deep-copies across, an empty source
leaves the target intact (StringNotEmpty guard). Covers the exported
copy_htsopt String path that no crawl test reaches. */
StringCopy(from->cookies_file, "/tmp/jar.txt");
StringCopy(to->cookies_file, "");
copy_htsopt(from, to);
if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
err = 1;
StringCopy(from->cookies_file, "");
copy_htsopt(from, to);
if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
err = 1;
/* #185 pause pair: copied when enabled (max>0), the 0 sentinel skips */
from->pause_min_ms = 5000;
from->pause_max_ms = 10000;
to->pause_min_ms = to->pause_max_ms = 0;
copy_htsopt(from, to);
if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
err = 1;
from->pause_min_ms = from->pause_max_ms = 0;
copy_htsopt(from, to);
if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_pause(httrackp *opt, int argc, char **argv) {
int err = 0, i, seen_low = 0, seen_high = 0;
(void) opt;
(void) argc;
(void) argv;
/* Consecutive-ms seeds (production shape: launch timestamps a few ms apart)
must stay in range and spread, not collapse to a bound -- worst case for a
weak low-bit mixer. */
for (i = 0; i < 10000; i++) {
int t = hts_pause_target_ms((TStamp) (1719500000000LL + i), 5000, 10000);
if (t < 5000 || t > 10000)
err = 1;
seen_low |= (t < 6000);
seen_high |= (t > 9000);
}
if (!seen_low || !seen_high)
err = 1;
if (hts_pause_target_ms(12345, 8000, 8000) != 8000) /* equal bounds = fixed */
err = 1;
/* deterministic: a seed yields the same target even after an intervening call
with another seed (no global PRNG state to perturb it) */
{
int a = hts_pause_target_ms(99, 5000, 10000);
(void) hts_pause_target_ms(54321, 5000, 10000);
if (hts_pause_target_ms(99, 5000, 10000) != a)
err = 1;
}
printf("pause: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_relative(httrackp *opt, int argc, char **argv) {
char s[HTS_URLMAXSIZE * 2];
@@ -1052,6 +1117,173 @@ static int st_cookies(httrackp *opt, int argc, char **argv) {
return err;
}
/* --strip-query: resolver + fil_normalized_filtered, end to end. */
static int st_stripquery(httrackp *opt, int argc, char **argv) {
char dest[1024], keys[256], ref[1024];
const char *k;
(void) opt;
(void) argc;
(void) argv;
/* empty rules == plain fil_normalized */
assertf(hts_query_strip_keys(NULL, "h.com", "/p?a=1", keys, sizeof(keys)) ==
NULL);
assertf(hts_query_strip_keys("", "h.com", "/p?a=1", keys, sizeof(keys)) ==
NULL);
assertf(strcmp(fil_normalized_filtered("/p?b=2&a=1", dest, NULL),
fil_normalized("/p?b=2&a=1", ref)) == 0);
/* bare form (*=keys): strip the key everywhere, keep+sort the rest */
k = hts_query_strip_keys("sid", "any.com", "/p?b=2&sid=x&a=1", keys,
sizeof(keys));
assertf(k != NULL && strcmp(k, "sid") == 0);
assertf(strcmp(fil_normalized_filtered("/p?b=2&sid=x&a=1", dest, k),
"/p?a=1&b=2") == 0);
/* reordered variant + an extra stripped key == the clean URL */
assertf(strcmp(fil_normalized_filtered("/p?sid=y&a=1&b=2", dest, "sid"),
fil_normalized("/p?a=1&b=2", ref)) == 0);
/* host pattern matches only that host, incl. its www-normalized forms */
assertf(hts_query_strip_keys("ex.com/*=utm", "other.com", "/p?utm=1", keys,
sizeof(keys)) == NULL);
assertf(hts_query_strip_keys("ex.com/*=utm", "ex.com", "/p?utm=1", keys,
sizeof(keys)) != NULL);
assertf(hts_query_strip_keys("ex.com/*=utm", "www.ex.com", "/p?utm=1", keys,
sizeof(keys)) != NULL);
assertf(hts_query_strip_keys("ex.com/*=utm", "http://www-3.ex.com",
"/p?utm=1", keys, sizeof(keys)) != NULL);
/* last match wins, wholesale: host rule overrides global, no union */
k = hts_query_strip_keys("*=sid\nex.com/*=utm", "ex.com",
"/p?sid=1&utm=2&a=3", keys, sizeof(keys));
assertf(k != NULL && strcmp(k, "utm") == 0);
assertf(strcmp(fil_normalized_filtered("/p?sid=1&utm=2&a=3", dest, k),
"/p?a=3&sid=1") == 0);
k = hts_query_strip_keys("*=sid\nex.com/*=utm", "z.com", "/p?sid=1&a=3", keys,
sizeof(keys));
assertf(k != NULL && strcmp(k, "sid") == 0);
/* whole-key match, not prefix: "utm" must not strip utm_source */
assertf(strcmp(fil_normalized_filtered("/p?utm_source=x&a=1", dest, "utm"),
"/p?a=1&utm_source=x") == 0);
/* "*" drops every param; a fully-stripped single-arg query loses its '?' */
assertf(strcmp(fil_normalized_filtered("/p?a=1&b=2", dest, "*"), "/p") == 0);
assertf(strcmp(fil_normalized_filtered("/p?utm=1", dest, "utm"), "/p") == 0);
/* degenerate forms a=, b, c== (key 'c'); strip c keeps a= and b */
assertf(strcmp(fil_normalized_filtered("/p?a=&b&c==", dest, "c"),
"/p?a=&b") == 0);
/* short key must not strip a longer one: 'c' must not touch 'cc' */
assertf(strcmp(fil_normalized_filtered("/p?cc=1&c=2", dest, "c"),
"/p?cc=1") == 0);
/* repeated key: every occurrence is stripped, not just the first */
assertf(
strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "foo"),
"/p?bar=13") == 0);
/* repeated key mixing missing/empty values */
assertf(
strcmp(fil_normalized_filtered("/p?foo&bar=13&foo=42&foo=", dest, "foo"),
"/p?bar=13") == 0);
/* repeated key kept (no match): all occurrences retained, then sorted */
assertf(strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "z"),
"/p?bar=13&foo=42&foo=43") == 0);
/* value containing '=': the key is only the part before the first '='. Strip
'foo' drops "foo=42=17" whole; the '=' in the value is not a delimiter. */
assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "foo"),
"/p?bar=") == 0);
/* keeping it preserves the embedded '=' verbatim */
assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "bar"),
"/p?foo=42=17") == 0);
/* a value segment is not a key: stripping "42" must not touch foo=42=17 */
assertf(strcmp(fil_normalized_filtered("/p?foo=42=17", dest, "42"),
"/p?foo=42=17") == 0);
/* Idempotency: the read path re-normalizes an already-normalized fil, so the
result must be a fixpoint or dedup misses (catches a dropped empty/trailing
arg like "?&&", "a&"). */
{
static const char *const qs[] = {"/p?a=&b&c==",
"/p?a&&b",
"/p?&a",
"/p?a&",
"/p?",
"/p?=v",
"/p?&&",
"/p?b=2&a=1",
"/p?utm=x&",
"/p?&utm=x",
"/p?foo=42&bar=13&foo=43",
"/p?foo&bar=13&foo=42&foo=",
"/p?foo=42=17&bar="};
static const char *const strips[] = {NULL, "z", "utm", "*", "a", "foo"};
char once[1024], twice[1024];
size_t i, j;
for (i = 0; i < sizeof(qs) / sizeof(qs[0]); i++) {
for (j = 0; j < sizeof(strips) / sizeof(strips[0]); j++) {
fil_normalized_filtered(qs[i], once, strips[j]);
fil_normalized_filtered(once, twice, strips[j]);
assertf(strcmp(once, twice) == 0);
}
}
}
printf("strip-query self-test OK\n");
return 0;
}
/* -%u url-hack split (#271): each sub-flag must toggle independently. */
static int st_urlhack(httrackp *opt, int argc, char **argv) {
(void) argc;
(void) argv;
#define EQ(aa, fa, ab, fb) hash_url_equals(opt, aa, fa, ab, fb)
/* urlhack on, no opt-outs: www, // and query order all collapse */
opt->urlhack = HTS_TRUE;
opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
/* keep-www-prefix: host off; // and query still collapse */
opt->no_www_dedup = HTS_TRUE;
assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
opt->no_www_dedup = HTS_FALSE;
/* keep-double-slashes: // significant; www, query order still collapse */
opt->no_slash_dedup = HTS_TRUE;
assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
opt->no_slash_dedup = HTS_FALSE;
/* keep-query-order: query order significant; www and // still collapse */
opt->no_query_dedup = HTS_TRUE;
assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
opt->no_query_dedup = HTS_FALSE;
/* all opt-outs == urlhack off entirely */
opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_TRUE;
assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
opt->urlhack = HTS_FALSE;
opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
#undef EQ
printf("urlhack self-test OK\n");
return 0;
}
/* ------------------------------------------------------------ */
/* Registry: name -> handler, with a usage hint and a one-line description. */
/* ------------------------------------------------------------ */
@@ -1068,6 +1300,10 @@ static const struct selftest_entry {
"size-aware filter verdict (negative size = unknown/scan time)",
st_filtersize},
{"simplify", "<path>", "collapse ./ and ../ in a path", st_simplify},
{"stripquery", "", "--strip-query pattern/key stripping self-test",
st_stripquery},
{"urlhack", "", "-%u url-hack sub-flag (www/slash/query) self-test",
st_urlhack},
{"mime", "<filename>", "MIME type for a filename", st_mime},
{"charset", "<charset> <string>",
"convert a string to UTF-8 from a charset", st_charset},
@@ -1080,6 +1316,7 @@ static const struct selftest_entry {
{"strsafe", "[overflow|overflow-buff [str]]", "bounded string-op self-test",
st_strsafe},
{"copyopt", "", "copy_htsopt option-copy self-test", st_copyopt},
{"pause", "", "randomized inter-file pause target self-test", st_pause},
{"relative", "<link> <curr-file>", "relative link between two paths",
st_relative},
{"resolve", "<link> <adr> <fil>", "resolve a link against an origin",

View File

@@ -90,4 +90,16 @@ refused "dangling-quote argument not refused cleanly"
run_only "$tmp/q-lone" '"'
refused "lone-quote argument not refused cleanly"
# --pause (#185): valid MIN[:MAX] accepted; malformed, reversed, over-range and
# non-finite values refused cleanly. NaN defeats naive `<`/`>` checks (it
# compares false to everything), so it must not slip through to the int cast.
run "$tmp/pause-ok" --pause 0.2:0.4
accepted "$tmp/pause-ok" "#185: valid --pause range rejected"
run "$tmp/pause-fix" --pause 0.2
accepted "$tmp/pause-fix" "#185: valid fixed --pause rejected"
for bad in nan nan:5 5:nan inf 10:5 99999; do
run "$tmp/pause-bad" --pause "$bad"
refused "#185: invalid --pause '$bad' not refused cleanly"
done
exit 0

View File

@@ -18,6 +18,21 @@ ent '&amp;' '&'
ent '&lt;&gt;' '<>'
ent '&eacute;' 'é'
# HTML5 names from the WHATWG set
ent '&hellip;' '…'
ent '&bigcup;' ''
# longest name (31 chars) exercises the name-length cap
ent '&CounterClockwiseContourIntegral;' '∳'
# astral codepoint -> 4-byte UTF-8
ent '&Aopf;' '𝔸'
# multi-codepoint refs are skipped at generation, so left verbatim
ent '&fjlig;' '&fjlig;'
# common HTML4 names still decode (regression guard against accidental drops)
ent '&copy;&reg;&trade;' '©®™'
ent '&mdash;&ndash;' '—–'
ent '&alpha;&beta;' 'αβ'
# numeric: decimal and hex
ent '&#65;&#66;' 'AB'
ent '&#x41;' 'A'

View File

@@ -50,27 +50,54 @@ match '*foo*bar' 'foozbar'
# '?' is the query-string marker, not a single-char wildcard
nomatch 'a?c' 'abc'
# backslash escapes a metacharacter inside a class so it is matched literally.
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
# both X and '\'. These assertions pin that behavior.
# Inside a class, backslash escapes the next char as a literal member (#148):
# '\X' matches X only (not '\'), and an escaped ']' is a member, not the terminator.
match '*[\*]' '*'
match '*[\*]' "\\"
nomatch '*[\*]' 'a'
nomatch '*[\*]' "\\"
match '*[\\]' "\\"
nomatch '*[\\]' 'a'
nomatch '*[\\]' '*'
match '*[\[]' '['
match '*[\[]' "\\"
nomatch '*[\[]' 'a'
nomatch '*[\[]' "\\"
match '*[\]]' ']'
nomatch '*[\]]' "\\"
# A literal ']' cannot be a class member: the class parser stops at the first
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'
# '*[\[\]]' is "the [ or ] character", as the filter guide documents.
match '*[\[\]]' '['
match '*[\[\]]' ']'
nomatch '*[\[\]]' 'a'
match '*[\[,\]]' '[' # comma between members is optional
match '*[\[,\]]' ']'
match '*[a,\[]' 'a' # an escaped member no longer eats the preceding one
match '*[a,\[]' '['
# Escape is decoded before the range/separator/size checks, so '\-' '\,' '\<'
# are literal members, not operators.
match '*[a\-z]' 'a'
match '*[a\-z]' 'z'
nomatch '*[a\-z]' 'b' # not the a..z range
match '*[\,]' ','
nomatch '*[\,]' "\\" # the escape must not leak '\' into the class
match '*[\<]' '<'
nomatch '*[\<]' "\\"
match '*[\[,\],a]' '['
match '*[\[,\],a]' ']'
match '*[\[,\],a]' 'a'
# A truncated range '*[a-' is the literal members {a,-}; the parser must not
# read past the end decoding it (was a 1-byte heap over-read in the range arm).
match '*[a-' 'a'
nomatch '*[a-' 'b'
# *(...) matches exactly one char from the class; *[...] matches a run.
match '*(a,b)' 'a'
nomatch '*(a,b)' 'aa'
nomatch '*(a,b)' 'c'
# documented composite filters (filters.html)
match 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.zip'
nomatch 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.tar'
match '*.html*[]' 'page.html'
nomatch '*.html*[]' 'page.html?x=1' # *[] forbids the trailing query
# Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
# means the size is still unknown (scan time). A size exclusion must stay neutral

15
tests/01_engine-pause.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# --pause (#185): the inter-file pause target must stay in [min,max] and spread
# across it (a per-call rand() would collapse it toward min). Driven by the
# in-process 'httrack -#test=pause' test. POSIX-portable ($(BASH) is /bin/sh on macOS).
set -eu
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=pause run)
test "$out" = "pause: OK" || {
echo "expected 'pause: OK', got: $out" >&2
exit 1
}

View File

@@ -0,0 +1,8 @@
#!/bin/bash
#
set -euo pipefail
# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"

View File

@@ -0,0 +1,8 @@
#!/bin/bash
#
set -euo pipefail
# -%u url-hack split (#271): www / // / query-order dedup toggle independently.
# All assertions live in the engine self-test (hash compare flag resolution).
httrack -O /dev/null -#test=urlhack run | grep -q "urlhack self-test OK"

23
tests/26_local-strip-query.test Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
#
# End-to-end --strip-query (#112): two links to one resource differing only by
# ?utm_source dedup to a single saved file (2 files written: index + resource);
# the control crawl without the option keeps both variants (3 files). Locks the
# CLI->opt->hash plumbing the engine self-test can't reach.
set -e
: "${top_srcdir:=..}"
# stripped: the two ?utm_source variants collapse to one resource
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
# control: no stripping -> both query-named variants are saved
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
httrack 'BASEURL/stripquery/index.html'
# strip still applies with url-hack off (-%u0): exercises the urlhack-off
# savename branch, which must normalize the dedup key the same way the hash does
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
httrack 'BASEURL/stripquery/index.html' -%u0 --strip-query 'utm_source'

View File

@@ -0,0 +1,22 @@
#!/bin/bash
#
# End-to-end --cookies-file (#215): /gated/secret.php needs a cookie no page
# ever Set-Cookies, so it is reachable only when the option preloads it from a
# Netscape cookies.txt. Locks the CLI->opt->cookie_load->wire plumbing.
set -e
: "${top_srcdir:=..}"
# preloaded cookie -> secret page is served. -o0 means a 500 leaves no file, so
# --found/--files only hold when the secret is genuinely fetched (200).
bash "$top_srcdir/tests/local-crawl.sh" --cookie 'session=opensesame' \
--errors 0 --files 2 \
--found 'gated/index.html' --found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0
# control: without the cookie the secret 500s; -o0 suppresses the error page so
# its absence is real (error + missing file)
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'gated/index.html' --not-found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0

36
tests/28_local-pause.test Executable file
View File

@@ -0,0 +1,36 @@
#!/bin/bash
#
# --pause (#185): a fixed inter-file delay must slow a multi-file crawl. Measure
# the same crawl with and without --pause and compare: the harness overhead
# cancels, leaving only the pause. Integer seconds keep it portable (BSD date
# has no %N); a lower bound is not timing-flaky since a pause only adds time.
set -e
: "${top_srcdir:=..}"
# python3 runs the local server (mirror local-crawl.sh); skip when absent, else
# run() swallows its exit-77 and the serverless 0s/0s crawl looks like a fail.
command -v python3 >/dev/null || {
echo "python3 not found; skipping local crawl tests"
exit 77
}
run() { # echoes the wall-clock seconds of one crawl
local t0 t1
t0=$(date +%s)
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
httrack 'BASEURL/types/index.html' -c1 "$@" >/dev/null 2>&1
t1=$(date +%s)
echo $((t1 - t0))
}
base=$(run)
paused=$(run --pause 0.5)
delta=$((paused - base))
echo "crawl: ${base}s, with --pause 0.5: ${paused}s (delta ${delta}s)"
if [ "$delta" -lt 2 ]; then
echo "FAIL: --pause did not delay the crawl (delta ${delta}s)" >&2
exit 1
fi

View File

@@ -0,0 +1,11 @@
#!/bin/bash
# Issue #204: a 302 Location with a #fragment must drop the fragment before the
# target is fetched. The server is strict (400 on a '#' in the request-target),
# so a leaked fragment logs an error and the target is never saved.
set -e
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'redir/target.html' \
httrack 'BASEURL/redir/index.html'

View File

@@ -5,6 +5,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
server-root/stripquery/index.html server-root/stripquery/a.html \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -40,12 +41,15 @@ TESTS = \
01_engine-idna.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
01_engine-urlhack.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \
@@ -68,6 +72,10 @@ TESTS = \
22_local-broken-size.test \
23_local-errpage.test \
24_local-resume-overlap.test \
25_local-mime-exclude.test
25_local-mime-exclude.test \
26_local-strip-query.test \
27_local-cookies-file.test \
28_local-pause.test \
29_local-redirect-fragment.test
CLEANFILES = check-network_sh.cache

View File

@@ -12,11 +12,14 @@
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# bash local-crawl.sh [--tls] [--root DIR] [--cookie NAME=VALUE ...] \
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
# which the ephemeral port forces into the cookie domain) and passes it to
# httrack via --cookies-file, to exercise preloaded cookies.
set -u
@@ -85,6 +88,7 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create
# --- parse leading control flags --------------------------------------------
declare -a audit=()
declare -a cookies=()
scheme=http
pos=0
args=("$@")
@@ -105,6 +109,10 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
root="${args[$pos]}"
;;
--cookie)
pos=$((pos + 1))
cookies+=("${args[$pos]}")
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
@@ -158,6 +166,17 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
done
# --- materialize any --cookie entries into a cookies.txt ---------------------
if test "${#cookies[@]}" -gt 0; then
jar="${tmpdir}/cookies.txt"
: >"$jar"
for spec in "${cookies[@]}"; do
printf '127.0.0.1:%s\tTRUE\t/\tFALSE\t1999999999\t%s\t%s\n' \
"$port" "${spec%%=*}" "${spec#*=}" >>"$jar"
done
hts+=(--cookies-file "$jar")
fi
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')

View File

@@ -110,6 +110,19 @@ class Handler(SimpleHTTPRequestHandler):
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
# --cookies-file (#215): the secret page needs a cookie no page ever sets,
# so it is reachable only when --cookies-file preloads it.
GATE_COOKIE = ("session", "opensesame")
def route_gated_index(self):
self.send_html('\tThis is a <a href="secret.php">link</a>')
def route_gated_secret(self):
name, value = self.GATE_COOKIE
if self.request_cookies().get(name) != value:
return self.fail_cookie(name)
self.send_html("\tThis is the secret.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
@@ -341,10 +354,27 @@ class Handler(SimpleHTTPRequestHandler):
if self.command != "HEAD":
self.wfile.write(body)
# 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
# that must be dropped before the target is fetched. A leaked '#' reaches the
# strict-server guard below and 400s.
def route_redir_index(self):
self.send_html('\t<a href="go.php">go</a>')
def route_redir_go(self):
self.send_response(302, "Found")
self.send_header("Location", "target.html#section")
self.send_header("Content-Length", "0")
self.end_headers()
def route_redir_target(self):
self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/gated/index.php": route_gated_index,
"/gated/secret.php": route_gated_secret,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
@@ -376,10 +406,23 @@ class Handler(SimpleHTTPRequestHandler):
"/mimex/index.html": route_mimex_index,
"/mimex/blob.pdf": route_mimex_blob,
"/mimex/real.html": route_mimex_real,
"/redir/index.html": route_redir_index,
"/redir/go.php": route_redir_go,
"/redir/target.html": route_redir_target,
}
# --- dispatch ----------------------------------------------------------
def reject_fragment(self):
# Strict server: a '#' in the request-target is the client failing to
# drop a fragment (#204). RFC 3986 forbids it on the wire; answer 400.
if "#" in self.path:
self.send_response(400, "Bad Request")
self.send_header("Content-Length", "0")
self.end_headers()
return True
return False
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
@@ -391,10 +434,14 @@ class Handler(SimpleHTTPRequestHandler):
return False
def do_GET(self):
if self.reject_fragment():
return
if not self.dispatch():
super().do_GET()
def do_HEAD(self):
if self.reject_fragment():
return
if not self.dispatch():
super().do_HEAD()

View File

@@ -0,0 +1 @@
<html><body>resource A</body></html>

View File

@@ -0,0 +1,5 @@
<html><body>
Two links to one resource, differing only by a tracking parameter.
<a href="a.html?utm_source=x">x</a>
<a href="a.html?utm_source=y">y</a>
</body></html>