Compare commits

..

4 Commits

Author SHA1 Message Date
Xavier Roche
896a589f94 Add --pause to space out file downloads by a random delay (#185)
A new --pause MIN[:MAX] (seconds, -%G) waits a random MIN..MAX between
files so a crawl looks less like a bot and is gentler on the server; a
single value is a fixed delay. Disabled by default.

It reuses the existing non-blocking launch gate
(back_pluggable_sockets_strict): rather than Sleep() -- which would freeze
the single select() pump and stall the other in-flight transfers -- the
gate just withholds new launches until the delay elapses, one file per
gap. The per-gap target is derived from the last-request timestamp so it
stays stable across the many gate evaluations within a gap yet rerolls on
each launch; sampling rand() per evaluation would instead bias the
realized delay toward MIN.

Two int fields appended at the httrackp tail (ABI-stable, no soname bump).
Covered by a pure-function self-test (range + spread, with teeth against
the min-bias bug) and a local-server crawl that asserts the pause slows a
multi-file mirror.

Closes #185

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-27 23:55:35 +02:00
Xavier Roche
5be8ba4bbd Add --cookies-file to preload a Netscape cookies.txt (#215) (#437)
Mirroring a site behind a login meant either re-implementing the auth
flow or dropping a file literally named cookies.txt into the output or
working directory, the only two places the engine looked. This adds a
CLI option to point at an arbitrary Netscape/Mozilla cookies.txt, so a
session exported from a browser (the "Get cookies.txt" extensions write
exactly this format) is replayed on the crawl and authenticated pages
come down.

The plumbing already existed: cookie_load parses the format into the
shared jar and the request path sends every matching cookie. The new
opt->cookies_file is loaded last, after the mirror/CWD defaults, so a
user-supplied value wins on a name/domain/path conflict. The field is
appended at the tail of httrackp, so the exported ABI is unchanged.

Cookies key on host[:port], so a bare-domain file matches a normal crawl
of a default-port site; only an explicit-port URL needs the port in the
cookie domain. Covered by 27_local-cookies-file.test: a gated page that
500s without a cookie no page ever sets, reachable only once the file
preloads it (with -o0 so the absence of a 500 error page is meaningful),
plus a no-cookie control. The local-crawl harness grows a --cookie helper
that writes a port-scoped jar. The copyopt self-test also gains a String
round-trip so the exported copy_htsopt path for the new field is covered.

Closes #215

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 22:57:05 +02:00
Xavier Roche
247a46068e debian: lead webhttrack browser dep with chromium, not firefox-esr (#436)
webhttrack's "firefox-esr | chromium | www-browser" made the httrack
source inherit firefox-esr's autoremoval clock: britney keys off the
first alternative of a disjunction, so every firefox-esr RC bug
(currently #1127569) dragged httrack toward removal even though
webhttrack stays installable via the other alternatives.

Lead with chromium instead. lintian requires a real package as the
first alternative (virtual-package-depends-without-real-package-depends),
so the virtual www-browser cannot go first; chromium is real, keeps the
dep lintian-clean, and makes britney track chromium rather than the
RC-bug-prone firefox-esr.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 20:44:30 +02:00
Xavier Roche
669947cd23 Split -%u URL Hacks into independent www/slash/query toggles (#271) (#435)
-%u (--urlhack) bundled three dedup normalizations under one switch:
www.host == host, redundant // collapse, and query-argument reordering.
A mirror that needed one but not another (e.g. keep www. distinct) had to
turn the whole umbrella off. Add three opt-out sub-options, defaulting to
the umbrella so existing -%u/-%u0 behavior is unchanged:

  --keep-www-prefix      keep www.foo.com distinct from foo.com   (-%j)
  --keep-double-slashes  keep redundant // in the path            (-%o)
  --keep-query-order     keep query-argument order significant    (-%y)

The split is resolved once in hash_init() into norm_host/norm_slash/
norm_query and threaded through the dedup hash (htshash.c), the savename
lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so
all three stay consistent. fil_normalized() gains an internal
fil_normalized_ex(do_slash, do_query) core; the public
fil_normalized()/fil_normalized_filtered() keep their signatures.

Normalization (slash/query) now follows urlhack and its sub-flags
uniformly, while --strip-query stays orthogonal. So with urlhack off,
strip-query strips keys without sorting the remainder; the url_savename
urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer
the hash uses, so a URL is always looked up under the key it was stored
with (a self-lookup mismatch this otherwise introduced).

http/https are always merged in the dedup key (the scheme is stripped
regardless of -%u), so that part of the request needs no toggle.

The opt-outs are spelled positively (--keep-*) because httrack's generic
--no<opt> prefix only appends the disabling "0" for parametered options,
not "single" booleans, so --nowww-dedup would silently no-op.

opt grows three hts_boolean fields appended at the struct tail (offsets
stable, no soname bump, matching the strip_query addition in #112).

Tested by a -#test=urlhack engine self-test (hash_url_equals over each
flag combination) plus a -%u0 + --strip-query crawl case exercising the
urlhack-off savename branch.

Closes #271

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 20:26:28 +02:00
17 changed files with 287 additions and 4 deletions

2
debian/control vendored
View File

@@ -30,7 +30,7 @@ Description: Copy websites to your computer (Offline browser)
Package: webhttrack
Architecture: any
Multi-Arch: foreign
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, firefox-esr | chromium | www-browser
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, chromium | firefox-esr | www-browser
Replaces: webhttrack-common (<< 3.43.9-2)
Breaks: webhttrack-common (<< 3.43.9-2)
Suggests: httrack, httrack-doc

View File

@@ -24,6 +24,7 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-EN, \-\-max\-time[=N]\fR ]
[ \fB\-AN, \-\-max\-rate[=N]\fR ]
[ \fB\-%cN, \-\-connection\-per\-second[=N]\fR ]
[ \fB\-%G, \-\-pause\fR ]
[ \fB\-GN, \-\-max\-pause[=N]\fR ]
[ \fB\-cN, \-\-sockets[=N]\fR ]
[ \fB\-TN, \-\-timeout[=N]\fR ]
@@ -49,6 +50,7 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-%p, \-\-preserve\fR ]
[ \fB\-%T, \-\-utf8\-conversion\fR ]
[ \fB\-bN, \-\-cookies[=N]\fR ]
[ \fB\-%K, \-\-cookies\-file\fR ]
[ \fB\-u, \-\-check\-type[=N]\fR ]
[ \fB\-j, \-\-parse\-java[=N]\fR ]
[ \fB\-sN, \-\-robots[=N]\fR ]
@@ -154,6 +156,8 @@ maximum mirror time in seconds (60=1 minute, 3600=1 hour) (\-\-max\-time[=N])
maximum transfer rate in bytes/seconds (1000=1KB/s max) (\-\-max\-rate[=N])
.IP \-%cN
maximum number of connections/seconds (*%c10) (\-\-connection\-per\-second[=N])
.IP \-%G
random pause of MIN[:MAX] seconds between files (e.g. %G5:10) (\-\-pause <param>)
.IP \-GN
pause transfer if N bytes reached, and wait until lock file is deleted (\-\-max\-pause[=N])
.SS Flow control:
@@ -212,6 +216,8 @@ links conversion to UTF\-8 (\-\-utf8\-conversion)
.SS Spider options:
.IP \-bN
accept cookies in cookies.txt (0=do not accept,* 1=accept) (\-\-cookies[=N])
.IP \-%K
load extra cookies from a Netscape cookies.txt (\-\-cookies\-file <param>)
.IP \-u
check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (\-\-check\-type[=N])
.IP \-j

View File

@@ -112,6 +112,10 @@ const char *hts_optalias[][4] = {
{"include-query-string", "-%q", "single", ""},
{"strip-query", "-%g", "param1",
"strip [host/pattern=]key1,key2,... from URLs"},
{"cookies-file", "-%K", "param1",
"load extra cookies from a Netscape cookies.txt"},
{"pause", "-%G", "param1",
"random pause of MIN[:MAX] seconds between files"},
{"generate-errors", "-o", "single", ""},
{"do-not-generate-errors", "-o0", "single", ""},
{"purge-old", "-X", "param", ""},

View File

@@ -35,6 +35,7 @@ Please visit our Website: http://www.httrack.com
#include <fcntl.h>
#include <ctype.h>
#include <stdint.h> /* uint64_t for the pause mixer (already a hard dep via md5.h) */
/* File defs */
#include "htscore.h"
@@ -523,9 +524,12 @@ int httpmirror(char *url1, httrackp * opt) {
opt->cookie = &cookie;
cookie.max_len = 30000; // max len
strcpybuff(cookie.data, "");
// Charger cookies.txt par défaut ou cookies.txt du miroir
// Load the mirror's cookies.txt, then the one in the current directory
cookie_load(opt->cookie, StringBuff(opt->path_log), "cookies.txt");
cookie_load(opt->cookie, "", "cookies.txt");
// A user-supplied cookie file is merged last so it wins on conflicts
if (strnotempty(StringBuff(opt->cookies_file)))
cookie_load(opt->cookie, "", StringBuff(opt->cookies_file));
} else
opt->cookie = NULL;
@@ -3311,6 +3315,21 @@ HTS_INLINE int back_fillmax(struct_back * sback, httrackp * opt,
return -1; /* plus de place */
}
/* Seed-derived: stable within a gap, rerolls per launch; a per-call rand()
would bias the delay toward min_ms (see header). Jitter, not crypto. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms) {
uint64_t z = (uint64_t) seed;
if (max_ms <= min_ms)
return min_ms;
/* SplitMix64 finalizer: scrambles the low-entropy ms timestamp. */
z += 0x9E3779B97F4A7C15ULL;
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL;
z ^= z >> 31;
return min_ms + (int) (z % (uint64_t) (max_ms - min_ms + 1));
}
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
int n = opt->maxsoc - back_nsoc(sback);
@@ -3331,6 +3350,18 @@ int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
}
}
// #185 randomized inter-file pause: non-blocking, one launch per gap
if (n > 0 && opt->pause_max_ms > 0 && HTS_STAT.last_connect > 0) {
TStamp opTime =
HTS_STAT.last_request ? HTS_STAT.last_request : HTS_STAT.last_connect;
TStamp lap = mtime_local() - opTime;
if (lap < hts_pause_target_ms(opTime, opt->pause_min_ms, opt->pause_max_ms))
n = 0;
else
n = 1;
}
return n;
}
@@ -3742,6 +3773,14 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (StringNotEmpty(from->strip_query))
StringCopyS(to->strip_query, from->strip_query);
if (StringNotEmpty(from->cookies_file))
StringCopyS(to->cookies_file, from->cookies_file);
if (from->pause_max_ms > 0) {
to->pause_min_ms = from->pause_min_ms;
to->pause_max_ms = from->pause_max_ms;
}
if (from->retry > -1)
to->retry = from->retry;

View File

@@ -418,6 +418,10 @@ int back_pluggable_sockets(struct_back * sback, httrackp * opt);
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt);
/* Randomized inter-file pause target in [min_ms,max_ms] (#185), derived from a
timestamp seed so it is stable within one gap and rerolls per launch. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms);
/* Schedule more links from the heap into free slots. Returns the number queued,
or <=0 if none could be added (no free slot / paused / stopped). */
int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,

View File

@@ -1976,6 +1976,51 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
StringCat(opt->strip_query, argv[na]);
}
break;
case 'K': // cookies-file: extra Netscape cookies.txt to preload
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
HTS_PANIC_PRINTF(
"Option cookies-file needs a blank space and "
"a cookies.txt path");
printf("Example: --cookies-file \"/home/me/cookies.txt\"\n");
htsmain_free();
return -1;
} else {
na++;
if (strlen(argv[na]) >= 1024) {
HTS_PANIC_PRINTF("Cookie file path too long");
htsmain_free();
return -1;
}
StringCopy(opt->cookies_file, argv[na]);
}
break;
case 'G': // pause: randomized inter-file delay MIN[:MAX] seconds
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
HTS_PANIC_PRINTF("Option pause needs a blank space and a "
"delay in seconds (MIN[:MAX])");
printf("Example: --pause 5:10\n");
htsmain_free();
return -1;
} else {
double pmin = 0, pmax = 0;
int nf;
na++;
nf = sscanf(argv[na], "%lf:%lf", &pmin, &pmax);
if (nf < 2)
pmax = pmin; /* a single value means a fixed delay */
/* positive-form bounds: NaN fails every comparison, so this
rejects it before the undefined (int)(NaN*1000) cast */
if (nf < 1 || !(pmin >= 0 && pmax >= pmin && pmax <= 86400)) {
HTS_PANIC_PRINTF("Invalid --pause range (expected "
"MIN[:MAX] seconds, 0<=MIN<=MAX<=86400)");
htsmain_free();
return -1;
}
opt->pause_min_ms = (int) (pmin * 1000.0);
opt->pause_max_ms = (int) (pmax * 1000.0);
}
break;
case 't': /* do not change type (ending) of filenames according to the MIME type */
opt->no_type_change = 1;
if (*(com+1)=='0') { opt->no_type_change = 0; com++; }

View File

@@ -521,6 +521,7 @@ void help(const char *app, int more) {
infomsg(" EN maximum mirror time in seconds (60=1 minute, 3600=1 hour)");
infomsg(" AN maximum transfer rate in bytes/seconds (1000=1KB/s max)");
infomsg(" %cN maximum number of connections/seconds (*%c10)");
infomsg(" %G random pause of MIN[:MAX] seconds between files (e.g. %G5:10)");
infomsg
(" GN pause transfer if N bytes reached, and wait until lock file is deleted");
infomsg("");
@@ -572,6 +573,7 @@ void help(const char *app, int more) {
infomsg("");
infomsg("Spider options:");
infomsg(" bN accept cookies in cookies.txt (0=do not accept,* 1=accept)");
infomsg(" %K load extra cookies from a Netscape cookies.txt");
infomsg
(" u check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always)");
infomsg

View File

@@ -6045,6 +6045,9 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->no_query_dedup = HTS_FALSE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
StringCopy(opt->strip_query, "");
StringCopy(opt->cookies_file, "");
opt->pause_min_ms = 0;
opt->pause_max_ms = 0;
opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, "");
@@ -6190,6 +6193,7 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
StringFree(opt->footer);
StringFree(opt->mod_blacklist);
StringFree(opt->strip_query);
StringFree(opt->cookies_file);
StringFree(opt->path_html);
StringFree(opt->path_html_utf8);

View File

@@ -535,6 +535,10 @@ struct httrackp {
no_www_dedup; /**< with urlhack, keep www.host distinct from host */
hts_boolean no_slash_dedup; /**< with urlhack, keep redundant // in paths */
hts_boolean no_query_dedup; /**< with urlhack, keep query-argument order */
String cookies_file; /**< extra Netscape cookies.txt to preload
(--cookies-file) */
int pause_min_ms; /**< inter-file pause lower bound, ms (0=off, #185) */
int pause_max_ms; /**< inter-file pause upper bound, ms */
};
/* Running statistics for a mirror. */

View File

@@ -899,12 +899,71 @@ static int st_copyopt(httrackp *opt, int argc, char **argv) {
if (to->parseall != HTS_TRUE)
err = 1;
/* String field: a non-empty source deep-copies across, an empty source
leaves the target intact (StringNotEmpty guard). Covers the exported
copy_htsopt String path that no crawl test reaches. */
StringCopy(from->cookies_file, "/tmp/jar.txt");
StringCopy(to->cookies_file, "");
copy_htsopt(from, to);
if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
err = 1;
StringCopy(from->cookies_file, "");
copy_htsopt(from, to);
if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
err = 1;
/* #185 pause pair: copied when enabled (max>0), the 0 sentinel skips */
from->pause_min_ms = 5000;
from->pause_max_ms = 10000;
to->pause_min_ms = to->pause_max_ms = 0;
copy_htsopt(from, to);
if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
err = 1;
from->pause_min_ms = from->pause_max_ms = 0;
copy_htsopt(from, to);
if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_pause(httrackp *opt, int argc, char **argv) {
int err = 0, i, seen_low = 0, seen_high = 0;
(void) opt;
(void) argc;
(void) argv;
/* Consecutive-ms seeds (production shape: launch timestamps a few ms apart)
must stay in range and spread, not collapse to a bound -- worst case for a
weak low-bit mixer. */
for (i = 0; i < 10000; i++) {
int t = hts_pause_target_ms((TStamp) (1719500000000LL + i), 5000, 10000);
if (t < 5000 || t > 10000)
err = 1;
seen_low |= (t < 6000);
seen_high |= (t > 9000);
}
if (!seen_low || !seen_high)
err = 1;
if (hts_pause_target_ms(12345, 8000, 8000) != 8000) /* equal bounds = fixed */
err = 1;
/* deterministic: a seed yields the same target even after an intervening call
with another seed (no global PRNG state to perturb it) */
{
int a = hts_pause_target_ms(99, 5000, 10000);
(void) hts_pause_target_ms(54321, 5000, 10000);
if (hts_pause_target_ms(99, 5000, 10000) != a)
err = 1;
}
printf("pause: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_relative(httrackp *opt, int argc, char **argv) {
char s[HTS_URLMAXSIZE * 2];
@@ -1251,6 +1310,7 @@ static const struct selftest_entry {
{"strsafe", "[overflow|overflow-buff [str]]", "bounded string-op self-test",
st_strsafe},
{"copyopt", "", "copy_htsopt option-copy self-test", st_copyopt},
{"pause", "", "randomized inter-file pause target self-test", st_pause},
{"relative", "<link> <curr-file>", "relative link between two paths",
st_relative},
{"resolve", "<link> <adr> <fil>", "resolve a link against an origin",

View File

@@ -90,4 +90,16 @@ refused "dangling-quote argument not refused cleanly"
run_only "$tmp/q-lone" '"'
refused "lone-quote argument not refused cleanly"
# --pause (#185): valid MIN[:MAX] accepted; malformed, reversed, over-range and
# non-finite values refused cleanly. NaN defeats naive `<`/`>` checks (it
# compares false to everything), so it must not slip through to the int cast.
run "$tmp/pause-ok" --pause 0.2:0.4
accepted "$tmp/pause-ok" "#185: valid --pause range rejected"
run "$tmp/pause-fix" --pause 0.2
accepted "$tmp/pause-fix" "#185: valid fixed --pause rejected"
for bad in nan nan:5 5:nan inf 10:5 99999; do
run "$tmp/pause-bad" --pause "$bad"
refused "#185: invalid --pause '$bad' not refused cleanly"
done
exit 0

15
tests/01_engine-pause.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# --pause (#185): the inter-file pause target must stay in [min,max] and spread
# across it (a per-call rand() would collapse it toward min). Driven by the
# in-process 'httrack -#test=pause' test. POSIX-portable ($(BASH) is /bin/sh on macOS).
set -eu
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=pause run)
test "$out" = "pause: OK" || {
echo "expected 'pause: OK', got: $out" >&2
exit 1
}

View File

@@ -0,0 +1,22 @@
#!/bin/bash
#
# End-to-end --cookies-file (#215): /gated/secret.php needs a cookie no page
# ever Set-Cookies, so it is reachable only when the option preloads it from a
# Netscape cookies.txt. Locks the CLI->opt->cookie_load->wire plumbing.
set -e
: "${top_srcdir:=..}"
# preloaded cookie -> secret page is served. -o0 means a 500 leaves no file, so
# --found/--files only hold when the secret is genuinely fetched (200).
bash "$top_srcdir/tests/local-crawl.sh" --cookie 'session=opensesame' \
--errors 0 --files 2 \
--found 'gated/index.html' --found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0
# control: without the cookie the secret 500s; -o0 suppresses the error page so
# its absence is real (error + missing file)
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'gated/index.html' --not-found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0

29
tests/28_local-pause.test Executable file
View File

@@ -0,0 +1,29 @@
#!/bin/bash
#
# --pause (#185): a fixed inter-file delay must slow a multi-file crawl. Measure
# the same crawl with and without --pause and compare: the harness overhead
# cancels, leaving only the pause. Integer seconds keep it portable (BSD date
# has no %N); a lower bound is not timing-flaky since a pause only adds time.
set -e
: "${top_srcdir:=..}"
run() { # echoes the wall-clock seconds of one crawl
local t0 t1
t0=$(date +%s)
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
httrack 'BASEURL/types/index.html' -c1 "$@" >/dev/null 2>&1
t1=$(date +%s)
echo $((t1 - t0))
}
base=$(run)
paused=$(run --pause 0.5)
delta=$((paused - base))
echo "crawl: ${base}s, with --pause 0.5: ${paused}s (delta ${delta}s)"
if [ "$delta" -lt 2 ]; then
echo "FAIL: --pause did not delay the crawl (delta ${delta}s)" >&2
exit 1
fi

View File

@@ -41,6 +41,7 @@ TESTS = \
01_engine-idna.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
@@ -72,6 +73,8 @@ TESTS = \
23_local-errpage.test \
24_local-resume-overlap.test \
25_local-mime-exclude.test \
26_local-strip-query.test
26_local-strip-query.test \
27_local-cookies-file.test \
28_local-pause.test
CLEANFILES = check-network_sh.cache

View File

@@ -12,11 +12,14 @@
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# bash local-crawl.sh [--tls] [--root DIR] [--cookie NAME=VALUE ...] \
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
# which the ephemeral port forces into the cookie domain) and passes it to
# httrack via --cookies-file, to exercise preloaded cookies.
set -u
@@ -85,6 +88,7 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create
# --- parse leading control flags --------------------------------------------
declare -a audit=()
declare -a cookies=()
scheme=http
pos=0
args=("$@")
@@ -105,6 +109,10 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
root="${args[$pos]}"
;;
--cookie)
pos=$((pos + 1))
cookies+=("${args[$pos]}")
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
@@ -158,6 +166,17 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
done
# --- materialize any --cookie entries into a cookies.txt ---------------------
if test "${#cookies[@]}" -gt 0; then
jar="${tmpdir}/cookies.txt"
: >"$jar"
for spec in "${cookies[@]}"; do
printf '127.0.0.1:%s\tTRUE\t/\tFALSE\t1999999999\t%s\t%s\n' \
"$port" "${spec%%=*}" "${spec#*=}" >>"$jar"
done
hts+=(--cookies-file "$jar")
fi
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')

View File

@@ -110,6 +110,19 @@ class Handler(SimpleHTTPRequestHandler):
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
# --cookies-file (#215): the secret page needs a cookie no page ever sets,
# so it is reachable only when --cookies-file preloads it.
GATE_COOKIE = ("session", "opensesame")
def route_gated_index(self):
self.send_html('\tThis is a <a href="secret.php">link</a>')
def route_gated_secret(self):
name, value = self.GATE_COOKIE
if self.request_cookies().get(name) != value:
return self.fail_cookie(name)
self.send_html("\tThis is the secret.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
@@ -345,6 +358,8 @@ class Handler(SimpleHTTPRequestHandler):
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/gated/index.php": route_gated_index,
"/gated/secret.php": route_gated_secret,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,