Compare commits

...

8 Commits

Author SHA1 Message Date
Xavier Roche
40a66600ff Add --strip-query to drop query keys from dedup naming (#112) (#434)
Two URLs that differ only in tracking or session query parameters
(?utm_source=x versus ?utm_source=y) were saved as separate files, and a
single CGI could fan out into thousands of near-duplicate pages.
fil_normalized already sorted query args, so reordered parameters dedup,
but there was no way to drop a named key.

--strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the
listed keys when computing the dedup key and the saved name. The fetched
URL is untouched, so a required sid= is still sent on the wire; only the
local namespace collapses. Patterns match the normalized host/path with
the +/- filter glob (strjoker), last match wins as in the filter list,
and stripping is decoupled from urlhack (-%u) so it never silently
no-ops with -%u0.

It all funnels through one chokepoint, fil_normalized: an internal
fil_normalized_filtered() strips then delegates, and hts_query_strip_keys
resolves the per-URL key list. The strip pass walks every query field,
including empty and trailing ones, so its output is a fixpoint under the
read path's second normalization (otherwise dedup silently misses).
Exported ABI is unchanged; the strip_query field is appended at the tail
of httrackp. Covered by a -#test=stripquery self-test (degenerate queries
like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup
crawl test.

Closes #112

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 11:13:16 +02:00
Xavier Roche
768756e231 ci: add a MemorySanitizer job for the offline engine self-tests (#433)
MSan is the only sanitizer that catches a read of uninitialized memory --
the class of #143, where the size filter tested an uninitialized stack
LLint and forbade files at random. ASan and UBSan let that through.

MSan reports any byte produced by an uninstrumented library as
uninitialized, so the job stays inside our own code: clang, a static link
(the MSan runtime is not injected into shared objects), --disable-https to
drop openssl, and only the offline 01_engine-* self-tests minus the
zlib-backed cache trio. Those self-tests drive the hostile-input parsers
(charset, mime, html, entities, idna, filters) straight through MSan.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:40:22 +02:00
Xavier Roche
b138c87a93 filtersize self-test: parse the size with sscanf(LLintP), and lock the '>' operator (#432)
Use the portable sscanf(argv[0], LLintP, &sz) idiom the rest of the tree
uses to read an LLint, instead of strtoll: LLint is not always long long
(MSVC __int64, plus fallbacks) and strtoll is absent on old MSVC.

Add two cases so the size-rule scan-time neutrality is pinned for the '>'
operator too, not only '<': -*.jpg*[>10] stays neutral at scan time and
cancels once the size is known.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 22:01:14 +02:00
Xavier Roche
3de47433b7 Keep size-based filter rules neutral until the file size is known (#143) (#431)
A rule such as -*.jpg*[<10] is meant to fetch every JPG, then delete the
ones under 10KB once their size is known. Instead it could forbid all of
them up front: at scan time the wizard calls fa_strjoker with no size, but
fa_strjoker always handed strjoker the address of an uninitialized local sz,
so the *[<10] predicate ran against stack garbage. When that garbage fell in
[0,10) the rule "matched" and the link was dropped before it was ever
downloaded ("(wizard) explicit forbidden (-*.jpg*[<10])").

Pass no size pointer when the size is unknown, routing into strjoker's
existing "test impossible -> no match" path so size rules stay neutral at
scan time and only fire once the real size is in. The size-known path is
unchanged.

Add a filtersize engine self-test that drives fa_strjoker through both
phases and a tests/01_engine-filter.test block locking the scenario.

Also lock #144: the *[name]/*[file]/*[path] classes do not span '?'; a
trailing query is tolerated by the same global rule that lets *.aspx match
page.aspx?y=2, not by the class. Working as intended.

Closes #143

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 21:21:54 +02:00
Xavier Roche
fb8827718e htscore: report why a -%L URL list could not be loaded (#49) (#430)
A missing, unreadable, or non-regular -%L file all collapsed into one
reasonless "Could not include URL list: <name>", which is what left the
#49 reporter unable to tell why the list was rejected. Open and stat()
the file explicitly so the log carries the cause: the errno text (no
such file, permission denied), "not a regular file", or "file too
large". The loader keeps the original regular-file guard, so it still
won't open a directory or FIFO.

Covered by an offline file:// test: a readable list loads with a
non-zero count, while a missing file, an unreadable file, and a
directory each fail with a distinct reason instead of the bare message.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 20:49:20 +02:00
Xavier Roche
7228210061 Abort the download when the response MIME type is excluded by -mime: (#58) (#429)
A -mime: exclusion only took effect after the full body had been
downloaded and then discarded (leaving a .delayed temp behind), wasting
bandwidth. Honor it as soon as the response Content-Type arrives:
back_wait now aborts the transfer before the body when hts_acceptmime
forbids the declared type, finishing the slot with a new
STATUSCODE_EXCLUDED clean-skip status rather than fetching and dropping.

Covers the reported case (an HTML-looking URL served as application/pdf
past a +*.html include) and any -mime: match regardless of extension.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 20:10:37 +02:00
Xavier Roche
38882c0aee Honor the server's Content-Range when resuming a partial download (#198) (#428)
* Honor the server's Content-Range when resuming a partial download (#198)

A resumed download (Range: bytes=N-) may be answered with a 206 whose range
starts before N: block-aligned caches and CDNs routinely round the start down
to a block boundary, and RFC 7233 lets the server pick the range it returns.
httrack ignored the returned Content-Range and blindly appended the 206 body to
the bytes already on disk, so the overlapping bytes were duplicated and the file
grew by the overlap. With timing deciding which files get interrupted (and thus
resumed), this surfaced as a random subset of files corrupted on each run, each
a few bytes too large.

Resume at the server's crange_start instead: ftruncate the partial to that
offset and write the 206 body there (the in-memory branch keeps only that
prefix). When the returned range is unusable (a forward gap, no/garbage
Content-Range, or one that doesn't reach EOF) drop the partial and refetch the
whole file rather than stitch a corrupt one.

Reading the existing crange_start/crange_end/crange fields only, no ABI change.
Driven by tests/24_local-resume-overlap.test: pass 1 interrupts a download
mid-body, pass 2 resumes against a 206 that backs up 8 bytes, and the result
must be byte-identical to the same content fetched whole.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Harden #198 fix: verify the truncate, assert the test hit the resume path

Two follow-ups from review of the resume fix.

If HTS_FTRUNCATE fails the partial could keep a stale tail (only when the
resource shrank between runs, sz > full, so the body write no longer covers
the old end). Check its return and, on failure, drop the partial and refetch
the whole file instead of writing a possibly-corrupt one.

The resume test only compared the resumed bytes against the whole file, which
also passes if httrack silently re-downloads the file with no Range (the bug
never fires). Mark when the server actually serves a resume 206 and assert pass
2 hit that path, so a full re-download fails the test instead of passing it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* tests: run 24_local-resume-overlap under set -e

Follow the golden rule for shell scripts: start with set -e so a non-last
failure can't be masked. Guard the backgrounded-crawl kill/wait spots with
|| true so the expected SIGTERM exit doesn't abort the run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:42:26 +02:00
Xavier Roche
bfc4a016ab Replace single-letter -# self-tests with a named -#test=NAME registry (#427)
The hidden engine self-tests had accreted into a grab-bag of arbitrary
single-letter/-digit -# arms (-#0, -#A, -#W, ...) buried in the htscoremain.c
option switch, with no mnemonics and stale --help text. Collapse them into one
registry: -#test lists every test with a usage hint and one-line description,
and -#test=NAME [args] runs one.

The handlers and the two helpers they used (basic_selftests,
string_safety_selftests) move to a new htsselftest.c keyed by a
{name, args, desc, fn} table; htscoremain.c keeps only a small dispatch that
runs ahead of the no-URL usage gate, so a bare -#test (or an arg-less test like
copyopt/dns/cookies) no longer needs a dummy URL token to be reached. The
genuine debug knobs (-#L, -#C, -#R, -#h, ...) stay as letters in the switch;
only the unit self-tests, whose sole callers are tests/01_engine-*.test, are
renamed, so this is internal-only with no compatibility surface. Behavior is
preserved: each test prints the same result line and exit code, which the
existing assertions pin. Three now-unused includes (htscache_selftest.h,
htsdns_selftest.h, htsencoding.h) drop out of htscoremain.c.

Tests: the engine tests move to -#test=NAME; 01_engine-hashtable now asserts its
success line (not just exit code) so a misrouted registry row can't pass, and a
new 01_engine-selftest-dispatch covers the bare-list and unknown-name paths.

The --help/man "guru options" list now points at -#test instead of enumerating
a stale subset. The lone vestigial alias --debug-testfilters still resolves to
the removed -#0 (it was already non-functional: param1 supplies one argument,
-#0 required two); it is left untouched because editing that array forces
clang-format to reflow the whole untouched table.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:05:59 +02:00
44 changed files with 2137 additions and 1119 deletions

View File

@@ -188,6 +188,51 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# MemorySanitizer catches reads of uninitialized memory (#143's stack-garbage
# size filter) that ASan/UBSan miss. It flags any byte an uninstrumented lib
# wrote, so the job stays in our own code: offline self-tests only, no openssl
# (--disable-https), no zlib cache tests, static (the runtime is not in .so's).
msan:
name: msan (MemorySanitizer, clang)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential clang autoconf automake libtool autoconf-archive \
zlib1g-dev
- name: Configure (MSan, static, no https)
run: |
set -euo pipefail
autoreconf -fi
./configure CC=clang \
CFLAGS="-fsanitize=memory -fsanitize-memory-track-origins=2 -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
LDFLAGS="-fsanitize=memory" \
--disable-https --disable-shared --enable-static
- name: Build
run: make -j"$(nproc)"
- name: Test (offline self-tests under MSan)
env:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
# the configuration users on minimal systems build, and one libssl is not even
# installed here so configure cannot silently re-enable it. The matrix above

View File

@@ -33,8 +33,9 @@ the operational checklist: toolchain, invariants, and how to ship a change.
- Be terse. Comment the why, in English; translate French comments you touch.
- Strip AI tells from prose (em-dash overuse, rule-of-three, filler, vague
attributions). Ref: Wikipedia "Signs of AI writing". Claude Code: `/humanizer`.
- Behavior change → add a test. Fast path: a hidden `httrack -#N` debug
subcommand (`htscoremain.c`) driven by a `tests/NN_*.test`, over a slow crawl.
- Behavior change → add a test. Fast path: a hidden `httrack -#test=NAME` engine
self-test (registry in `htsselftest.c`; `-#test` lists them) driven by a
`tests/NN_*.test`, over a slow crawl.
## Review your change adversarially (strongly suggested)
Before pushing, and when reviewing others, don't skim for bugs:

View File

@@ -3,7 +3,7 @@
.\"
.\" This file is generated by man/makeman.sh; do not edit by hand.
.\" SPDX-License-Identifier: GPL-3.0-or-later
.TH httrack 1 "13 June 2026" "httrack website copier"
.TH httrack 1 "27 June 2026" "httrack website copier"
.SH NAME
httrack \- offline browser : copy websites to a local directory
.SH SYNOPSIS
@@ -43,6 +43,7 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-x, \-\-replace\-external\fR ]
[ \fB\-%x, \-\-disable\-passwords\fR ]
[ \fB\-%q, \-\-include\-query\-string\fR ]
[ \fB\-%g, \-\-strip\-query\fR ]
[ \fB\-o, \-\-generate\-errors\fR ]
[ \fB\-X, \-\-purge\-old[=N]\fR ]
[ \fB\-%p, \-\-preserve\fR ]
@@ -198,6 +199,8 @@ replace external html links by error pages (\-\-replace\-external)
do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
.IP \-%q
*include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
.IP \-%g
strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
.IP \-o
*generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
.IP \-X
@@ -313,12 +316,8 @@ debug HTTP headers in logfile (\-\-debug\-headers)
.SS Guru options: (do NOT use if possible)
.IP \-#X
*use optimized engine (limited memory boundary checks) (\-\-fast\-engine)
.IP \-#0
filter test (\-#0 '*.gif' 'www.bar.com/foo.gif') (\-\-debug\-testfilters <param>)
.IP \-#1
simplify test (\-#1 ./foo/bar/../foobar)
.IP \-#2
type test (\-#2 /foo/bar.php)
.IP \-#test
list engine self\-tests (run one with \-#test=NAME [args])
.IP \-#C
cache list (\-#C '*.com/spider*.gif' (\-\-debug\-cache <param>)
.IP \-#R

View File

@@ -56,7 +56,7 @@ whttrackrundir = $(bindir)
whttrackrun_SCRIPTS = webhttrack
libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htscache_selftest.c htsdns_selftest.c \
htscache_selftest.c htsdns_selftest.c htsselftest.c \
htscatchurl.c htsfilters.c htsftp.c htshash.c coucal/coucal.c \
htshelp.c htslib.c htscoremain.c \
htsname.c htsrobots.c htstools.c htswizard.c \
@@ -66,7 +66,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
md5.c \
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htscatchurl.h \
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htsselftest.h htscatchurl.h \
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
htshelp.h htsindex.h htslib.h htsmd5.h \

View File

@@ -60,6 +60,9 @@ Please visit our Website: http://www.httrack.com
param1 : this option must be alone, and needs one distinct parameter (-P <path>)
param0 : this option must be alone, but the parameter should be put together (+*.gif)
*/
/* clang-format off: hand-aligned table; clang-format reflows the whole
initializer (2->4 space) on any edit, churning every untouched row. */
/* clang-format off */
const char *hts_optalias[][4] = {
/* {"","","",""}, */
{"path", "-O", "param1", "output path"},
@@ -107,6 +110,8 @@ const char *hts_optalias[][4] = {
{"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
"single", ""},
{"include-query-string", "-%q", "single", ""},
{"strip-query", "-%g", "param1",
"strip [host/pattern=]key1,key2,... from URLs"},
{"generate-errors", "-o", "single", ""},
{"do-not-generate-errors", "-o0", "single", ""},
{"purge-old", "-X", "param", ""},
@@ -241,6 +246,7 @@ const char *hts_optalias[][4] = {
{"", "", "", ""}
};
/* clang-format on */
/*
Check for alias in command-line

View File

@@ -57,7 +57,10 @@ Please visit our Website: http://www.httrack.com
// DOS
#include <process.h> /* _beginthread, _endthread */
#endif
#include <io.h> /* _chsize_s */
#define HTS_FTRUNCATE(fp, sz) _chsize_s(_fileno(fp), (sz))
#else
#define HTS_FTRUNCATE(fp, sz) ftruncate(fileno(fp), (sz))
#endif
#define VT_CLREOL "\33[K"
@@ -3763,7 +3766,27 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
}
#endif
/********** **************************** ********** */
} else { // il faut aller le chercher
}
// MIME type excluded by a -mime: filter: abort, don't fetch
// the body (#58)
else if (HTTP_IS_OK(back[i].r.statuscode) &&
!back[i].testmode &&
strnotempty(back[i].r.contenttype) &&
hts_acceptmime(opt, 0, back[i].url_adr,
back[i].url_fil,
back[i].r.contenttype) == 1) {
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
back[i].r.statuscode = STATUSCODE_EXCLUDED;
strcpybuff(back[i].r.msg, "Excluded by MIME type filter");
hts_log_print(
opt, LOG_NOTICE,
"File excluded by MIME type filter (%s): %s%s",
back[i].r.contenttype, back[i].url_adr,
back[i].url_fil);
} else { // il faut aller le chercher
// effacer buffer (requète)
if (!noFreebuff) {
@@ -3774,35 +3797,70 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
// xxc SI CHUNK VERIFIER QUE CA MARCHE??
if (back[i].r.statuscode == 206) { // on nous envoie un morceau (la fin) coz une partie sur disque!
off_t sz = fsize_utf8(back[i].url_sav);
/* RFC 7233: resume at the server's Content-Range start,
not the offset we requested; a server may resume
earlier and appending the overlap duplicates bytes
(#198). */
const LLint resume = back[i].r.crange_start;
const hts_boolean range_ok =
back[i].r.crange > 0 && resume >= 0 &&
resume <= (LLint) sz &&
back[i].r.crange_end + 1 == back[i].r.crange &&
(back[i].r.totalsize < 0 ||
back[i].r.totalsize ==
back[i].r.crange_end - resume + 1);
#if HDEBUG
printf("partial content: " LLintP " on disk..\n",
(LLint) sz);
#endif
if (sz >= 0) {
if (sz >= 0 && range_ok) {
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML
if (opt->getmode & HTS_GETMODE_NONHTML) {
filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu
file_notify(opt, back[i].url_adr, back[i].url_fil,
back[i].url_sav, 0, 1,
back[i].r.notmodified);
back[i].r.out = FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "ab"); // append
back[i].r.out =
FOPEN(fconv(catbuff, sizeof(catbuff),
back[i].url_sav),
"r+b"); // resume in place
if (back[i].r.out && opt->cache != 0) {
back[i].r.is_write = 1; // écrire
back[i].r.size = sz; // déja écrit
back[i].r.statuscode = HTTP_OK; // Forcer 'OK'
back[i].r.is_write = 1;
back[i].r.size = resume; // bytes already on disk
back[i].r.statuscode = HTTP_OK; // force 'OK'
if (back[i].r.totalsize >= 0)
back[i].r.totalsize += sz; // plus en fait
fseek(back[i].r.out, 0, SEEK_END); // à la fin
/* create a temporary reference file in case of broken mirror */
if (back_serialize_ref(opt, &back[i]) != 0) {
hts_log_print(opt, LOG_WARNING,
"Could not create temporary reference file for %s%s",
back[i].url_adr, back[i].url_fil);
}
back[i].r.totalsize += resume; // -> full size
// drop bytes past the resume point; a silent
// failure could leave a stale tail, so on error
// drop the partial and refetch the whole file
if (HTS_FTRUNCATE(back[i].r.out,
(off_t) resume) != 0) {
fclose(back[i].r.out);
back[i].r.out = NULL;
url_savename_refname_remove(
opt, back[i].url_adr, back[i].url_fil);
UNLINK(back[i].url_sav);
back[i].status = STATUS_READY;
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Can not truncate partial file, "
"restarting");
} else {
fseeko(back[i].r.out, (off_t) resume, SEEK_SET);
/* create a temporary reference file in case of
* broken mirror */
if (back_serialize_ref(opt, &back[i]) != 0) {
hts_log_print(opt, LOG_WARNING,
"Could not create temporary "
"reference file for %s%s",
back[i].url_adr,
back[i].url_fil);
}
#if HDEBUG
printf("continue interrupted file\n");
printf("continue interrupted file\n");
#endif
}
} else { // On est dans la m**
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
@@ -3814,17 +3872,18 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
FILE *fp =
FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "rb");
if (fp) {
LLint alloc_mem = sz + 1;
LLint alloc_mem = resume + 1;
if (back[i].r.totalsize >= 0)
alloc_mem += back[i].r.totalsize; // AJOUTER RESTANT!
if (deleteaddr(&back[i].r)
&& (back[i].r.adr =
(char *) malloct((size_t) alloc_mem))) {
back[i].r.size = sz;
back[i].r.size = resume;
if (back[i].r.totalsize >= 0)
back[i].r.totalsize += sz; // plus en fait
if ((fread(back[i].r.adr, 1, sz, fp)) != sz) {
back[i].r.totalsize += resume; // -> full size
if ((fread(back[i].r.adr, 1, (size_t) resume,
fp)) != (size_t) resume) {
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
@@ -3842,14 +3901,30 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
"No memory for partial file");
}
fclose(fp);
} else { // Argh..
} else { // open failed
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Can not open partial file");
}
}
} else { // Non trouvé??
} else if (sz >=
0) { // unusable range -> restart whole file
hts_log_print(opt, LOG_WARNING,
"Unusable partial-content range for %s%s "
"(have " LLintP " bytes, got " LLintP
"-" LLintP "/" LLintP "), restarting",
back[i].url_adr, back[i].url_fil,
(LLint) sz, back[i].r.crange_start,
back[i].r.crange_end, back[i].r.crange);
url_savename_refname_remove(opt, back[i].url_adr,
back[i].url_fil);
UNLINK(back[i].url_sav);
back[i].status = STATUS_READY;
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Unusable partial content, restarting");
} else { // partial not found
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg, "Can not find partial file");
@@ -3930,7 +4005,6 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
}
}
}
/*} */

View File

@@ -146,7 +146,8 @@ typedef enum BackStatusCode {
STATUSCODE_NON_FATAL = -5,
STATUSCODE_SSL_HANDSHAKE = -6,
STATUSCODE_TOO_BIG = -7,
STATUSCODE_TEST_OK = -10
STATUSCODE_TEST_OK = -10,
STATUSCODE_EXCLUDED = -11 /* aborted: MIME excluded by a -mime: filter */
} BackStatusCode;
/** HTTrack status ('status' member of of 'lien_back') **/

View File

@@ -736,26 +736,39 @@ int httpmirror(char *url1, httrackp * opt) {
/* OPTIMIZED for fast load */
if (StringNotEmpty(opt->filelist)) {
char *filelist_buff = NULL;
const size_t filelist_sz = off_t_to_size_t(fsize(StringBuff(opt->filelist)));
size_t filelist_sz = 0;
const char *filelist_err = NULL; /* failure reason, NULL on success */
const off_t fs = fsize(StringBuff(opt->filelist));
if (filelist_sz != (size_t) -1) {
if (fs < 0) {
/* fsize() hides the cause; redo stat() for a precise errno (#49) */
struct stat st;
filelist_err = stat(StringBuff(opt->filelist), &st) != 0
? strerror(errno)
: "not a regular file";
} else if ((filelist_sz = off_t_to_size_t(fs)) == (size_t) -1) {
filelist_err = "file too large";
filelist_sz = 0;
} else {
FILE *fp = fopen(StringBuff(opt->filelist), "rb");
if (fp) {
if (fp == NULL) {
filelist_err = strerror(errno);
} else {
filelist_buff = malloct(filelist_sz + 1);
if (filelist_buff) {
if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
freet(filelist_buff);
filelist_buff = NULL;
} else {
*(filelist_buff + filelist_sz) = '\0';
}
if (filelist_buff == NULL) {
filelist_err = "out of memory";
} else if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
freet(filelist_buff);
filelist_err = "read error";
} else {
filelist_buff[filelist_sz] = '\0';
}
fclose(fp);
}
}
if (filelist_buff) {
if (filelist_buff != NULL) {
int filelist_ptr = 0;
int n = 0;
char BIGSTK line[HTS_URLMAXSIZE * 2];
@@ -780,8 +793,8 @@ int httpmirror(char *url1, httrackp * opt) {
// Free buffer
freet(filelist_buff);
} else {
hts_log_print(opt, LOG_ERROR, "Could not include URL list: %s",
StringBuff(opt->filelist));
hts_log_print(opt, LOG_ERROR, "Could not include URL list \"%s\": %s",
StringBuff(opt->filelist), filelist_err);
}
}
@@ -3726,6 +3739,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (StringNotEmpty(from->user_agent))
StringCopyS(to->user_agent, from->user_agent);
if (StringNotEmpty(from->strip_query))
StringCopyS(to->strip_query, from->strip_query);
if (from->retry > -1)
to->retry = from->retry;

View File

@@ -236,6 +236,8 @@ struct hash_struct {
coucal former_adrfil;
/* scratch buffers reused across lookups (not reentrant) */
int normalized;
/* query-strip keys (not owned); set from opt->strip_query at hash_init */
const char *strip_query;
char normfil[HTS_URLMAXSIZE * 2];
char normfil2[HTS_URLMAXSIZE * 2];
char catbuff[CATBUFF_SIZE];
@@ -364,6 +366,17 @@ int fspc(httrackp * opt, FILE * fp, const char *type);
char *next_token(char *p, int flag);
/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
"*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip);
/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
'\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize);
/* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
owns it and must release it with freet(). Return NULL on missing/unreadable
file (readfile_or substitutes defaultdata instead). The byte content is NOT

File diff suppressed because it is too large Load Diff

View File

@@ -76,7 +76,8 @@ int fa_strjoker(int type, char **filters, int nfil, const char *nom, LLint * siz
}
if (size)
sz = *size;
if (strjoker(nom, filters[i] + filteroffs, &sz, size_flag)) { // reconnu
/* size unknown (scan time): no size pointer => size tests stay neutral */
if (strjoker(nom, filters[i] + filteroffs, size ? &sz : NULL, size_flag)) {
if (size)
if (sz != *size)
sizelimit = sz;

View File

@@ -117,10 +117,17 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
// copy link
assertf(fil != NULL);
if (hash->normalized) {
fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
{
/* resolve the per-URL strip keys; strip applies even when urlhack is off */
char BIGSTK keybuf[HTS_URLMAXSIZE];
const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
keybuf, sizeof(keybuf));
if (hash->normalized || keys != NULL) {
fil_normalized_filtered(fil, &hash->normfil[strlen(hash->normfil)], keys);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
}
}
// hash
@@ -161,12 +168,20 @@ static int key_adrfil_equals_generic(void *arg,
}
// now compare pathes
if (normalized) {
fil_normalized(a_fil, hash->normfil);
fil_normalized(b_fil, hash->normfil2);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
{
char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
const char *const keysa =
hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
const char *const keysb =
hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
if (normalized || keysa != NULL || keysb != NULL) {
fil_normalized_filtered(a_fil, hash->normfil, keysa);
fil_normalized_filtered(b_fil, hash->normfil2, keysb);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
}
}
}
@@ -227,6 +242,9 @@ void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
hash->adrfil = coucal_new(0);
hash->former_adrfil = coucal_new(0);
hash->normalized = normalized;
/* snapshot the query-strip list (not owned; valid for the hash lifetime) */
hash->strip_query =
StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;
hts_set_hash_handler(hash->sav, opt);
hts_set_hash_handler(hash->adrfil, opt);

View File

@@ -563,6 +563,7 @@ void help(const char *app, int more) {
(" %x do not include any password for external password protected websites (%x0 include)");
infomsg
(" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
infomsg(" %g strip query keys for dedup ([host/pattern=]key1,key2,...)");
infomsg
(" o *generate output html file in case of error (404..) (o0 don't generate)");
infomsg(" X *purge old files after update (X0 keep delete)");
@@ -646,9 +647,7 @@ void help(const char *app, int more) {
infomsg("");
infomsg("Guru options: (do NOT use if possible)");
infomsg(" #X *use optimized engine (limited memory boundary checks)");
infomsg(" #0 filter test (-#0 '*.gif' 'www.bar.com/foo.gif')");
infomsg(" #1 simplify test (-#1 ./foo/bar/../foobar)");
infomsg(" #2 type test (-#2 /foo/bar.php)");
infomsg(" #test list engine self-tests (run one with -#test=NAME [args])");
infomsg(" #C cache list (-#C '*.com/spider*.gif'");
infomsg(" #R cache repair (damaged cache)");
infomsg(" #d debug parser");

View File

@@ -3681,6 +3681,142 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
return dest;
}
/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
case-sensitive, space-trimmed tokens. */
static int hts_query_key_stripped(const char *arg, size_t keylen,
const char *strip) {
const char *p = strip;
while (*p != '\0') {
const char *start = p;
size_t toklen;
while (*p != '\0' && *p != ',')
p++;
toklen = (size_t) (p - start);
while (toklen > 0 && *start == ' ') {
start++;
toklen--;
}
while (toklen > 0 && start[toklen - 1] == ' ')
toklen--;
if (toklen == 1 && start[0] == '*')
return 1;
if (toklen == keylen && strncmp(start, arg, keylen) == 0)
return 1;
if (*p == ',')
p++;
}
return 0;
}
/* see htscore.h */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip) {
const char *query;
char BIGSTK tmp[HTS_URLMAXSIZE * 2];
htsbuff cb;
int wrote = 0;
/* No strip list, or no query: plain normalization. */
if (strip == NULL || *strip == '\0' ||
(query = strchr(source, '?')) == NULL) {
return fil_normalized(source, dest);
}
/* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
(the read re-normalizes it; a dropped empty arg would miss dedup). */
cb = htsbuff_ptr(tmp, sizeof(tmp));
htsbuff_catn(&cb, source, (size_t) (query - source));
for (query++;;) {
const char *const arg = query;
const char *eq = NULL;
size_t keylen, arglen;
while (*query != '\0' && *query != '&') {
if (eq == NULL && *query == '=')
eq = query;
query++;
}
arglen = (size_t) (query - arg);
keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
if (!hts_query_key_stripped(arg, keylen, strip)) {
htsbuff_catc(&cb, wrote ? '&' : '?');
htsbuff_catn(&cb, arg, arglen);
wrote = 1;
}
if (*query == '\0')
break;
query++;
}
return fil_normalized(tmp, dest);
}
/* see htscore.h */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize) {
const char *p, *q;
const char *result = NULL;
char BIGSTK url[HTS_URLMAXSIZE * 2];
if (rules == NULL || *rules == '\0' || destsize == 0)
return NULL;
/* Match string = normalized host/path, query removed. jump_normalized_const
collapses www+scheme/auth so read and write (double-normalized) agree;
query excluded keeps the decision on host/path only. */
url[0] = '\0';
strcatbuff(url, jump_normalized_const(adr));
if (fil[0] != '/')
strcatbuff(url, "/");
q = strchr(fil, '?');
if (q != NULL)
strncatbuff(url, fil, (int) (q - fil));
else
strcatbuff(url, fil);
/* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
"pattern=keys"; no '=' is the bare form, pattern "*". */
for (p = rules; *p != '\0';) {
const char *const line = p;
const char *eol, *eq, *keys;
char BIGSTK pat[HTS_URLMAXSIZE * 2];
while (*p != '\0' && *p != '\n')
p++;
eol = p;
if (*p == '\n')
p++;
if (eol == line)
continue;
eq = memchr(line, '=', (size_t) (eol - line));
if (eq != NULL) {
size_t patlen = (size_t) (eq - line);
if (patlen >= sizeof(pat))
patlen = sizeof(pat) - 1;
memcpy(pat, line, patlen);
pat[patlen] = '\0';
keys = eq + 1;
} else {
pat[0] = '*';
pat[1] = '\0';
keys = line;
}
if (strjoker(url, pat, NULL, NULL) != NULL) {
size_t klen = (size_t) (eol - keys);
if (klen >= destsize)
klen = destsize - 1;
memcpy(dest, keys, klen);
dest[klen] = '\0';
result = dest;
}
}
return result;
}
#define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
size_t destsize) {
@@ -5891,6 +6027,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
StringCopy(opt->strip_query, "");
opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, "");
@@ -6035,6 +6172,7 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
StringFree(opt->urllist);
StringFree(opt->footer);
StringFree(opt->mod_blacklist);
StringFree(opt->strip_query);
StringFree(opt->path_html);
StringFree(opt->path_html_utf8);

View File

@@ -198,6 +198,13 @@ int url_savename(lien_adrfilsave *const afs,
// copy of fil, used for lookups (see urlhack)
const char *normadr = adr;
const char *normfil = fil_complete;
/* query keys to strip for this URL (NULL = none); decoupled from urlhack */
char BIGSTK stripkeys[HTS_URLMAXSIZE];
const char *const strip =
StringNotEmpty(opt->strip_query)
? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
fil_complete, stripkeys, sizeof(stripkeys))
: NULL;
const char *const print_adr = jump_protocol_const(adr);
const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL; // Position nom et point
@@ -232,7 +239,7 @@ int url_savename(lien_adrfilsave *const afs,
if (opt->urlhack) {
// copy of adr (without protocol), used for lookups (see urlhack)
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil = fil_normalized(fil_complete, normfil_);
normfil = fil_normalized_filtered(fil_complete, normfil_, strip);
} else {
if (link_has_authority(adr_complete)) { // https or other protocols : in "http/" subfolder
char *pos = strchr(adr_complete, ':');
@@ -245,6 +252,9 @@ int url_savename(lien_adrfilsave *const afs,
normadr = normadr_;
}
}
// strip still applies with urlhack off (host left untouched)
if (strip != NULL)
normfil = fil_normalized_filtered(fil_complete, normfil_, strip);
}
// à afficher sans ftp://

View File

@@ -529,6 +529,8 @@ struct httrackp {
htslibhandles libHandles; /**< loaded external module handles */
//
htsoptstate state; /**< embedded live engine state */
String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
appended at the tail to keep field offsets stable */
};
/* Running statistics for a mirror. */

1244
src/htsselftest.c Normal file

File diff suppressed because it is too large Load Diff

52
src/htsselftest.h Normal file
View File

@@ -0,0 +1,52 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 2026 Xavier Roche and other contributors
SPDX-License-Identifier: GPL-3.0-or-later
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Ethical use: we kindly ask that you NOT use this software to harvest email
addresses or to collect any other private information about people. Doing so
would dishonor our work and waste the many hours we have spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: htsselftest.h */
/* named dispatch for the hidden engine self-tests */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#ifndef HTSSELFTEST_DEFH
#define HTSSELFTEST_DEFH
#ifdef HTS_INTERNAL_BYTECODE
#ifndef HTS_DEF_FWSTRUCT_httrackp
#define HTS_DEF_FWSTRUCT_httrackp
typedef struct httrackp httrackp;
#endif
/* Run engine self-test `name` over the positional args argv[0..argc-1], or list
the available tests when name is NULL, empty, or "list". Prints the result;
returns the process exit code (0 == success). The caller owns option cleanup.
Reached through the hidden `httrack -#test[=NAME ...]` subcommand. */
int hts_selftest(httrackp *opt, const char *name, int argc, char **argv);
#endif
#endif

View File

@@ -4,7 +4,7 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Golden cache-format regression test (driven by 'httrack -#B <dir>').
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
#
# 01_engine-cache.test writes the cache with the same build it reads back (a
# round-trip), so it cannot catch a read-path or ZIP-format regression where
@@ -13,7 +13,7 @@
# byte-exact.
#
# Regenerate the fixture after a deliberate format change with
# 'httrack -#B <dir> regen', then copy <dir>/hts-cache/new.zip over the
# 'httrack -#test=cache-golden <dir> regen', then copy <dir>/hts-cache/new.zip over the
# committed file.
set -eu
@@ -37,11 +37,11 @@ trap 'rm -rf "$dir"' EXIT
mkdir -p "$dir/hts-cache"
cp "$fixture/hts-cache/new.zip" "$dir/hts-cache/new.zip"
out=$(httrack -#B "$dir")
out=$(httrack -#test=cache-golden "$dir")
# Match the exact success line: the read must have found and verified every
# entry, not merely failed to enter the mode (a bad -#B falls through to the
# usage screen, which also exits non-zero but never prints this).
# entry, not merely failed to enter the mode (a renamed/removed test prints the
# registry to stderr, which also exits non-zero but never prints this).
test "$out" = "cache-golden: OK" || {
echo "expected 'cache-golden: OK', got: $out" >&2
exit 1

View File

@@ -4,20 +4,20 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Cache write-failure handling (httrack -#W <dir>). #174/#219.
# Cache write-failure handling (httrack -#test=cache-writefail <dir>). #174/#219.
# A failing new.zip write (disk full) used to crash the process via assertf; it
# must instead stop the mirror with a fatal error (exit_xh=-1), no crash. The
# self-test asserts that; reverting the fix makes -#W abort (SIGABRT) and fail.
# self-test asserts that; reverting the fix makes -#test=cache-writefail abort (SIGABRT) and fail.
set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
out=$(httrack -#W "$dir")
out=$(httrack -#test=cache-writefail "$dir")
# Match the exact success line (error logs also go to stdout); a bad -#W falls
# through to the usage screen, which exits non-zero but never prints this.
# Match the exact success line (error logs also go to stdout); a renamed/removed
# test prints the registry to stderr, which exits non-zero but never prints this.
printf '%s\n' "$out" | grep -qx "cache-writefail: OK" || {
echo "expected 'cache-writefail: OK', got: $out" >&2
exit 1

View File

@@ -4,7 +4,7 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Cache create/read/update logic (driven by 'httrack -#A <dir>').
# Cache create/read/update logic (driven by 'httrack -#test=cache <dir>').
#
# The in-process self-test stores several hand-crafted edge entries (normal
# HTML, an empty redirect with a near-limit location, a non-HTML body kept via
@@ -20,13 +20,13 @@ set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
# Like the other -# debug modes, a trailing token (the working directory) is
# required; a bare '-#A' falls through to the usage screen.
out=$(httrack -#A "$dir")
# The working directory is a required argument; without it the test prints a
# usage line to stderr and returns non-zero.
out=$(httrack -#test=cache "$dir")
# Match the exact success line, so the test cannot pass for an unrelated reason
# (e.g. the -#A mode being gone and falling through to the usage screen, which
# also exits non-zero but never prints this).
# (e.g. the cache test being gone, which prints the registry to stderr but
# never prints this line).
test "$out" = "cache-selftest: OK" || {
echo "expected 'cache-selftest: OK', got: $out" >&2
exit 1

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
# -#test=charset <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
test "$(httrack -O /dev/null -#test=charset "$1" "$2")" == "$3" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#test=charset "$1" "$2" >/dev/null 2>&1 || exit 1
}
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.
@@ -31,7 +31,7 @@ conv 'us-ascii' 'hello' 'hello'
# unknown charset: ASCII passes through unchanged, but non-ASCII input cannot be
# decoded and yields empty output (an error is printed to stderr).
conv 'no-such-charset-xyz' 'abc' 'abc'
test "$(httrack -O /dev/null -#3 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
test "$(httrack -O /dev/null -#test=charset 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
# malformed UTF-8 (lone continuation byte, truncated lead byte) must not crash
runs 'utf-8' $'\x80'

View File

@@ -1,14 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#test=cookies' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=cookies run)
# Exact-match the success line so a fall-through to usage can't pass the test.
# Exact-match the success line so a renamed/removed test (it prints the registry
# to stderr) can't pass.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1

View File

@@ -2,15 +2,16 @@
#
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
# they silently stop being copied. Driven by the in-process 'httrack -#test=copyopt' test.
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
set -eu
# A trailing token is required; a bare '-#9' falls through to the usage screen.
out=$(httrack -#9 run)
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=copyopt run)
# Exact-match the success line so a fall-through to usage can't pass the test.
# Exact-match the success line so a renamed/removed test (it prints the registry
# to stderr) can't pass.
test "$out" = "copy-htsopt: OK" || {
echo "expected 'copy-htsopt: OK', got: $out" >&2
exit 1

View File

@@ -5,9 +5,8 @@ set -euo pipefail
# DNS resolver/cache self-test: a mock getaddrinfo (no network) checks address
# family, single-address selection, the -@i4/-@i6 family filter, and cache reuse.
# The trailing token is required, like the other -# selftests, so a bare command
# line isn't treated as "no arguments" and routed to the usage screen.
out=$(httrack -#D run)
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=dns run)
test "$out" = "dns-selftest: OK" || {
echo "expected 'dns-selftest: OK', got: $out" >&2

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output).
# -#test=entities <string> prints the string with entities decoded (UTF-8 output).
ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
test "$(httrack -O /dev/null -#test=entities "$1")" == "$2" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#test=entities "$1" >/dev/null 2>&1 || exit 1
}
# named entities

View File

@@ -0,0 +1,65 @@
#!/bin/bash
#
# -%L URL-list loading (#49): a readable list is honored; an unusable one fails
# with the reason (errno / not-a-regular-file), not a bare "Could not include
# URL list". Offline: file:// fixture, no server. Asserts on httrack's own
# strings and the message shape, so it is locale-independent.
set -euo pipefail
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_filelist.XXXXXX") || exit 1
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
echo '<html><body>hi</body></html>' >"$tmp/index.html"
# run httrack with the given -%L target; structured log lands in $out/hts-log.txt
run() {
local out="$1" list="$2"
rm -rf "$out"
mkdir -p "$out"
httrack -O "$out" --quiet -n "-%L" "$list" >"$out/.stdout" 2>&1 || true
LOG="$out/hts-log.txt"
}
fail() {
echo "FAIL: $1"
cat "$LOG"
exit 1
}
loghas() {
grep -Eq "$1" "$LOG" || fail "expected /$1/ in $LOG"
}
lognot() {
if grep -Eq "$1" "$LOG"; then fail "unexpected /$1/ in $LOG"; fi
}
# readable list: its one URL is loaded and counted (count must be non-zero)
printf 'file://%s/index.html\n' "$tmp" >"$tmp/urls.txt"
run "$tmp/ok" "$tmp/urls.txt"
loghas '[1-9][0-9]* links added from'
# missing file: quoted name + a non-empty reason, never the old reasonless
# "Could not include URL list: <name>". The reason is the stat() errno, not the
# directory fallback literal (guards against dropping the errno lookup).
run "$tmp/miss" "$tmp/nope.txt"
loghas 'Could not include URL list "[^"]+": .+'
lognot 'Could not include URL list: '
lognot 'not a regular file'
# a directory is rejected with our own reason (locale-independent)
mkdir -p "$tmp/adir"
run "$tmp/dir" "$tmp/adir"
loghas 'Could not include URL list "[^"]+": not a regular file'
# unreadable regular file: the fopen() errno arm fires, distinct from the
# directory branch. Root bypasses mode 000, so skip it there.
if test "$(id -u)" -ne 0; then
: >"$tmp/noperm.txt"
chmod 000 "$tmp/noperm.txt"
run "$tmp/perm" "$tmp/noperm.txt"
chmod 644 "$tmp/noperm.txt"
loghas 'Could not include URL list "[^"]+": .+'
lognot 'not a regular file'
fi
exit 0

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# wildcard filter engine (strjoker), the core of +/- include/exclude rules.
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
# -#test=filter <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does match $1" || exit 1
}
nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does NOT match $1" || exit 1
}
# bare star matches everything
@@ -71,3 +71,27 @@ nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'
# Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
# means the size is still unknown (scan time). A size exclusion must stay neutral
# then, so the file is fetched and only cancelled once its size is known (#143).
fsize() {
local want="$1"
shift
test "$(httrack -O /dev/null -#test=filtersize "$@")" == "$want" || exit 1
}
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # scan time: keep
fsize 'verdict=forbidden size_flag=1' 5 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # <10KB: cancel
fsize 'verdict=allowed size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # >=10KB: keep
fsize 'verdict=forbidden size_flag=0' -1 foo.txt -* '+*.jpg' '-*.jpg*[<10]' # not a jpg
# the '>' operator is just as neutral at scan time, and fires once size is known
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # scan time: keep
fsize 'verdict=forbidden size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # >10KB: cancel
# [name]/[file]/[path] never span '?' mid-string; a trailing query is still
# tolerated by the global '?' rule (same as plain *.aspx), not the class (#144).
nomatch '*[path]/end' 'a?b/end'
nomatch '*[file]end' 'foo?xend'
nomatch '*[name]X' 'abc?X'
match '*[file]' 'foo?x=1' # trailing query: tolerated, as for *.aspx
match '*.aspx' 'page.aspx?y=2'

View File

@@ -3,5 +3,7 @@
set -euo pipefail
# httrack internal hashtable autotest on 100K keys
httrack -#7 100000
# httrack internal hashtable autotest on 100K keys. Assert the success line (on
# stderr) so a misrouted registry entry can't pass on exit code alone.
out=$(httrack -#test=hashtable 100000 2>&1)
printf '%s\n' "$out" | grep -q "all hashtable tests were successful!" || exit 1

View File

@@ -3,13 +3,13 @@
set -euo pipefail
# IDNA / punycode encode (-#4) and decode (-#5). This code has a CVE history,
# IDNA / punycode encode (-#test=idna-encode) and decode (-#test=idna-decode). This code has a CVE history,
# so the edge cases below cover passthrough, round-trips, and malformed input.
enc() { test "$(httrack -O /dev/null -#4 "$1")" == "$2" || exit 1; }
dec() { test "$(httrack -O /dev/null -#5 "$1")" == "$2" || exit 1; }
enc() { test "$(httrack -O /dev/null -#test=idna-encode "$1")" == "$2" || exit 1; }
dec() { test "$(httrack -O /dev/null -#test=idna-decode "$1")" == "$2" || exit 1; }
# crash probe: malformed ACE input must exit cleanly, not abort.
runs() { httrack -O /dev/null -#5 "$1" >/dev/null 2>&1 || exit 1; }
runs() { httrack -O /dev/null -#test=idna-decode "$1" >/dev/null 2>&1 || exit 1; }
# encode
enc 'www.café.com' 'www.xn--caf-dma.com'

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# MIME type guessing from extension (get_httptype / give_mimext).
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
# -#test=mime <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is '$2'" || exit 1
}
unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
}
mime '/a/b.html' 'text/html'

View File

@@ -8,7 +8,7 @@ set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
got=$(httrack -O /dev/null -#test=relative "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
@@ -19,7 +19,7 @@ rel() {
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
got=$(httrack -O /dev/null -#test=resolve "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"

View File

@@ -3,11 +3,11 @@
set -euo pipefail
# Local save-name extension resolution (url_savename via -#N <fil> <content-type>).
# Local save-name extension resolution (url_savename via -#test=savename <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
name() {
out="$(httrack -O /dev/null -#N "$1" "$2" | sed -n 's/^savename: //p')"
out="$(httrack -O /dev/null -#test=savename "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
exit 1

View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# The -#test dispatch itself: a bare -#test lists the registry, and an unknown
# name errors (non-zero, diagnostic) instead of silently passing.
set -eu
# Bare -#test lists known tests (printed to stderr).
list=$(httrack -#test 2>&1)
printf '%s\n' "$list" | grep -q "filter" || exit 1
printf '%s\n' "$list" | grep -q "cache-writefail" || exit 1
# Unknown name: non-zero exit + diagnostic, and no test result line.
rc=0
err=$(httrack -#test=bogus 2>&1) || rc=$?
test "$rc" -ne 0 || exit 1
printf '%s\n' "$err" | grep -q "Unknown self-test" || exit 1

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
test "$(httrack -O /dev/null -#test=simplify "$1")" == "simplified=$2" || exit 1
}
simp './foo/bar/' 'foo/bar/'

View File

@@ -0,0 +1,8 @@
#!/bin/bash
#
set -euo pipefail
# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"

View File

@@ -3,23 +3,22 @@
set -euo pipefail
# htssafe.h bounded string operations (driven by 'httrack -#8').
# htssafe.h bounded string operations (driven by 'httrack -#test=strsafe').
# Success path: every bounded op (strcpybuff/strcatbuff/strncatbuff/strlcpybuff)
# must behave correctly. Like the other -# debug modes, a trailing token is
# required (a bare '-#8' falls through to the usage screen).
# must behave correctly. 'run' selects the success path (vs the overflow modes).
rc=0
out=$(httrack -#8 run) || rc=$?
out=$(httrack -#test=strsafe run) || rc=$?
test "$rc" -eq 0 || exit 1
test "$out" == "strsafe: OK" || exit 1
# Overflow path: an over-capacity write into a sized buffer must be caught by
# the bounded macro and abort the process, not be silently truncated/completed.
# Assert the htssafe abort signature specifically, so the test cannot pass for
# an unrelated reason (e.g. the -#8 mode being gone and falling through to the
# usage screen, which also exits non-zero).
# an unrelated reason (e.g. the strsafe test being gone, which prints the
# registry to stderr and also exits non-zero).
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
err=$(httrack -#test=strsafe overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*)
echo "over-capacity write was NOT caught" >&2
@@ -36,7 +35,7 @@ esac
# capacity (4 bytes into a 4-byte buffer), so this also pins the boundary: a
# '<=' off-by-one in the capacity check would let it through (and print "NOT
# aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
err=$(httrack -#test=strsafe overflow-buff "abcd" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*)
echo "htsbuff over-capacity write was NOT caught" >&2

View File

@@ -0,0 +1,109 @@
#!/bin/bash
# Issue #198: on a resumed download the server may answer the Range with a 206
# that starts *before* the offset we asked for (block-aligned ranges). httrack
# must honor the returned Content-Range, not blindly append, or the overlap
# bytes get duplicated and the file grows (corrupt PDFs). Pass 1 interrupts
# flaky.bin mid-body (partial + temp-ref); pass 2 resumes against a 206 that
# backs up 8 bytes. The result must equal the same bytes fetched whole (full.bin).
set -eu
: "${top_srcdir:=..}"
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_198.XXXXXX") || exit 1
serverpid=
crawlpid=
cleanup() {
if test -n "$crawlpid"; then kill -9 "$crawlpid" 2>/dev/null || true; fi
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null || true
wait "$serverpid" 2>/dev/null || true
fi
rm -rf "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT PIPE TERM
# OVERLAP_COUNTER gets a byte per flaky.bin request so pass 1 knows when to interrupt.
serverlog="${tmpdir}/server.log"
counter="${tmpdir}/hits"
resumed="${tmpdir}/resumed" # gets a byte when the server serves a resume 206
OVERLAP_COUNTER="$counter" OVERLAP_RESUMED="$resumed" \
python3 "$server" --root "${testdir}/server-root" \
>"$serverlog" 2>&1 &
serverpid=$!
port=
for _ in $(seq 1 50); do
line=$(head -n1 "$serverlog" 2>/dev/null)
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
kill -0 "$serverpid" 2>/dev/null || {
echo "server exited early: $(cat "$serverlog")"
exit 1
}
sleep 0.1
done
test -n "$port" || {
echo "could not discover server port"
exit 1
}
base="http://127.0.0.1:${port}"
which httrack >/dev/null || {
echo "could not find httrack"
exit 1
}
out="${tmpdir}/crawl"
common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0 -c1)
refdir="${out}/hts-cache/ref"
# pass 1: interrupt once flaky.bin's prefix is streaming (partial + temp-ref).
printf '[pass 1: interrupt flaky.bin] ..\t'
httrack "${common[@]}" "${base}/overlap/index.html" >"${tmpdir}/log1" 2>&1 &
crawlpid=$!
for _ in $(seq 1 300); do
test -s "$counter" && break
kill -0 "$crawlpid" 2>/dev/null || break
sleep 0.1
done
sleep 0.5
kill -TERM "$crawlpid" 2>/dev/null || true
wait "$crawlpid" 2>/dev/null || true
crawlpid=
test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
echo "FAIL: no temp-ref survived pass 1; cannot drive the resume"
exit 1
}
echo "OK (temp-ref present)"
# pass 2: --continue -> resume Range -> 206 that starts 8 bytes early.
printf '[pass 2: resume flaky.bin] ..\t'
httrack "${common[@]}" --continue "${base}/overlap/index.html" >"${tmpdir}/log2" 2>&1 || true
echo "OK"
# Guard against a silent full re-download: the byte-compare below only tests the
# fix if pass 2 actually went through the resume Range -> 206 path.
printf '[resume path was exercised] ..\t'
if ! test -s "$resumed"; then
echo "FAIL: pass 2 never triggered a resume 206; the overlap fix was not exercised"
exit 1
fi
echo "OK"
printf '[resumed file is not corrupted] ..\t'
dir=$(find "$out" -maxdepth 1 -type d -name '127.0.0.1*' | head -1)
flaky="${dir}/overlap/flaky.bin"
full="${dir}/overlap/full.bin"
if ! test -f "$flaky" || ! test -f "$full"; then
echo "FAIL: flaky.bin or full.bin missing after pass 2"
exit 1
fi
if ! cmp -s "$flaky" "$full"; then
echo "FAIL: resumed flaky.bin ($(wc -c <"$flaky")) != full.bin ($(wc -c <"$full")); overlap duplicated"
exit 1
fi
echo "OK ($(wc -c <"$flaky") bytes, byte-identical)"

View File

@@ -0,0 +1,16 @@
#!/bin/bash
#
# A -mime: exclusion must abort the transfer on the response Content-Type, not
# fetch the whole 1 MB body then discard it (#58). The bytes-received guard is
# the real one: the file is absent either way, but only the fix keeps the count
# tiny (header only) instead of pulling the body. Match it positively (a small,
# <=4-digit count) so a vanished/reworded summary line fails rather than passes.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'mimex/real.html' \
--not-found 'mimex/blob.pdf' \
--log-found 'excluded by MIME type filter' \
--log-found '\[[0-9]{1,4} bytes received' \
httrack 'BASEURL/mimex/index.html' '-mime:application/pdf'

18
tests/26_local-strip-query.test Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
#
# End-to-end --strip-query (#112): two links to one resource differing only by
# ?utm_source dedup to a single saved file (2 files written: index + resource);
# the control crawl without the option keeps both variants (3 files). Locks the
# CLI->opt->hash plumbing the engine self-test can't reach.
set -e
: "${top_srcdir:=..}"
# stripped: the two ?utm_source variants collapse to one resource
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
# control: no stripping -> both query-named variants are saved
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
httrack 'BASEURL/stripquery/index.html'

View File

@@ -5,6 +5,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
server-root/stripquery/index.html server-root/stripquery/a.html \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -34,6 +35,7 @@ TESTS = \
01_engine-dns.test \
01_engine-doitlog.test \
01_engine-entities.test \
01_engine-filelist.test \
01_engine-filter.test \
01_engine-hashtable.test \
01_engine-idna.test \
@@ -42,7 +44,9 @@ TESTS = \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
02_manpage-regen.test \
02_update-cache.test \
@@ -64,6 +68,9 @@ TESTS = \
20_local-resume-loop.test \
21_local-intl-update.test \
22_local-broken-size.test \
23_local-errpage.test
23_local-errpage.test \
24_local-resume-overlap.test \
25_local-mime-exclude.test \
26_local-strip-query.test
CLEANFILES = check-network_sh.cache

View File

@@ -177,6 +177,24 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# --- MIME-type exclusion abort (issue #58) -----------------------------
# A -mime:application/pdf filter must abort the transfer once the header
# arrives, not download the whole body and discard it.
def route_mimex_index(self):
self.send_html(
'\t<a href="blob.pdf">pdf</a>\n' '\t<a href="real.html">real</a>\n'
)
# 1 MB body: the fix aborts after the header, so httrack's "bytes received"
# stays tiny; without it the engine reads the body and the count jumps.
MIMEX_BLOB = b"%PDF-1.4\n" + b"\x00" * (1024 * 1024)
def route_mimex_blob(self):
self.send_raw(self.MIMEX_BLOB, "application/pdf")
def route_mimex_real(self):
self.send_raw(b"<html><body>real</body></html>", "text/html")
# --- special chars in URLs across an update (issue #157) ---------------
# A dotless, accented basename served as text/html (MediaWiki style). The
# name the first crawl picks (.html) must survive the update pass.
@@ -225,6 +243,71 @@ class Handler(SimpleHTTPRequestHandler):
self.send_header("Content-Length", "0")
self.end_headers()
# 206 resume must honor the server's Content-Range, not the offset we asked
# for (#198): a server resuming a few bytes *before* the request must not
# leave httrack duplicating the overlap onto the partial. flaky.bin
# interrupts once then resumes OVERLAP_EARLY bytes early; full.bin serves
# the identical bytes in one shot, so the test can compare the two.
OVERLAP_BLOB = b"%PDF-1.4\n" + bytes((i * 37 + 11) % 256 for i in range(8000))
OVERLAP_EARLY = 8
OVERLAP_PREFIX_LEN = 4000 # flushed before the stall
_overlap_started = False
def route_overlap_index(self):
self.send_html('\t<a href="flaky.bin">flaky</a>\n\t<a href="full.bin">full</a>')
def route_overlap_full(self):
self.send_raw(self.OVERLAP_BLOB, "application/octet-stream")
def route_overlap(self):
counter = os.environ.get("OVERLAP_COUNTER")
if counter:
with open(counter, "a") as fp:
fp.write("x")
blob = self.OVERLAP_BLOB
rng = self.headers.get("Range")
# First GET: stream a prefix then stall, so the crawl can be interrupted
# mid-body (partial + temp-ref on disk).
if rng is None and not Handler._overlap_started:
Handler._overlap_started = True
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(blob)))
self.send_header("Accept-Ranges", "bytes")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(blob[: self.OVERLAP_PREFIX_LEN])
self.wfile.flush()
try:
while True:
time.sleep(3600)
except OSError:
pass
return
if rng is None: # no resume request: serve the whole file
return self.route_overlap_full()
# Resume: honor the Range, but back up OVERLAP_EARLY bytes.
start = (
int(rng[len("bytes=") :].split("-")[0]) if rng.startswith("bytes=") else 0
)
start = max(0, start - self.OVERLAP_EARLY)
# Signal that the resume Range -> 206 path actually fired, so the test
# can prove it was exercised (not a silent full re-download).
resumed = os.environ.get("OVERLAP_RESUMED")
if resumed:
with open(resumed, "a") as fp:
fp.write("x")
part = blob[start:]
self.send_response(206, "Partial Content")
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(part)))
self.send_header(
"Content-Range", "bytes %d-%d/%d" % (start, len(blob) - 1, len(blob))
)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(part)
# error pages / 0-byte files (#17): -o0 ("no error pages") must keep 4xx/5xx
# bodies off disk; a genuine 0-byte 200 is a valid file and stays.
def route_errpage_index(self):
@@ -281,12 +364,18 @@ class Handler(SimpleHTTPRequestHandler):
"/intl/" + INTL_NAME: route_intl_page,
"/resume/index.html": route_resume_index,
"/resume/blob.txt": route_resume,
"/overlap/index.html": route_overlap_index,
"/overlap/flaky.bin": route_overlap,
"/overlap/full.bin": route_overlap_full,
"/size/index.html": route_size_index,
"/size/oversize.bin": route_size_oversize,
"/errpage/index.html": route_errpage_index,
"/errpage/good.html": route_errpage_good,
"/errpage/missing.html": route_errpage_missing,
"/errpage/empty.html": route_errpage_empty,
"/mimex/index.html": route_mimex_index,
"/mimex/blob.pdf": route_mimex_blob,
"/mimex/real.html": route_mimex_real,
}
# --- dispatch ----------------------------------------------------------

View File

@@ -0,0 +1 @@
<html><body>resource A</body></html>

View File

@@ -0,0 +1,5 @@
<html><body>
Two links to one resource, differing only by a tracking parameter.
<a href="a.html?utm_source=x">x</a>
<a href="a.html?utm_source=y">y</a>
</body></html>