Compare commits

...

14 Commits

Author SHA1 Message Date
Xavier Roche
484fc47eab One bad cache entry ends the whole mirror (#494)
* Reword the "bogus state" cache-skip warnings

Users read "file not stored in cache due to bogus state" as a crash or
cache corruption; it is a benign skip when the transfer is shorter than
the Content-Length. Say what happened, the consequence, and the -%B
override instead, and give the delayed-type variant a plain wording.
Test pins updated; 22_local-broken-size gains set -e (its first crawl's
audit failure was masked by the second crawl's exit status).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Make one bad cache-entry write skippable instead of ending the mirror

Since #426 any new.zip write failure cleanly aborts the whole mirror.
Keep that for storage-level trouble (fatal errno such as ENOSPC, or
every entry failing), but let an isolated failure drop only the current
entry: abandon it, warn with the URL, and keep the mirror and the cache
stream going. A streak of CACHE_MAX_WRITE_FAILURES consecutive failures
still aborts. Also degrade the >2GB assertf crash in cache_add: an
oversized on-disk body is stored headers-only (X-In-Cache: 0), an
in-memory one drops the entry.

The cache-writefail self-test now pins all four regimes (fatal errno,
persistent streak, isolated skip with sibling round-trip, oversize); on
the previous code it fails the new assertions and hits the oversize
assertf (SIGABRT).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* review: pin streak semantics, assert the skip warning, add EDQUOT

Adversarial test audit found the policy's core semantics unpinned: no
test distinguished a consecutive-failure cap from a total count, or
proved a stored entry resets the streak. Phase 1 now asserts the abort
lands exactly on the 8th consecutive failure, and a new phase drives 10
failures interleaved with successes and asserts no abort. The .test now
greps a URL-bearing skip warning. check_fatal_io_errno gains EDQUOT
(quota exhaustion is disk-full for our purposes).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* review: trim comments to the one-line default

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 15:15:07 +02:00
Xavier Roche
abaf9b69a2 Harden the zip cache read path against a corrupt X-Size (#493)
A tampered X-Size in a zip cache entry wrapped into an oversized malloc
(an allocation-size-too-big abort under ASan). The alloc casts to int
(malloct((int) r.size + 1)), so besides a negative size, any positive
value at or above INT_MAX truncates negative and wraps too; reject the
whole out-of-range span before the load.

-#test=cache-corrupt byte-injects a real two-entry zip (bad/oversized/
negative X-Size, blanked X-In-Cache, smashed local header, garbled
deflate) and checks each entry degrades to STATUSCODE_INVALID in the same
read session as its intact sibling, so one corrupt entry never taints the
cache. Only the live zip format is covered; the legacy .dat reader is dead
and slated for removal.

Re-lands the change orphaned when it was merged into its stacked base
branch instead of master.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 11:19:34 +02:00
Xavier Roche
b0466b1d7b Cache reconcile policy was triplicated and broken for zip caches (#491)
* Extract the cache generation reconcile policy into hts_cache_reconcile()

The old-vs-new hts-cache generation dance was written out three times:
startup promote (htscoremain.c), interrupted-run keep-larger
(htscoremain.c) and end-of-run rollback (htscore.c), with bare 32768/65536
thresholds. Fold them into one policy function in htscache.c, selected by
mode, with named thresholds.

Behavior-preserving, with one provably-dead branch dropped: the
interrupted-run legacy .dat arm sat in the else of fexist(new.dat) while
itself requiring new.dat, so it could never run.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Fix the reconcile format gates: zip caches were skipped or destroyed

The extracted policies carried format gates mangled when the zip cache
landed:
- promote: the legacy .dat/.ndx arm hid in the else of 'new.zip exists',
  so a pure-legacy cache never got its old generation promoted;
- interrupted-run: the zip arm was gated on fexist(new.dat), never true
  for a modern cache, so the whole site was a no-op;
- rollback: only .dat/.ndx were restored. On a zip cache the restore was
  a no-op, so a transient outage left the thin error cache as new.zip and
  the next run's rotation deleted old.zip: the good generation was lost
  and everything got re-downloaded.

Handle the two formats independently in each mode, and restore the
.lst/.txt sidecars regardless of format.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Test the reconcile policies: engine self-test and dead-server update

-#test=reconcile drives hts_cache_reconcile() over sized file fixtures for
the three modes; the cases covering the fixed gates fail against the old
code (13 checks). 37_local-cache-outage crawls, stops the server, re-runs
the mirror and asserts the previous new.zip comes back byte-identical with
no old.zip left behind (local-crawl.sh grows --rerun-dead for this); it
fails against the old rollback too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Pin healthy-update behavior: the no-data rollback must stay quiet

The audit flagged the rollback trigger (stat_files<=0 && recv<32K) as
misfiring on a legit all-304 small-site update. Probing shows it does not:
stat_files counts cache-carried files too (even -p0 reports them written),
so the trigger is unreachable for any run that scanned links, and only
real failures reach it. No engine change; pin the contract instead:
38_local-update-304 updates a tiny fully-cacheable site (/mini304/, served
through the /big/ 304 validator) and asserts no rollback notice, a
'no files updated' summary, and intact files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Guard the interrupted-run reconcile against an absent new cache

fsize() returns -1 for a missing file, which passes the 'new < TINY'
size test, so an interrupted-run reconcile could promote a solid old
generation onto an absent new one. That is unreachable today (PROMOTE
runs first and normalizes the missing-new case), but it is a latent
sharp edge a future reordering would expose. Require the new file to
exist before the size comparison, on both the zip and legacy arms.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 10:52:53 +02:00
Xavier Roche
f785286c87 debian: restore the 3.49.10-2 changelog entry omitted from 3.49.11-1 (#490)
The 3.49.10-2 FTBFS-fix upload was made straight to the archive and its
changelog entry was never committed back, so the 3.49.11-1 changelog
jumped from 3.49.10-1 to 3.49.11-1. The BTS derives its version graph
from upload changelogs, so #1140983 (fixed in 3.49.10-2) is counted as
still present in 3.49.11-1 and britney blocks testing migration.

Restore the entry verbatim from the archive; the BTS metadata itself is
corrected separately with a control message (fixed 1140983 3.49.11-1).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 07:49:53 +02:00
Xavier Roche
440a8603a9 mkdeb: suppress the recommended-field lintian tag (stale-local skew) (#489)
trixie's lintian 2.122 still warns that the Priority field #466 dropped
is recommended; the sid lintian in CI accepts the drop. Same stale-local
class as newer-standards-version, and it broke the release gate on the
release host.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 00:53:28 +02:00
Xavier Roche
4979e58dc0 mkdeb: don't require Build-Depends locally for a source-only build (#488)
A source package build runs no debhelper, but dpkg-checkbuilddeps still
aborts when the host lags the declared compat (trixie ships debhelper
13.24, the package now wants 14). Pass -d on the source-only path; the
buildds and the --sbuild gate keep enforcing Build-Depends. Hit cutting
3.49.11-1.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 00:42:47 +02:00
Xavier Roche
894cf5a8d2 Release 3.49.11 (#487)
Bump the package version (configure.ac, htsglobal.h), VERSION_INFO to
3:3:0 (compatible additions only, soname stays .so.3), add the 3.49-11
history.txt block and the 3.49.11-1 Debian changelog entry.
Standards-Version 4.7.4 is still current (policy 4.7.4.1).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-05 00:11:21 +02:00
Xavier Roche
0c1aa51385 make check never crawls a site-shaped fixture nor the 304 update path (#486)
* Add a seeded /big/ pseudo-site to the local test server

A deterministic ~360-file tree: 96 pages with 12 rotating pattern
families, sha256-derived asset bodies with honest magic bytes, planted
errors, and a fixed Last-Modified with If-Modified-Since 304 handling
so an update pass can revalidate instead of re-downloading.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Add 36_local-bigcrawl: diverse-crawl and 304-update regression test

Crawl the /big/ fixture plus an update pass: exact file and error
counts, decoy absence, rewrite spot checks, and a pinned 'no files
updated' proving the update pass was revalidation-only (the safety net
for the cache-reconcile rework).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Harden 36_local-bigcrawl per adversarial review

Make the og:image/twitter:image/formaction/ping decoys extensionless so
they stay unfetchable in every aggressive-parser state (today they
survive only because parseall_lastc freezes after the first script tag:
htsparse.c never resets inscript_state_pos on </script>). Close the
audit gaps: pin one per-page image name, forbid unrewritten absolute
hrefs in a tree page, pin the composition of the 4 planted errors, and
bound the mirror size from below (new --min-mirror-bytes audit).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 23:53:17 +02:00
Xavier Roche
fb4267c6d7 Add a -M byte-cap non-regression test (#485)
The -M limit had no test coverage. New /bigfiles/ fixture (8 fast 640KB
files), a --max-mirror-bytes audit in local-crawl.sh, and a crawl that
must log the "giving up" error and stay under the uncapped total.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 15:23:25 +02:00
Xavier Roche
f0b044c2f3 Deduplicate the wait-for-socket macros and the stats-refresh loop tick (#484)
* Fold the twin wait-for-socket macros into one shared function

URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET (htsname.c) and
WAIT_FOR_AVAILABLE_SOCKET (htsparse.h) were byte-for-byte duplicates,
both carrying the #481 checkmirror break. Replace them with a single
hts_wait_available_socket() in htscore.c; the callback-abort path still
returns -1 from the callers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Extract the duplicated stats-refresh loop tick into hts_loop_tick()

The wait-for-socket macro body (stats refresh + loop callback) also
existed open-coded at six more sites across htsname.c and htsparse.c,
differing only in the slot index passed to the callback and the abort
action. Collapse all of them onto a shared hts_loop_tick(); each caller
keeps its own abort path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 15:17:48 +02:00
Xavier Roche
dfafe28002 Enforce the -E time limit inside the transfer wait cycle (#482)
-E was only evaluated at per-link boundaries, so a slow or throttling
server starved the check for minutes, and the smooth stop it finally
requested drained the remaining transfers at server pace with no bound.
back_wait now checks the deadline every cycle and, once a short grace
period expires, aborts the in-flight HTTP transfers like the -T timeout
path does (FTP slots stay with their owning thread). back_checkmirror's
0 return, previously dead, now carries the hard stop.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:13:33 +02:00
Xavier Roche
a3f04bde72 Sniff magic bytes before the wire type renames a URL extension (#478)
A URL whose extension maps to a specific type but is served with a
disagreeing specific Content-Type was always renamed after the wire
(photo.jpg served as image/png became photo.png). The contested
verdict (#480) is now settled by the leading body bytes: magic proving
the extension's own type keeps it, anything inconclusive trusts the
wire as before, and the #267 soft-404 guard is unchanged.

New htssniff.c covers the magic-sniffable part of the supported MIME
set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE
document containers, archives, fonts, conservative text prefixes).
hts_wait_delayed waits for a sniffable head (or EOF) only on contested
verdicts; the head is read from the live backing slot (memory,
url_sav, or the compressed-stream tmpfile, inflated in memory). Update
runs never re-read bytes: they reproduce the previous run's verdict
from the recorded X-Save name (cache_read_including_broken grows a
return_save), so names never churn across updates or upgrades.
Non-delayed mode never sniffs; its HEAD probe has no body on the
first run. Also unlock the waiter's slot on the user-cancel abort.

Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/
bigtype/packed/mutant keep their extension, lie.png stays png), add
-#test=sniff table rows, and pin the recorded-verdict proxy in
01_zlib-savename-cached (kept out of the MSan job: uninstrumented
zlib). All discriminate against the pre-sniff binary.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 10:05:41 +02:00
Xavier Roche
11beef52e1 Name the contested case in extension naming and pin the current contract (#480)
* Make the wire-type-vs-extension naming decision an explicit verdict

Behavior-preserving refactor of wire_patches_ext: the decision becomes
a three-way wire_ext_verdict (ext kept / wire wins / contested), with
the contested case, a specific declared type disagreeing with a
specific URL extension, named explicitly instead of falling through.
Today a contested verdict trusts the wire, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Pin the naming contract: knobs and fixtures for content-independent naming

-#test=savename gains body= (leading body bytes via a temp url_sav
file) and cached= (a real one-entry cache, reopened read-only, whose
stored body is PNG magic); new rows and 01_zlib-savename-cached.test
pin that naming never depends on content or on the previously recorded
save name, only on headers. e2e fixtures (wrongtype.jpg served as
image/png, a gzip variant, a 16 KiB body, content that changes between
crawls) pin the wire-wins outcome across fresh and update passes. Any
future content-based tie-break must flip these rows explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 09:13:47 +02:00
Xavier Roche
d7c4eab1f5 Delayed type check finalizes fast transfers under their .delayed placeholder name (#479)
* Fix delayed-slot races: lock guards and finalize order

A slot whose savename is still the .delayed placeholder could be
finalized and recorded under that transient name when the transfer
completed before hts_wait_delayed resolved the type (fast servers):
back_clean ignored the lock hts_wait_delayed holds, the direct-to-disk
shortcut bound the output file to the placeholder name, and the
type-known finalize ran before url_sav was patched, caching the entry
under the wrong name (the 'bogus state (incomplete type)' warnings).
Skip locked slots in back_clean and in the disk shortcut, and patch
url_sav before finalizing.

The delayed-naming body sniff (next commit) waits for body bytes and
would hit these races deterministically; 15_local-types and
18_local-update cover the flow end to end.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Exercise the degenerate delayed-type paths; audit .delayed leftovers everywhere

New delayed/ fixtures (302 without Location, self-loop, a chain one hop
past the redirect budget, a proper redirect, a typeless unknown-ext
file) with 33_local-delayed.test, run twice via --rerun. local-crawl.sh
now fails ANY local test whose finished mirror contains a *.delayed
temporary, guarding the whole suite against the #107 class.

Probing the degenerate paths settled a review question: afs->save can
still be a fresh placeholder when url_sav is patched (redirects that
never resolve), but nothing name-keyed is stored there -- non-OK
entries are cached header-only with an empty X-Save, the link is
dropped, no file is left. The test pins exactly that.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Empty responses finalized delayed slots under the .delayed name

Found while probing for a harness trigger of the finalize race: a
delayed-type URL answered 200 with Content-Length: 0 takes the
empty-response branch, which reaches READY inside the header round and
calls back_finalize directly, while url_sav still holds the .delayed
placeholder. Deterministic on any such URL: the 'bogus state
(incomplete type)' warning and an entry recorded under the placeholder
name. Empty responses are common in the wild (beacons, placeholder
files), so this is the reproducible face of #5.

Defer the finalize when the slot is locked, matching the other guards:
the waiter finalizes right after patching url_sav. The new empty.php
fixture makes 33_local-delayed fail on master and pass here; the
timing race itself remains untriggerable from the harness (headers and
body advance in separate back_wait rounds; probe notes in the PR).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 08:37:14 +02:00
41 changed files with 2890 additions and 505 deletions

3
.gitignore vendored
View File

@@ -39,3 +39,6 @@ Makefile
# Editor / autotools backup files.
*~
# Python bytecode (tests/local-server.py).
__pycache__/

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_INIT([httrack], [3.49.11], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,11 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
# 3:3:0: 3.49.11 only adds enum values, macros and inline helpers to the
# installed headers (no struct layout or exported signature changed vs
# 3.49.10), so it stays soname .so.3; bump revision.
# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
VERSION_INFO="3:2:0"
VERSION_INFO="3:3:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

21
debian/changelog vendored
View File

@@ -1,3 +1,24 @@
httrack (3.49.11-1) unstable; urgency=medium
* New upstream release: crawl correctness and security fixes (network-facing
buffer overflows, file-type detection, redirect handling) and modernized
web defaults; full list in history.txt.
* Add DEP-12 upstream metadata (#466).
* Bump debhelper compat to 14 (#466).
* Drop the redundant Priority field and update the NMU lintian override to
the current tag names (#466).
-- Xavier Roche <xavier@debian.org> Sun, 05 Jul 2026 00:03:18 +0200
httrack (3.49.10-2) unstable; urgency=medium
* Fix FTBFS: tests/28_local-pause failed instead of skipping when python3 is
absent (the local-server tests need python3, which the buildds lack). Add
patches/skip-local-pause-test-without-python3.patch to guard the test on
python3 up front, like its siblings, so it skips cleanly.
-- Xavier Roche <xavier@debian.org> Sun, 28 Jun 2026 20:18:46 +0200
httrack (3.49.10-1) unstable; urgency=medium
* New upstream release: new download-pacing and URL-handling options plus a

View File

@@ -4,6 +4,23 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack
3.49-11
+ New: parse robots.txt Allow rules and path wildcards per RFC 9309 (#452)
+ New: advertise deflate in Accept-Encoding and decode deflate responses (#450)
+ New: follow <source> and <track> media elements as embedded links (#451)
+ New: added modern web MIME types to the type/extension table (#448)
+ Fixed: enforce the -E time limit during a slow transfer instead of only between files (#481)
+ Fixed: sniff the leading bytes of a download so a misdeclared Content-Type no longer renames a correct URL extension
+ Fixed: fast transfers could be saved under their temporary .delayed placeholder name (#5, #107)
+ Fixed: follow a redirect that maps to the same saved file instead of writing a self-pointing stub (#159)
+ Fixed: several network-facing buffer overflows in the FTP, Java and HTML parsers
+ Fixed: the htsjava plugin could not be loaded (hidden entry points, stale library name)
+ Fixed: HTML-escape truncation and a cache-buffer leak in the parser
+ Changed: modernized the default User-Agent to an honest HTTrack identifier (#449)
+ Changed: decode the full WHATWG set of HTML named character references (#443)
+ Changed: refreshed stale HTTP status, proxy-port and TLS-floor constants (#453)
+ Changed: multiple internal hardening, build, test and CI improvements
3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)

View File

@@ -62,7 +62,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htsname.c htsrobots.c htstools.c htswizard.c \
htsalias.c htsthread.c htsindex.c htsbauth.c \
htsmd5.c htszlib.c htswrap.c htsconcat.c \
htsmodules.c htscharset.c punycode.c htsencoding.c \
htsmodules.c htscharset.c punycode.c htsencoding.c htssniff.c \
md5.c \
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
@@ -70,7 +70,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
htshelp.h htsindex.h htslib.h htsmd5.h \
htsmodules.h htsname.h htsnet.h \
htsmodules.h htsname.h htsnet.h htssniff.h \
htsopt.h htsrobots.h htsthread.h \
htstools.h htswizard.h htswrap.h htszlib.h \
htsstrings.h htsarrays.h httrack-library.h \

View File

@@ -572,9 +572,12 @@ int back_finalize(httrackp * opt, cache_back * cache, struct_back * sback,
&& back[p].r.size != back[p].r.totalsize && !opt->tolerant) {
if (back[p].status == STATUS_READY) {
hts_log_print(opt, LOG_WARNING,
"file not stored in cache due to bogus state (broken size, expected "
LLintP " got " LLintP "): %s%s", back[p].r.totalsize,
back[p].r.size, back[p].url_adr, back[p].url_fil);
"incomplete transfer (expected " LLintP
" bytes, got " LLintP
"): file not cached, will be retried on the next update"
" (use -%%B to cache anyway): %s%s",
back[p].r.totalsize, back[p].r.size, back[p].url_adr,
back[p].url_fil);
} else {
hts_log_print(opt, LOG_INFO,
"incomplete file not yet stored in cache (expected "
@@ -879,11 +882,12 @@ int back_finalize(httrackp * opt, cache_back * cache, struct_back * sback,
back[p].url_fil, NULL);
} else {
/* Partial file, but marked as "ok" ? */
hts_log_print(opt, LOG_WARNING,
"file not stored in cache due to bogus state (incomplete type with %s (%d), size "
LLintP "): %s%s", back[p].r.msg, back[p].r.statuscode,
(LLint) back[p].r.size, back[p].url_adr,
back[p].url_fil);
hts_log_print(
opt, LOG_WARNING,
"file with unresolved type not cached (%s (%d), size " LLintP
"): %s%s",
back[p].r.msg, back[p].r.statuscode, (LLint) back[p].r.size,
back[p].url_adr, back[p].url_fil);
}
}
@@ -1359,6 +1363,18 @@ int back_flush_output(httrackp * opt, cache_back * cache, struct_back * sback,
}
// effacer entrée
/* Discard a cancelled mid-write .delayed placeholder (unusable across runs). */
static void back_delayed_discard(httrackp *opt, lien_back *back) {
if (back->r.out != NULL) {
fclose(back->r.out);
back->r.out = NULL;
}
back->r.is_write = 0;
if (opt != NULL)
url_savename_refname_remove(opt, back->url_adr, back->url_fil);
(void) UNLINK(back->url_sav);
}
int back_delete(httrackp * opt, cache_back * cache, struct_back * sback,
const int p) {
lien_back *const back = sback->lnk;
@@ -1366,6 +1382,12 @@ int back_delete(httrackp * opt, cache_back * cache, struct_back * sback,
assertf(p >= 0 && p < back_max);
if (p >= 0 && p < sback->count) { // on sait jamais..
/* mid-write cancel: drop a .delayed placeholder; real-named partials
survive for resume (--continue) */
if (back[p].r.is_write && IS_DELAYED_EXT(back[p].url_sav) &&
(back[p].status != STATUS_READY || back[p].r.statuscode <= 0)) {
back_delayed_discard(opt, &back[p]);
}
// Vérificateur d'intégrité
#if DEBUG_CHECKINT
_CHECKINT(&back[p], "Appel back_delete")
@@ -2237,12 +2259,13 @@ int host_wait(httrackp * opt, lien_back * back) {
static int slot_can_be_cleaned(const lien_back * back) {
return (back->status == STATUS_READY) // ready
/* Check autoclean */
&& (!back->testmode) // not test mode
&& (strnotempty(back->url_sav)) // filename exists
&& (HTTP_IS_OK(back->r.statuscode)) // HTTP "OK"
&& (back->r.size >= 0) // size>=0
;
/* Check autoclean */
&& (!back->locked) // not held by hts_wait_delayed (name pending)
&& (!back->testmode) // not test mode
&& (strnotempty(back->url_sav)) // filename exists
&& (HTTP_IS_OK(back->r.statuscode)) // HTTP "OK"
&& (back->r.size >= 0) // size>=0
;
}
static int slot_can_be_finalized(httrackp * opt, const lien_back * back) {
@@ -2418,6 +2441,34 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back_clean(opt, cache, sback);
#endif
/* Time limit exceeded past grace: abort in-flight transfers so no wait loop
starves (#481). FTP slots stay, their thread owns the socket. */
if (!back_checkmirror(opt)) {
int aborted = 0;
unsigned int i;
for (i = 0; i < (unsigned int) back_max; i++) {
if (back[i].status > 0 && back[i].status < STATUS_FTP_TRANSFER) {
if (back[i].r.soc != INVALID_SOCKET) {
deletehttp(&back[i].r);
}
back[i].r.soc = INVALID_SOCKET;
/* drop a .delayed placeholder; real partials survive for resume */
if (back[i].r.is_write && IS_DELAYED_EXT(back[i].url_sav))
back_delayed_discard(opt, &back[i]);
back[i].r.statuscode = STATUSCODE_TIMEOUT;
strcpybuff(back[i].r.msg, "Mirror Time Out");
back[i].status = STATUS_READY;
back_set_finished(sback, i);
aborted++;
}
}
if (aborted > 0)
hts_log_print(opt, LOG_WARNING,
"time limit reached, %d transfer(s) aborted", aborted);
return;
}
// recevoir tant qu'il y a des données (avec un maximum de max_loop boucles)
do_wait = 0;
gestion_timeout = 0;
@@ -2891,10 +2942,10 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
// range size hack old location
#if HTS_DIRECTDISK
// Court-circuit:
// Peut-on stocker le fichier directement sur disque?
// Ahh que ca serait vachement mieux et que ahh que la mémoire vous dit merci!
if (back[i].status) {
// Shortcut: store the file directly on disk when possible,
// sparing memory
if (back[i].status &&
!back[i].locked) { // name still pending when locked
if (back[i].r.is_write == 0) { // mode mémoire
if (back[i].r.adr == NULL) { // rien n'a été écrit
if (!back[i].testmode) { // pas mode test
@@ -3960,8 +4011,12 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
&& (back[i].r.adr = (char *) malloct(2))) {
back[i].r.adr[0] = 0;
}
hts_log_print(opt, LOG_TRACE, "finalizing empty");
back_finalize(opt, cache, sback, i);
/* locked = name pending; the waiter finalizes after
patching url_sav (else: cached as .delayed, #5) */
if (!back[i].locked) {
hts_log_print(opt, LOG_TRACE, "finalizing empty");
back_finalize(opt, cache, sback, i);
}
} else if (!back[i].r.is_chunk) { // pas de chunk
//if (back[i].r.http11!=2) { // pas de chunk
back[i].is_chunk = 0;
@@ -4159,6 +4214,11 @@ int back_checksize(httrackp * opt, lien_back * eback, int check_only_totalsize)
return 1;
}
/* Grace left to the smooth stop before in-flight transfers are aborted. */
static int back_maxtime_grace(const int maxtime) {
return maximum(5, minimum(30, maxtime / 10));
}
int back_checkmirror(httrackp * opt) {
// Check max size
if ((opt->maxsite > 0) && (HTS_STAT.stat_bytes >= opt->maxsite)) {
@@ -4175,13 +4235,19 @@ int back_checkmirror(httrackp * opt) {
*/
}
// Check max time
if ((opt->maxtime > 0)
&& ((time_local() - HTS_STAT.stat_timestart) >= opt->maxtime)) {
if (!opt->state.stop) { /* not yet stopped */
hts_log_print(opt, LOG_ERROR, "More than %d seconds passed.. giving up",
opt->maxtime);
/* cancel mirror smoothly */
hts_request_stop(opt, 0);
if (opt->maxtime > 0) {
const TStamp elapsed = time_local() - HTS_STAT.stat_timestart;
if (elapsed >= opt->maxtime) {
if (!opt->state.stop) { /* not yet stopped */
hts_log_print(opt, LOG_ERROR, "More than %d seconds passed.. giving up",
opt->maxtime);
/* cancel mirror smoothly */
hts_request_stop(opt, 0);
}
/* smooth stop starved past the grace period: stop waiting (#481) */
if (elapsed - opt->maxtime >= back_maxtime_grace(opt->maxtime))
return 0;
}
}
return 1; /* Ok, go on */

View File

@@ -136,6 +136,8 @@ void back_solve(httrackp * opt, lien_back * sback);
int host_wait(httrackp * opt, lien_back * sback);
#endif
int back_checksize(httrackp * opt, lien_back * eback, int check_only_totalsize);
/* Enforce -M/-E quotas: requests a smooth stop when reached; returns 0 once
the -E deadline overran its grace period (callers must stop waiting). */
int back_checkmirror(httrackp * opt);
#endif

View File

@@ -40,6 +40,7 @@ Please visit our Website: http://www.httrack.com
#include "htscore.h"
#include "htsbasenet.h"
#include "htsmd5.h"
#include <limits.h>
#include <time.h>
#include "htszlib.h"
@@ -220,23 +221,38 @@ struct cache_back_zip_entry {
} \
} while(0)
/* A cache (new.zip) write failed: storage is gone (disk full / dropped share),
so the mirror is doomed too. Abort it via exit_xh, don't crash as assertf
did. */
/* Consecutive entry write failures before the cache stream is declared dead. */
#define CACHE_MAX_WRITE_FAILURES 8
/* Cache write failed: a fatal errno or a failure streak aborts the mirror
(exit_xh); an isolated failure only drops the current entry. */
static void cache_zip_write_failed(httrackp *opt, cache_back *cache,
const char *what, int zErr) {
if (!cache->zipWriteFailed) {
cache->zipWriteFailed = HTS_TRUE;
if (check_fatal_io_errno()) {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: disk full or filesystem problems");
} else {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: cache write failed (%s): %s", what,
hts_get_zerror(zErr));
const char *what, int zErr,
hts_boolean entry_open, const char *url_adr,
const char *url_fil) {
const int fatal_errno = zErr == ZIP_ERRNO && check_fatal_io_errno();
cache->zipWriteFailures++;
if (fatal_errno || cache->zipWriteFailures >= CACHE_MAX_WRITE_FAILURES) {
if (!cache->zipWriteFailed) {
cache->zipWriteFailed = HTS_TRUE;
if (fatal_errno) {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: disk full or filesystem problems");
} else {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: cache write failed (%s): %s", what,
hts_get_zerror(zErr));
}
}
opt->state.exit_xh = -1; /* fatal: stop the mirror, exit non-zero */
} else {
if (entry_open)
zipCloseFileInZip((zipFile) cache->zipOutput); /* abandon, best-effort */
hts_log_print(opt, LOG_WARNING,
"cache write failed (%s: %s), entry not cached: %s%s", what,
hts_get_zerror(zErr), url_adr, url_fil);
}
opt->state.exit_xh = -1; /* fatal: stop the mirror, exit non-zero */
}
/* Ajout d'un fichier en cache */
@@ -286,10 +302,19 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
if (r->size < 0) // error
return;
// data in cache
if (dataincache) {
assertf(((int) r->size) == r->size);
//entryBodySize = (int) r->size;
// data in cache: the body must fit the 32-bit zip write API
if (dataincache && (LLint) (int) r->size != r->size) {
if (r->is_write && url_save != NULL && strnotempty(url_save)) {
hts_log_print(opt, LOG_WARNING,
"file too large for the cache, storing headers only: %s%s",
url_adr, url_fil);
dataincache = 0;
} else {
hts_log_print(opt, LOG_WARNING,
"entry too large for the cache, not cached: %s%s", url_adr,
url_fil);
return;
}
}
/* Fields */
@@ -369,7 +394,8 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
*/
headers, (uInt) strlen(headers), NULL, 0, NULL, /* comment */
Z_DEFLATED, Z_DEFAULT_COMPRESSION)) != Z_OK) {
cache_zip_write_failed(opt, cache, "opening a cache entry", zErr);
cache_zip_write_failed(opt, cache, "opening a cache entry", zErr, HTS_FALSE,
url_adr, url_fil);
return;
}
@@ -380,7 +406,8 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
if ((zErr =
zipWriteInFileInZip((zipFile) cache->zipOutput, r->adr,
(int) r->size)) != Z_OK) {
cache_zip_write_failed(opt, cache, "writing to the cache", zErr);
cache_zip_write_failed(opt, cache, "writing to the cache", zErr,
HTS_TRUE, url_adr, url_fil);
return;
}
}
@@ -402,8 +429,8 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
if ((zErr =
zipWriteInFileInZip((zipFile) cache->zipOutput, buff,
(int) nl)) != Z_OK) {
cache_zip_write_failed(opt, cache, "writing to the cache",
zErr);
cache_zip_write_failed(opt, cache, "writing to the cache", zErr,
HTS_TRUE, url_adr, url_fil);
fclose(fp);
return;
}
@@ -419,15 +446,19 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
/* Close */
if ((zErr = zipCloseFileInZip((zipFile) cache->zipOutput)) != Z_OK) {
cache_zip_write_failed(opt, cache, "closing a cache entry", zErr);
cache_zip_write_failed(opt, cache, "closing a cache entry", zErr, HTS_FALSE,
url_adr, url_fil);
return;
}
/* Flush */
if ((zErr = zipFlush((zipFile) cache->zipOutput)) != 0) {
cache_zip_write_failed(opt, cache, "flushing the cache", zErr);
cache_zip_write_failed(opt, cache, "flushing the cache", zErr, HTS_FALSE,
url_adr, url_fil);
return;
}
cache->zipWriteFailures = 0; /* entry stored: reset the failure streak */
}
#else
@@ -596,15 +627,18 @@ htsblk cache_read_ro(httrackp * opt, cache_back * cache, const char *adr,
return cache_readex(opt, cache, adr, fil, save, location, NULL, 1);
}
htsblk cache_read_including_broken(httrackp * opt, cache_back * cache,
const char *adr, const char *fil) {
htsblk r = cache_read(opt, cache, adr, fil, NULL, NULL);
htsblk cache_read_including_broken(httrackp *opt, cache_back *cache,
const char *adr, const char *fil,
char *return_save) {
htsblk r = cache_readex(opt, cache, adr, fil, NULL, NULL, return_save, 0);
if (r.statuscode == -1) {
lien_back *itemback = NULL;
if (back_unserialize_ref(opt, adr, fil, &itemback) == 0) {
r = itemback->r;
if (return_save != NULL)
strlcpybuff(return_save, itemback->url_sav, HTS_URLMAXSIZE * 2);
/* cleanup */
back_clear_entry(itemback); /* delete entry content */
freet(itemback); /* delete item */
@@ -765,6 +799,15 @@ static htsblk cache_readex_new(httrackp * opt, cache_back * cache,
strlcpybuff(return_save, previous_save, HTS_URLMAXSIZE * 2);
}
/* A tampered X-Size must be rejected before the size-driven malloc.
The alloc casts to int (malloct((int) r.size + 1)), so bound it to
[0, INT_MAX): a negative value, or a positive one whose (int) cast
truncates negative, would otherwise wrap to a huge allocation. */
if (r.size < 0 || r.size >= INT_MAX) {
r.statuscode = STATUSCODE_INVALID;
strcpybuff(r.msg, "Cache Read Error : Bad Size");
}
/* Complete fields */
r.totalsize = r.size;
r.adr = NULL;
@@ -791,7 +834,8 @@ static htsblk cache_readex_new(httrackp * opt, cache_back * cache,
} // otherwise, the ZIP file is supposed to be consistent with data.
}
/* Read data ? */
else { /* ne pas lire uniquement header */
else if (r.statuscode !=
STATUSCODE_INVALID) { /* ne pas lire uniquement header */
int ok = 0;
#if HTS_DIRECTDISK
@@ -1417,6 +1461,86 @@ static int hts_rename(httrackp * opt, const char *a, const char *b) {
return rename(a, b);
}
/* Pathname of a file inside the mirror dir (rotating concat buffer). */
static char *reconcile_path(httrackp *opt, const char *name) {
return fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), name);
}
/* Interrupted-run heuristic: prefer the old generation when the new cache
stalled below NEW_TINY while the old one grew past OLD_SOLID (historical
arbitrary thresholds). */
#define CACHE_RECONCILE_NEW_TINY 32768
#define CACHE_RECONCILE_OLD_SOLID 65536
/* Replace the new-generation file by the old one, when the old one exists. */
static void reconcile_promote(httrackp *opt, const char *oldname,
const char *newname) {
if (fexist(reconcile_path(opt, oldname))) {
remove(reconcile_path(opt, newname));
rename(reconcile_path(opt, oldname), reconcile_path(opt, newname));
}
}
void hts_cache_reconcile(httrackp *opt, hts_cache_reconcile_mode mode) {
switch (mode) {
case CACHE_RECONCILE_PROMOTE:
/* Previous run rotated new.* to old.* then died before writing: promote
the old generation back, whichever format it uses. */
if (!fexist(reconcile_path(opt, "hts-cache/new.zip")))
reconcile_promote(opt, "hts-cache/old.zip", "hts-cache/new.zip");
if ((!fexist(reconcile_path(opt, "hts-cache/new.dat")) ||
!fexist(reconcile_path(opt, "hts-cache/new.ndx"))) &&
fexist(reconcile_path(opt, "hts-cache/old.dat")) &&
fexist(reconcile_path(opt, "hts-cache/old.ndx"))) {
reconcile_promote(opt, "hts-cache/old.dat", "hts-cache/new.dat");
reconcile_promote(opt, "hts-cache/old.ndx", "hts-cache/new.ndx");
}
break;
case CACHE_RECONCILE_INTERRUPTED:
/* Aborted run: keep the larger generation when the new cache is
suspiciously small next to the old one. The new file must exist: fsize()
is -1 for a missing file, which would spuriously pass the "< TINY" test
and overwrite a solid old generation that PROMOTE/ROLLBACK should keep.
*/
if (!opt->cache || !fexist(reconcile_path(opt, "hts-in_progress.lock")))
break;
if (fexist(reconcile_path(opt, "hts-cache/new.zip")) &&
fexist(reconcile_path(opt, "hts-cache/old.zip")) &&
fsize(reconcile_path(opt, "hts-cache/new.zip")) <
CACHE_RECONCILE_NEW_TINY &&
fsize(reconcile_path(opt, "hts-cache/old.zip")) >
CACHE_RECONCILE_OLD_SOLID &&
fsize(reconcile_path(opt, "hts-cache/old.zip")) >
fsize(reconcile_path(opt, "hts-cache/new.zip")))
reconcile_promote(opt, "hts-cache/old.zip", "hts-cache/new.zip");
if (fexist(reconcile_path(opt, "hts-cache/new.dat")) &&
fexist(reconcile_path(opt, "hts-cache/old.dat")) &&
fexist(reconcile_path(opt, "hts-cache/old.ndx")) &&
fsize(reconcile_path(opt, "hts-cache/new.dat")) <
CACHE_RECONCILE_NEW_TINY &&
fsize(reconcile_path(opt, "hts-cache/old.dat")) >
CACHE_RECONCILE_OLD_SOLID &&
fsize(reconcile_path(opt, "hts-cache/old.dat")) >
fsize(reconcile_path(opt, "hts-cache/new.dat"))) {
reconcile_promote(opt, "hts-cache/old.dat", "hts-cache/new.dat");
reconcile_promote(opt, "hts-cache/old.ndx", "hts-cache/new.ndx");
}
break;
case CACHE_RECONCILE_ROLLBACK:
/* Nothing transferred: restore the previous generation and sidecars. */
reconcile_promote(opt, "hts-cache/old.zip", "hts-cache/new.zip");
if (fexist(reconcile_path(opt, "hts-cache/old.dat")) &&
fexist(reconcile_path(opt, "hts-cache/old.ndx"))) {
reconcile_promote(opt, "hts-cache/old.dat", "hts-cache/new.dat");
reconcile_promote(opt, "hts-cache/old.ndx", "hts-cache/new.ndx");
}
reconcile_promote(opt, "hts-cache/old.lst", "hts-cache/new.lst");
reconcile_promote(opt, "hts-cache/old.txt", "hts-cache/new.txt");
break;
}
}
// renvoyer uniquement en tête, ou NULL si erreur
// return NULL upon error, and set -1 to r.statuscode
htsblk *cache_header(httrackp * opt, cache_back * cache, const char *adr,

View File

@@ -66,8 +66,11 @@ htsblk cache_read(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location);
htsblk cache_read_ro(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location);
htsblk cache_read_including_broken(httrackp * opt, cache_back * cache,
const char *adr, const char *fil);
/* Like cache_read, but also yields entries whose transfer broke; return_save
(optional, HTS_URLMAXSIZE*2) receives the entry's recorded save name. */
htsblk cache_read_including_broken(httrackp *opt, cache_back *cache,
const char *adr, const char *fil,
char *return_save);
htsblk cache_readex(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location,
char *return_save, int readonly);
@@ -75,6 +78,17 @@ htsblk *cache_header(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, htsblk * r);
void cache_init(cache_back * cache, httrackp * opt);
/* Which hts-cache/ generation (new.* vs old.*) is authoritative. */
typedef enum {
CACHE_RECONCILE_PROMOTE, /* no new cache: promote the old generation */
CACHE_RECONCILE_INTERRUPTED, /* aborted run: keep the larger generation */
CACHE_RECONCILE_ROLLBACK /* nothing transferred: restore the old one */
} hts_cache_reconcile_mode;
/* Reconcile the on-disk cache generations according to mode; a no-op when
the involved files are absent. */
void hts_cache_reconcile(httrackp *opt, hts_cache_reconcile_mode mode);
int cache_writedata(FILE * cache_ndx, FILE * cache_dat, const char *str1,
const char *str2, char *outbuff, int len);
int cache_readdata(cache_back * cache, const char *str1, const char *str2,

View File

@@ -48,6 +48,7 @@ Please visit our Website: http://www.httrack.com
#include "htszlib.h"
#include <errno.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
@@ -321,6 +322,7 @@ typedef struct {
size_t budget; /**< bytes allowed through before writes start failing */
int fail_errno; /**< errno set on the failing write (ENOSPC, EIO, ...) */
int writes; /**< zwrite call count, to detect re-entry into the stream */
int fail_once; /**< recover (unlimited budget) after the first failure */
} writefail_inject;
/* zwrite that copies until the budget runs out, then fails with inj->fail_errno
@@ -335,6 +337,8 @@ static uLong selftest_failing_zwrite(voidpf opaque, voidpf stream,
inj->budget -= (size_t) size;
return (uLong) fwrite(buf, 1, (size_t) size, (FILE *) stream);
}
if (inj->fail_once)
inj->budget = (size_t) -1; /* the backend recovers after this failure */
errno = inj->fail_errno;
return 0; /* short write -> the minizip op returns an error */
}
@@ -373,9 +377,50 @@ static void writefail_store(httrackp *opt, cache_back *cache, const char *fil,
freet(bodycopy);
}
/* #174/#219: a failing cache write used to crash via assertf(); it must instead
stop the mirror (exit_xh = -1) without crashing. Assert that, plus the cache
is flagged and a sibling write doesn't re-enter the broken stream. */
/* Store an entry claiming a >2GB body; the degrade path never reads data. */
static void writefail_store_oversized(httrackp *opt, cache_back *cache,
const char *fil, int is_write) {
htsblk r;
char locbuf[4];
hts_init_htsblk(&r);
r.statuscode = 200;
r.size = (LLint) INT_MAX + 1;
strcpybuff(r.msg, "OK");
strcpybuff(r.contenttype, "application/octet-stream");
locbuf[0] = '\0';
r.location = locbuf;
r.is_write = (short int) is_write;
cache_add(opt, cache, &r, "example.com", fil, "example.com/big.bin", 1, NULL);
}
/* Read back `entryname`: extra field (cached headers) and body. Returns the
body length, or -1 if the entry is absent or unreadable. */
static int writefail_read_entry(const char *path, const char *entryname,
char *extra, size_t extralen, char *body,
size_t bodylen) {
unzFile z = unzOpen(path);
int n = -1;
if (z == NULL)
return -1;
if (unzLocateFile(z, entryname, 1) == UNZ_OK &&
unzOpenCurrentFile(z) == UNZ_OK) {
const int elen = unzGetLocalExtrafield(z, extra, (unsigned) (extralen - 1));
if (elen >= 0) {
extra[elen] = '\0';
n = unzReadCurrentFile(z, body, (unsigned) bodylen);
}
unzCloseCurrentFile(z);
}
unzClose(z);
return n;
}
/* Cache write-failure policy (#174/#219): fatal errno or a failure streak
stops the mirror (exit_xh=-1, no crash); isolated/oversized drops the entry.
*/
int cache_write_failure_selftest(httrackp *opt, const char *dir) {
int fail = 0;
char path[HTS_URLMAXSIZE];
@@ -388,9 +433,8 @@ int cache_write_failure_selftest(httrackp *opt, const char *dir) {
gen_body(body, body_len, 1 /* incompressible */);
fconcat(path, sizeof(path), dir, "/wfail.zip");
/* phase 0: fail on the body write, fatal errno (ENOSPC, the disk-full
branch). phase 1: fail on the open, non-fatal errno (EIO, dropped-share
branch). Both must abort the mirror. */
/* phase 0: fatal errno (ENOSPC) aborts at once; phase 1: persistent EIO
drops entries until the streak caps out, then aborts. */
for (phase = 0; phase < 2; phase++) {
cache_back cache;
writefail_inject inj;
@@ -399,6 +443,7 @@ int cache_write_failure_selftest(httrackp *opt, const char *dir) {
inj.budget = (phase == 0) ? 4096 : 0;
inj.fail_errno = (phase == 0) ? ENOSPC : EIO;
inj.writes = 0;
inj.fail_once = 0;
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = stderr;
@@ -412,7 +457,25 @@ int cache_write_failure_selftest(httrackp *opt, const char *dir) {
}
opt->state.exit_xh = 0; /* clear; the failing write must set it to -1 */
writefail_store(opt, &cache, "/blob.bin", body, body_len);
if (phase == 0) {
writefail_store(opt, &cache, "/blob.bin", body, body_len);
} else {
/* the abort must land exactly on the 8th consecutive failure */
int i;
for (i = 0; i < 7; i++) {
char fil[32];
snprintf(fil, sizeof(fil), "/b%d.bin", i);
writefail_store(opt, &cache, fil, body, 16);
}
if (cache.zipWriteFailed) {
fprintf(stderr, "cache-writefail: phase 1: aborted before the "
"8th consecutive failure\n");
fail++;
}
writefail_store(opt, &cache, "/b7.bin", body, 16);
}
if (!cache.zipWriteFailed) {
fprintf(stderr, "cache-writefail: phase %d: write error not caught\n",
phase);
@@ -443,6 +506,136 @@ int cache_write_failure_selftest(httrackp *opt, const char *dir) {
}
}
/* failures with successes in between reset the streak: never aborts */
{
cache_back cache;
writefail_inject inj;
int i;
inj.budget = (size_t) -1;
inj.fail_errno = EIO;
inj.writes = 0;
inj.fail_once = 0;
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = stderr;
cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.zipOutput = selftest_open_failing_zip(path, &inj);
opt->state.exit_xh = 0;
for (i = 0; i < 10; i++) {
char fil[32];
inj.budget = 0; /* this store fails */
snprintf(fil, sizeof(fil), "/s%d.bin", i);
writefail_store(opt, &cache, fil, body, 16);
inj.budget = (size_t) -1; /* this one succeeds and resets the streak */
snprintf(fil, sizeof(fil), "/ok%d.bin", i);
writefail_store(opt, &cache, fil, body, 16);
}
if (cache.zipWriteFailed || opt->state.exit_xh != 0) {
fprintf(stderr,
"cache-writefail: scattered: non-consecutive failures aborted "
"the mirror (flagged=%d, exit_xh=%d)\n",
(int) cache.zipWriteFailed, opt->state.exit_xh);
fail++;
}
zipClose(cache.zipOutput, NULL);
cache.zipOutput = NULL;
}
/* isolated failure: only that entry drops; a later sibling round-trips */
{
cache_back cache;
writefail_inject inj;
char extra[8192];
char rbody[64];
int n;
inj.budget = 4096;
inj.fail_errno = EIO;
inj.writes = 0;
inj.fail_once = 1;
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = stderr;
cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.zipOutput = selftest_open_failing_zip(path, &inj);
opt->state.exit_xh = 0;
writefail_store(opt, &cache, "/blob.bin", body, body_len);
if (cache.zipWriteFailed || opt->state.exit_xh != 0) {
fprintf(stderr,
"cache-writefail: skip: isolated failure aborted the mirror "
"(flagged=%d, exit_xh=%d)\n",
(int) cache.zipWriteFailed, opt->state.exit_xh);
fail++;
}
writefail_store(opt, &cache, "/blob2.bin", body, 16);
zipClose(cache.zipOutput, NULL);
cache.zipOutput = NULL;
n = writefail_read_entry(path, "http://example.com/blob2.bin", extra,
sizeof(extra), rbody, sizeof(rbody));
if (n != 16 || memcmp(rbody, body, 16) != 0) {
fprintf(stderr,
"cache-writefail: skip: sibling entry lost after a skipped "
"entry (%d)\n",
n);
fail++;
}
}
/* >2GB bodies: in-memory drops the entry, on-disk degrades to headers-only */
{
cache_back cache;
writefail_inject inj;
char extra[8192];
char rbody[64];
int n;
inj.budget = (size_t) -1; /* no injected failure */
inj.fail_errno = 0;
inj.writes = 0;
inj.fail_once = 0;
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = stderr;
cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.zipOutput = selftest_open_failing_zip(path, &inj);
opt->state.exit_xh = 0;
writefail_store_oversized(opt, &cache, "/bigmem.bin", 0 /* in-memory */);
writefail_store_oversized(opt, &cache, "/bigdisk.bin", 1 /* on-disk */);
zipClose(cache.zipOutput, NULL);
cache.zipOutput = NULL;
if (cache.zipWriteFailed || opt->state.exit_xh != 0) {
fprintf(stderr,
"cache-writefail: oversize: mirror aborted (flagged=%d, "
"exit_xh=%d)\n",
(int) cache.zipWriteFailed, opt->state.exit_xh);
fail++;
}
if (writefail_read_entry(path, "http://example.com/bigmem.bin", extra,
sizeof(extra), rbody, sizeof(rbody)) >= 0) {
fprintf(stderr,
"cache-writefail: oversize: in-memory entry was stored\n");
fail++;
}
n = writefail_read_entry(path, "http://example.com/bigdisk.bin", extra,
sizeof(extra), rbody, sizeof(rbody));
if (n != 0 || strstr(extra, "X-In-Cache: 0") == NULL) {
fprintf(stderr,
"cache-writefail: oversize: on-disk entry not stored "
"headers-only (%d)\n",
n);
fail++;
}
}
freet(body);
return fail;
}
@@ -716,3 +909,398 @@ int cache_golden_selftest(httrackp *opt, const char *dir, int regen) {
return failures;
}
/* --- hts_cache_reconcile() policies -------------------------------------- */
/* All reconcile inputs/outputs, wiped between cases. */
static const char *const reconcile_files[] = {
"hts-cache/new.zip", "hts-cache/old.zip", "hts-cache/new.dat",
"hts-cache/old.dat", "hts-cache/new.ndx", "hts-cache/old.ndx",
"hts-cache/new.lst", "hts-cache/old.lst", "hts-cache/new.txt",
"hts-cache/old.txt", "hts-in_progress.lock"};
static char *reconcile_st_path(httrackp *opt, const char *name) {
return fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), name);
}
static void reconcile_wipe(httrackp *opt) {
size_t i;
for (i = 0; i < sizeof(reconcile_files) / sizeof(reconcile_files[0]); i++)
remove(reconcile_st_path(opt, reconcile_files[i]));
}
/* Create a filler file of exactly `size` bytes. */
static void reconcile_put(httrackp *opt, const char *name, size_t size) {
FILE *const fp = fopen(reconcile_st_path(opt, name), "wb");
static const char filler[1024] = {'x'};
assertf(fp != NULL);
while (size > 0) {
const size_t n = size > sizeof(filler) ? sizeof(filler) : size;
assertf(fwrite(filler, 1, n, fp) == n);
size -= n;
}
fclose(fp);
}
/* Expect `name` to weigh `size` bytes, or be absent when size == -1. */
static int reconcile_expect(httrackp *opt, const char *name, off_t size,
const char *what) {
const off_t got = fsize(reconcile_st_path(opt, name));
if (got != size) {
fprintf(stderr, "cache-reconcile: %s: %s is %d bytes, expected %d\n", what,
name, (int) got, (int) size);
return 1;
}
return 0;
}
int cache_reconcile_selftest(httrackp *opt, const char *dir) {
int failures = 0;
/* around the interrupted-run thresholds (new < 32768, old > 65536) */
static const off_t TINY = 1024, MID = 40000, SOLID = 131072;
golden_setup(opt, dir);
#ifdef _WIN32
mkdir(reconcile_st_path(opt, "hts-cache"));
#else
mkdir(reconcile_st_path(opt, "hts-cache"), HTS_PROTECT_FOLDER);
#endif
/* PROMOTE: a zip old generation replaces a missing new one */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_PROMOTE);
failures += reconcile_expect(opt, "hts-cache/new.zip", SOLID, "promote-zip");
failures += reconcile_expect(opt, "hts-cache/old.zip", -1, "promote-zip");
/* PROMOTE: an existing new.zip is left alone */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.zip", TINY);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_PROMOTE);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", TINY, "promote-zip-noop");
failures +=
reconcile_expect(opt, "hts-cache/old.zip", SOLID, "promote-zip-noop");
/* PROMOTE: a pure-legacy old generation is promoted too (was dead when no
zip cache existed) */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/old.dat", SOLID);
reconcile_put(opt, "hts-cache/old.ndx", TINY);
hts_cache_reconcile(opt, CACHE_RECONCILE_PROMOTE);
failures += reconcile_expect(opt, "hts-cache/new.dat", SOLID, "promote-dat");
failures += reconcile_expect(opt, "hts-cache/new.ndx", TINY, "promote-dat");
failures += reconcile_expect(opt, "hts-cache/old.dat", -1, "promote-dat");
/* PROMOTE: a half-written legacy new pair is replaced by the old pair */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.dat", TINY);
reconcile_put(opt, "hts-cache/old.dat", SOLID);
reconcile_put(opt, "hts-cache/old.ndx", TINY);
hts_cache_reconcile(opt, CACHE_RECONCILE_PROMOTE);
failures +=
reconcile_expect(opt, "hts-cache/new.dat", SOLID, "promote-dat-partial");
failures +=
reconcile_expect(opt, "hts-cache/new.ndx", TINY, "promote-dat-partial");
/* INTERRUPTED: no lock file, no action */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.zip", TINY);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", TINY, "interrupted-nolock");
/* INTERRUPTED: an absent new.zip must NOT promote old.zip (fsize(-1) would
spuriously pass "< TINY"); leave the solid old generation for ROLLBACK */
reconcile_wipe(opt);
reconcile_put(opt, "hts-in_progress.lock", 0);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", -1, "interrupted-nonew");
failures +=
reconcile_expect(opt, "hts-cache/old.zip", SOLID, "interrupted-nonew");
/* INTERRUPTED: stalled tiny new.zip loses to a solid old.zip (was dead for
zip caches: the arm was gated on a legacy new.dat) */
reconcile_wipe(opt);
reconcile_put(opt, "hts-in_progress.lock", 0);
reconcile_put(opt, "hts-cache/new.zip", TINY);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", SOLID, "interrupted-zip");
failures += reconcile_expect(opt, "hts-cache/old.zip", -1, "interrupted-zip");
/* INTERRUPTED: old below the confidence threshold, keep new */
reconcile_wipe(opt);
reconcile_put(opt, "hts-in_progress.lock", 0);
reconcile_put(opt, "hts-cache/new.zip", TINY);
reconcile_put(opt, "hts-cache/old.zip", MID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", TINY, "interrupted-smallold");
/* INTERRUPTED: new big enough to trust, keep it */
reconcile_wipe(opt);
reconcile_put(opt, "hts-in_progress.lock", 0);
reconcile_put(opt, "hts-cache/new.zip", MID);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.zip", MID, "interrupted-bignew");
/* INTERRUPTED: the legacy pair follows the same size rule (was dead code) */
reconcile_wipe(opt);
reconcile_put(opt, "hts-in_progress.lock", 0);
reconcile_put(opt, "hts-cache/new.dat", TINY);
reconcile_put(opt, "hts-cache/new.ndx", TINY);
reconcile_put(opt, "hts-cache/old.dat", SOLID);
reconcile_put(opt, "hts-cache/old.ndx", MID);
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
failures +=
reconcile_expect(opt, "hts-cache/new.dat", SOLID, "interrupted-dat");
failures +=
reconcile_expect(opt, "hts-cache/new.ndx", MID, "interrupted-dat");
/* ROLLBACK: the old zip generation is restored (a zip cache used to lose
its only good generation here) */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.zip", TINY);
reconcile_put(opt, "hts-cache/old.zip", SOLID);
hts_cache_reconcile(opt, CACHE_RECONCILE_ROLLBACK);
failures += reconcile_expect(opt, "hts-cache/new.zip", SOLID, "rollback-zip");
failures += reconcile_expect(opt, "hts-cache/old.zip", -1, "rollback-zip");
/* ROLLBACK: sidecars are restored regardless of format */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.lst", TINY);
reconcile_put(opt, "hts-cache/old.lst", MID);
reconcile_put(opt, "hts-cache/old.txt", MID);
hts_cache_reconcile(opt, CACHE_RECONCILE_ROLLBACK);
failures += reconcile_expect(opt, "hts-cache/new.lst", MID, "rollback-lst");
failures += reconcile_expect(opt, "hts-cache/new.txt", MID, "rollback-txt");
/* ROLLBACK: full legacy generation incl. sidecars (historical behavior) */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.dat", TINY);
reconcile_put(opt, "hts-cache/new.ndx", TINY);
reconcile_put(opt, "hts-cache/old.dat", SOLID);
reconcile_put(opt, "hts-cache/old.ndx", MID);
reconcile_put(opt, "hts-cache/old.lst", MID);
reconcile_put(opt, "hts-cache/old.txt", MID);
hts_cache_reconcile(opt, CACHE_RECONCILE_ROLLBACK);
failures += reconcile_expect(opt, "hts-cache/new.dat", SOLID, "rollback-dat");
failures += reconcile_expect(opt, "hts-cache/new.ndx", MID, "rollback-dat");
failures += reconcile_expect(opt, "hts-cache/new.lst", MID, "rollback-dat");
failures += reconcile_expect(opt, "hts-cache/new.txt", MID, "rollback-dat");
/* ROLLBACK: nothing to restore, the new generation stays */
reconcile_wipe(opt);
reconcile_put(opt, "hts-cache/new.zip", TINY);
hts_cache_reconcile(opt, CACHE_RECONCILE_ROLLBACK);
failures += reconcile_expect(opt, "hts-cache/new.zip", TINY, "rollback-noop");
reconcile_wipe(opt);
return failures;
}
/* --- read-side corruption injection --------------------------------------- */
/* canary read back intact after each corruption; victim gets the byte surgery
*/
#define CORRUPT_ADR "corrupt.example.com"
static char corrupt_body_a[33 + 1];
static char corrupt_body_b[44 + 1];
/* Write a fresh two-entry cache: /canary.html then /victim.html. */
static void corrupt_build(httrackp *opt) {
cache_back cache;
memset(corrupt_body_a, 'a', sizeof(corrupt_body_a) - 1);
memset(corrupt_body_b, 'b', sizeof(corrupt_body_b) - 1);
remove(reconcile_st_path(opt, "hts-cache/new.zip"));
remove(reconcile_st_path(opt, "hts-cache/old.zip"));
selftest_open_for_write(&cache, opt);
store_entry(opt, &cache, CORRUPT_ADR, "/canary.html", "canary.html", 200,
"OK", "text/html", "utf-8", "", "", "", "", corrupt_body_a,
strlen(corrupt_body_a));
store_entry(opt, &cache, CORRUPT_ADR, "/victim.html", "victim.html", 200,
"OK", "text/html", "utf-8", "", "", "", "", corrupt_body_b,
strlen(corrupt_body_b));
selftest_close(&cache);
}
/* Like corrupt_build, but the victim carries a 20-char Etag whose header line
is later overwritten with a forged oversized X-Size (same byte length). */
static void corrupt_build_etag(httrackp *opt) {
cache_back cache;
memset(corrupt_body_a, 'a', sizeof(corrupt_body_a) - 1);
memset(corrupt_body_b, 'b', sizeof(corrupt_body_b) - 1);
remove(reconcile_st_path(opt, "hts-cache/new.zip"));
remove(reconcile_st_path(opt, "hts-cache/old.zip"));
selftest_open_for_write(&cache, opt);
store_entry(opt, &cache, CORRUPT_ADR, "/canary.html", "canary.html", 200,
"OK", "text/html", "utf-8", "", "", "", "", corrupt_body_a,
strlen(corrupt_body_a));
store_entry(opt, &cache, CORRUPT_ADR, "/victim.html", "victim.html", 200,
"OK", "text/html", "utf-8", "", "AAAAAAAAAAAAAAAAAAAA", "", "",
corrupt_body_b, strlen(corrupt_body_b));
selftest_close(&cache);
}
/* Patch the nth of total occurrences of pat (same-length rep) in new.zip. */
static void corrupt_patch(httrackp *opt, const char *pat, size_t patlen,
const char *rep, size_t nth, size_t total) {
LLint fsz = 0;
char *data = readfile2(reconcile_st_path(opt, "hts-cache/new.zip"), &fsz);
const size_t n = (size_t) fsz;
size_t k, hits = 0, at = 0;
FILE *fp;
assertf(data != NULL);
for (k = 0; k + patlen <= n; k++) {
if (memcmp(data + k, pat, patlen) == 0) {
hits++;
if (hits == nth)
at = k;
}
}
assertf(hits == total);
memcpy(data + at, rep, patlen);
fp = fopen(reconcile_st_path(opt, "hts-cache/new.zip"), "wb");
assertf(fp != NULL);
assertf(fwrite(data, 1, n, fp) == n);
fclose(fp);
freet(data);
}
/* Garbage the first bytes of the victim's deflated data (2nd local header). */
static void corrupt_victim_body(httrackp *opt) {
LLint fsz = 0;
char *data = readfile2(reconcile_st_path(opt, "hts-cache/new.zip"), &fsz);
const size_t n = (size_t) fsz;
size_t k, hits = 0, off = 0;
FILE *fp;
assertf(data != NULL);
for (k = 0; k + 4 <= n; k++) {
if (memcmp(data + k, "PK\x03\x04", 4) == 0 && ++hits == 2) {
const size_t namelen =
(unsigned char) data[k + 26] | ((unsigned char) data[k + 27] << 8);
const size_t extralen =
(unsigned char) data[k + 28] | ((unsigned char) data[k + 29] << 8);
off = k + 30 + namelen + extralen;
}
}
assertf(hits == 2);
assertf(off != 0 && off + 4 <= n);
memset(data + off, 0xFF, 4);
fp = fopen(reconcile_st_path(opt, "hts-cache/new.zip"), "wb");
assertf(fp != NULL);
assertf(fwrite(data, 1, n, fp) == n);
fclose(fp);
freet(data);
}
/* Read the corrupt /victim.html and, in the SAME read session, the intact
/canary.html: the victim must be rejected (wantmsg pins which path) and the
canary must still decode byte-exact, proving one bad entry never taints a
sibling read. */
static int corrupt_expect_victim(httrackp *opt, const char *wantmsg,
const char *what) {
cache_back cache;
htsblk v, c;
char BIGSTK lv[HTS_URLMAXSIZE * 2];
char BIGSTK lc[HTS_URLMAXSIZE * 2];
int fail = 0;
selftest_open_for_read(&cache, opt);
lv[0] = lc[0] = '\0';
v = cache_readex(opt, &cache, CORRUPT_ADR, "/victim.html", "", lv, NULL, 1);
if (v.statuscode != STATUSCODE_INVALID) {
fprintf(stderr, "%s: %s: victim: statuscode is %d, expected %d\n",
selftest_tag, what, v.statuscode, STATUSCODE_INVALID);
fail++;
}
if (wantmsg != NULL && strcmp(v.msg, wantmsg) != 0) {
fprintf(stderr, "%s: %s: victim: msg is '%s', expected '%s'\n",
selftest_tag, what, v.msg, wantmsg);
fail++;
}
c = cache_readex(opt, &cache, CORRUPT_ADR, "/canary.html", "", lc, NULL, 1);
if (c.statuscode != 200 || c.adr == NULL ||
c.size != (LLint) strlen(corrupt_body_a) ||
memcmp(c.adr, corrupt_body_a, strlen(corrupt_body_a)) != 0) {
fprintf(stderr, "%s: %s: canary tainted (status %d)\n", selftest_tag, what,
c.statuscode);
fail++;
}
if (v.adr != NULL)
freet(v.adr);
if (c.adr != NULL)
freet(c.adr);
selftest_close(&cache);
return fail;
}
/* One zip corruption case: build, patch, then check victim+canary in-session.
*/
static int corrupt_case_zip(httrackp *opt, const char *pat, const char *rep,
size_t nth, size_t total, const char *wantmsg,
const char *what) {
corrupt_build(opt);
corrupt_patch(opt, pat, strlen(pat), rep, nth, total);
return corrupt_expect_victim(opt, wantmsg, what);
}
int cache_corruption_selftest(httrackp *opt, const char *dir) {
int failures = 0;
selftest_tag = "cache-corrupt";
golden_setup(opt, dir);
failures +=
corrupt_case_zip(opt, "X-Size: 44", "X-Size: 99", 1, 1,
"Cache Read Error : Read Data", "oversized X-Size");
failures +=
corrupt_case_zip(opt, "X-Size: 44", "X-Size: -4", 1, 1,
"Cache Read Error : Bad Size", "negative X-Size");
/* both entries carry the line; the victim's is the second */
failures += corrupt_case_zip(opt, "X-In-Cache: 1", "X-In-Cache: 0", 2, 2,
"Previous cache file not found (empty filename)",
"blanked X-In-Cache");
/* smashed local file header: the entry is dropped at index load */
failures +=
corrupt_case_zip(opt, "PK\x03\x04", "XK\x03\x04", 2, 2,
"File Cache Entry Not Found", "smashed local header");
corrupt_build(opt);
corrupt_victim_body(opt);
failures += corrupt_expect_victim(opt, "Cache Read Error : Read Data",
"garbled deflate stream");
/* An X-Size above INT_MAX is positive as int64 (slips a bare sign check) but
truncates negative in the (int) cast the malloc uses: a wraparound alloc.
cache_add asserts size fits an int, so such a value only reaches the reader
from a corrupt/foreign cache; inject it by overwriting the victim's long
Etag line with a same-length forged X-Size line (the parser keeps the last
X-Size it sees), keeping the zip byte-length and offsets intact. */
corrupt_build_etag(opt);
corrupt_patch(opt, "Etag: AAAAAAAAAAAAAAAAAAAA", 26,
"X-Size: 2147483648AAAAAAAA", 1, 1);
failures += corrupt_expect_victim(opt, "Cache Read Error : Bad Size",
"X-Size above INT_MAX");
return failures;
}

View File

@@ -52,10 +52,19 @@ int cache_selftests(httrackp *opt, const char *dir);
committed file, never by the test). Returns the failed-check count. */
int cache_golden_selftest(httrackp *opt, const char *dir, int regen);
/* #174/#219: assert a failing cache write aborts the mirror cleanly instead of
crashing. Returns the failed-check count. */
/* Cache write-failure policy (#174/#219): abort on fatal errno or a streak,
drop just the entry otherwise. Returns the failed-check count. */
int cache_write_failure_selftest(httrackp *opt, const char *dir);
/* Exercise the hts_cache_reconcile() generation policies on file fixtures
under <dir>. Returns the failed-check count. */
int cache_reconcile_selftest(httrackp *opt, const char *dir);
/* Inject read-side corruption (zip byte surgery: bad size, header, deflate)
under <dir> and assert every case degrades to STATUSCODE_INVALID without
tainting a sibling entry. */
int cache_corruption_selftest(httrackp *opt, const char *dir);
#endif
#endif

View File

@@ -2137,47 +2137,7 @@ int httpmirror(char *url1, httrackp * opt) {
hts_log_print(opt, LOG_NOTICE,
"No data seems to have been transferred during this session! : restoring previous one!");
XH_uninit;
if ((fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log), "hts-cache/old.dat")))
&&
(fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.ndx")))) {
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.dat"));
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.ndx"));
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.lst"));
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.txt"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.dat"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.dat"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.ndx"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.ndx"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.lst"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.lst"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.txt"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.txt"));
}
hts_cache_reconcile(opt, CACHE_RECONCILE_ROLLBACK);
opt->state.exit_xh = 2; /* interrupted (no connection detected) */
return 1;
}
@@ -2892,6 +2852,9 @@ int check_fatal_io_errno(void) {
#endif
#ifdef EROFS
case EROFS: /* Read-only file system */
#endif
#ifdef EDQUOT
case EDQUOT: /* Disk quota exceeded */
#endif
return 1;
break;
@@ -3371,6 +3334,41 @@ int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
return n;
}
/* One engine-loop tick: refresh the transfer stats and run the loop callback
for slot b (-1 = none). HTS_FALSE = the callback requested an abort. */
hts_boolean hts_loop_tick(struct_back *sback, httrackp *opt, int b, int ptr) {
engine_stats();
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
return RUN_CALLBACK7(
opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)
? HTS_TRUE
: HTS_FALSE;
}
/* Single implementation of the historical WAIT_FOR_AVAILABLE_SOCKET macros. */
hts_boolean hts_wait_available_socket(struct_back *sback, httrackp *opt,
cache_back *cache, int ptr) {
const int prev = opt->state._hts_in_html_parsing;
while (back_pluggable_sockets_strict(sback, opt) <= 0) {
opt->state._hts_in_html_parsing = 6;
back_wait(sback, opt, cache, 0);
/* time limit (-E) exceeded: stop waiting for a socket (#481) */
if (!back_checkmirror(opt))
break;
if (!hts_loop_tick(sback, opt, -1, ptr))
return HTS_FALSE;
}
opt->state._hts_in_html_parsing = prev;
return HTS_TRUE;
}
int back_pluggable_sockets(struct_back * sback, httrackp * opt) {
int n;

View File

@@ -216,6 +216,7 @@ struct cache_back {
int zipEntriesCapa;
hts_boolean
zipWriteFailed; /**< a cache write failed; stop touching the stream */
int zipWriteFailures; /**< consecutive entry write failures; reset on store */
};
#ifndef HTS_DEF_FWSTRUCT_hash_struct
@@ -432,6 +433,15 @@ int back_pluggable_sockets(struct_back * sback, httrackp * opt);
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt);
/* One engine-loop tick: refresh the transfer stats and run the loop callback
for slot b (-1 = none). HTS_FALSE = the callback requested an abort. */
hts_boolean hts_loop_tick(struct_back *sback, httrackp *opt, int b, int ptr);
/* Wait until a test socket can be plugged, pumping transfers, stats and the
loop callback; gives up past the -E deadline. HTS_FALSE = callback abort. */
hts_boolean hts_wait_available_socket(struct_back *sback, httrackp *opt,
cache_back *cache, int ptr);
/* Randomized inter-file pause target in [min_ms,max_ms] (#185), derived from a
timestamp seed so it is stable within one gap and rerolls per launch. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms);

View File

@@ -544,69 +544,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
}
// Existence d'un cache - pas de new mais un old.. renommer
// No new cache but an old one? promote it
#if DEBUG_STEPS
printf("Checking cache\n");
#endif
if (!fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), "hts-cache/new.zip"))) {
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), "hts-cache/old.zip"))) {
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/old.zip"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.zip"));
}
} else
if ((!fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), "hts-cache/new.dat")))
||
(!fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.ndx")))) {
if ((fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log), "hts-cache/old.dat")))
&&
(fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/old.ndx")))) {
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.dat"));
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.ndx"));
//remove(fconcat(StringBuff(opt->path_log),"hts-cache/new.lst"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/old.dat"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.dat"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.ndx"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.ndx"));
//rename(fconcat(StringBuff(opt->path_log),"hts-cache/old.lst"),fconcat(StringBuff(opt->path_log),"hts-cache/new.lst"));
}
}
hts_cache_reconcile(opt, CACHE_RECONCILE_PROMOTE);
/* Interrupted mirror detected */
if (!opt->quiet) {
@@ -2554,109 +2496,8 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
printf("Cache & log settings\n");
#endif
// on utilise le cache..
// en cas de présence des deux versions, garder la version la plus avancée,
// cad la version contenant le plus de fichiers
if (opt->cache) {
if (fexist(fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log), "hts-in_progress.lock"))) { // problemes..
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.dat"))) {
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.zip"))) {
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.zip")) < 32768) {
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.zip")) > 65536) {
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.zip")) > fsize(fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->
path_log),
"hts-cache/new.zip")))
{
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.zip"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.zip"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.zip"));
}
}
}
}
} else
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.dat"))
&&
fexist(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.ndx"))) {
if (fexist
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.dat"))
&&
fexist(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.ndx"))) {
// switcher si new<32Ko et old>65Ko (tailles arbitraires) ?
// ce cas est peut être une erreur ou un crash d'un miroir ancien, prendre
// alors l'ancien cache
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.dat")) < 32768) {
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.dat")) > 65536) {
if (fsize
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.dat")) > fsize(fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->
path_log),
"hts-cache/new.dat")))
{
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.dat"));
remove(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/new.ndx"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.dat"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.dat"));
rename(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_log),
"hts-cache/old.ndx"), fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_log),
"hts-cache/new.ndx"));
//} else { // ne rien faire
// remove("hts-cache/old.dat");
// remove("hts-cache/old.ndx");
}
}
}
}
}
}
}
// If both cache generations exist, keep the most complete one
hts_cache_reconcile(opt, CACHE_RECONCILE_INTERRUPTED);
// Débuggage des en têtes
if (_DEBUG_HEAD) {
ioinfo =

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-10"
#define HTTRACK_VERSIONID "3.49.10"
#define HTTRACK_VERSION "3.49-11"
#define HTTRACK_VERSIONID "3.49.11"
#define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0"

View File

@@ -41,6 +41,10 @@ Please visit our Website: http://www.httrack.com
#include "htstools.h"
#include "htscharset.h"
#include "htsencoding.h"
#include "htssniff.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
#include <ctype.h>
#define ADD_STANDARD_PATH \
@@ -70,31 +74,6 @@ static const char *hts_tbdev[] = {
""
};
#define URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET() do { \
int prev = opt->state._hts_in_html_parsing; \
while(back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state. _hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback,opt,cache,0); \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket=back_nsoc(sback); \
HTS_STAT.stat_errors=fspc(opt,NULL,"error"); \
HTS_STAT.stat_warnings=fspc(opt,NULL,"warning"); \
HTS_STAT.stat_infos=fspc(opt,NULL,"info"); \
HTS_STAT.nbk=backlinks_done(sback,opt->liens,opt->lien_tot,ptr); \
HTS_STAT.nb=back_transferred(HTS_STAT.stat_bytes,sback); \
/* Check */ \
{ \
if (!RUN_CALLBACK7(opt, loop, sback->lnk, sback->count,-1,ptr,opt->lien_tot,(int) (time_local()-HTS_STAT.stat_timestart),&HTS_STAT)) { \
return -1; \
} \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while(0)
/* Strip all // */
static void cleanDoubleSlash(char *s) {
int i, j;
@@ -138,46 +117,184 @@ static void cleanEndingSpaceOrDot(char *s) {
}
}
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server declared NO type (the HTS_UNKNOWN_MIME sentinel; the #267 mangle
guard): a typeless .png stays .png, but a .pdf explicitly served as text/html
is named .html. The sentinel rides the cache, so updates stay consistent. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
/* Wire Content-Type vs URL extension: a patchable wire type wins over an
unspecific ext, the HTS_UNKNOWN_MIME sentinel keeps a specific non-HTML ext
(#267 guard), a declared disagreement is CONTESTED (sniffed below). */
typedef enum wire_verdict {
WIRE_KEEPS_EXT,
WIRE_WINS,
WIRE_CONTESTED
} wire_verdict;
static wire_verdict wire_ext_verdict(httrackp *opt, const char *wiremime,
const char *file, char *urlmime,
size_t urlmime_size) {
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
return WIRE_KEEPS_EXT; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (!get_httptype_sized(opt, urlmime, urlmime_size, file, 0))
return WIRE_WINS; /* URL ext implies no known type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server declared no type (the sentinel); an explicitly declared type,
even text/html, is trusted, so a binary-looking URL that really serves
HTML (login/error interstitial, soft-404) is named .html. */
return WIRE_KEEPS_EXT; /* agreement (no .htm->.html churn) */
if (!is_hypertext_mime(opt, urlmime, file) &&
strfield2(wiremime, HTS_UNKNOWN_MIME))
return WIRE_KEEPS_EXT; /* no declared type */
return WIRE_CONTESTED;
}
/* Optional evidence for a contested wire-vs-ext verdict. */
typedef struct sniff_src {
struct_back *sback; /* live backing (looked up by adr/fil) */
const lien_back *headers; /* snapshot: r.adr, else the url_sav file */
const char *adr, *fil;
const char *prev_save; /* previous run's save name (cache X-Save) */
} sniff_src;
#if HTS_USEZLIB
/* Inflate the head of a gzip/zlib stream; 0 when undecodable. */
static size_t sniff_inflate_head(const void *in, size_t in_len, void *out,
size_t out_len) {
z_stream zs;
size_t n = 0;
int err;
memset(&zs, 0, sizeof(zs));
if (inflateInit2(&zs, 47) != Z_OK) /* 47: gzip or zlib, autodetected */
return 0;
zs.next_in = (const Bytef *) in;
zs.avail_in = (uInt) in_len;
zs.next_out = (Bytef *) out;
zs.avail_out = (uInt) out_len;
err = inflate(&zs, Z_SYNC_FLUSH);
if (err == Z_OK || err == Z_STREAM_END || err == Z_BUF_ERROR)
n = out_len - zs.avail_out;
inflateEnd(&zs);
return n;
}
#endif
static size_t sniff_read_head(const char *path, void *buf, size_t len) {
char catbuff[CATBUFF_SIZE];
FILE *const fp = FOPEN(fconv(catbuff, sizeof(catbuff), path), "rb");
size_t n = 0;
if (fp != NULL) {
n = fread(buf, 1, len, fp);
fclose(fp);
}
return n;
}
/* Body head of one slot: memory, else its flushed on-disk file (url_sav, or
tmpfile for a compressed stream); inflated so the sniff sees the final body.
*/
static size_t sniff_slot_head(const lien_back *slot, void *buf, size_t len) {
const htsblk *const r = &slot->r;
size_t n = 0;
if (r->adr != NULL && r->size > 0) {
n = (size_t) r->size < len ? (size_t) r->size : len;
memcpy(buf, r->adr, n);
} else {
if (r->out != NULL)
fflush(r->out);
if (slot->url_sav[0] != '\0')
n = sniff_read_head(slot->url_sav, buf, len);
if (n == 0 && slot->tmpfile != NULL && slot->tmpfile[0] != '\0')
n = sniff_read_head(slot->tmpfile, buf, len);
}
if (n > 0 && r->compressed) {
#if HTS_USEZLIB
unsigned char raw[HTS_SNIFF_LEN];
if (n > sizeof(raw))
n = sizeof(raw);
memcpy(raw, buf, n);
n = sniff_inflate_head(raw, n, buf, len);
#else
n = 0;
#endif
}
return n;
}
/* Up to len leading body bytes; 0 when unavailable, and always in
non-delayed mode (its HEAD-probe first run couldn't sniff either). */
static size_t sniff_body_head(httrackp *opt, const sniff_src *src, void *buf,
size_t len) {
size_t n = 0;
if (src == NULL || opt->savename_delayed == HTS_SAVENAME_DELAYED_NONE)
return 0;
/* live backing slot: a snapshot (back_copy_static) loses r.adr/r.out */
if (src->sback != NULL && src->adr != NULL && src->fil != NULL) {
const int b = back_index(opt, src->sback, src->adr, src->fil, NULL);
if (b >= 0)
n = sniff_slot_head(&src->sback->lnk[b], buf, len);
}
if (n == 0 && src->headers != NULL)
n = sniff_slot_head(src->headers, buf, len);
return n;
}
/* Contested verdicts: magic proving the URL ext keeps it, else wire wins. */
static int wire_patches_ext(httrackp *opt, const sniff_src *src,
const char *wiremime, const char *file) {
char urlmime[256];
switch (wire_ext_verdict(opt, wiremime, file, urlmime, sizeof(urlmime))) {
case WIRE_KEEPS_EXT:
return 0;
case WIRE_WINS:
return 1;
case WIRE_CONTESTED:
break;
}
if (src != NULL) {
if (hts_sniff_mime_known(urlmime)) {
unsigned char head[HTS_SNIFF_LEN];
const size_t n = sniff_body_head(opt, src, head, sizeof(head));
if (n > 0)
return hts_sniff_mime_consistent(head, n, urlmime) ? 0 : 1;
}
/* no bytes: reproduce the previous run's verdict (cached X-Save name) */
if (src->prev_save != NULL && src->prev_save[0] != '\0') {
char prevmime[256];
prevmime[0] = '\0';
if (get_httptype_sized(opt, prevmime, sizeof(prevmime), src->prev_save,
0) &&
strfield2(prevmime, urlmime))
return 0;
}
}
return 1;
}
int hts_ext_sniff_wanted(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
return wiremime != NULL && strnotempty(wiremime) &&
wire_ext_verdict(opt, wiremime, file, urlmime, sizeof(urlmime)) ==
WIRE_CONTESTED &&
hts_sniff_mime_known(urlmime);
}
/* Wire-metadata name change: a Content-Disposition filename wins (returns 2),
else the declared type's ext when wire_patches_ext() allows (returns 1),
else 0. ext receives the new extension or replacement filename. */
static int resolve_extension(httrackp *opt, const char *cdispo,
const char *contenttype, const char *fil,
char *ext, size_t ext_size) {
static int resolve_extension(httrackp *opt, const sniff_src *src,
const char *cdispo, const char *contenttype,
const char *fil, char *ext, size_t ext_size) {
if (strnotempty(cdispo)) {
strlcpybuff(ext, cdispo, ext_size);
return 2;
}
if (wire_patches_ext(opt, contenttype, fil) &&
if (wire_patches_ext(opt, src, contenttype, fil) &&
give_mimext(ext, ext_size, contenttype))
return 1;
return 0;
@@ -429,14 +546,21 @@ int url_savename(lien_adrfilsave *const afs,
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
ishtml(opt, fil) < 0) { // unsure whether it's html or a file
// lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
char BIGSTK previous_save[HTS_URLMAXSIZE * 2];
htsblk r;
previous_save[0] = '\0';
r = cache_read_including_broken(opt, cache, adr, fil,
previous_save); // test uniquement
if (r.statuscode != -1) { // cache entry read OK
hts_log_print(opt, LOG_DEBUG, "Testing link type (from cache) %s%s",
adr_complete, fil_complete);
if (!HTTP_IS_REDIRECT(r.statuscode)) {
ext_chg = resolve_extension(opt, r.cdispo, r.contenttype, fil,
ext, sizeof(ext));
const sniff_src src = {sback, NULL, adr, fil, previous_save};
ext_chg = resolve_extension(opt, &src, r.cdispo, r.contenttype,
fil, ext, sizeof(ext));
}
} else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
@@ -463,7 +587,9 @@ int url_savename(lien_adrfilsave *const afs,
!opt->state.stop) {
// Check if the file is ready in backing.
if (headers != NULL && headers->status >= 0 && !is_redirect) {
ext_chg = resolve_extension(opt, headers->r.cdispo,
const sniff_src src = {sback, headers, adr, fil, NULL};
ext_chg = resolve_extension(opt, &src, headers->r.cdispo,
headers->r.contenttype,
headers->url_fil, ext, sizeof(ext));
}
@@ -501,11 +627,10 @@ int url_savename(lien_adrfilsave *const afs,
int has_been_moved = 0;
lien_adrfil current;
/* Ensure we don't use too many sockets by using a "testing" one
If we have only 1 simultaneous connection authorized, wait for pending download
Wait for an available slot
/* Wait for an available test slot, honoring the connection limits
*/
URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET();
if (!hts_wait_available_socket(sback, opt, cache, ptr))
return -1;
/* Rock'in */
current.adr[0] = current.fil[0] = '\0';
@@ -535,24 +660,11 @@ int url_savename(lien_adrfilsave *const afs,
if (ptr >= 0) {
back_fillmax(sback, opt, cache, ptr, numero_passe);
}
// on est obligé d'appeler le shell pour le refresh..
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart),
&HTS_STAT)) {
if (!hts_loop_tick(sback, opt, b, ptr)) {
return -1;
} else if (opt->state._hts_cancel || !back_checkmirror(opt)) { // cancel 2 ou 1 (cancel parsing)
} else if (opt->state._hts_cancel ||
!back_checkmirror(
opt)) { // cancel level 2 or 1 (cancel parsing)
back_delete(opt, cache, sback, b); // cancel test
stop_looping = 1;
}
@@ -617,8 +729,9 @@ int url_savename(lien_adrfilsave *const afs,
"Loop with HEAD request (during prefetch) at %s%s",
current.adr, current.fil);
}
// Ajouter
URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET();
if (!hts_wait_available_socket(sback, opt,
cache, ptr))
return -1;
if (back_add(sback, opt, cache, moved.adr, moved.fil, methode, referer_adr, referer_fil, 1) != -1) { // OK
hts_log_print(opt, LOG_DEBUG,
"(during prefetch) %s (%d) to link %s at %s%s",
@@ -674,7 +787,7 @@ int url_savename(lien_adrfilsave *const afs,
// no error: change the type?
ext_chg = resolve_extension(
opt, back[b].r.cdispo, back[b].r.contenttype,
opt, NULL, back[b].r.cdispo, back[b].r.contenttype,
back[b].url_fil, ext, sizeof(ext));
}
// FIN Si non déplacé, forcer type?

View File

@@ -100,6 +100,8 @@ void standard_name(char *b, size_t bsize, const char *dot_pos,
const char *nom_pos, const char *fil_complete,
int short_ver);
void url_savename_addstr(char *d, const char *s);
/* Contested wire-vs-ext verdict that a body sniff could settle (htssniff.h). */
int hts_ext_sniff_wanted(httrackp *opt, const char *wiremime, const char *file);
char *url_md5(char *digest_buffer, const char *fil_complete);
void url_savename_refname(const char *adr, const char *fil, char *filename);
char *url_savename_refname_fullpath(httrackp * opt, const char *adr,

View File

@@ -49,6 +49,7 @@ Please visit our Website: http://www.httrack.com
#include "htsindex.h"
#include "htscharset.h"
#include "htsencoding.h"
#include "htssniff.h"
/* external modules */
#include "htsmodules.h"
@@ -3398,20 +3399,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
back_wait(sback, opt, cache, HTS_STAT.stat_timestart);
back_fillmax(sback, opt, cache, ptr, numero_passe);
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, 0, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) {
if (!hts_loop_tick(sback, opt, 0, ptr)) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
*stre->exit_xh_ = 1; // exit requested
XH_uninit;
@@ -3422,7 +3410,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
nofollow = 1; // moins violent
opt->state._hts_cancel = 0;
}
}
// refresh the backing system each 2 seconds
if (engine_stats()) {
@@ -3959,22 +3946,8 @@ void hts_mirror_process_user_interaction(htsmoduleStruct * str,
{
back_wait(sback, opt, cache, HTS_STAT.stat_timestart);
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
b = 0;
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)
|| !back_checkmirror(opt)) {
if (!hts_loop_tick(sback, opt, b, ptr) || !back_checkmirror(opt)) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
*stre->exit_xh_ = 1; // exit requested
XH_uninit;
@@ -4076,21 +4049,11 @@ void hts_mirror_process_user_interaction(htsmoduleStruct * str,
while(opt->state._hts_setpause || back_pluggable_sockets_strict(sback, opt) <= 0) { // on fait la pause..
opt->state._hts_in_html_parsing = 6;
back_wait(sback, opt, cache, HTS_STAT.stat_timestart);
/* time limit (-E) exceeded: stop waiting for a socket (#481) */
if (!back_checkmirror(opt))
break;
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) {
if (!hts_loop_tick(sback, opt, b, ptr)) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
*stre->exit_xh_ = 1; // exit requested
XH_uninit;
@@ -4277,26 +4240,12 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
freet(s);
}
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) {
if (!hts_loop_tick(sback, opt, b, ptr)) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
*stre->exit_xh_ = 1; // exit requested
XH_uninit;
return 0;
}
}
#if HTS_POLL
@@ -4529,10 +4478,9 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
IS_DELAYED_EXT(afs->save) && continue_loop && loops < 7; loops++) {
continue_loop = 0;
/*
Wait for an available slot
*/
WAIT_FOR_AVAILABLE_SOCKET();
/* Wait for an available slot */
if (!hts_wait_available_socket(sback, opt, cache, ptr))
return -1;
/* We can lookup directly in the cache to speedup this mess */
if (opt->delayed_cached) {
@@ -4678,39 +4626,28 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
if (ptr >= 0) {
back_fillmax(sback, opt, cache, ptr, numero_passe);
}
// on est obligé d'appeler le shell pour le refresh..
{
// Transfer rate
engine_stats();
// Refresh various stats
HTS_STAT.stat_nsocket = back_nsoc(sback);
HTS_STAT.stat_errors = fspc(opt, NULL, "error");
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning");
HTS_STAT.stat_infos = fspc(opt, NULL, "info");
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr);
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback);
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) {
return -1;
} else if (opt->state._hts_cancel || !back_checkmirror(opt)) { // cancel 2 ou 1 (cancel parsing)
back_delete(opt, cache, sback, b); // cancel test
break;
}
if (!hts_loop_tick(sback, opt, b, ptr)) {
back_set_unlocked(sback, b);
return -1;
} else if (opt->state._hts_cancel ||
!back_checkmirror(
opt)) { // cancel level 2 or 1 (cancel parsing)
back_delete(opt, cache, sback, b); // cancel test
break;
}
} while(
/* dns/connect/request */
(back[b].status >= 99 && back[b].status <= 101)
||
/* For redirects, wait for request to be terminated */
(HTTP_IS_REDIRECT(back[b].r.statuscode) && back[b].status > 0)
||
/* Same for errors */
(HTTP_IS_ERROR(back[b].r.statuscode) && back[b].status > 0)
);
} while (
/* dns/connect/request */
(back[b].status >= 99 && back[b].status <= 101) ||
/* For redirects, wait for request to be terminated */
(HTTP_IS_REDIRECT(back[b].r.statuscode) && back[b].status > 0) ||
/* Same for errors */
(HTTP_IS_ERROR(back[b].r.statuscode) && back[b].status > 0) ||
/* Contested type: wait for a sniffable body head (or EOF) */
(back[b].r.statuscode == HTTP_OK && back[b].status > 0 &&
strnotempty(back[b].r.cdispo) == 0 &&
back[b].r.size < HTS_SNIFF_LEN &&
hts_ext_sniff_wanted(opt, back[b].r.contenttype,
back[b].url_fil)));
if (b >= 0) {
back_set_unlocked(sback, b); // Unlocked entry
}
@@ -4845,6 +4782,9 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
/* Still have a back reference */
if (b >= 0) {
/* patch url_sav BEFORE finalize: it records/caches under this name
*/
strcpybuff(back[b].url_sav, afs->save);
/* Finalize now as we have the type */
if (back[b].status == STATUS_READY) {
if (!back[b].finalized) {
@@ -4852,8 +4792,6 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
back_finalize(opt, cache, sback, b);
}
}
/* Patch destination filename for direct-to-disk mode */
strcpybuff(back[b].url_sav, afs->save);
}
} // b >= 0

View File

@@ -175,27 +175,4 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
/* Apply changes */ \
* str->ptr_ = ptr
#define WAIT_FOR_AVAILABLE_SOCKET() do { \
int prev = opt->state._hts_in_html_parsing; \
while(back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state._hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback,opt,cache,0); \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket=back_nsoc(sback); \
HTS_STAT.stat_errors=fspc(opt,NULL,"error"); \
HTS_STAT.stat_warnings=fspc(opt,NULL,"warning"); \
HTS_STAT.stat_infos=fspc(opt,NULL,"info"); \
HTS_STAT.nbk=backlinks_done(sback,opt->liens,opt->lien_tot,ptr); \
HTS_STAT.nb=back_transferred(HTS_STAT.stat_bytes,sback); \
/* Check */ \
if (!RUN_CALLBACK7(opt, loop, sback->lnk, sback->count, -1,ptr,opt->lien_tot,(int) (time_local()-HTS_STAT.stat_timestart),&HTS_STAT)) { \
return -1; \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while(0)
#endif

View File

@@ -52,6 +52,7 @@ Please visit our Website: http://www.httrack.com
#include "htsencoding.h"
#include "htsftp.h"
#include "htsmd5.h"
#include "htssniff.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
@@ -1117,6 +1118,46 @@ static int st_header(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Decode a body argument ("hex:FFD8.." or literal text) into buf. */
static size_t st_decode_body(const char *arg, char *buf, size_t size) {
size_t n = 0;
if (strncmp(arg, "hex:", 4) == 0) {
const char *s = arg + 4;
for (; s[0] != '\0' && s[1] != '\0' && n + 1 < size; s += 2) {
unsigned int byte;
if (sscanf(s, "%2x", &byte) != 1)
break;
buf[n++] = (char) byte;
}
} else {
n = strlen(arg);
if (n >= size)
n = size - 1;
memcpy(buf, arg, n);
}
buf[n] = '\0';
return n;
}
static int st_sniff(httrackp *opt, int argc, char **argv) {
char BIGSTK body[1024];
size_t n;
(void) opt;
if (argc < 2) {
fprintf(stderr, "sniff: needs a content-type and a body\n");
return 1;
}
n = st_decode_body(argv[1], body, sizeof(body));
printf("sniff: known=%d consistent=%d\n",
hts_sniff_mime_known(argv[0]) == HTS_TRUE,
hts_sniff_mime_consistent(body, n, argv[0]) == HTS_TRUE);
return 0;
}
static int st_savename(httrackp *opt, int argc, char **argv) {
lien_adrfilsave afs;
cache_back cache;
@@ -1125,6 +1166,9 @@ static int st_savename(httrackp *opt, int argc, char **argv) {
lien_back headers;
const char *adr = "www.example.com";
const char *cdispo = NULL;
const char *body = NULL;
const char *cached = NULL;
const char *bodyfile = "st-savename-body.tmp";
int statuscode = HTTP_OK, status = 0;
int i;
@@ -1158,6 +1202,10 @@ static int st_savename(httrackp *opt, int argc, char **argv) {
opt->savename_83 = atoi(a + 4);
else if (strncmp(a, "type=", 5) == 0)
opt->savename_type = atoi(a + 5);
else if (strncmp(a, "body=", 5) == 0)
body = a + 5;
else if (strncmp(a, "cached=", 7) == 0)
cached = a + 7;
else if (strncmp(a, "prior=", 6) != 0) {
fprintf(stderr, "savename: unknown arg '%s'\n", a);
return 1;
@@ -1168,7 +1216,47 @@ static int st_savename(httrackp *opt, int argc, char **argv) {
strcpybuff(afs.af.fil, argv[0]);
memset(&cache, 0, sizeof(cache));
cache.hashtable = (void *) coucal_new(0);
if (cached != NULL) { /* cached=<content-type>|<save name> */
char *dup = strdupt(cached);
char *const sep = strchr(dup, '|');
char locbuf[64] = "";
htsblk cr;
if (sep == NULL) {
fprintf(stderr, "savename: cached needs ctype|save\n");
return 1;
}
*sep = '\0';
/* one-entry cache in cwd, reopened read-only; body is PNG magic on
purpose: only the recorded name (X-Save) may drive the naming */
StringCopy(opt->path_log, "");
cache.type = 1;
cache.log = cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache_init(&cache, opt);
hts_init_htsblk(&cr);
cr.statuscode = HTTP_OK;
strcpybuff(cr.msg, "OK");
strcpybuff(cr.contenttype, dup);
cr.location = locbuf;
cr.adr = strdupt("\x89PNG\r\n\x1a\n");
cr.size = 8;
cache_add(opt, &cache, &cr, adr, argv[0], sep + 1, 1, NULL);
freet(cr.adr);
if (cache.zipOutput != NULL) {
zipClose(cache.zipOutput, NULL);
cache.zipOutput = NULL;
}
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.ro = 1;
cache_init(&cache, opt);
freet(dup);
} else {
cache.hashtable = (void *) coucal_new(0);
}
sback = back_new(opt, opt->maxsoc * 32 + 1024);
/* same wiring as hts_mirror (htscore.c) */
@@ -1201,9 +1289,23 @@ static int st_savename(httrackp *opt, int argc, char **argv) {
if (cdispo != NULL)
strcpybuff(headers.r.cdispo, cdispo);
strcpybuff(headers.url_fil, argv[0]);
if (body != NULL) { /* leading body bytes, read via url_sav */
char BIGSTK data[1024];
const size_t n = st_decode_body(body, data, sizeof(data));
FILE *const fp = fopen(bodyfile, "wb");
if (fp == NULL || fwrite(data, 1, n, fp) != n) {
fprintf(stderr, "savename: can not write %s\n", bodyfile);
return 1;
}
fclose(fp);
strcpybuff(headers.url_sav, bodyfile);
}
url_savename(&afs, NULL, NULL, NULL, opt, sback, &cache, &hash, 0, 0,
&headers);
if (body != NULL)
(void) UNLINK(bodyfile);
printf("savename: %s\n", afs.save);
return 0;
}
@@ -1245,6 +1347,30 @@ static int st_cache_writefail(httrackp *opt, int argc, char **argv) {
return err;
}
static int st_cache_corrupt(httrackp *opt, int argc, char **argv) {
int err;
if (argc < 1) {
fprintf(stderr, "cache-corrupt: needs a directory\n");
return 1;
}
err = cache_corruption_selftest(opt, argv[0]);
printf("cache-corrupt: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_reconcile(httrackp *opt, int argc, char **argv) {
int err;
if (argc < 1) {
fprintf(stderr, "reconcile: needs a directory\n");
return 1;
}
err = cache_reconcile_selftest(opt, argv[0]);
printf("cache-reconcile: %s\n", err ? "FAIL" : "OK");
return err;
}
static int st_dns(httrackp *opt, int argc, char **argv) {
const int err = dns_selftests(opt);
@@ -2010,11 +2136,17 @@ static const struct selftest_entry {
st_header},
{"savename", "<fil> <content-type> [key=value ...]",
"local save-name for a URL", st_savename},
{"sniff", "<content-type> <hex:..|text>", "MIME magic consistency",
st_sniff},
{"cache", "<dir>", "cache read/write round-trip self-test", st_cache},
{"cache-golden", "<dir> [regen]", "frozen cache-format read self-test",
st_cache_golden},
{"cache-writefail", "<dir>", "cache write-failure handling self-test",
st_cache_writefail},
{"reconcile", "<dir>", "cache generation reconcile policy self-test",
st_reconcile},
{"cache-corrupt", "<dir>", "cache read-side corruption self-test",
st_cache_corrupt},
{"dns", "", "DNS resolver/cache self-test", st_dns},
{"cookies", "", "cookie request-header self-test", st_cookies},
{"useragent", "", "default User-Agent self-test", st_useragent},

352
src/htssniff.c Normal file
View File

@@ -0,0 +1,352 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2017 Xavier Roche and other contributors
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Important notes:
- We hereby ask people using this source NOT to use it in purpose of grabbing
emails addresses, or collecting any other private information on persons.
This would disgrace our work, and spoil the many hours we spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: MIME magic-byte consistency checks */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#include "htssniff.h"
#include <string.h>
#include "htslib.h"
/* One magic rule: `len` bytes at `off` confirm `mime`. */
typedef struct sniff_magic {
const char *mime;
unsigned short off;
unsigned char len;
const char *bytes;
} sniff_magic;
/* Direction is mime -> magic (verify a claim, never classify); types with
no reliable magic (plain text, css, js..) are deliberately absent. Patterns
follow the WHATWG MIME Sniffing Standard tables where it defines them
(https://mimesniff.spec.whatwg.org/); the rest covers httrack's wider MIME
set. Spec-only types absent from our MIME tables (EOT, font/collection)
are omitted as unreachable. */
static const sniff_magic sniff_table[] = {
/* images */
{"image/jpeg", 0, 3, "\xff\xd8\xff"},
{"image/pipeg", 0, 3, "\xff\xd8\xff"},
{"image/pjpeg", 0, 3, "\xff\xd8\xff"},
{"image/png", 0, 8, "\x89PNG\r\n\x1a\n"},
{"image/gif", 0, 6, "GIF87a"},
{"image/gif", 0, 6, "GIF89a"},
{"image/bmp", 0, 2, "BM"},
{"image/tiff", 0, 4, "II*\0"},
{"image/tiff", 0, 4, "MM\0*"},
{"image/x-icon", 0, 4, "\0\0\1\0"},
{"image/x-icon", 0, 4, "\0\0\2\0"}, /* Windows cursor, per the spec */
{"image/x-portable-bitmap", 0, 2, "P1"},
{"image/x-portable-bitmap", 0, 2, "P4"},
{"image/x-portable-pixmap", 0, 2, "P3"},
{"image/x-portable-pixmap", 0, 2, "P6"},
{"image/x-xpixmap", 0, 9, "/* XPM */"},
{"image/x-xbitmap", 0, 7, "#define"},
{"image/x-rgb", 0, 2, "\x01\xda"},
{"image/x-cmu-raster", 0, 4, "\xf1\x00\x40\xbb"},
/* audio */
{"audio/mpeg", 0, 3, "ID3"},
{"audio/basic", 0, 4, ".snd"},
{"audio/mid", 0, 8, "MThd\0\0\0\6"},
{"audio/midi", 0, 8, "MThd\0\0\0\6"},
{"audio/x-pn-realaudio", 0, 4, ".ra\xfd"},
{"audio/x-pn-realaudio", 0, 4, ".RMF"},
{"audio/x-pn-realaudio-plugin", 0, 4, ".ra\xfd"},
{"audio/x-pn-realaudio-plugin", 0, 4, ".RMF"},
{"audio/flac", 0, 4, "fLaC"},
{"audio/aac", 0, 4, "ADIF"},
/* video */
{"video/mpeg", 0, 4, "\x00\x00\x01\xba"},
{"video/mpeg", 0, 4, "\x00\x00\x01\xb3"},
{"video/x-sgi-movie", 0, 4, "MOVI"},
/* archives / compression */
{"application/x-gzip", 0, 3, "\x1f\x8b\x08"},
{"multipart/x-gzip", 0, 3, "\x1f\x8b\x08"},
{"application/x-compressed", 0, 3, "\x1f\x8b\x08"},
{"application/x-compress", 0, 2, "\x1f\x9d"},
{"application/x-bzip2", 0, 3, "BZh"},
{"application/x-7z-compressed", 0, 6, "7z\xbc\xaf\x27\x1c"},
/* 6-byte prefix common to RAR4 (spec) and RAR5 */
{"application/x-rar-compressed", 0, 6, "Rar!\x1a\x07"},
{"application/zstd", 0, 4, "\x28\xb5\x2f\xfd"},
{"application/arj", 0, 2, "\x60\xea"},
{"application/x-cpio", 0, 6, "070701"},
{"application/x-cpio", 0, 6, "070707"},
{"application/x-cpio", 0, 2, "\xc7\x71"},
{"application/x-sv4cpio", 0, 6, "070701"},
{"application/x-sv4crc", 0, 6, "070702"},
{"application/x-stuffit", 0, 8, "StuffIt "},
{"application/x-stuffit", 0, 4, "SIT!"},
{"application/mac-binhex40", 0, 10, "(This file"},
/* documents */
{"application/pdf", 0, 5, "%PDF-"},
{"application/postscript", 0, 2, "%!"},
{"application/rtf", 0, 5, "{\\rtf"},
{"application/x-dvi", 0, 2, "\xf7\x02"},
{"application/x-hdf", 0, 4, "\x0e\x03\x13\x01"},
{"application/x-hdf", 0, 8, "\x89HDF\r\n\x1a\n"},
{"application/x-netcdf", 0, 4, "CDF\x01"},
{"application/x-netcdf", 0, 4, "CDF\x02"},
{"application/x-msaccess", 0, 19, "\0\1\0\0Standard Jet DB"},
/* fonts */
{"font/woff", 0, 4, "wOFF"},
{"font/woff2", 0, 4, "wOF2"},
{"font/ttf", 0, 4, "\0\1\0\0"},
{"font/ttf", 0, 4, "true"},
{"font/otf", 0, 4, "OTTO"},
/* misc */
{"application/x-shockwave-flash", 0, 3, "FWS"},
{"application/x-shockwave-flash", 0, 3, "CWS"},
{"application/x-shockwave-flash", 0, 3, "ZWS"},
{"application/futuresplash", 0, 3, "FWS"},
{"application/x-director", 0, 4, "RIFX"},
{"application/x-director", 0, 4, "XFIR"},
{"application/x-java-vm", 0, 4, "\xca\xfe\xba\xbe"},
{"application/wasm", 0, 4, "\0asm"},
{"application/x-msmetafile", 0, 4, "\xd7\xcd\xc6\x9a"},
{"application/x-msmetafile", 0, 4, "\x01\x00\x09\x00"},
{"application/x-x509-ca-cert", 0, 2, "\x30\x82"},
{"application/x-pkcs12", 0, 2, "\x30\x82"},
{"application/x-pkcs7-mime", 0, 2, "\x30\x82"},
{"application/x-pkcs7-signature", 0, 2, "\x30\x82"},
{"application/x-pkcs7-certificates", 0, 2, "\x30\x82"},
{"x-world/x-vrml", 0, 5, "#VRML"},
{"application/x-bittorrent", 0, 11, "d8:announce"},
{"drawing/x-dwf", 0, 4, "(DWF"},
{"application/acad", 0, 4, "AC10"},
{NULL, 0, 0, NULL}};
/* MIME families sharing a container magic */
static const char *const zip_mimes[] = {
"application/zip", "application/x-zip-compressed", "multipart/x-zip", NULL};
static const char *const zip_mime_prefixes[] = {
"application/vnd.openxmlformats-officedocument.",
"application/vnd.oasis.opendocument.", NULL};
static const char *const ole_mimes[] = {"application/msword",
"application/excel",
"application/vnd.ms-excel",
"application/powerpoint",
"application/vnd.ms-powerpoint",
"application/vnd.ms-project",
"application/vnd.ms-works",
"application/x-msmoney",
"application/x-mspublisher",
NULL};
static const char *const tar_mimes[] = {
"application/x-tar", "application/x-ustar", "application/x-gtar", NULL};
static const char *const ogg_mimes[] = {"application/ogg", "audio/ogg",
"video/ogg", "audio/opus", NULL};
static const char *const ebml_mimes[] = {"video/webm", "audio/webm", NULL};
/* ISO-BMFF, any 'ftyp' brand: containers overlap too much to split */
static const char *const bmff_mimes[] = {"video/mp4", "audio/mp4",
"video/quicktime", NULL};
static const char *const avif_mimes[] = {"image/avif", NULL};
static const char *const heic_mimes[] = {"image/heic", NULL};
static const char *const asf_mimes[] = {"video/x-ms-asf", "video/x-ms-wmv",
"video/x-la-asf", NULL};
static const char *const xml_mimes[] = {"application/xml", "text/xml",
"image/svg+xml", "image/svg-xml", NULL};
static const char *const svg_mimes[] = {"image/svg+xml", "image/svg-xml", NULL};
static const char *const html_mimes[] = {"text/html", NULL};
static const char *const pem_mimes[] = {
"application/x-x509-ca-cert", "application/x-pkcs7-certificates",
"application/x-pkcs7-mime", "application/x-pkcs7-signature", NULL};
static hts_boolean mime_in(const char *const *list, const char *mime) {
size_t i;
for (i = 0; list[i] != NULL; i++)
if (strfield2(list[i], mime))
return HTS_TRUE;
return HTS_FALSE;
}
static hts_boolean mime_in_prefix(const char *const *list, const char *mime) {
size_t i;
for (i = 0; list[i] != NULL; i++)
if (strfield(mime, list[i]))
return HTS_TRUE;
return HTS_FALSE;
}
static hts_boolean has_bytes(const unsigned char *d, size_t n, size_t off,
const char *bytes, size_t len) {
/* overflow-safe: untrusted n alone on one side */
return n >= off && len <= n - off && memcmp(d + off, bytes, len) == 0
? HTS_TRUE
: HTS_FALSE;
}
static unsigned char ascii_lower(unsigned char c) {
return c >= 'A' && c <= 'Z' ? (unsigned char) (c + 32) : c;
}
/* Case-insensitive text prefix after an optional UTF-8 BOM and whitespace. */
static hts_boolean has_text_prefix(const unsigned char *d, size_t n,
const char *prefix) {
const size_t len = strlen(prefix);
size_t i, k;
i = n >= 3 && memcmp(d, "\xef\xbb\xbf", 3) == 0 ? 3 : 0;
while (i < n && (d[i] == ' ' || d[i] == '\t' || d[i] == '\r' || d[i] == '\n'))
i++;
if (len > n - i) /* i <= n from the loop above */
return HTS_FALSE;
for (k = 0; k < len; k++)
if (ascii_lower(d[i + k]) != ascii_lower((unsigned char) prefix[k]))
return HTS_FALSE;
return HTS_TRUE;
}
typedef enum sniff_op {
SNIFF_QUERY_KNOWN, /* is any rule defined for this MIME? */
SNIFF_QUERY_MATCH /* do the bytes confirm this MIME? */
} sniff_op;
/* Single walk for both queries so the rule set can't drift apart. */
static hts_boolean sniff_eval(sniff_op op, const unsigned char *d, size_t n,
const char *mime) {
size_t i;
/* KNOWN short-circuits; MATCH tests the magic */
#define SNIFF_RULE(cond) \
do { \
if (op == SNIFF_QUERY_KNOWN) \
return HTS_TRUE; \
if (cond) \
return HTS_TRUE; \
} while (0)
for (i = 0; sniff_table[i].mime != NULL; i++) {
if (strfield2(sniff_table[i].mime, mime)) {
SNIFF_RULE(has_bytes(d, n, sniff_table[i].off, sniff_table[i].bytes,
sniff_table[i].len));
}
}
if (mime_in(zip_mimes, mime) || mime_in_prefix(zip_mime_prefixes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "PK\3\4", 4) ||
has_bytes(d, n, 0, "PK\5\6", 4));
}
if (mime_in(ole_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1", 8));
}
if (mime_in(tar_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 257, "ustar", 5));
}
if (mime_in(ogg_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "OggS\0", 5));
}
if (mime_in(ebml_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\x1a\x45\xdf\xa3", 4));
}
if (mime_in(bmff_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 4, "ftyp", 4));
}
if (mime_in(avif_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 4, "ftypavif", 8) ||
has_bytes(d, n, 4, "ftypavis", 8));
}
if (mime_in(heic_mimes, mime)) {
SNIFF_RULE(
has_bytes(d, n, 4, "ftyphei", 7) || has_bytes(d, n, 4, "ftyphev", 7) ||
has_bytes(d, n, 4, "ftypmif1", 8) || has_bytes(d, n, 4, "ftypmsf1", 8));
}
if (mime_in(asf_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\x30\x26\xb2\x75\x8e\x66\xcf\x11", 8));
}
if (strfield2("audio/x-wav", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) && has_bytes(d, n, 8, "WAVE", 4));
}
if (strfield2("video/x-msvideo", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) && has_bytes(d, n, 8, "AVI ", 4));
}
if (strfield2("image/webp", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) &&
has_bytes(d, n, 8, "WEBPVP", 6));
}
if (strfield2("image/x-portable-anymap", mime)) {
SNIFF_RULE(n >= 2 && d[0] == 'P' && d[1] >= '1' && d[1] <= '6');
}
if (strfield2("audio/x-aiff", mime)) {
SNIFF_RULE(
has_bytes(d, n, 0, "FORM", 4) &&
(has_bytes(d, n, 8, "AIFF", 4) || has_bytes(d, n, 8, "AIFC", 4)));
}
if (strfield2("audio/mpeg", mime)) {
/* MPEG audio frame sync (11 bits), valid layer and bitrate fields */
SNIFF_RULE(n >= 2 && d[0] == 0xff && (d[1] & 0xe0) == 0xe0 &&
(d[1] & 0x06) != 0);
}
if (strfield2("audio/aac", mime)) {
/* ADTS sync */
SNIFF_RULE(n >= 2 && d[0] == 0xff && (d[1] & 0xf6) == 0xf0);
}
if (strfield2("video/mp2t", mime)) {
SNIFF_RULE(n >= 1 && d[0] == 0x47 && (n <= 188 || d[188] == 0x47));
}
if (mime_in(xml_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<?xml"));
}
if (mime_in(svg_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<svg") ||
has_text_prefix(d, n, "<!DOCTYPE svg"));
}
if (mime_in(html_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<!DOCTYPE") ||
has_text_prefix(d, n, "<html") ||
has_text_prefix(d, n, "<head"));
}
if (mime_in(pem_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "-----BEGIN"));
}
if (strfield2("audio/x-mpegurl", mime)) {
SNIFF_RULE(has_text_prefix(d, n, "#EXTM3U"));
}
if (strfield2("text/x-vcard", mime)) {
SNIFF_RULE(has_text_prefix(d, n, "BEGIN:VCARD"));
}
#undef SNIFF_RULE
return HTS_FALSE;
}
hts_boolean hts_sniff_mime_known(const char *mime) {
if (mime == NULL || *mime == '\0')
return HTS_FALSE;
return sniff_eval(SNIFF_QUERY_KNOWN, NULL, 0, mime);
}
hts_boolean hts_sniff_mime_consistent(const void *data, size_t size,
const char *mime) {
if (data == NULL || size == 0 || mime == NULL || *mime == '\0')
return HTS_FALSE;
return sniff_eval(SNIFF_QUERY_MATCH, (const unsigned char *) data, size,
mime);
}

50
src/htssniff.h Normal file
View File

@@ -0,0 +1,50 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2017 Xavier Roche and other contributors
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Important notes:
- We hereby ask people using this source NOT to use it in purpose of grabbing
emails addresses, or collecting any other private information on persons.
This would disgrace our work, and spoil the many hours we spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: MIME magic-byte consistency checks */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#ifndef HTSSNIFF_DEFH
#define HTSSNIFF_DEFH
#include <stddef.h>
#include "htsglobal.h"
/* Leading-body window read to arbitrate a wire/extension MIME conflict. */
#define HTS_SNIFF_LEN 512
/* Can a magic rule ever confirm this MIME? (whether sniffing is worth it) */
hts_boolean hts_sniff_mime_known(const char *mime);
/* TRUE when the leading body bytes are consistent with the claimed MIME;
FALSE on unknown MIME, unknown magic, or too-short data (fail-safe). */
hts_boolean hts_sniff_mime_consistent(const void *data, size_t size,
const char *mime);
#endif

View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Cache generation reconcile policies (httrack -#test=reconcile <dir>):
# promote a stranded old generation, keep the larger one after an aborted
# run, and restore the old one when an update transferred nothing.
set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
out=$(httrack -#test=reconcile "$dir")
test "$out" = "cache-reconcile: OK" || {
echo "expected 'cache-reconcile: OK', got: $out" >&2
exit 1
}

View File

@@ -7,8 +7,16 @@ set -euo pipefail
# name() asserts on the basename, full() on the whole path; prior= registers an
# already-crawled link whose sav is rooted under the -O path (/dev/null here).
# resolve httrack before cd: make check puts a RELATIVE ../src on PATH
httrack_bin=$(cd "$(dirname "$(command -v httrack)")" && pwd)/httrack
# scratch dir: body= and cached= write temp files (st-savename-body.tmp, hts-cache/)
scratch=$(mktemp -d)
trap 'rm -rf "$scratch"' EXIT
cd "$scratch"
run() {
httrack -O /dev/null -#test=savename "$@" | sed -n 's/^savename: //p'
"$httrack_bin" -O /dev/null -#test=savename "$@" | sed -n 's/^savename: //p'
}
name() {
@@ -73,6 +81,15 @@ name '/x.pdf' 'text/html' 'x.html' status=-1
name '/x.html' 'text/html' 'x.html' status=-1
name '/x.php' 'application/pdf' 'x.pdf' status=-1 cdispo=report.pdf
# Contested type (wire disagrees with a specific ext): magic bytes proving the
# extension right keep it, anything else trusts the wire as before.
name '/photo.jpg' 'image/png' 'photo.jpg' body=hex:FFD8FFE000104A46
name '/photo.jpg' 'image/png' 'photo.png' body=hex:89504E470D0A1A0A
name '/photo.jpg' 'image/png' 'photo.png'
name '/doc.pdf' 'text/html' 'doc.pdf' body=hex:255044462D312E34
name '/doc.pdf' 'text/html' 'doc.html' 'body=<html><body>soft 404</body></html>'
name '/style.css' 'image/png' 'style.png' 'body=body { }' # no rule for css: wire wins
# A redirect answer resolves nothing: delayed placeholder name.
name '/x.php' 'text/html' 'x.0.delayed' statuscode=301

View File

@@ -0,0 +1,87 @@
#!/bin/bash
#
set -euo pipefail
# MIME magic consistency (-#test=sniff <content-type> <hex:..|text>), the
# tie-break behind htsname's wire-vs-extension naming.
chk() {
local mime="$1" body="$2" want="$3"
out="$(httrack -#test=sniff "$mime" "$body" | sed -n 's/^sniff: //p')"
test "$out" == "$want" || {
echo "FAIL: '$mime' '$body' -> '$out' (want '$want')"
exit 1
}
}
yes='known=1 consistent=1'
no='known=1 consistent=0'
unk='known=0 consistent=0'
# images
chk image/jpeg hex:FFD8FFE000104A46 "$yes"
chk image/png hex:89504E470D0A1A0A "$yes"
chk image/png hex:FFD8FFE000104A46 "$no" # jpeg bytes are not a png
chk image/gif 'GIF89a' "$yes"
chk image/bmp 'BMxxxx' "$yes"
chk image/tiff hex:49492A00 "$yes"
chk image/tiff hex:4D4D002A "$yes" # both endians
chk image/x-icon hex:00000100 "$yes"
chk image/x-icon hex:00000200 "$yes" # Windows cursor, spec maps to x-icon
chk image/webp 'RIFFxxxxWEBPVP' "$yes"
chk image/webp 'RIFFxxxxWAVE' "$no" # riff subtype discriminates
chk image/avif hex:0000001C6674797061766966 "$yes"
chk image/avif hex:0000001C6674797068656963 "$no" # heic brand is not avif
chk image/heic hex:0000001C6674797068656963 "$yes"
chk image/svg+xml '<svg xmlns="x">' "$yes"
chk image/svg+xml $'\xef\xbb\xbf <?xml version="1.0"?>' "$yes" # BOM+ws skip
# audio / video
chk audio/mpeg 'ID3xxx' "$yes"
chk audio/mpeg hex:FFFB9000 "$yes" # bare frame sync
chk audio/aac hex:FFF15080 "$yes"
chk audio/flac 'fLaC' "$yes"
chk audio/ogg hex:4F67675300 "$yes"
chk audio/x-wav 'RIFFxxxxWAVE' "$yes"
chk video/x-msvideo 'RIFFxxxxAVI ' "$yes"
chk video/x-msvideo 'RIFFxxxxWAVE' "$no"
chk video/mp4 hex:000000186674797069736F6D "$yes"
chk video/webm hex:1A45DFA3 "$yes"
chk video/mpeg hex:000001BA "$yes"
chk video/x-ms-wmv hex:3026B2758E66CF11 "$yes"
# archives; zip magic covers the office-container families
chk application/zip hex:504B0304 "$yes"
chk application/vnd.openxmlformats-officedocument.wordprocessingml.document hex:504B0304 "$yes"
chk application/vnd.oasis.opendocument.text hex:504B0304 "$yes"
chk application/msword hex:D0CF11E0A1B11AE1 "$yes"
chk application/msword hex:504B0304 "$no" # legacy .doc is OLE, not zip
chk application/x-gzip hex:1F8B08 "$yes"
chk application/x-bzip2 'BZh9' "$yes"
chk application/x-7z-compressed hex:377ABCAF271C "$yes"
chk application/x-rar-compressed hex:526172211A07 "$yes"
chk application/zstd hex:28B52FFD "$yes"
chk application/x-tar "hex:$(printf '00%.0s' {1..257})7573746172" "$yes" # ustar at 257
chk application/x-tar hex:7573746172 "$no"
# documents, fonts, misc
chk application/pdf '%PDF-1.7' "$yes"
chk application/pdf '<html><body>soft 404</body></html>' "$no"
chk application/postscript '%!PS-Adobe' "$yes"
chk application/rtf '{\rtf1' "$yes"
chk font/woff2 'wOF2' "$yes"
chk font/otf 'OTTO' "$yes"
chk font/ttf hex:0001000000 "$yes"
chk application/x-shockwave-flash 'CWSx' "$yes"
chk application/x-java-vm hex:CAFEBABE "$yes"
chk application/wasm hex:0061736D "$yes"
chk text/html $' \r\n<!DOCTYPE html><html>' "$yes"
chk text/html '<html lang="en">' "$yes"
chk text/html 'plain text, no markup' "$no"
chk text/xml '<?xml version="1.0"?>' "$yes"
# no magic rule at all: never confirmed, never blocks the wire type
chk text/css 'body { }' "$unk"
chk text/plain 'hello' "$unk"
chk application/x-javascript 'var x;' "$unk"

View File

@@ -0,0 +1,19 @@
#!/bin/bash
#
# Read-side cache corruption (httrack -#test=cache-corrupt <dir>): zip byte
# surgery (bad/oversized X-Size, blanked X-In-Cache, smashed header, garbled
# deflate) must each be rejected per-entry, never crash, never taint the sibling.
set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
# the smashed-header case logs expected "Corrupted cache entry" warnings on
# stdout; the verdict is the last line
out=$(httrack -#test=cache-corrupt "$dir" 2>/dev/null | tail -n1)
test "$out" = "cache-corrupt: OK" || {
echo "expected 'cache-corrupt: OK', got: $out" >&2
exit 1
}

View File

@@ -4,10 +4,9 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Cache write-failure handling (httrack -#test=cache-writefail <dir>). #174/#219.
# A failing new.zip write (disk full) used to crash the process via assertf; it
# must instead stop the mirror with a fatal error (exit_xh=-1), no crash. The
# self-test asserts that; reverting the fix makes -#test=cache-writefail abort (SIGABRT) and fail.
# Cache write-failure policy (-#test=cache-writefail <dir>). #174/#219: disk
# full or a failure streak aborts cleanly; an isolated failure or an oversized
# entry is only dropped.
set -eu
@@ -22,3 +21,9 @@ printf '%s\n' "$out" | grep -qx "cache-writefail: OK" || {
echo "expected 'cache-writefail: OK', got: $out" >&2
exit 1
}
# A skipped entry must be warned about with its URL.
printf '%s\n' "$out" | grep -q "entry not cached: example.com/" || {
echo "expected a URL-bearing skip warning" >&2
exit 1
}

View File

@@ -0,0 +1,33 @@
#!/bin/bash
#
set -euo pipefail
# Update-run naming from a real cache entry (-#test=savename cached=<ctype>|<save>).
# Named 01_zlib-*: the cache writer needs zlib, which the MSan job can't run.
# resolve httrack before cd: make check puts a RELATIVE ../src on PATH
httrack_bin=$(cd "$(dirname "$(command -v httrack)")" && pwd)/httrack
scratch=$(mktemp -d)
trap 'rm -rf "$scratch"' EXIT
cd "$scratch"
name() {
local fil="$1" ctype="$2" want="$3"
shift 3
out="$("$httrack_bin" -O /dev/null -#test=savename "$fil" "$ctype" "$@" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
# No live bytes: the recorded save name (X-Save) reproduces the previous
# verdict; cached body bytes (PNG magic) are ignored; css has no magic rule.
name '/photo.jpg' 'image/png' 'photo.jpg' 'cached=image/png|www.example.com/photo.jpg'
name '/photo.jpg' 'image/png' 'photo.png' 'cached=image/png|www.example.com/photo.png'
name '/photo.jpg' 'image/jpeg' 'photo.jpg' 'cached=image/jpeg|www.example.com/photo.png'
name '/style.css' 'image/png' 'style.css' 'cached=image/png|www.example.com/style.css'
# agreement keeps the URL ext verbatim (.jpeg), never canonicalized to .jpg
name '/photo.jpeg' 'image/jpeg' 'photo.jpeg' 'cached=image/jpeg|www.example.com/photo.jpeg'

View File

@@ -1,11 +1,10 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
# Content-Type vs URL-extension naming (#267 family, default -%N2). A MISSING
# type keeps a specific non-HTML ext; a DECLARED disagreeing type is trusted
# unless magic bytes prove the ext right (lie/wrongtype/packed keep theirs),
# so a real HTML body (report.pdf) still becomes .html. Wrong names are
# asserted absent so a regression in either direction fails.
: "${top_srcdir:=..}"
@@ -14,7 +13,11 @@ bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/lie.png' --not-found 'types/lie.html' \
--found 'types/wrongtype.jpg' --not-found 'types/wrongtype.png' \
--found 'types/bigtype.jpg' --not-found 'types/bigtype.png' \
--found 'types/mutant.jpg' --not-found 'types/mutant.png' \
--found 'types/packed.jpg' --not-found 'types/packed.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \

View File

@@ -1,15 +1,18 @@
#!/bin/bash
#
# A second (update) pass must keep the names the first crawl chose. The stored
# Content-Type rides the cache, so the update reads back the same value -- the
# unknown/unknown sentinel for a typeless response, the declared type otherwise
# -- and names consistently: a declared-text/html .pdf stays .html and a
# typeless .png stays .png across the update rather than reverting.
# An update pass keeps the names the first crawl chose: type and save name
# ride the cache, so a declared-text/html .pdf stays .html, a typeless .png
# stays .png, and a sniff-kept ext is reproduced from X-Save even when the
# refetched content changed (mutant.jpg serves PNG bytes on the rerun).
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/lie.html' \
--found 'types/lie.png' --not-found 'types/lie.html' \
--found 'types/wrongtype.jpg' --not-found 'types/wrongtype.png' \
--found 'types/bigtype.jpg' --not-found 'types/bigtype.png' \
--found 'types/packed.jpg' --not-found 'types/packed.png' \
--found 'types/mutant.jpg' --not-found 'types/mutant.png' \
httrack 'BASEURL/types/index.html'

View File

@@ -1,17 +1,19 @@
#!/bin/bash
# Issues #32/#41: a Content-Length that disagrees with the body warns "bogus
# state (broken size)" and skips the cache; -%B (tolerant) accepts it.
# Issues #32/#41: a Content-Length that disagrees with the body warns
# "incomplete transfer" and skips the cache; -%B (tolerant) accepts it.
set -euo pipefail
: "${top_srcdir:=..}"
# Default: warn, but the file is still written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-found 'bogus state \(broken size' \
--log-found 'incomplete transfer \(expected' \
httrack 'BASEURL/size/index.html'
# -%B (tolerant): no warning, file written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-not-found 'bogus state' \
--log-not-found 'incomplete transfer|not cached' \
httrack 'BASEURL/size/index.html' '-%B'

View File

@@ -0,0 +1,20 @@
#!/bin/bash
#
# Degenerate delayed-type paths (#5/#107 family): redirects that never resolve
# a name must drop cleanly -- no .delayed leftovers (audited by local-crawl.sh),
# no "not cached" warnings, resolvable links still land correctly.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --rerun --errors 0 \
--found 'delayed/real.pdf' \
--file-matches 'delayed/real.pdf' '%PDF' \
--found 'delayed/notype.bin.html' \
--found 'delayed/empty.html' \
--not-found 'delayed/noloc.html' \
--not-found 'delayed/selfloop.html' \
--not-found 'delayed/chain9.pdf' \
--log-not-found 'not cached' \
httrack 'BASEURL/delayed/index.html'

View File

@@ -0,0 +1,21 @@
#!/bin/bash
#
# -E time limit (#481): server pages trickle for minutes; the engine must stop
# on its own at -E plus grace, aborting the in-flight transfers.
set -euo pipefail
: "${top_srcdir:=..}"
# cancelled crawls can orphan .delayed placeholders (#483): skip that audit
start=$(date +%s)
bash "$top_srcdir/tests/local-crawl.sh" \
--skip-delayed-audit \
--log-found 'More than 2 seconds passed' \
httrack 'BASEURL/trickle/index.html' -E2 -c4
wall=$(($(date +%s) - start))
# hard stop is due at -E2 + 5s grace; near TRICKLE_SECONDS means it never fired
if [ "$wall" -ge 30 ]; then
echo "crawl took ${wall}s, -E hard stop did not engage" >&2
exit 1
fi

View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# -M byte cap (#77): the crawl must stop with the "giving up" error and keep
# the mirror well under the 8 x 640KB the fixture totals uncapped.
set -euo pipefail
: "${top_srcdir:=..}"
# cap = -M + the 4 in-flight files the smooth stop lets finish + one of margin
bash "$top_srcdir/tests/local-crawl.sh" \
--log-found 'More than 400000 bytes have been transferred.. giving up' \
--found bigfiles/p0.bin \
--max-mirror-bytes 3700000 \
httrack 'BASEURL/bigfiles/index.html' -M400000 -c4

View File

@@ -0,0 +1,55 @@
#!/bin/bash
#
# Diverse seeded /big/ crawl: 12 pattern families, decoy absence, update pass
# must 304-revalidate. 360 = 1 index + 96 pages + 192 imgs + 5 shared + 60
# family + 6 singles; the 4 planted errors write -o1 pages, not counted.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --rerun \
--errors 4 --files 360 \
--found 'big/p/95.html' \
--found 'big/a/d1/d2/d3/d4/d5/d6/d7/d8/deep.png' \
--found 'big/a/f2-2x.png' \
--found 'big/a/subs.vtt' \
--found 'big/a/font.woff2' \
--found 'big/a/js-data.bin' \
--found 'big/d/01.pdf' \
--found 'big/d/named.pdf' \
--found 'big/a/doc.pdf' \
--found "big/f9/caf$(printf '\xc3\xa9').html" \
--found 'big/f7/fa.html' \
--found 'big/a/ref.png' \
--found 'big/f6/sub/leaf.html' \
--found 'big/f1/dir/index.html' \
--found 'big/f10/empty.html' \
--found 'big/indexd41d.html' \
--found 'big/a/i0a.png' \
--not-found 'big/x/og' \
--not-found 'big/x/tw' \
--not-found 'big/x/jsonld.png' \
--not-found 'big/x/never-scanned.png' \
--not-found 'big/x/atom-only.html' \
--not-found 'big/x/sitemap-only.html' \
--not-found 'big/x/form-target.html' \
--not-found 'big/x/formact' \
--not-found 'big/x/ping' \
--not-found 'big/x/aj.jar' \
--not-found 'big/x/bj.jar' \
--not-found 'big/x/is1.png' \
--not-found 'big/x/concat.html' \
--file-matches 'big/p/2.html' 'srcset="\.\./a/f2-1x\.png 1x, \.\./a/f2-2x\.png 2x"' \
--file-matches 'big/a/blk2.css' 'url\(blk2-bg\.png\)' \
--file-matches 'big/p/5.html' "document\\.write\\('<a href=\"\\.\\./f5/dw\\.html\"" \
--file-not-matches 'big/p/1.html' 'href="/big/' \
--log-not-found 'not cached|[Pp]anic|assert' \
--log-found '\(404\) at link [^ ]*/big/e/404\.html' \
--log-found '\(410\) at link [^ ]*/big/e/410\.html' \
--log-found '\(500\) at link [^ ]*/big/e/500\.html' \
--log-found 'decompressing.*big/e/gztrunc\.html' \
--log-found ', no files updated' \
--max-mirror-bytes 700000 \
--min-mirror-bytes 500000 \
httrack 'BASEURL/big/index.html' --retries=0 -c8 -%c100 -A100000000

View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
# An update run against a dead server must not destroy the cache: the no-data
# rollback restores the previous hts-cache generation (zip caches lost it).
set -eu
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun-dead \
--found 'simple/basic.html' \
httrack 'BASEURL/simple/basic.html'

View File

@@ -0,0 +1,14 @@
#!/bin/bash
#
# An all-304 update of a tiny site (headers under the 32K rollback threshold)
# is a healthy run: it must not trip the no-data rollback as a fake outage.
set -eu
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--log-found 'no files updated' \
--log-not-found 'No data seems to have been transferred' \
--found 'mini304/index.html' --found 'mini304/page.html' \
httrack 'BASEURL/mini304/index.html'

View File

@@ -48,12 +48,14 @@ TESTS = \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-reconcile.test \
01_engine-redirect.test \
01_engine-relative.test \
01_engine-robots.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-sniff.test \
01_engine-status.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
@@ -62,8 +64,10 @@ TESTS = \
01_engine-useragent.test \
01_zlib-acceptencoding.test \
01_zlib-cache.test \
01_zlib-cache-corrupt.test \
01_zlib-cache-golden.test \
01_zlib-cache-writefail.test \
01_zlib-savename-cached.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \
@@ -93,6 +97,12 @@ TESTS = \
29_local-redirect-fragment.test \
30_local-fragment-link.test \
31_local-javaclass.test \
32_local-cdispo.test
32_local-cdispo.test \
33_local-delayed.test \
34_local-maxtime.test \
35_local-maxsize.test \
36_local-bigcrawl.test \
37_local-cache-outage.test \
38_local-update-304.test
CLEANFILES = check-network_sh.cache

View File

@@ -16,13 +16,17 @@
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# --file-matches PATH REGEX ... --file-not-matches PATH REGEX ... \
# --max-mirror-bytes N \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
# --max/--min-mirror-bytes bound the mirrored content bytes (host root).
# --file-matches/--file-not-matches grep (ERE) a mirrored file (PATH under the
# host root), to assert rewritten link/content survived the crawl.
# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
# which the ephemeral port forces into the cookie domain) and passes it to
# httrack via --cookies-file, to exercise preloaded cookies.
# --rerun-dead re-runs with the server stopped: the no-data rollback must
# restore the previous hts-cache generation byte-identical.
set -u
@@ -35,6 +39,7 @@ key="${testdir}/server.key"
tls=
verbose=
rerun=
rerun_dead=
tmpdir=
serverpid=
crawlpid=
@@ -92,6 +97,7 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create
# --- parse leading control flags --------------------------------------------
declare -a audit=()
declare -a cookies=()
skip_delayed_audit=""
scheme=http
pos=0
args=("$@")
@@ -99,7 +105,8 @@ nargs=$#
while test "$pos" -lt "$nargs"; do
case "${args[$pos]}" in
--debug) verbose=1 ;;
--rerun) rerun=1 ;; # run httrack a second time (update pass) before auditing
--rerun) rerun=1 ;; # run httrack a second time (update pass) before auditing
--rerun-dead) rerun_dead=1 ;; # re-run with the server stopped (cache rollback)
--no-purge)
nopurge=1
audit+=("--no-purge")
@@ -116,11 +123,14 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
cookies+=("${args[$pos]}")
;;
--skip-delayed-audit)
skip_delayed_audit=1
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory | --log-found | --log-not-found)
--found | --not-found | --directory | --log-found | --log-not-found | --max-mirror-bytes | --min-mirror-bytes)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
@@ -235,6 +245,43 @@ if test -n "$rerun"; then
fi
fi
# --- optional dead pass: server stopped, the cache must survive the rollback --
if test -n "$rerun_dead"; then
zip="${out}/hts-cache/new.zip"
test -s "$zip" || die "no cache was written by the first pass"
cp "$zip" "${tmpdir}/cache-before.zip"
cp "${out}/hts-log.txt" "${tmpdir}/log-before.txt"
kill "$serverpid" 2>/dev/null
wait "$serverpid" 2>/dev/null
serverpid=
info "re-running httrack against the stopped server"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" \
"${moreargs[@]}" "${hts[@]}" >"${log}.dead" 2>&1 &
crawlpid=$!
wait "$crawlpid" || true
crawlpid=
result "OK (dead pass ran)"
# The dead pass must have gone through the no-data rollback, not bailed out
# before the mirror loop (which would leave the cache trivially untouched).
info "checking the dead pass hit the rollback"
if grep -aq "No data seems to have been transferred" "${out}/hts-log.txt"; then
result "OK"
else
result "rollback notice not found in hts-log.txt"
exit 1
fi
info "checking the previous cache generation was restored"
if cmp -s "$zip" "${tmpdir}/cache-before.zip" &&
test ! -e "${out}/hts-cache/old.zip"; then
result "OK"
else
result "new.zip differs from the pre-outage cache (or old.zip left behind)"
exit 1
fi
# Audits below describe the healthy crawl, not the dead pass.
cp "${tmpdir}/log-before.txt" "${out}/hts-log.txt"
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
hostroot=
for cand in "${out}/127.0.0.1_${port}" "${out}/127.0.0.1"; do
@@ -246,6 +293,17 @@ done
test -n "$hostroot" || die "could not find host root under $out"
debug "host root: $hostroot"
# A completed crawl must leave no .delayed temporaries (issue #107).
# --skip-delayed-audit: a cancelled crawl can orphan placeholders (issue #483)
if test -z "$skip_delayed_audit"; then
info "checking for leftover .delayed files"
leftovers=$(find "$out" -name '*.delayed' 2>/dev/null | head -5)
if test -z "$leftovers"; then result "OK"; else
result "leftover: $leftovers"
exit 1
fi
fi
# --- audit -------------------------------------------------------------------
i=0
while test "$i" -lt "${#audit[@]}"; do
@@ -301,6 +359,24 @@ while test "$i" -lt "${#audit[@]}"; do
exit 1
else result "OK"; fi
;;
--max-mirror-bytes)
i=$((i + 1))
sz=$(find "$hostroot" -type f -exec cat {} + | wc -c | tr -d '[:space:]')
info "checking mirror size ${sz} <= ${audit[$i]} bytes"
if test "$sz" -le "${audit[$i]}"; then result "OK"; else
result "mirror too big"
exit 1
fi
;;
--min-mirror-bytes)
i=$((i + 1))
sz=$(find "$hostroot" -type f -exec cat {} + | wc -c | tr -d '[:space:]')
info "checking mirror size ${sz} >= ${audit[$i]} bytes"
if test "$sz" -ge "${audit[$i]}"; then result "OK"; else
result "mirror too small"
exit 1
fi
;;
--file-matches)
path="${audit[$((i + 1))]}"
i=$((i + 2))

View File

@@ -14,6 +14,8 @@ stdlib only (http.server + ssl) -- no new build or runtime dependency.
"""
import argparse
import gzip
import hashlib
import os
import time
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
@@ -41,6 +43,416 @@ PAGE = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"""
# --- /big/ seeded pseudo-site (36_local-bigcrawl) ---------------------------
# Deterministic ~360-file tree; bodies derive from sha256(BIG_SEED, name) so
# every run serves identical content and the test pins exact counts.
BIG_SEED = "bigcrawl-lite-1"
BIG_PAGES = 96
BIG_FANOUT = 4
# Fixed validator: a matching If-Modified-Since gets 304, so the update pass
# revalidates instead of re-downloading.
BIG_LASTMOD = "Mon, 01 Jan 2024 00:00:00 GMT"
BIG_CTYPES = {
"html": "text/html",
"css": "text/css",
"js": "application/x-javascript",
"png": "image/png",
"gif": "image/gif",
"jpg": "image/jpeg",
"webp": "image/webp",
"pdf": "application/pdf",
"woff2": "font/woff2",
"mp4": "video/mp4",
"webm": "video/webm",
"mp3": "audio/mpeg",
"vtt": "text/vtt",
"xml": "text/xml",
"svg": "image/svg+xml",
"jar": "application/java-archive",
"bin": "application/octet-stream",
}
# Honest magic bytes per claimed type so the #478 sniff never contests.
BIG_MAGIC = {
"png": b"\x89PNG\r\n\x1a\n",
"gif": b"GIF89a",
"jpg": b"\xff\xd8\xff\xe0",
"webp": b"RIFF\x10\x27\x00\x00WEBPVP8 ",
"pdf": b"%PDF-1.4\n",
"woff2": b"wOF2",
"mp4": b"\x00\x00\x00\x18ftypmp42",
"webm": b"\x1a\x45\xdf\xa3",
"mp3": b"ID3\x04\x00\x00\x00\x00\x00\x00",
"jar": b"PK\x03\x04",
}
def big_blob(name, size):
out = b""
n = 0
while len(out) < size:
out += hashlib.sha256(f"{BIG_SEED}/{name}/{n}".encode()).digest()
n += 1
return out[:size]
def big_asset(name):
ext = name.rsplit(".", 1)[-1]
size = 200 + int(hashlib.sha256(name.encode()).hexdigest(), 16) % 3800
raw = big_blob(name, size)
if ext in ("css", "js", "txt"):
return b"/* " + raw.hex().encode() + b" */"
return BIG_MAGIC.get(ext, b"") + raw
def big_html(title, inner):
page = (
"<!DOCTYPE html><html><head><title>%s</title></head><body>\n%s\n</body></html>"
% (
title,
inner,
)
)
return page.encode()
def _hexfill(name):
return big_blob(name, 160).hex()
HOME = '<a href="/big/index.html">home</a>'
BIG_TEXT_ASSETS = {
"site.css": (
"body { background: url(bg.png); } /* %s */" % _hexfill("site.css"),
"text/css",
),
"print.css": ("p { margin: 0; } /* %s */" % _hexfill("print.css"), "text/css"),
"blk.css": (
'@import "blk2.css";\n'
'@font-face { font-family: big; src: local("Nope Sans"), '
'url(font.woff2) format("woff2"); }\n'
"/* %s */" % _hexfill("blk.css"),
"text/css",
),
# Absolute url() must come back relative after the rewrite (test greps it);
# the \/ escapes collapse to an already-linked URL if taken literally.
"blk2.css": (
"body { background: url(/big/a/blk2-bg.png); }\n"
"i { background: url(/big\\/a\\/bg.png); }\n"
"/* %s */" % _hexfill("blk2.css"),
"text/css",
),
# .open() grabs its first arg only (a method there is rejected, #218), so
# the window.open single-URL form is the token-detected shape.
"app.js": (
'var im = new Image(); im.src = "/big/a/js-img.png";\n'
'function pop() { window.open("/big/a/js-data.bin"); }\n'
"// %s\n" % _hexfill("app.js"),
"application/x-javascript",
),
"heavy.js": (
'var h = new Image(); h.src = "/big/a/js1.png";\n'
'function nav() { location.href = "/big/p/1.html"; }\n'
'function pop() { window.open("/big/a/js2.bin"); }\n'
"// %s\n" % _hexfill("heavy.js"),
"application/x-javascript",
),
# text/javascript is fetched but never scanned: the URL inside must stay
# out of the mirror.
"decoy.js": (
'var d = new Image(); d.src = "/big/x/never-scanned.png";\n',
"text/javascript",
),
"subs.vtt": ("WEBVTT\n\n00:00.000 --> 00:01.000\nbig\n", "text/vtt"),
"logo.svg": (
'<svg xmlns="http://www.w3.org/2000/svg" width="4" height="4">'
'<image href="ref.png" width="4" height="4"/></svg>',
"image/svg+xml",
),
}
def _fam_feeds(port):
return (
'<link rel="alternate" type="application/rss+xml" href="/big/f12/rss.xml">'
'<a href="/big/f12/atom.xml">atom</a>'
'<a href="/big/f12/sitemap.xml">sitemap</a>'
)
def _fam_plain(port):
return (
'<a href="../f1/one.html">one</a>'
'<a href="./two.html">two</a>'
'<a href="../../big/f1/tri.html">tri</a>'
'<a href="/big/f1/abs.html">abs</a>'
'<a href="/big/f1/list.html">list</a>'
'<a href="/big/f1/list.html?page=2">p2</a>'
'<a href="/big/f1/list.html?page=3&amp;sort=asc">p3</a>'
'<a href="/big/f1/dir">dir</a>'
'<a href="">self</a><a href="#">frag</a>'
'<a href="mailto:big@example.com">mail</a>'
'<a href="tel:+15551234">tel</a>'
'<a href="data:text/plain;base64,aGk=">data</a>'
)
def _fam_srcset(port):
return (
'<img src="/big/a/f2-base.png">'
'<img srcset="/big/a/f2-1x.png 1x, /big/a/f2-2x.png 2x"'
' src="/big/a/f2-base.png">'
'<img data-srcset="/big/a/f2-1x.png 1x, /big/a/f2-2x.png 2x"'
' src="/big/a/f2-base.png" loading="lazy">'
'<picture><source type="image/webp" srcset="/big/a/f2-alt.webp">'
'<img src="/big/a/f2-base.png"></picture>'
)
def _fam_media(port):
return (
'<video src="/big/a/clip.mp4" poster="/big/a/poster.jpg">'
'<source src="/big/a/clip.webm" type="video/webm">'
'<track src="/big/a/subs.vtt" kind="subtitles" srclang="en">'
"</video>"
'<audio><source src="/big/a/tune.mp3" type="audio/mpeg"></audio>'
)
def _fam_css(port):
# image-set with descriptors is a proven-safe decoy (engine-surface §6).
return (
'<link rel="stylesheet" href="/big/a/print.css" media="print">'
'<div style="background:url(/big/a/attr-bg.png)">styled</div>'
'<style>@import "/big/a/blk.css"; h1 { background: url(/big/a/blk-bg.gif); }'
' h2 { background-image: image-set("/big/x/is1.png" 1x, "/big/x/is2.png" 2x); }'
"</style>"
)
def _fam_js(port):
# The concatenated string is rejected by the scanner (no single literal).
return (
'<script src="/big/a/heavy.js"></script>'
'<script src="/big/a/decoy.js"></script>'
"<script>document.write('<a href=\"/big/f5/dw.html\">dw</a>');\n"
'var nope = "xx-" + "/big/x/concat.html";</script>'
)
def _fam_meta(port):
# Extensionless decoy targets stay unfetchable even if the aggressive
# parser fires (no known extension, no scheme: rejected in every state).
return (
'<meta http-equiv="refresh" content="2;URL=/big/f6/refreshed.html">'
'<a href="/big/f6/based.html">based</a>'
'<meta property="og:image" content="/big/x/og">'
'<meta name="twitter:image" content="/big/x/tw">'
'<script type="application/ld+json">'
'{"@type": "Thing", "image": "/big/x/jsonld.png"}</script>'
)
def _fam_legacy(port):
# Comma-valued applet archive is rejected whole by the engine (decoy).
return (
'<a href="/big/f7/frames.html">frames</a>'
'<img src="/big/a/map.gif" usemap="#m">'
'<map name="m">'
'<area shape="rect" coords="0,0,9,9" href="/big/f7/area.html"></map>'
'<embed src="/big/a/e.pdf" type="application/pdf" width="9" height="9">'
'<object data="/big/a/o.pdf" type="application/pdf"></object>'
'<applet archive="/big/x/aj.jar,/big/x/bj.jar" width="1" height="1"></applet>'
)
def _fam_svg(port):
return (
'<svg width="9" height="9">'
'<image href="/big/a/svg-in.png" width="4" height="4"/>'
'<use xlink:href="#icon"/></svg>'
'<img src="/big/a/logo.svg">'
)
def _fam_i18n(port):
return (
'<a href="/big/f9/caf%C3%A9.html">cafe</a>'
'<a href="/big/f9/latin1.html">latin1</a>'
'<a href="/big/f9/metaonly.html">meta</a>'
'<a href="/big/f9/bom.html">bom</a>'
)
def _fam_http(port):
return (
'<a href="/big/r/hop1">chain</a>'
'<a href="/big/r/get42">get42</a>'
'<a href="/big/d/01">d01</a>'
'<a href="/big/d/02">d02</a>'
'<a href="/big/f10/empty.html">empty</a>'
'<a href="/big/d/dl">dl</a>'
)
def _fam_forms(port):
# GET form action is rewritten but never fetched; formaction/ping are
# outside the attribute tables (decoys).
return (
'<form action="/big/x/form-target.html" method="get">'
'<input type="text" name="q">'
'<input type="image" src="/big/a/btn.png" alt="go"></form>'
'<a href="/big/f11/page.html">bare</a>'
'<a href="/big/f11/page.html?utm_source=news&amp;utm_medium=mail">utm</a>'
'<a href="/big/f11/sess.html?PHPSESSID=deadbeef123">sess</a>'
'<button formaction="/big/x/formact">go</button>'
'<a href="/big/f11/page.html" ping="/big/x/ping">ping</a>'
)
BIG_FAMILIES = [
_fam_feeds,
_fam_plain,
_fam_srcset,
_fam_media,
_fam_css,
_fam_js,
_fam_meta,
_fam_legacy,
_fam_svg,
_fam_i18n,
_fam_http,
_fam_forms,
]
def big_link(m, style):
return ["%d.html" % m, "../p/%d.html" % m, "/big/p/%d.html" % m][style]
def big_page(n, port):
style = n % 3
home = ["../index.html", "/big/index.html", "../index.html"][style]
parts = ['<a href="%s">home</a>' % home]
if n > 0:
parts.append('<a href="%s">up</a>' % big_link((n - 1) // BIG_FANOUT, style))
for c in range(n * BIG_FANOUT + 1, n * BIG_FANOUT + BIG_FANOUT + 1):
if c < BIG_PAGES:
parts.append('<a href="%s">p%d</a>' % (big_link(c, style), c))
parts.append('<link rel="stylesheet" href="/big/a/site.css">')
parts.append('<script src="/big/a/app.js"></script>')
exts = ["png", "gif", "jpg"]
ia = "/big/a/i%da.%s" % (n, exts[n % 3])
ib = "/big/a/i%db.%s" % (n, exts[(n + 1) % 3])
# Rotate the second-image construct across deterministic table attributes.
con = n % 4
if con == 0:
parts.append('<img src="%s"><img src="%s">' % (ia, ib))
elif con == 1:
parts.append(
'<img src="%s"><table background="%s"><tr><td>t</td></tr></table>'
% (ia, ib)
)
elif con == 2:
parts.append('<img src="%s"><img src="%s" data-src="%s">' % (ia, ia, ib))
else:
parts.append(
'<img src="%s" loading="lazy"><video poster="%s"></video>' % (ia, ib)
)
parts.append(BIG_FAMILIES[n % 12](port))
return big_html("p%d" % n, "\n".join(parts))
def big_index(port):
return big_html(
"big index",
'<link rel="stylesheet" href="/big/a/site.css">'
'<script src="/big/a/app.js"></script>'
'<a href="p/0.html">root</a>'
'<img src="/big/a/d1/d2/d3/d4/d5/d6/d7/d8/deep.png">'
'<a href="/big/f1/long.html?x=%s">long</a>'
'<a href="/big/f1/gzok.html">gzok</a>'
'<a href="//127.0.0.1:%d/big/f1/protorel.html">protorel</a>'
'<a href="http://127.0.0.1:%d/big/f1/abshost.html">abshost</a>'
'<a href="/big/e/404.html">e404</a>'
'<a href="/big/e/410.html">e410</a>'
'<a href="/big/e/500.html">e500</a>'
'<a href="/big/e/gztrunc.html">gzt</a>'
'<a href="?">query</a>' % ("a" * 900, port, port),
)
BIG_REDIRECTS = {
"/big/r/hop1": (301, "/big/r/hop2"),
"/big/r/hop2": (302, "/big/f10/land.html"),
"/big/r/get42": (301, "/big/a/doc.pdf"),
"/big/f1/dir": (301, "/big/f1/dir/"),
}
BIG_SIMPLE_PAGES = {
"/big/p/two.html": "dot-slash target",
"/big/f1/one.html": "one",
"/big/f1/tri.html": "tri",
"/big/f1/abs.html": "abs",
"/big/f1/dir/": "dir index",
"/big/f1/long.html": "long",
"/big/f1/gzok.html": "gzok",
"/big/f1/protorel.html": "protorel",
"/big/f1/abshost.html": "abshost",
"/big/f5/dw.html": "dw target",
"/big/f6/refreshed.html": "refreshed",
"/big/f6/sub/leaf.html": "leaf",
"/big/f7/fa.html": "frame a",
"/big/f7/fb.html": "frame b",
"/big/f7/fn.html": "noframes",
"/big/f7/area.html": "area",
"/big/f10/land.html": "landed",
"/big/f11/page.html": "the page",
"/big/f11/sess.html": "the sess page",
}
# Extensionless downloads: name resolution is wire-type driven (#478 contract).
BIG_DOWNLOADS = {
"/big/d/01": ("pdf", None),
"/big/d/02": ("png", None),
"/big/d/dl": ("pdf", 'attachment; filename="named.pdf"'),
}
def _big_rss(port):
# purl.org marker makes the feed parse; item URLs are already-linked pages.
return (
'<?xml version="1.0"?>\n'
'<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">\n'
"<channel><title>big</title><link>http://127.0.0.1:%d/big/index.html</link>\n"
"<item><title>i1</title><link>http://127.0.0.1:%d/big/p/1.html</link>\n"
'<enclosure url="http://127.0.0.1:%d/big/p/2.html" type="text/html"/></item>\n'
"</channel></rss>\n" % (port, port, port)
).encode()
def _big_atom(port):
# No purl marker: emitted verbatim, its URL must never be fetched.
return (
'<?xml version="1.0"?>\n'
'<feed xmlns="http://www.w3.org/2005/Atom"><title>big</title>\n'
"<entry><title>e1</title>"
'<link href="http://127.0.0.1:%d/big/x/atom-only.html"/>'
"</entry></feed>\n" % port
).encode()
def _big_sitemap(port):
return (
'<?xml version="1.0"?>\n'
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
"<url><loc>http://127.0.0.1:%d/big/x/sitemap-only.html</loc></url>\n"
"</urlset>\n" % port
).encode()
class Handler(SimpleHTTPRequestHandler):
# Quieter logging; the launcher captures httrack's own log anyway.
def log_message(self, fmt, *args):
@@ -150,6 +562,8 @@ class Handler(SimpleHTTPRequestHandler):
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
FAKE_JPEG = b"\xff\xd8\xff\xe0" + b"\x00" * 64
BIG_JPEG = b"\xff\xd8\xff\xe0" + bytes(range(256)) * 64 # > sniff window
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
@@ -161,6 +575,8 @@ class Handler(SimpleHTTPRequestHandler):
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/wrongtype.jpg": (FAKE_JPEG, "image/png"),
"/types/bigtype.jpg": (BIG_JPEG, "image/png"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
@@ -178,6 +594,10 @@ class Handler(SimpleHTTPRequestHandler):
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<img src="wrongtype.jpg" />\n'
'\t<img src="bigtype.jpg" />\n'
'\t<img src="mutant.jpg" />\n'
'\t<img src="packed.jpg" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
@@ -192,6 +612,25 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# content changes between crawls: run 1 sniffs JPEG, the update pass must
# keep the run-1 name (recorded verdict) even though the body is now PNG
MUTANT_SEEN = set()
def route_types_mutant(self):
path = urlsplit(self.path).path
body = self.FAKE_PNG if path in self.MUTANT_SEEN else self.FAKE_JPEG
if self.command != "HEAD":
self.MUTANT_SEEN.add(path)
self.send_raw(body, "image/png")
# gzip on the wire: the sniff must see the decoded body, not the stream
def route_types_packed(self):
self.send_raw(
gzip.compress(self.FAKE_JPEG),
"image/png",
extra_headers=[("Content-Encoding", "gzip")],
)
# --- MIME-type exclusion abort (issue #58) -----------------------------
# A -mime:application/pdf filter must abort the transfer once the header
# arrives, not download the whole body and discard it.
@@ -342,7 +781,7 @@ class Handler(SimpleHTTPRequestHandler):
self.send_raw(b"", "text/html")
# broken Content-Length (#32/#41): declared size != bytes sent. httrack
# warns "bogus state (broken size)" and skips the cache unless -%B.
# warns "incomplete transfer" and skips the cache unless -%B.
def route_size_index(self):
self.send_html('\t<a href="oversize.bin">over</a>\n')
@@ -392,6 +831,97 @@ class Handler(SimpleHTTPRequestHandler):
def route_redir_target(self):
self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
# --- /mini304/: tiny fully-cacheable site (an update gets only 304s) ---
def route_mini304_index(self):
self.big_send(
b'<html><body>\n\t<a href="page.html">page</a>\n</body></html>\n',
"text/html",
)
def route_mini304_page(self):
self.big_send(b"<html><body>tiny cacheable page</body></html>\n", "text/html")
# --- delayed-type degenerate paths (issues #5/#107) --------------------
def route_delayed_index(self):
self.send_html(
'\t<a href="noloc.php">noloc</a>\n'
'\t<a href="selfloop.php">selfloop</a>\n'
'\t<a href="chain1.php">chain</a>\n'
'\t<a href="redir.php">redir</a>\n'
'\t<a href="notype.bin">notype</a>\n'
'\t<a href="empty.php">empty</a>\n'
)
def send_redirect(self, location):
self.send_response(302, "Found")
if location is not None:
self.send_header("Location", location)
self.send_header("Content-Length", "0")
self.end_headers()
def route_delayed_noloc(self):
self.send_redirect(None) # 302 without Location: name never resolves
def route_delayed_selfloop(self):
self.send_redirect("selfloop.php")
def route_delayed_chain(self):
# chain1..chain9: one more hop than the type-check redirect budget
n = int(urlsplit(self.path).path.rsplit("chain", 1)[1].split(".")[0])
if n < 9:
self.send_redirect("chain%d.php" % (n + 1))
else:
self.send_raw(self.FAKE_PDF, "application/pdf")
def route_delayed_redir(self):
self.send_redirect("real.pdf")
def route_delayed_realpdf(self):
self.send_raw(self.FAKE_PDF, "application/pdf")
def route_delayed_notype(self):
self.send_raw(self.FAKE_PDF, None)
def route_delayed_empty(self):
self.send_raw(b"", "text/html") # 200 + Content-Length: 0
# -E time-limit (#481): pages that trickle far longer than any -E budget,
# so only an engine-side abort can end the crawl.
TRICKLE_SECONDS = 60
def send_bin_index(self):
"""Index page linking p0.bin..p7.bin (shared by trickle and bigfiles)."""
self.send_html(
"".join('\t<a href="p%d.bin">p%d</a>\n' % (i, i) for i in range(8))
)
def route_trickle_index(self):
self.send_bin_index()
def route_trickle_page(self):
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(2 * self.TRICKLE_SECONDS))
self.end_headers()
if self.command == "HEAD":
return
try:
for _ in range(self.TRICKLE_SECONDS):
self.wfile.write(b"xy")
self.wfile.flush()
time.sleep(1.0)
except OSError:
pass
# -M byte cap (#77): large fast files so a crawl overruns -M immediately.
BIGFILE_BYTES = 640 * 1024
def route_bigfiles_index(self):
self.send_bin_index()
def route_bigfile(self):
self.send_raw(b"x" * self.BIGFILE_BYTES, "application/octet-stream")
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
@@ -407,6 +937,10 @@ class Handler(SimpleHTTPRequestHandler):
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/wrongtype.jpg": route_types,
"/types/bigtype.jpg": route_types,
"/types/mutant.jpg": route_types_mutant,
"/types/packed.jpg": route_types_packed,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
@@ -432,11 +966,187 @@ class Handler(SimpleHTTPRequestHandler):
"/cdispo/index.html": route_cdispo_index,
"/cdispo/fetch.php": route_cdispo,
"/cdispo/evil.php": route_cdispo,
"/delayed/index.html": route_delayed_index,
"/trickle/index.html": route_trickle_index,
"/trickle/p0.bin": route_trickle_page,
"/trickle/p1.bin": route_trickle_page,
"/trickle/p2.bin": route_trickle_page,
"/trickle/p3.bin": route_trickle_page,
"/trickle/p4.bin": route_trickle_page,
"/trickle/p5.bin": route_trickle_page,
"/trickle/p6.bin": route_trickle_page,
"/trickle/p7.bin": route_trickle_page,
"/bigfiles/index.html": route_bigfiles_index,
"/bigfiles/p0.bin": route_bigfile,
"/bigfiles/p1.bin": route_bigfile,
"/bigfiles/p2.bin": route_bigfile,
"/bigfiles/p3.bin": route_bigfile,
"/bigfiles/p4.bin": route_bigfile,
"/bigfiles/p5.bin": route_bigfile,
"/bigfiles/p6.bin": route_bigfile,
"/bigfiles/p7.bin": route_bigfile,
"/delayed/noloc.php": route_delayed_noloc,
"/delayed/selfloop.php": route_delayed_selfloop,
"/delayed/redir.php": route_delayed_redir,
"/delayed/real.pdf": route_delayed_realpdf,
"/delayed/notype.bin": route_delayed_notype,
"/delayed/empty.php": route_delayed_empty,
"/delayed/chain1.php": route_delayed_chain,
"/delayed/chain2.php": route_delayed_chain,
"/delayed/chain3.php": route_delayed_chain,
"/delayed/chain4.php": route_delayed_chain,
"/delayed/chain5.php": route_delayed_chain,
"/delayed/chain6.php": route_delayed_chain,
"/delayed/chain7.php": route_delayed_chain,
"/delayed/chain8.php": route_delayed_chain,
"/delayed/chain9.php": route_delayed_chain,
"/redir/index.html": route_redir_index,
"/redir/go.php": route_redir_go,
"/redir/target.html": route_redir_target,
"/mini304/index.html": route_mini304_index,
"/mini304/page.html": route_mini304_page,
}
# --- /big/ seeded pseudo-site ------------------------------------------
def big_send(self, body, ctype, code=200, extra=()):
if code == 200 and self.headers.get("If-Modified-Since") == BIG_LASTMOD:
self.send_response(304)
self.send_header("Content-Length", "0")
self.end_headers()
return
self.send_response(code)
if code == 200:
self.send_header("Last-Modified", BIG_LASTMOD)
self.send_header("Content-Type", ctype)
self.send_header("Content-Length", str(len(body)))
for name, value in extra:
self.send_header(name, value)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
def big_error(self, code, reason):
body = big_html("error", "<p>%d</p>%s" % (code, HOME))
self.big_send(body, "text/html", code=code, extra=[("X-Reason", reason)])
def route_big(self):
split = urlsplit(self.path)
path = unquote(split.path)
port = self.server.server_address[1]
if path in BIG_REDIRECTS:
code, location = BIG_REDIRECTS[path]
self.send_response(code)
self.send_header("Location", location)
self.send_header("Content-Length", "0")
self.end_headers()
elif path == "/big/index.html":
self.big_send(big_index(port), "text/html")
elif path in BIG_SIMPLE_PAGES:
body = big_html(path, "<p>%s</p>%s" % (BIG_SIMPLE_PAGES[path], HOME))
if path == "/big/f1/gzok.html":
self.big_send(
gzip.compress(body, mtime=0),
"text/html",
extra=[("Content-Encoding", "gzip")],
)
else:
self.big_send(body, "text/html")
elif path == "/big/f1/list.html":
# Pagination: distinct content per query string.
body = big_html("list", "<p>listing %s</p>%s" % (split.query or "1", HOME))
self.big_send(body, "text/html")
elif path == "/big/f6/based.html":
self.big_send(
big_html(
"based",
'<base href="http://127.0.0.1:%d/big/f6/sub/">'
'<a href="leaf.html">leaf</a>' % port,
),
"text/html",
)
elif path == "/big/f7/frames.html":
self.big_send(
b'<html><frameset cols="50%,50%"><frame src="fa.html">'
b'<frame src="fb.html"><noframes><body><a href="fn.html">fn</a>'
b"</body></noframes></frameset></html>",
"text/html",
)
elif path == "/big/f9/café.html":
self.big_send(big_html("cafe", "<p>cafe</p>%s" % HOME), "text/html")
elif path == "/big/f9/latin1.html":
self.big_send(
b"<html><body><p>caf\xe9 latin</p></body></html>",
"text/html; charset=ISO-8859-1",
)
elif path == "/big/f9/metaonly.html":
self.big_send(
'<html><head><meta charset="utf-8"></head>'
"<body><p>café meta</p></body></html>".encode(),
"text/html",
)
elif path == "/big/f9/bom.html":
self.big_send(
b"\xef\xbb\xbf" + big_html("bom", "<p>bom</p>%s" % HOME), "text/html"
)
elif path == "/big/f10/empty.html":
self.big_send(b"", "text/html")
elif path == "/big/f12/rss.xml":
self.big_send(_big_rss(port), "text/xml")
elif path == "/big/f12/atom.xml":
self.big_send(_big_atom(port), "application/xml")
elif path == "/big/f12/sitemap.xml":
self.big_send(_big_sitemap(port), "text/xml")
elif path.startswith("/big/p/"):
try:
n = int(path[len("/big/p/") : -len(".html")])
except ValueError:
n = -1
if 0 <= n < BIG_PAGES and path.endswith(".html"):
self.big_send(big_page(n, port), "text/html")
else:
self.big_error(404, "no such page")
elif path.startswith("/big/a/") or path.startswith("/big/x/"):
name = path[len("/big/a/") :]
if path.startswith("/big/a/") and name in BIG_TEXT_ASSETS:
text, ctype = BIG_TEXT_ASSETS[name]
self.big_send(text.encode(), ctype)
elif name.endswith(".html"):
# Decoy targets 200 so a parser leak becomes a mirror file.
self.big_send(big_html(name, "<p>%s</p>" % name), "text/html")
else:
ext = name.rsplit(".", 1)[-1]
ctype = BIG_CTYPES.get(ext, "application/octet-stream")
self.big_send(big_asset(name), ctype)
elif path in BIG_DOWNLOADS:
ext, cdispo = BIG_DOWNLOADS[path]
extra = [("Content-Disposition", cdispo)] if cdispo else []
self.big_send(
big_asset(path[len("/big/") :] + "." + ext),
BIG_CTYPES[ext],
extra=extra,
)
elif path == "/big/e/404.html":
self.big_error(404, "Not Found")
elif path == "/big/e/410.html":
self.big_error(410, "Gone")
elif path == "/big/e/500.html":
self.big_error(500, "Server Error")
elif path == "/big/e/gztrunc.html":
# Half a gzip stream, honest Content-Length: decode fails, and the
# missing Last-Modified keeps it the one uncacheable resource.
full = gzip.compress(big_html("gz", "x" * 3000), mtime=0)
body = full[: len(full) // 2]
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Encoding", "gzip")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
else:
self.big_error(404, "no such big path")
# --- dispatch ----------------------------------------------------------
def reject_fragment(self):
@@ -452,6 +1162,9 @@ class Handler(SimpleHTTPRequestHandler):
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
if path.startswith("/big/"):
self.route_big()
return True
# Match percent-encoded paths (accented #157 route) by their decoded form.
handler = self.ROUTES.get(path) or self.ROUTES.get(unquote(path))
if handler is not None:

View File

@@ -211,7 +211,9 @@ main() {
# lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S)
# -d: a source build runs no debhelper, so don't require Build-Depends
# locally (the buildds and the --sbuild gate enforce them).
[[ $source_only -eq 1 ]] && build_opts+=(-S -d)
if [[ $unsigned -eq 1 ]]; then
build_opts+=(-us -uc)
else
@@ -234,12 +236,15 @@ main() {
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
# reject the Debian "unstable" distribution. Suppressed tags are stale-local-
# lintian skew, not package defects: newer-standards-version, and
# recommended-field (old lintian still wants the Priority field the sid
# lintian in CI accepts dropping). set -e turns any error/warning tag into
# a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
--suppress-tags newer-standards-version,recommended-field \
"${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/"