Compare commits

...

29 Commits

Author SHA1 Message Date
Xavier Roche
6da801e3b9 Add a -M byte-cap non-regression test
The -M limit had no test coverage. New /bigfiles/ fixture (8 fast 640KB
files), a --max-mirror-bytes audit in local-crawl.sh, and a crawl that
must log the "giving up" error and stay under the uncapped total.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-07-04 15:05:15 +02:00
Xavier Roche
dfafe28002 Enforce the -E time limit inside the transfer wait cycle (#482)
-E was only evaluated at per-link boundaries, so a slow or throttling
server starved the check for minutes, and the smooth stop it finally
requested drained the remaining transfers at server pace with no bound.
back_wait now checks the deadline every cycle and, once a short grace
period expires, aborts the in-flight HTTP transfers like the -T timeout
path does (FTP slots stay with their owning thread). back_checkmirror's
0 return, previously dead, now carries the hard stop.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 13:13:33 +02:00
Xavier Roche
a3f04bde72 Sniff magic bytes before the wire type renames a URL extension (#478)
A URL whose extension maps to a specific type but is served with a
disagreeing specific Content-Type was always renamed after the wire
(photo.jpg served as image/png became photo.png). The contested
verdict (#480) is now settled by the leading body bytes: magic proving
the extension's own type keeps it, anything inconclusive trusts the
wire as before, and the #267 soft-404 guard is unchanged.

New htssniff.c covers the magic-sniffable part of the supported MIME
set (images, A/V containers by RIFF subtype and ftyp brand, zip/OLE
document containers, archives, fonts, conservative text prefixes).
hts_wait_delayed waits for a sniffable head (or EOF) only on contested
verdicts; the head is read from the live backing slot (memory,
url_sav, or the compressed-stream tmpfile, inflated in memory). Update
runs never re-read bytes: they reproduce the previous run's verdict
from the recorded X-Save name (cache_read_including_broken grows a
return_save), so names never churn across updates or upgrades.
Non-delayed mode never sniffs; its HEAD probe has no body on the
first run. Also unlock the waiter's slot on the user-cancel abort.

Tests flip the #480 contract pins to the sniffed outcomes (wrongtype/
bigtype/packed/mutant keep their extension, lie.png stays png), add
-#test=sniff table rows, and pin the recorded-verdict proxy in
01_zlib-savename-cached (kept out of the MSan job: uninstrumented
zlib). All discriminate against the pre-sniff binary.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 10:05:41 +02:00
Xavier Roche
11beef52e1 Name the contested case in extension naming and pin the current contract (#480)
* Make the wire-type-vs-extension naming decision an explicit verdict

Behavior-preserving refactor of wire_patches_ext: the decision becomes
a three-way wire_ext_verdict (ext kept / wire wins / contested), with
the contested case, a specific declared type disagreeing with a
specific URL extension, named explicitly instead of falling through.
Today a contested verdict trusts the wire, unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Pin the naming contract: knobs and fixtures for content-independent naming

-#test=savename gains body= (leading body bytes via a temp url_sav
file) and cached= (a real one-entry cache, reopened read-only, whose
stored body is PNG magic); new rows and 01_zlib-savename-cached.test
pin that naming never depends on content or on the previously recorded
save name, only on headers. e2e fixtures (wrongtype.jpg served as
image/png, a gzip variant, a 16 KiB body, content that changes between
crawls) pin the wire-wins outcome across fresh and update passes. Any
future content-based tie-break must flip these rows explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 09:13:47 +02:00
Xavier Roche
d7c4eab1f5 Delayed type check finalizes fast transfers under their .delayed placeholder name (#479)
* Fix delayed-slot races: lock guards and finalize order

A slot whose savename is still the .delayed placeholder could be
finalized and recorded under that transient name when the transfer
completed before hts_wait_delayed resolved the type (fast servers):
back_clean ignored the lock hts_wait_delayed holds, the direct-to-disk
shortcut bound the output file to the placeholder name, and the
type-known finalize ran before url_sav was patched, caching the entry
under the wrong name (the 'bogus state (incomplete type)' warnings).
Skip locked slots in back_clean and in the disk shortcut, and patch
url_sav before finalizing.

The delayed-naming body sniff (next commit) waits for body bytes and
would hit these races deterministically; 15_local-types and
18_local-update cover the flow end to end.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Exercise the degenerate delayed-type paths; audit .delayed leftovers everywhere

New delayed/ fixtures (302 without Location, self-loop, a chain one hop
past the redirect budget, a proper redirect, a typeless unknown-ext
file) with 33_local-delayed.test, run twice via --rerun. local-crawl.sh
now fails ANY local test whose finished mirror contains a *.delayed
temporary, guarding the whole suite against the #107 class.

Probing the degenerate paths settled a review question: afs->save can
still be a fresh placeholder when url_sav is patched (redirects that
never resolve), but nothing name-keyed is stored there -- non-OK
entries are cached header-only with an empty X-Save, the link is
dropped, no file is left. The test pins exactly that.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Empty responses finalized delayed slots under the .delayed name

Found while probing for a harness trigger of the finalize race: a
delayed-type URL answered 200 with Content-Length: 0 takes the
empty-response branch, which reaches READY inside the header round and
calls back_finalize directly, while url_sav still holds the .delayed
placeholder. Deterministic on any such URL: the 'bogus state
(incomplete type)' warning and an entry recorded under the placeholder
name. Empty responses are common in the wild (beacons, placeholder
files), so this is the reproducible face of #5.

Defer the finalize when the slot is locked, matching the other guards:
the waiter finalizes right after patching url_sav. The new empty.php
fixture makes 33_local-delayed fail on master and pass here; the
timing race itself remains untriggerable from the harness (headers and
body advance in separate back_wait rounds; probe notes in the PR).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-04 08:37:14 +02:00
Xavier Roche
2eac19655b Widen the savename self-test and cover Content-Disposition end to end (#477)
* tests: widen the savename self-test and cover Content-Disposition end to end

-#test=savename grows key=value knobs (cdispo=, statuscode=, status=, adr=,
strip=, urlhack= and its negatives, n83=, type=) plus repeatable
prior=adr|fil|sav rows that register an already-crawled link, so the .test
can pin the still-downloading mime path, redirect delayed naming, dedup and
collision suffixing, 8-3 modes, --strip-query dedup and hostile fils
(traversal, control chars, oversized names) - the regression net for the
upcoming resolve_extension work.

A new cdispo/ endpoint in local-server.py and 32_local-cdispo.test give the
Content-Disposition branches their first end-to-end coverage, including a
traversal filename reduced to its last component.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* review follow-ups: isolate the wire cdispo strip, add over-match negatives

The review found the e2e traversal row masked by two independent layers
(the wire parser and url_savename both strip path components), so a new
-#test=header self-test pins treathead's Content-Disposition parse alone.
Three negative rows keep dedup honest: a kept query key that differs, a
distinct URL under urlhack, and a same-basename-different-directory prior
must all produce a fresh name, not a false match. route_cdispo now reuses
send_raw via an extra_headers argument.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:42:32 +02:00
Xavier Roche
83c231d50e Fold the type-probe's triplicated extension decision into one function (#476)
* htsname: extract resolve_extension() from the type-probe block

The cache, backing-headers and live-probe paths each carried the same
cdispo / wire_patches_ext / give_mimext decision, per the block's own
FIXME ("factorize this unholy mess"); fold the three copies into one
static resolve_extension(). Drop the dead DEFAULT_BIN_EXT arms (the
define has been commented out in htsconfig.h for years) and the ishtest
variable only they read.

-#test=savename gains an optional Content-Disposition argument, and the
engine test now pins the filename-replacement path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* tests: pin cdispo reserved-char sanitization in the savename table

Review follow-up: adds a row where the expected name differs from the
Content-Disposition input, exercising the hostile-cdispo cleaning path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:38:30 +02:00
Xavier Roche
9d29b8329b Phase 0 follow-ups: residual unescape OOB, dead java plugin, FTP test strength (#475)
* htsencoding: bound the raw UTF-8 flush in hts_unescapeUrlSpecial

The completed-sequence flush memcpy ends with a 'continue' that skips the
per-byte NUL-reserve guard, so a raw multi-byte character landing at the
exact end of dest let the trailing NUL write dest[max] (1-byte OOB, found
by the post-#474 review pass; ASan-verified via the extended
-#test=unescape-bounds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Resurrect the java .class parser on modern Unix builds

The plugin was dead three ways on a current Linux build: hts_plug() was
compiled hidden (-fvisibility=hidden, EXTERNAL_FUNCTION expanded to nothing
on ELF), hts_create_opt() dlopens libhtsjava.so.2 which no longer exists
since the soname moved to .so.3, and JAVA_HEADER.magic is 'unsigned long'
(8 bytes on LP64) under a 10-byte fread, so major/count came from
uninitialized bytes and the 0xCAFEBABE check never matched.

EXTERNAL_FUNCTION now forces default visibility on ELF, the dlopen name is
derived from VERSION_INFO at configure time, and the header fields are
fixed-width. 31_local-javaclass.test crawls a generated .class and asserts
a resource named only in its constant pool is fetched; it fails if any of
the three regresses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* htsftp: assert nonzero buffer sizes, harden the userpass self-test

ftp_split_userpass underflows its size-1 math on a zero size; assert the
precondition now that the function is public in htsftp.h. The self-test
gains a tight-size run with guard bytes and exact-content checks, which the
256-byte buffers alone could not fail on an off-by-one.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* configure: use the dylib plugin name on Darwin

libtool names the module libhtsjava.N.dylib there, so the .so.N form can
never load; caught by 31_local-javaclass.test on the macOS CI job (the old
hardcoded .so.2 was just as dead, silently).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* tests: cover the %xx-encoded UTF-8 flush path in unescape-bounds

The raw-byte cases never take the utfBufferJ = lastJ rollback branch, so a
wrong flush offset there would have passed (review finding).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 23:04:02 +02:00
Xavier Roche
ac4a1ca48e Harden three parser buffer copies against overflow (#474)
Width-less sscanf("%s %s %s") parsed a request line into fixed method/url/
protocol buffers in the catch-URL proxy (htscatchurl.c) and the postfile
sender (htslib.c), so a long token could overflow the smallest one
(method[32] / [256]). Add field widths matching each buffer.

The makeindex title decoder copied the UTF-8-expanded title into s[] with a
raw strcpy; a charset that expands past the buffer would overflow it.
Truncate with snprintf and free through freet.

The entity/URL unescapers guarded the copy with `j + 1 > max`, which reserves
nothing for the trailing NUL: an input filling dest to exactly `max` chars
wrote dest[max], one byte past a max-sized buffer. Every production caller
passes max = buffer capacity (strlen+1 or sizeof), so this was a real 1-byte
OOB, latent only because output is usually shorter than the buffer. Use
`>=`. The st_entities self-test passed max = strlen (relying on strdup's
extra byte); correct it to strlen+1 to match the callers.

Self-test unescape-bounds exercises the exact-fill boundary on both
unescapers; it aborts under ASan on the pre-fix guard.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:54:09 +02:00
Xavier Roche
9f2f2e52fa Fix three network-facing overflows in the FTP and Java parsers (#473)
get_ftp_line() copied a server reply byte-by-byte into a fixed char[1024]
with no index bound, so a hostile or MITM FTP server could smash the stack
with an over-long CRLF-less line. Bound the write and truncate.

The ftp:// userinfo parser copied "user:pass@" into user[256]/pass[256] with
two unbounded loops, overflowing from a long userinfo supplied by a hostile
ftp:// link. Extract the split into ftp_split_userpass(), which truncates
each field to fit.

The Java .class parser did calloc(header.count, sizeof(RESP_STRUCT)) on an
attacker-controlled u2 count, allocating ~68 MB per crafted class (DoS). Cap
the count to the file size (each constant-pool entry is at least one byte on
disk) via a new hts_count_fits() guard, and move the alloc/free to the
bounds-checked calloct/freet wrappers.

Self-tests: ftp-line drives get_ftp_line over a socketpair with a 4 KB reply,
ftp-userpass feeds an over-long userinfo, java exercises the count cap. The
first two abort under ASan on the pre-fix code.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:54:01 +02:00
Xavier Roche
92db2f2b41 htsparse: follow same-file redirects instead of self-pointing stubs (#159) (#471)
An http->https redirect (or any alias where the savename strips a
component the URL keeps) collapses the source and target onto one saved
file. hts_mirror_check_moved wrote a "Page has moved" stub linking to the
target's savename, but that equals the source's, so the stub pointed at
itself and the real content was never saved.

Detect a same-file alias with a new hts_redirect_same_savefile helper
(scheme and userinfo stripped, www kept, path slash/query-normalized per
the URL-hack dedup flags) and follow the redirect through: record the
moved link at the same savename so its content overwrites the placeholder.
Genuinely-different moves keep the stub. The old informational "URL Hack
identical" log is superseded by an actionable message on the followed
redirect. Covered by a redirect-samefile engine self-test.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 09:41:17 +02:00
Xavier Roche
ec52112446 tests: cover anchored-link (#frag) rewriting (#279) (#470)
An anchored hyperlink target.html#sec must fetch the target with the
fragment dropped yet keep the fragment in the rewritten local link so
the anchor still resolves. This already works; #279 is a stale
report from the Google Code era with no current repro.

Pin the behavior with a local-crawl test: the strict server 400s on a
'#' in the request-target (so a leaked fragment fails the fetch), and a
new --file-matches audit asserts the mirrored link keeps #sec/#sec2 for
both the unquoted and quoted forms.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 22:26:16 +02:00
Xavier Roche
1eaddc9c0e htsparse: drive the extended-context field list from one X-macro (#469)
ENGINE_SET_CONTEXT and ENGINE_SAVE_CONTEXT kept hand-maintained parallel
copies of the mutable extended-context fields. The two lists drifting apart is
how the makestat_time throttle bug got in: a field reloaded by SET with no
matching SAVE. Move the six mutable fields into a single ENGINE_MUTABLE_FIELDS
list that DEFINE, SET and SAVE each expand through their own operation, so a
load without a matching store can no longer be written.

Pure refactor: object code matches the prior macros apart from embedded
__LINE__ constants shifting with the smaller source.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 22:02:03 +02:00
Xavier Roche
d97a7bdfd9 htsparse: convert JS-detection automaton macros to static functions (#468)
AUTOMATE_LOOKUP_CURRENT_ADR and INCREMENT_CURRENT_ADR captured four parser
locals (inscript, inscript_state, inscript_state_pos, html) by lexical scope.
Replace them with two static helpers driven through a small script_automate
struct of pointers set up once, and lift the INSCRIPT enum to file scope so the
helpers can name it. The obscure `new_state_pos*sizeof(row) < sizeof(table)`
bound becomes the equivalent `next < INSCRIPT_NSTATES`.

Behavior-preserving: a JS-heavy crawl (document.write, quoted/escaped strings,
// and /* */ comments, onclick handlers) mirrors byte-identically against
master.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 21:38:11 +02:00
Xavier Roche
d2d02d87c2 htsparse: extract HT_ADD_END into hts_finish_html_file, drop dead macros (#467)
The final HTML-flush step lived in the ~60-line HT_ADD_END macro, captured by
lexical scope in htsparse(). Move it to a real function in htscore.c (next to
its sibling hts_finish_makeindex), passing the output buffer as a plain
pointer+length since the TypedArray is an anonymous struct type that can't cross
a function boundary; the caller keeps the size guard and the trailing
TypedArrayFree. Behavior-preserving: a two-crawl mirror produces byte-identical
output against master.

Also delete the _FILTERS/_FILTERS_PTR macros, dead in htsparse.c (htscore.c and
htswizard.c keep their own local copies); _ROBOTS stays.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 21:08:16 +02:00
Xavier Roche
4958bb8666 htsparse: fix HTML-escape truncation, cache-buffer leak, and stats-throttle reset (#465)
* htsparse: reserve 6x room for full HTML-escaping, not 5x

HT_ADD_HTMLESCAPED_ANY reserved strlen*5+1024 on the assumption that
"&amp;" (5 bytes) is the worst-case expansion. That holds for
escape_for_html_print, but escape_for_html_print_full turns a high byte
into "&#xHH;" (6 bytes). Past ~1023 high bytes the reservation is short,
so the escaper hits its internal cap: it truncates the string mid-run and
its overflow return counts the terminating NUL, which then lands inside
the mirrored HTML file. The only _full call site rewrites a link into a
2KB buffer, so a long non-ASCII local path triggers it.

Give the macro a per-function expansion factor (HTS_HTMLESCAPE_MAXEXP=5,
HTS_HTMLESCAPE_FULL_MAXEXP=6) and pass 6 for the _full variant. A new
escape-room self-test pins each function's real worst-case expansion
against the constant the macro reserves, so the two can't drift again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* htsparse: free the cache buffer in HT_ADD_END

The not-modified fast path reads the stored //[HTML-MD5]// digest via
cache_readdata, which malloc's the buffer, but never freed it. Every page
whose on-disk size already matches the freshly rewritten one leaks that
buffer. Free it after the compare.

Wrapped the macro in clang-format off/on: it is hand-aligned and
clang-format realigns every backslash on any edit, churning untouched
lines.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* htsparse: keep makestat_time out of ENGINE_SET_CONTEXT

makestat_time throttles the makestat/maketrack stats to once per minute:
the wait loop compares time_local() against it and, when it fires, writes
it back to the local. But the field is by-value in the extended context,
so it can't round-trip through ENGINE_SAVE_CONTEXT, while ENGINE_SET_CONTEXT
re-read it from the load-once baseline on every loop iteration. That reset
the local before the next compare, so under -%v / maketrack the throttle
never held and the stats line plus the full back-stack dump were emitted
every iteration.

Drop makestat_time (and the never-changing makestat_fp) from SET_CONTEXT;
they belong to the load-once set. Wrapped the macro in clang-format off/on
for the same backslash-realignment reason as HT_ADD_END.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 19:49:14 +02:00
Xavier Roche
07da404cb8 debian: lintian cleanup (metadata, compat 14, priority, NMU override) (#466)
Add debian/upstream/metadata (DEP-12), bump debhelper-compat to 14, drop the
redundant Priority: optional source field, and rename the stale NMU override
tag changelog-should-mention-nmu to no-nmu-in-changelog with a note on why the
address mismatch trips it.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 19:42:34 +02:00
Xavier Roche
694e45c698 Collapse the 5 inplace_escape_* bodies into one shared helper (#464)
DECLARE_INPLACE_ESCAPE_VERSION stamped five byte-identical function
bodies, one per escaper. Move the body into a single static
inplace_escape() helper parameterized by a function pointer to the
underlying escape_*(); the five HTSEXT_API inplace_escape_* symbols
stay as thin wrappers, so the exported ABI is unchanged.

A new -#test=inplace-escape self-test asserts each inplace_escape_*()
equals its escape_*() applied to a copy across several samples
(including a >255-byte one to hit the helper's malloct path), guarding
the refactor.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:14:15 +02:00
Xavier Roche
db9ec2cc3b Replace duplicated HT_INDEX_END macro with a shared function (#463)
The macro that closes the makeindex index.html (footer, optional refresh meta, then the user "primary" command) was copy-pasted into htsparse.c and htscore.c, flagged by a `// COPY IN HTSCORE.C` comment and drifting in whitespace. Collapse both into hts_finish_makeindex() in htscore.c, declared in htscore.h.

The two copies were not byte-identical: the final usercommand() call passed "primary","primary" from htsparse.c but "","" from htscore.c. The helper takes those as adr/fil parameters so each call site keeps its exact behavior.

Add a -#test=makeindex self-test (driven by 01_engine-makeindex.test) that drives the function offline and asserts the footer is written, the refresh meta appears only for a single first link, and *fp/*done are updated.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:49:34 +02:00
Xavier Roche
6a9ab2a11f Fix macro-hygiene defects in htsstrings.h (dup define, double-eval) (#462)
StringSubRW was defined twice (the second under a redundant comment); drop
the duplicate. StringCatN and StringSetLength each evaluated their SIZE
argument twice, so a side-effecting argument would run twice and a wrapping
expression could clamp inconsistently. Capture SIZE once into a local, the
way StringCopyN already does. StringSetLength keeps a signed local so the
"negative means strlen(buffer_)" contract is preserved.

No current call site passes a side-effecting SIZE, so the change is
behavior-preserving; the strsafe self-test now passes a (counter++, V)
argument and asserts a single evaluation, which fails on the old macros.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:49:01 +02:00
Xavier Roche
13b31986d5 tests: skip connect-fallback test on GNU/Hurd (#461)
19_local-connect-fallback synthesizes a multi-address host by pinning
"deadhost" to a dead 127.0.0.2 then the live 127.0.0.1, so httrack's
per-address connect fallback has somewhere to fall back to. That fixture
needs a second loopback IP: the dead and live addresses share the URL's
port (htslib.c applies it to every resolved address), so they must differ
by IP. Linux and macOS route all of 127/8 to loopback; GNU/Hurd has only
127.0.0.1, so the dead address fails synchronously, the fallback never
engages, and the test fails on hurd-i386.

This is a fixture limitation, not a bug in the fallback code, so skip on
GNU/Hurd (uname -s = GNU). A runtime bind/connect probe was tried first
but both wrongly skipped macOS, which connects to 127.0.0.2 fine but does
not treat it as bindable; the one-loopback-IP fact is what actually
distinguishes Hurd. hurd-i386 is a non-release port and did not block
migration.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 19:40:07 +02:00
Xavier Roche
bd7e0989f6 Parse robots.txt with RFC 9309 Allow/Disallow precedence (#458)
The robots.txt handler only did substring Disallow matching against a flat
token blob: no Allow:, no path wildcards. Sites using "Disallow: /" plus
"Allow: /public/" were over-blocked, since Allow was never parsed.

Move the body parsing into robots_parse() (htsrobots.c) so both the crawler
and a self-test feed raw robots.txt. Rules are stored Allow/Disallow-tagged
and consulted with RFC 9309 precedence: the longest matching path pattern
wins, Allow breaking ties. Pattern matching supports '*' (any run) and a
trailing '$' (end-of-path anchor) via a linear two-pointer matcher with a
single resumable star position, so hostile patterns cannot trigger
exponential backtracking. Path matching is now case-sensitive per the RFC.

robots_wizard is internal (not in DevIncludes_DATA, no HTSEXT_API; htsopt.h
holds only an opaque pointer), so the in-memory format changed without an ABI
break. Sitemap:/Crawl-delay: lines are tolerated but ignored, as before.

New -#test=robots self-test plus tests/01_engine-robots.test cover the
Allow-over-Disallow longest match, the equal-length Allow tie, '*'/'$'
wildcards, and httrack-group selection.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 09:07:54 +02:00
Xavier Roche
bd74ec7cab Advertise deflate in Accept-Encoding and decode it (#459)
The request Accept-Encoding offered only gzip even though the response
parser already recognized deflate/x-deflate. But the actual decode path
(hts_zunpack) used zlib's gzread, which only inflates gzip and copies any
deflate body through verbatim, so a deflate response would have been
written out still compressed. Advertising deflate without fixing that
would corrupt files.

Rewrite hts_zunpack to inflate via inflateInit2 with format detection:
gzip and zlib (RFC1950) auto-detect with +32 windowBits, everything else
is treated as raw deflate (RFC1951). Then add deflate to the advertised
list through a small hts_acceptencoding() helper shared with the test.

A new -#test=acceptencoding self-test asserts the advertised header
carries both gzip and deflate, and round-trips gzip, zlib and raw-deflate
bodies through hts_zunpack on disk. Both halves fail on the old binary.

Brotli is intentionally out of scope (new dependency, larger change).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:54:03 +02:00
Xavier Roche
1ed8ffad64 tests: group zlib-dependent self-tests under 01_zlib-* (#460)
The MSan job runs the offline 01_engine-* self-tests but must skip any that
exercise the system zlib: uninstrumented libz floods MemorySanitizer with
false positives (MSan can't see libz initialize its own internal state, so
every byte deflate/inflate produces reads as "uninitialized"). That was a
grep -v -- '-cache' exclusion, a list that would grow with each new zlib test.

Rename the three cache tests to a 01_zlib-* prefix so the MSan job selects
01_engine-* with no exclusion list. They still run in the normal suite and
under ASan+UBSan (full make check), where uninstrumented libz is fine. The
deflate Accept-Encoding test (PR #459) follows the same convention.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:39:43 +02:00
Xavier Roche
b68de172fa Follow <source>/<track> as embedded near-links (#451) (#457)
src/srcset are already extracted from any element via hts_detect[], so
<source>/<track> URLs were captured. But hts_detect_embed only listed
<img src>, so at the recursion-depth boundary under --near these media
were treated as plain links and dropped, unlike <img>. Add a separate
HTML5 media table (keeping the legacy one clang-format-stable) and chain
the embed lookup through both.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:18:28 +02:00
Xavier Roche
aabfd34380 Refresh minor legacy constants (#453) (#456)
Three independent low-risk items under tracking issue #453:

- HTTP status enum/reason map: add 429 (Too Many Requests) and 451
  (Unavailable For Legal Reasons); appended to the HTTPStatusCode enum
  (installed header, tail-append only) and to infostatuscode_const().
- Drop the 1990s relic port 31337 from the local proxy bind fallback list.
- Pin a TLS protocol floor (TLS1.2 via SSL_CTX_set_min_proto_version, with an
  SSL_OP_NO_* fallback for OpenSSL < 1.1.0) so obsolete SSLv3/TLS1.0/1.1 aren't
  negotiated. No cert verification added: that remains by design.

Adds a -#test=status engine self-test (01_engine-status.test) asserting the
reason phrases for 429/451.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:12:17 +02:00
Xavier Roche
65ff9e0f11 Modernize the default User-Agent (#449) (#455)
The default UA was "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)",
a 25-year-old string naming a dead OS and trivially fingerprinted. Replace
it with an honest crawler UA carrying the HTTrack token and a reference URL:
"Mozilla/5.0 (compatible; HTTrack/3.x; +https://www.httrack.com/)".

Both the engine default (hts_create_opt) and the webhttrack mini-server
config default now share one HTS_DEFAULT_USER_AGENT macro, built from
HTTRACK_AFF_VERSION so the version token tracks releases without going stale.

Closes #449

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 23:18:28 +02:00
Xavier Roche
730a1c8c5b Add modern web MIME types to the type/extension table (#454)
The MIME ⇄ extension table in htslib.c had no entry for formats that are now
common: webp, avif, woff/woff2, json, wasm, mp4/webm, opus/flac, and friends.
A crawl that met these relied on the application/<subtype> fallback (or got no
extension at all), so saved files landed with wrong or missing extensions and
get_httptype guessed nothing from the extension.

The new rows live in a separate hts_mime_modern[] table rather than appended to
the legacy hts_mime[]: clang-format reflows a whole brace initializer on any
in-place edit, which would churn every untouched legacy row. A small
hts_mime_lookup() helper scans a table in either direction; give_mimext() and
get_httptype_sized() now consult the legacy table first, then the modern one,
so a new row can never shadow a legacy mapping. Legacy aliases (flash,
realaudio) stay for archiving old content.

Self-test covers both lookups; the give_mimext cases use MIME types the
application/<=4-char-subtype fallback can't fabricate, so a missing row
actually fails the assert.

Closes #448

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 22:53:48 +02:00
Xavier Roche
f9ee4702a2 tests: skip 28_local-pause without python3; add a buildd-reproducing CI job (#445)
* tests: skip 28_local-pause when python3 is absent (Debian buildd)

The local-server tests all skip with exit 77 when python3 is missing, but
28_local-pause runs local-crawl.sh inside a command-substitution with output
redirected to /dev/null, swallowing that exit-77 skip signal. On a host with
no python3 (every Debian buildd) both crawls then run serverless, finish in
0s, and the test reports FAIL instead of SKIP, turning 3.49.10-1 red on all
architectures. Add the same python3 guard the other tests get via
local-crawl.sh, up front, so it skips cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* ci: add a no-python3 job reproducing the Debian buildd chroot

GitHub runners ship python3, so every make-check job exercised the
python3-present path and the local-server tests never skipped. The Debian
buildds build in a minimal chroot with no python3, where those tests must
SKIP (exit 77) -- and 28_local-pause failed there instead, FTBFS on every
arch for 3.49.10-1, invisible to CI.

Add a job that builds, removes python3, and runs make check, so the skip
path is exercised on every PR. Verified to fail on the pre-fix tree and pass
after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 20:12:49 +02:00
74 changed files with 3217 additions and 681 deletions

View File

@@ -61,6 +61,50 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Reproduce the Debian buildds: they build in a minimal chroot with no
# python3, so the local-server tests must SKIP (exit 77), not fail. GitHub
# runners ship python3, so every other job hides this path; here we remove it
# before `make check`. This is the guard that would have caught the 3.49.10-1
# FTBFS (28_local-pause failed instead of skipping when python3 was absent).
buildd-no-python3:
name: build (no python3, Debian buildd)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: Configure
run: |
set -euo pipefail
autoreconf -fi
./configure
- name: Build
run: make -j"$(nproc)"
- name: Test without python3
run: |
set -euo pipefail
# Hide every python3* so `command -v python3` fails like it does in the
# buildd chroot; masking with /bin/false would still resolve.
sudo find /usr/bin /usr/local/bin -maxdepth 1 -name 'python3*' \
-exec mv {} {}.hidden \;
! command -v python3
make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Portability: build and test on macOS (Darwin/clang) on a native runner --
# no VM. The tree has no __APPLE__ branches, so Darwin exercises the
# generic-Unix path on a second libc and kernel. brew's openssl@3 is keg-only,
@@ -225,8 +269,9 @@ jobs:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
# 01_engine-* only; zlib-dependent self-tests are named 01_zlib-* and
# skipped here (uninstrumented libz floods MSan with false positives).
tests="$(cd tests && ls 01_engine-*.test | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure

View File

@@ -63,6 +63,16 @@ AC_SUBST(LT_CV_OBJDIR,$lt_cv_objdir)
# Export version info
AC_SUBST(VERSION_INFO)
# Versioned plugin name for dlopen() in hts_create_opt(); soname major is
# libtool's current - age, so this tracks VERSION_INFO bumps automatically.
HTS_SONAME_MAJOR=$((${VERSION_INFO%%:*} - ${VERSION_INFO##*:}))
case "$host_os" in
darwin*) HTS_LIBHTSJAVA_NAME="libhtsjava.$HTS_SONAME_MAJOR.dylib" ;;
*) HTS_LIBHTSJAVA_NAME="libhtsjava.so.$HTS_SONAME_MAJOR" ;;
esac
AC_DEFINE_UNQUOTED([HTS_LIBHTSJAVA_NAME], ["$HTS_LIBHTSJAVA_NAME"],
[Versioned libhtsjava runtime name, derived from VERSION_INFO])
### Default CFLAGS
DEFAULT_CFLAGS="-Wall -Wformat -Wformat-security \
-Wmultichar -Wwrite-strings -Wcast-qual -Wcast-align \

3
debian/control vendored
View File

@@ -1,9 +1,8 @@
Source: httrack
Section: web
Priority: optional
Maintainer: Xavier Roche <roche@httrack.com>
Standards-Version: 4.7.4
Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
Build-Depends: debhelper-compat (= 14), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
Rules-Requires-Root: no
Homepage: http://www.httrack.com
Vcs-Git: https://github.com/xroche/httrack.git

View File

@@ -1,4 +1,6 @@
httrack source: changelog-should-mention-nmu
# Maintainer uploads sign the changelog as xavier@debian.org while the control
# Maintainer is roche@httrack.com; lintian reads the address mismatch as an NMU.
httrack source: no-nmu-in-changelog
httrack source: source-nmu-has-incorrect-version-number
# The bundled HTML pages are the genuine upstream documentation taken from

6
debian/upstream/metadata vendored Normal file
View File

@@ -0,0 +1,6 @@
---
Repository: https://github.com/xroche/httrack.git
Repository-Browse: https://github.com/xroche/httrack
Bug-Database: https://github.com/xroche/httrack/issues
Bug-Submit: https://github.com/xroche/httrack/issues/new
Contact: Xavier Roche <roche@httrack.com>

View File

@@ -62,7 +62,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htsname.c htsrobots.c htstools.c htswizard.c \
htsalias.c htsthread.c htsindex.c htsbauth.c \
htsmd5.c htszlib.c htswrap.c htsconcat.c \
htsmodules.c htscharset.c punycode.c htsencoding.c \
htsmodules.c htscharset.c punycode.c htsencoding.c htssniff.c \
md5.c \
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
@@ -70,7 +70,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
htshelp.h htsindex.h htslib.h htsmd5.h \
htsmodules.h htsname.h htsnet.h \
htsmodules.h htsname.h htsnet.h htssniff.h \
htsopt.h htsrobots.h htsthread.h \
htstools.h htswizard.h htswrap.h htszlib.h \
htsstrings.h htsarrays.h httrack-library.h \

View File

@@ -1359,6 +1359,18 @@ int back_flush_output(httrackp * opt, cache_back * cache, struct_back * sback,
}
// effacer entrée
/* Discard a cancelled mid-write .delayed placeholder (unusable across runs). */
static void back_delayed_discard(httrackp *opt, lien_back *back) {
if (back->r.out != NULL) {
fclose(back->r.out);
back->r.out = NULL;
}
back->r.is_write = 0;
if (opt != NULL)
url_savename_refname_remove(opt, back->url_adr, back->url_fil);
(void) UNLINK(back->url_sav);
}
int back_delete(httrackp * opt, cache_back * cache, struct_back * sback,
const int p) {
lien_back *const back = sback->lnk;
@@ -1366,6 +1378,12 @@ int back_delete(httrackp * opt, cache_back * cache, struct_back * sback,
assertf(p >= 0 && p < back_max);
if (p >= 0 && p < sback->count) { // on sait jamais..
/* mid-write cancel: drop a .delayed placeholder; real-named partials
survive for resume (--continue) */
if (back[p].r.is_write && IS_DELAYED_EXT(back[p].url_sav) &&
(back[p].status != STATUS_READY || back[p].r.statuscode <= 0)) {
back_delayed_discard(opt, &back[p]);
}
// Vérificateur d'intégrité
#if DEBUG_CHECKINT
_CHECKINT(&back[p], "Appel back_delete")
@@ -2237,12 +2255,13 @@ int host_wait(httrackp * opt, lien_back * back) {
static int slot_can_be_cleaned(const lien_back * back) {
return (back->status == STATUS_READY) // ready
/* Check autoclean */
&& (!back->testmode) // not test mode
&& (strnotempty(back->url_sav)) // filename exists
&& (HTTP_IS_OK(back->r.statuscode)) // HTTP "OK"
&& (back->r.size >= 0) // size>=0
;
/* Check autoclean */
&& (!back->locked) // not held by hts_wait_delayed (name pending)
&& (!back->testmode) // not test mode
&& (strnotempty(back->url_sav)) // filename exists
&& (HTTP_IS_OK(back->r.statuscode)) // HTTP "OK"
&& (back->r.size >= 0) // size>=0
;
}
static int slot_can_be_finalized(httrackp * opt, const lien_back * back) {
@@ -2418,6 +2437,34 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back_clean(opt, cache, sback);
#endif
/* Time limit exceeded past grace: abort in-flight transfers so no wait loop
starves (#481). FTP slots stay, their thread owns the socket. */
if (!back_checkmirror(opt)) {
int aborted = 0;
unsigned int i;
for (i = 0; i < (unsigned int) back_max; i++) {
if (back[i].status > 0 && back[i].status < STATUS_FTP_TRANSFER) {
if (back[i].r.soc != INVALID_SOCKET) {
deletehttp(&back[i].r);
}
back[i].r.soc = INVALID_SOCKET;
/* drop a .delayed placeholder; real partials survive for resume */
if (back[i].r.is_write && IS_DELAYED_EXT(back[i].url_sav))
back_delayed_discard(opt, &back[i]);
back[i].r.statuscode = STATUSCODE_TIMEOUT;
strcpybuff(back[i].r.msg, "Mirror Time Out");
back[i].status = STATUS_READY;
back_set_finished(sback, i);
aborted++;
}
}
if (aborted > 0)
hts_log_print(opt, LOG_WARNING,
"time limit reached, %d transfer(s) aborted", aborted);
return;
}
// recevoir tant qu'il y a des données (avec un maximum de max_loop boucles)
do_wait = 0;
gestion_timeout = 0;
@@ -2891,10 +2938,10 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
// range size hack old location
#if HTS_DIRECTDISK
// Court-circuit:
// Peut-on stocker le fichier directement sur disque?
// Ahh que ca serait vachement mieux et que ahh que la mémoire vous dit merci!
if (back[i].status) {
// Shortcut: store the file directly on disk when possible,
// sparing memory
if (back[i].status &&
!back[i].locked) { // name still pending when locked
if (back[i].r.is_write == 0) { // mode mémoire
if (back[i].r.adr == NULL) { // rien n'a été écrit
if (!back[i].testmode) { // pas mode test
@@ -3960,8 +4007,12 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
&& (back[i].r.adr = (char *) malloct(2))) {
back[i].r.adr[0] = 0;
}
hts_log_print(opt, LOG_TRACE, "finalizing empty");
back_finalize(opt, cache, sback, i);
/* locked = name pending; the waiter finalizes after
patching url_sav (else: cached as .delayed, #5) */
if (!back[i].locked) {
hts_log_print(opt, LOG_TRACE, "finalizing empty");
back_finalize(opt, cache, sback, i);
}
} else if (!back[i].r.is_chunk) { // pas de chunk
//if (back[i].r.http11!=2) { // pas de chunk
back[i].is_chunk = 0;
@@ -4159,6 +4210,11 @@ int back_checksize(httrackp * opt, lien_back * eback, int check_only_totalsize)
return 1;
}
/* Grace left to the smooth stop before in-flight transfers are aborted. */
static int back_maxtime_grace(const int maxtime) {
return maximum(5, minimum(30, maxtime / 10));
}
int back_checkmirror(httrackp * opt) {
// Check max size
if ((opt->maxsite > 0) && (HTS_STAT.stat_bytes >= opt->maxsite)) {
@@ -4175,13 +4231,19 @@ int back_checkmirror(httrackp * opt) {
*/
}
// Check max time
if ((opt->maxtime > 0)
&& ((time_local() - HTS_STAT.stat_timestart) >= opt->maxtime)) {
if (!opt->state.stop) { /* not yet stopped */
hts_log_print(opt, LOG_ERROR, "More than %d seconds passed.. giving up",
opt->maxtime);
/* cancel mirror smoothly */
hts_request_stop(opt, 0);
if (opt->maxtime > 0) {
const TStamp elapsed = time_local() - HTS_STAT.stat_timestart;
if (elapsed >= opt->maxtime) {
if (!opt->state.stop) { /* not yet stopped */
hts_log_print(opt, LOG_ERROR, "More than %d seconds passed.. giving up",
opt->maxtime);
/* cancel mirror smoothly */
hts_request_stop(opt, 0);
}
/* smooth stop starved past the grace period: stop waiting (#481) */
if (elapsed - opt->maxtime >= back_maxtime_grace(opt->maxtime))
return 0;
}
}
return 1; /* Ok, go on */

View File

@@ -136,6 +136,8 @@ void back_solve(httrackp * opt, lien_back * sback);
int host_wait(httrackp * opt, lien_back * sback);
#endif
int back_checksize(httrackp * opt, lien_back * eback, int check_only_totalsize);
/* Enforce -M/-E quotas: requests a smooth stop when reached; returns 0 once
the -E deadline overran its grace period (callers must stop waiting). */
int back_checkmirror(httrackp * opt);
#endif

View File

@@ -129,6 +129,8 @@ typedef enum HTTPStatusCode {
HTTP_UNSUPPORTED_MEDIA_TYPE = 415,
HTTP_REQUESTED_RANGE_NOT_SATISFIABLE = 416,
HTTP_EXPECTATION_FAILED = 417,
HTTP_TOO_MANY_REQUESTS = 429,
HTTP_UNAVAILABLE_FOR_LEGAL_REASONS = 451,
HTTP_INTERNAL_SERVER_ERROR = 500,
HTTP_NOT_IMPLEMENTED = 501,
HTTP_BAD_GATEWAY = 502,

View File

@@ -596,15 +596,18 @@ htsblk cache_read_ro(httrackp * opt, cache_back * cache, const char *adr,
return cache_readex(opt, cache, adr, fil, save, location, NULL, 1);
}
htsblk cache_read_including_broken(httrackp * opt, cache_back * cache,
const char *adr, const char *fil) {
htsblk r = cache_read(opt, cache, adr, fil, NULL, NULL);
htsblk cache_read_including_broken(httrackp *opt, cache_back *cache,
const char *adr, const char *fil,
char *return_save) {
htsblk r = cache_readex(opt, cache, adr, fil, NULL, NULL, return_save, 0);
if (r.statuscode == -1) {
lien_back *itemback = NULL;
if (back_unserialize_ref(opt, adr, fil, &itemback) == 0) {
r = itemback->r;
if (return_save != NULL)
strlcpybuff(return_save, itemback->url_sav, HTS_URLMAXSIZE * 2);
/* cleanup */
back_clear_entry(itemback); /* delete entry content */
freet(itemback); /* delete item */

View File

@@ -66,8 +66,11 @@ htsblk cache_read(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location);
htsblk cache_read_ro(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location);
htsblk cache_read_including_broken(httrackp * opt, cache_back * cache,
const char *adr, const char *fil);
/* Like cache_read, but also yields entries whose transfer broke; return_save
(optional, HTS_URLMAXSIZE*2) receives the entry's recorded save name. */
htsblk cache_read_including_broken(httrackp *opt, cache_back *cache,
const char *adr, const char *fil,
char *return_save);
htsblk cache_readex(httrackp * opt, cache_back * cache, const char *adr,
const char *fil, const char *save, char *location,
char *return_save, int readonly);

View File

@@ -64,7 +64,7 @@ Please visit our Website: http://www.httrack.com
// catch_url_init(&port,&return_host);
HTSEXT_API T_SOC catch_url_init_std(int *port_prox, char *adr_prox) {
T_SOC soc;
int try_to_listen_to[] = { 8080, 3128, 80, 81, 82, 8081, 3129, 31337, 0, -1 };
int try_to_listen_to[] = {8080, 3128, 80, 81, 82, 8081, 3129, 0, -1};
int i = 0;
do {
@@ -175,7 +175,9 @@ HTSEXT_API hts_boolean catch_url(T_SOC soc, char *url, char *method,
//
socinput(soc, line, 1000);
if (strnotempty(line)) {
if (sscanf(line, "%s %s %s", method, url, protocol) == 3) {
/* widths bound the caller buffers: method[32], url[HTS_URLMAXSIZE*2],
protocol[256] */
if (sscanf(line, "%31s %2047s %255s", method, url, protocol) == 3) {
lien_adrfil af;
// méthode en majuscule

View File

@@ -406,29 +406,106 @@ void hts_invalidate_link(httrackp * opt, int lpos) {
opt->liens[lpos]->pass2 = -1;
}
// Write the makeindex footer (refresh meta when makeindex_links==1), close
// the file, then run usercommand.
void hts_finish_makeindex(httrackp *opt, int *makeindex_done,
FILE **makeindex_fp, int makeindex_links,
const char *makeindex_firstlink,
const char *template_footer, const char *adr,
const char *fil) {
if (!*makeindex_done) {
if (*makeindex_fp) {
char BIGSTK tempo[1024];
if (makeindex_links == 1) {
char BIGSTK link_escaped[HTS_URLMAXSIZE * 2];
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped));
snprintf(tempo, sizeof(tempo),
"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">" CRLF,
link_escaped);
} else
tempo[0] = '\0';
hts_template_format(*makeindex_fp, template_footer,
"<!-- Mirror and index made by HTTrack Website "
"Copier/" HTTRACK_VERSION " " HTTRACK_AFF_AUTHORS
" -->",
tempo, /* EOF */ NULL);
fflush(*makeindex_fp);
fclose(*makeindex_fp);
*makeindex_fp = NULL;
usercommand(opt, 0, NULL,
fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_html_utf8), "index.html"),
adr, fil);
}
}
*makeindex_done = 1;
}
#define HT_INDEX_END do { \
if (!makeindex_done) { \
if (makeindex_fp) { \
char BIGSTK tempo[1024]; \
if (makeindex_links == 1) { \
char BIGSTK link_escaped[HTS_URLMAXSIZE*2]; \
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped)); \
snprintf(tempo,sizeof(tempo),"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">"CRLF, link_escaped); \
} else \
tempo[0]='\0'; \
hts_template_format(makeindex_fp,template_footer, \
"<!-- Mirror and index made by HTTrack Website Copier/"HTTRACK_VERSION" "HTTRACK_AFF_AUTHORS" -->", \
tempo, /* EOF */ NULL \
); \
fflush(makeindex_fp); \
fclose(makeindex_fp); /* à ne pas oublier sinon on passe une nuit blanche */ \
makeindex_fp=NULL; \
usercommand(opt,0,NULL,fconcat(OPT_GET_BUFF(opt),OPT_GET_BUFF_SIZE(opt),StringBuff(opt->path_html_utf8),"index.html"),"",""); \
} \
} \
makeindex_done=1; /* ok c'est fait */ \
} while(0)
/* Flush the parsed HTML output buffer to disk, skipping the rewrite when the
* on-disk MD5 is unchanged. */
void hts_finish_html_file(httrackp *opt, cache_back *cache, htsblk *r,
FILE **fp, const char *ht_buff, size_t ht_len,
const char *adr, const char *fil, const char *save) {
char digest[32 + 2];
off_t fsize_old =
fsize(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), save));
int ok = 0;
digest[0] = '\0';
domd5mem(ht_buff, ht_len, digest, 1);
if (fsize_old == (off_t) ht_len) {
int mlen = 0;
char *mbuff;
cache_readdata(cache, "//[HTML-MD5]//", save, &mbuff, &mlen);
if (mlen)
mbuff[mlen] = '\0';
if ((mlen == 32) && (strcmp(((mbuff != NULL) ? mbuff : ""), digest) == 0)) {
ok = 1;
hts_log_print(opt, LOG_DEBUG, "File not re-written (md5): %s", save);
}
freet(mbuff);
}
if (!ok) {
file_notify(opt, adr, fil, save, 1, 1, r->notmodified);
*fp = filecreate(&opt->state.strc, save);
if (*fp) {
if (ht_len > 0 && fwrite(ht_buff, 1, ht_len, *fp) != ht_len) {
int fcheck = check_fatal_io_errno();
if (fcheck)
opt->state.exit_xh = -1;
if (opt->log) {
hts_log_print(opt, LOG_ERROR | LOG_ERRNO,
"Unable to write HTML file %s", save);
if (fcheck)
hts_log_print(opt, LOG_ERROR, "* * Fatal write error, giving up");
}
}
fclose(*fp);
*fp = NULL;
if (strnotempty(r->lastmodified))
set_filetime_rfc822(save, r->lastmodified);
} else {
int fcheck = check_fatal_io_errno();
if (fcheck) {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: disk full or filesystem problems");
opt->state.exit_xh = -1;
}
hts_log_print(opt, LOG_ERROR | LOG_ERRNO, "Unable to save file %s", save);
if (fcheck)
hts_log_print(opt, LOG_ERROR, "* * Fatal write error, giving up");
}
} else {
file_notify(opt, adr, fil, save, 0, 0, r->notmodified);
filenote(&opt->state.strc, save, NULL);
}
if (cache->ndx)
cache_writedata(cache->ndx, cache->dat, "//[HTML-MD5]//", save, digest,
(int) strlen(digest));
}
/* does it look like XML ? (SVG et al.) */
static int look_like_xml(const char *s) {
@@ -1796,90 +1873,18 @@ int httpmirror(char *url1, httrackp * opt) {
if (strnotempty(savename()) == 0) { // pas de chemin de sauvegarde
if (strcmp(urlfil(), "/robots.txt") == 0) { // robots.txt
if (r.adr) {
int bptr = 0;
char BIGSTK line[1024];
char BIGSTK buff[8192];
char BIGSTK infobuff[8192];
int record = 0;
line[0] = '\0';
buff[0] = '\0';
infobuff[0] = '\0';
//
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", r.adr);
#endif
do {
char *comm;
int llen;
bptr += binput(r.adr + bptr, line, sizeof(line) - 2);
/* strip comment */
comm = strchr(line, '#');
if (comm != NULL) {
*comm = '\0';
}
/* strip spaces */
llen = (int) strlen(line);
while(llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a;
a = line + 11;
while(is_realspace(*a))
a++; // sauter espace(s)
if (*a == '*') {
if (record != 2)
record = 1; // c pour nous
} else if (strfield(a, "httrack") || strfield(a, "winhttrack")
|| strfield(a, "webhttrack")) {
buff[0] = '\0'; // re-enregistrer
infobuff[0] = '\0';
record = 2; // locked
#if DEBUG_ROBOTS
printf("explicit disallow for httrack\n");
#endif
} else
record = 0;
} else if (record) {
if (strfield(line, "disallow:")) {
char *a = line + 9;
while(is_realspace(*a))
a++; // sauter espace(s)
if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
hts_boolean keep_root = (opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
? HTS_TRUE
: HTS_FALSE;
#else
hts_boolean keep_root = HTS_TRUE;
#endif
{ /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a);
strcatbuff(buff, "\n");
if ((strlen(infobuff) + strlen(a) + 8) <
sizeof(infobuff)) {
if (strnotempty(infobuff))
strcatbuff(infobuff, ", ");
strcatbuff(infobuff, a);
}
}
}
#ifdef IGNORE_RESTRICTIVE_ROBOTS
else {
hts_log_print(opt, LOG_NOTICE,
"Note: %s robots.txt rules are too restrictive, ignoring /",
urladr());
}
#endif
}
}
}
} while((bptr < r.size) && (strlen(buff) < (sizeof(buff) - 32)));
if (strnotempty(buff)) {
checkrobots_set(&robots, urladr(), buff);
robots_parse(&robots, urladr(), r.adr, r.size, infobuff,
sizeof(infobuff), keep_root);
if (strnotempty(infobuff)) {
hts_log_print(opt, LOG_INFO,
"Note: robots.txt forbidden links for %s are: %s",
urladr(), infobuff);
@@ -2116,7 +2121,8 @@ int httpmirror(char *url1, httrackp * opt) {
/*
Ensure the index is being closed
*/
HT_INDEX_END;
hts_finish_makeindex(opt, &makeindex_done, &makeindex_fp, makeindex_links,
makeindex_firstlink, template_footer, "", "");
/*
updating-a-remotely-deteted-website hack

View File

@@ -362,6 +362,20 @@ void usercommand(httrackp * opt, int exe, const char *cmd, const char *file,
void usercommand_exe(const char *cmd, const char *file);
// Finish the makeindex index.html (footer + refresh meta), run usercommand.
// Updates *makeindex_done/*makeindex_fp in place; adr/fil are the mode strings.
void hts_finish_makeindex(httrackp *opt, int *makeindex_done,
FILE **makeindex_fp, int makeindex_links,
const char *makeindex_firstlink,
const char *template_footer, const char *adr,
const char *fil);
// Flush ht_buff[0..ht_len] to save on disk (skip if MD5 unchanged); *fp
// closed+NULLed on write. Precondition: ht_len>0.
void hts_finish_html_file(httrackp *opt, cache_back *cache, htsblk *r,
FILE **fp, const char *ht_buff, size_t ht_len,
const char *adr, const char *fil, const char *save);
int filters_init(char ***ptrfilters, int maxfilter, int filterinc);
int fspc(httrackp * opt, FILE * fp, const char *type);
@@ -470,4 +484,8 @@ void voidf(void);
/* HTML marker comment marking where the top index is spliced. */
#define HTS_TOPINDEX "TOP_INDEX_HTTRACK"
/* Worst-case byte expansion HT_ADD_HTMLESCAPED* must reserve per escaper. */
#define HTS_HTMLESCAPE_MAXEXP 5 /* escape_for_html_print: '&'->"&amp;" */
#define HTS_HTMLESCAPE_FULL_MAXEXP 6 /* _full: high byte->"&#xHH;" */
#endif

View File

@@ -69,11 +69,15 @@ typedef struct t_hts_callbackarg t_hts_callbackarg;
typedef struct t_hts_callbackarg t_hts_callbackarg;
#endif
/* Marks a symbol an external wrapper module exports back to the engine
(dllexport on Windows, nothing elsewhere). */
/* Marks a symbol an external wrapper module exports back to the engine.
Must override -fvisibility=hidden on ELF, or dlopen()ed plugins (htsjava)
hide their own hts_plug()/hts_unplug() entry points. */
#ifndef EXTERNAL_FUNCTION
#ifdef _WIN32
#define EXTERNAL_FUNCTION __declspec(dllexport)
#elif ((defined(__GNUC__) && (__GNUC__ >= 4)) || \
(defined(HAVE_VISIBILITY) && HAVE_VISIBILITY))
#define EXTERNAL_FUNCTION __attribute__((visibility("default")))
#else
#define EXTERNAL_FUNCTION
#endif

View File

@@ -190,9 +190,9 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
}
}
}
/* copy */
if (j + 1 > max) {
/* reserve one byte for the trailing NUL written after the loop */
if (j + 1 >= max) {
/* overflow */
return -1;
}
@@ -300,6 +300,11 @@ int hts_unescapeUrlSpecial(const char *src, char *dest, const size_t max,
/* Was the character read successfully ? */
if (nRead == utfBufferSize) {
/* the 'continue' below skips the NUL-reserve guard: re-check */
if (utfBufferJ + utfBufferSize >= max) {
return -1;
}
/* Rollback write position to sequence start write position */
j = utfBufferJ;
@@ -314,8 +319,8 @@ int hts_unescapeUrlSpecial(const char *src, char *dest, const size_t max,
}
}
/* Check for overflow */
if (j + 1 > max) {
/* reserve one byte for the trailing NUL written after the loop */
if (j + 1 >= max) {
return -1;
}

View File

@@ -128,6 +128,33 @@ void launch_ftp(FTPDownloadStruct * params) {
return 0; \
}
/* Bounded split of a hostile-URL "user[:pass]@" prefix (see htsftp.h). */
void ftp_split_userpass(const char *src, const char *end, char *user,
size_t user_size, char *pass, size_t pass_size) {
size_t n = 0;
assertf(user_size > 0 && pass_size > 0); /* the size-1 math underflows on 0 */
while (src[n] != '\0' && src[n] != ':') {
if (n < user_size - 1)
user[n] = src[n];
n++;
}
user[n < user_size ? n : user_size - 1] = '\0';
pass[0] = '\0';
if (src[n] == ':') { // password follows the colon
const size_t base = n + 1;
size_t k = 0;
while (&src[base + k + 1] < end && src[base + k] != '\0') {
if (k < pass_size - 1)
pass[k] = src[base + k];
k++;
}
pass[k < pass_size ? k : pass_size - 1] = '\0';
}
}
// la véritable fonction une fois lancées les routines thread/fork
int run_launch_ftp(FTPDownloadStruct * pStruct) {
lien_back *back = pStruct->pBack;
@@ -173,24 +200,7 @@ int run_launch_ftp(FTPDownloadStruct * pStruct) {
while(*real_adr == '/')
real_adr++; // sauter /
if ((adr = jump_identification(real_adr)) != real_adr) { // user
int i = -1;
pass[0] = '\0';
do {
i++;
user[i] = real_adr[i];
} while((real_adr[i] != ':') && (real_adr[i]));
user[i] = '\0';
if (real_adr[i] == ':') { // pass
int j = -1;
i++; // oui on saute aussi le :
do {
j++;
pass[j] = real_adr[i + j];
} while(((&real_adr[i + j + 1]) < adr) && (real_adr[i + j]));
pass[j] = '\0';
}
ftp_split_userpass(real_adr, adr, user, sizeof(user), pass, sizeof(pass));
}
// Calculer RETR <nom>
{
@@ -984,8 +994,8 @@ int get_ftp_line(T_SOC soc, char *ptrline, size_t line_size, int timeout) {
//case 0: break; // pas encore --> erreur (on attend)!
case 1:
HTS_STAT.HTS_TOTAL_RECV += 1; // compter flux entrant
if ((b != 10) && (b != 13))
data[i++] = b;
if ((b != 10) && (b != 13) && (i < (int) sizeof(data) - 1))
data[i++] = b; // truncate hostile over-long reply lines
break;
default:
if (ptrline)

View File

@@ -70,6 +70,11 @@ int back_launch_ftp(FTPDownloadStruct * params);
int run_launch_ftp(FTPDownloadStruct * params);
int send_line(T_SOC soc, const char *data);
int get_ftp_line(T_SOC soc, char *line, size_t line_size, int timeout);
/* Split a "user[:pass]@" prefix (end = jump_identification result) into
bounded, NUL-terminated user/pass buffers, truncating to fit.
Both sizes must be nonzero. */
void ftp_split_userpass(const char *src, const char *end, char *user,
size_t user_size, char *pass, size_t pass_size);
T_SOC get_datasocket(char *to_send, size_t to_send_size);
int stop_ftp(lien_back * back);
char *linejmp(char *line);

View File

@@ -229,6 +229,10 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
/* Honest crawler User-Agent; no fake OS/browser to go stale. */
#define HTS_DEFAULT_USER_AGENT \
"Mozilla/5.0 (compatible; HTTrack/" HTTRACK_AFF_VERSION \
"; +https://www.httrack.com/)"
#define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \

View File

@@ -63,6 +63,9 @@ Please visit our Website: http://www.httrack.com
/* This file */
#include "htsjava.h"
/* calloct/freet wrappers */
#include "htssafe.h"
static int reverse_endian(void) {
int endian = 1;
@@ -204,7 +207,16 @@ static int hts_parse_java(t_hts_callbackarg * carg, httrackp * opt,
return 0;
}
tab = (RESP_STRUCT *) calloc(header.count, sizeof(RESP_STRUCT));
/* A constant-pool entry is >= 1 byte on disk; reject a count exceeding
the file size (hostile .class ~68 MB alloc DoS). */
if (!hts_count_fits(header.count, (LLint) fsize(file))) {
fclose(fpout);
sprintf(str->err_msg,
"Invalid constant pool count %u (file len " LLintP ")",
(unsigned) header.count, (LLint) fsize(file));
return 0;
}
tab = (RESP_STRUCT *) calloct(header.count, sizeof(RESP_STRUCT));
if (!tab) {
sprintf(str->err_msg, "Unable to alloc %d bytes",
(int) sizeof(RESP_STRUCT));
@@ -230,7 +242,7 @@ static int hts_parse_java(t_hts_callbackarg * carg, httrackp * opt,
} else { // ++ une erreur est survenue!
if (strnotempty(str->err_msg) == 0)
strcpy(str->err_msg, "Internal readtable error");
free(tab);
freet(tab);
if (fpout) {
fclose(fpout);
fpout = NULL;
@@ -288,7 +300,7 @@ static int hts_parse_java(t_hts_callbackarg * carg, httrackp * opt,
#if JAVADEBUG
printf("end\n");
#endif
free(tab);
freet(tab);
if (fpout) {
fclose(fpout);
fpout = NULL;

View File

@@ -33,15 +33,19 @@ Please visit our Website: http://www.httrack.com
#ifndef HTSJAVA_DEFH
#define HTSJAVA_DEFH
#include <stdint.h>
#ifndef HTS_DEF_FWSTRUCT_JAVA_HEADER
#define HTS_DEF_FWSTRUCT_JAVA_HEADER
typedef struct JAVA_HEADER JAVA_HEADER;
#endif
/* 10-byte on-disk .class header image, fread() directly: fields need exact
widths (LP64's 8-byte 'unsigned long' magic never matched 0xCAFEBABE). */
struct JAVA_HEADER {
unsigned long int magic;
unsigned short int minor;
unsigned short int major;
unsigned short int count;
uint32_t magic;
uint16_t minor;
uint16_t major;
uint16_t count;
};
#ifndef HTS_DEF_FWSTRUCT_RESP_STRUCT

View File

@@ -563,6 +563,39 @@ const char *hts_mime[][2] = {
{"", ""}
};
/* Modern web formats (post-2010), kept in their own table: appending to the
legacy hts_mime[] above makes clang-format reflow its whole initializer.
Scanned after hts_mime[], so it never shadows a legacy mapping. */
static const char *hts_mime_modern[][2] = {
{"image/webp", "webp"},
{"image/avif", "avif"},
{"image/heic", "heic"},
{"font/woff", "woff"},
{"font/woff2", "woff2"},
{"font/ttf", "ttf"},
{"font/otf", "otf"},
{"application/json", "json"},
{"application/ld+json", "jsonld"},
{"application/manifest+json", "webmanifest"},
{"application/wasm", "wasm"},
{"text/javascript", "js"},
{"text/javascript", "mjs"},
{"text/markdown", "md"},
{"video/mp4", "mp4"},
{"video/webm", "webm"},
{"video/ogg", "ogv"},
{"video/mp2t", "ts"},
{"audio/mp4", "m4a"},
{"audio/aac", "aac"},
{"audio/ogg", "oga"},
{"audio/opus", "opus"},
{"audio/flac", "flac"},
{"audio/webm", "weba"},
{"application/x-7z-compressed", "7z"},
{"application/x-rar-compressed", "rar"},
{"application/zstd", "zst"},
{"", ""}};
// Reserved (RFC2396)
#define CIS(c,ch) ( ((unsigned char)(c)) == (ch) )
#define CHAR_RESERVED(c) ( CIS(c,';') \
@@ -1116,7 +1149,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
char BIGSTK protocol[256], url[HTS_URLMAXSIZE * 2], method[256];
linput(fp, line, 1000);
if (sscanf(line, "%s %s %s", method, url, protocol) == 3) {
/* widths bound method[256], url[HTS_URLMAXSIZE*2], protocol[256] */
if (sscanf(line, "%255s %2047s %255s", method, url, protocol) == 3) {
size_t ret;
// selon que l'on a ou pas un proxy
if (retour->req.proxy.active) {
@@ -1293,16 +1327,12 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
// Compression accepted ?
if (retour->req.http11) {
hts_boolean compressible = HTS_FALSE;
#if HTS_USEZLIB
if ((!retour->req.range_used)
&& (!retour->req.nocompression))
print_buffer(&bstr, "Accept-Encoding: " "gzip" /* gzip if the preffered encoding */
", " "identity;q=0.9" H_CRLF);
else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
#else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
compressible = (!retour->req.range_used && !retour->req.nocompression);
#endif
print_buffer(&bstr, "Accept-Encoding: %s" H_CRLF,
hts_acceptencoding(compressible));
}
/* Authentification */
@@ -1918,6 +1948,10 @@ HTSEXT_API const char *infostatuscode_const(int statuscode) {
return "Requested Range Not Satisfiable";
case 417:
return "Expectation Failed";
case 429:
return "Too Many Requests";
case 451:
return "Unavailable For Legal Reasons";
case 500:
return "Internal Server Error";
case 501:
@@ -4098,25 +4132,33 @@ DECLARE_APPEND_ESCAPE_VERSION(escape_uri)
#undef DECLARE_APPEND_ESCAPE_VERSION
// Same as above, but in-place
#undef DECLARE_INPLACE_ESCAPE_VERSION
#define DECLARE_INPLACE_ESCAPE_VERSION(NAME) \
HTSEXT_API size_t inplace_ ##NAME(char *const dest, const size_t size) { \
char buffer[256]; \
const size_t len = strnlen(dest, size); \
const int in_buffer = len + 1 < sizeof(buffer); \
char *src = in_buffer ? buffer : malloct(len + 1); \
size_t ret; \
assertf(src != NULL); \
assertf(len < size); \
memcpy(src, dest, len + 1); \
ret = NAME(src, dest, size); \
if (!in_buffer) { \
freet(src); \
} \
return ret; \
// In-place escaping: copy dest aside, then escape that copy back into dest.
typedef size_t (*escape_fn_t)(const char *src, char *dest, size_t size);
static size_t inplace_escape(char *const dest, const size_t size,
escape_fn_t escape) {
char buffer[256];
const size_t len = strnlen(dest, size);
const int in_buffer = len + 1 < sizeof(buffer);
char *src = in_buffer ? buffer : malloct(len + 1);
size_t ret;
assertf(src != NULL);
assertf(len < size);
memcpy(src, dest, len + 1);
ret = escape(src, dest, size);
if (!in_buffer) {
freet(src);
}
return ret;
}
// Thin exported wrappers binding inplace_escape() to each escaper (ABI).
#undef DECLARE_INPLACE_ESCAPE_VERSION
#define DECLARE_INPLACE_ESCAPE_VERSION(NAME) \
HTSEXT_API size_t inplace_##NAME(char *const dest, const size_t size) { \
return inplace_escape(dest, size, NAME); \
}
DECLARE_INPLACE_ESCAPE_VERSION(escape_in_url)
DECLARE_INPLACE_ESCAPE_VERSION(escape_spc_url)
DECLARE_INPLACE_ESCAPE_VERSION(escape_uri_utf)
@@ -4308,6 +4350,20 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, 1);
}
// first match in a NUL-terminated {mime,ext} table. key selects the lookup
// column (0=mime, 1=ext); returns the other column, or NULL if no row matches
// (a "*" partner means the row carries no value).
static const char *hts_mime_lookup(const char *(*table)[2], int key,
const char *needle) {
int j;
for (j = 0; strnotempty(table[j][1]); j++) {
if (strfield2(table[j][key], needle) && table[j][!key][0] != '*')
return table[j][!key];
}
return NULL;
}
// write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise
@@ -4331,17 +4387,15 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
while ((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (a >= fil && *a == '.' && strlen(a) < 32) {
int j = 0;
const char *mime;
a++;
while(strnotempty(hts_mime[j][1])) {
if (strfield2(hts_mime[j][1], a)) {
if (hts_mime[j][0][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][0], ssize);
return 1;
}
}
j++;
mime = hts_mime_lookup(hts_mime, 1, a);
if (mime == NULL)
mime = hts_mime_lookup(hts_mime_modern, 1, a);
if (mime != NULL) {
strlcpybuff(s, mime, ssize);
return 1;
}
if (flag) {
@@ -4365,6 +4419,11 @@ HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil,
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, flag);
}
/* Advertised Accept-Encoding; gzip and deflate both decode via hts_zunpack */
const char *hts_acceptencoding(hts_boolean compressible) {
return compressible ? "gzip, deflate, identity;q=0.9" : "identity";
}
// get type of fil (php)
// s: buffer (text/html) or NULL
// return: 1 if known by user
@@ -4476,18 +4535,16 @@ int get_userhttptype(httrackp * opt, char *s, const char *fil) {
// returns 1 if an extension was found (and written to s), 0 otherwise
int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
const char *ext;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {
if (hts_mime[j][1][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][1], ssize);
ok = 1;
}
}
j++;
ext = hts_mime_lookup(hts_mime, 0, st);
if (ext == NULL)
ext = hts_mime_lookup(hts_mime_modern, 0, st);
if (ext != NULL) {
strlcpybuff(s, ext, ssize);
ok = 1;
}
// wrap "x" mimetypes, such as:
// application/x-mp3
@@ -5754,6 +5811,13 @@ HTSEXT_API int hts_init(void) {
abortLog("unable to initialize TLS: SSL_CTX_new()");
assertf("unable to initialize TLS" == NULL);
}
/* Pin a TLS floor (no SSLv3/TLS1.0/1.1); no cert verify, by design. */
#if OPENSSL_VERSION_NUMBER >= 0x10100000L
SSL_CTX_set_min_proto_version(openssl_ctx, TLS1_2_VERSION);
#else
SSL_CTX_set_options(openssl_ctx, SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3 |
SSL_OP_NO_TLSv1 | SSL_OP_NO_TLSv1_1);
#endif
}
#endif
@@ -5959,9 +6023,11 @@ HTSEXT_API httrackp *hts_create_opt(void) {
"htsswf", "htsjava", "httrack-plugin", NULL
};
#else
static const char *defaultModules[] = {
"libhtsswf.so.1", "libhtsjava.so.2", "httrack-plugin", NULL
};
#ifndef HTS_LIBHTSJAVA_NAME
#define HTS_LIBHTSJAVA_NAME "libhtsjava.so" /* non-autoconf fallback */
#endif
static const char *defaultModules[] = {"libhtsswf.so.1", HTS_LIBHTSJAVA_NAME,
"httrack-plugin", NULL};
#endif
httrackp *opt = malloc(sizeof(httrackp));
@@ -6005,8 +6071,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->user_agent, HTS_DEFAULT_USER_AGENT);
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default

View File

@@ -285,6 +285,9 @@ int ishttperror(int err);
int get_userhttptype(httrackp * opt, char *s, const char *fil);
int give_mimext(char *s, size_t ssize, const char *st);
/* Advertised Accept-Encoding value (no header name/CRLF); see htslib.c. */
const char *hts_acceptencoding(hts_boolean compressible);
int may_bogus_multiple(httrackp * opt, const char *mime, const char *filename);
int may_unknown2(httrackp * opt, const char *mime, const char *filename);

View File

@@ -41,6 +41,10 @@ Please visit our Website: http://www.httrack.com
#include "htstools.h"
#include "htscharset.h"
#include "htsencoding.h"
#include "htssniff.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
#include <ctype.h>
#define ADD_STANDARD_PATH \
@@ -70,30 +74,36 @@ static const char *hts_tbdev[] = {
""
};
#define URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET() do { \
int prev = opt->state._hts_in_html_parsing; \
while(back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state. _hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback,opt,cache,0); \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket=back_nsoc(sback); \
HTS_STAT.stat_errors=fspc(opt,NULL,"error"); \
HTS_STAT.stat_warnings=fspc(opt,NULL,"warning"); \
HTS_STAT.stat_infos=fspc(opt,NULL,"info"); \
HTS_STAT.nbk=backlinks_done(sback,opt->liens,opt->lien_tot,ptr); \
HTS_STAT.nb=back_transferred(HTS_STAT.stat_bytes,sback); \
/* Check */ \
{ \
if (!RUN_CALLBACK7(opt, loop, sback->lnk, sback->count,-1,ptr,opt->lien_tot,(int) (time_local()-HTS_STAT.stat_timestart),&HTS_STAT)) { \
return -1; \
} \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while(0)
#define URLSAVENAME_WAIT_FOR_AVAILABLE_SOCKET() \
do { \
int prev = opt->state._hts_in_html_parsing; \
while (back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state._hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback, opt, cache, 0); \
/* time limit (-E) exceeded: stop waiting for a socket (#481) */ \
if (!back_checkmirror(opt)) \
break; \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket = back_nsoc(sback); \
HTS_STAT.stat_errors = fspc(opt, NULL, "error"); \
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning"); \
HTS_STAT.stat_infos = fspc(opt, NULL, "info"); \
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr); \
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback); \
/* Check */ \
{ \
if (!RUN_CALLBACK7( \
opt, loop, sback->lnk, sback->count, -1, ptr, opt->lien_tot, \
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) { \
return -1; \
} \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while (0)
/* Strip all // */
static void cleanDoubleSlash(char *s) {
@@ -138,37 +148,191 @@ static void cleanEndingSpaceOrDot(char *s) {
}
}
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server declared NO type (the HTS_UNKNOWN_MIME sentinel; the #267 mangle
guard): a typeless .png stays .png, but a .pdf explicitly served as text/html
is named .html. The sentinel rides the cache, so updates stay consistent. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
/* Wire Content-Type vs URL extension: a patchable wire type wins over an
unspecific ext, the HTS_UNKNOWN_MIME sentinel keeps a specific non-HTML ext
(#267 guard), a declared disagreement is CONTESTED (sniffed below). */
typedef enum wire_verdict {
WIRE_KEEPS_EXT,
WIRE_WINS,
WIRE_CONTESTED
} wire_verdict;
static wire_verdict wire_ext_verdict(httrackp *opt, const char *wiremime,
const char *file, char *urlmime,
size_t urlmime_size) {
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
return WIRE_KEEPS_EXT; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (!get_httptype_sized(opt, urlmime, urlmime_size, file, 0))
return WIRE_WINS; /* URL ext implies no known type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server declared no type (the sentinel); an explicitly declared type,
even text/html, is trusted, so a binary-looking URL that really serves
HTML (login/error interstitial, soft-404) is named .html. */
return WIRE_KEEPS_EXT; /* agreement (no .htm->.html churn) */
if (!is_hypertext_mime(opt, urlmime, file) &&
strfield2(wiremime, HTS_UNKNOWN_MIME))
return WIRE_KEEPS_EXT; /* no declared type */
return WIRE_CONTESTED;
}
/* Optional evidence for a contested wire-vs-ext verdict. */
typedef struct sniff_src {
struct_back *sback; /* live backing (looked up by adr/fil) */
const lien_back *headers; /* snapshot: r.adr, else the url_sav file */
const char *adr, *fil;
const char *prev_save; /* previous run's save name (cache X-Save) */
} sniff_src;
#if HTS_USEZLIB
/* Inflate the head of a gzip/zlib stream; 0 when undecodable. */
static size_t sniff_inflate_head(const void *in, size_t in_len, void *out,
size_t out_len) {
z_stream zs;
size_t n = 0;
int err;
memset(&zs, 0, sizeof(zs));
if (inflateInit2(&zs, 47) != Z_OK) /* 47: gzip or zlib, autodetected */
return 0;
zs.next_in = (const Bytef *) in;
zs.avail_in = (uInt) in_len;
zs.next_out = (Bytef *) out;
zs.avail_out = (uInt) out_len;
err = inflate(&zs, Z_SYNC_FLUSH);
if (err == Z_OK || err == Z_STREAM_END || err == Z_BUF_ERROR)
n = out_len - zs.avail_out;
inflateEnd(&zs);
return n;
}
#endif
static size_t sniff_read_head(const char *path, void *buf, size_t len) {
char catbuff[CATBUFF_SIZE];
FILE *const fp = FOPEN(fconv(catbuff, sizeof(catbuff), path), "rb");
size_t n = 0;
if (fp != NULL) {
n = fread(buf, 1, len, fp);
fclose(fp);
}
return n;
}
/* Body head of one slot: memory, else its flushed on-disk file (url_sav, or
tmpfile for a compressed stream); inflated so the sniff sees the final body.
*/
static size_t sniff_slot_head(const lien_back *slot, void *buf, size_t len) {
const htsblk *const r = &slot->r;
size_t n = 0;
if (r->adr != NULL && r->size > 0) {
n = (size_t) r->size < len ? (size_t) r->size : len;
memcpy(buf, r->adr, n);
} else {
if (r->out != NULL)
fflush(r->out);
if (slot->url_sav[0] != '\0')
n = sniff_read_head(slot->url_sav, buf, len);
if (n == 0 && slot->tmpfile != NULL && slot->tmpfile[0] != '\0')
n = sniff_read_head(slot->tmpfile, buf, len);
}
if (n > 0 && r->compressed) {
#if HTS_USEZLIB
unsigned char raw[HTS_SNIFF_LEN];
if (n > sizeof(raw))
n = sizeof(raw);
memcpy(raw, buf, n);
n = sniff_inflate_head(raw, n, buf, len);
#else
n = 0;
#endif
}
return n;
}
/* Up to len leading body bytes; 0 when unavailable, and always in
non-delayed mode (its HEAD-probe first run couldn't sniff either). */
static size_t sniff_body_head(httrackp *opt, const sniff_src *src, void *buf,
size_t len) {
size_t n = 0;
if (src == NULL || opt->savename_delayed == HTS_SAVENAME_DELAYED_NONE)
return 0;
/* live backing slot: a snapshot (back_copy_static) loses r.adr/r.out */
if (src->sback != NULL && src->adr != NULL && src->fil != NULL) {
const int b = back_index(opt, src->sback, src->adr, src->fil, NULL);
if (b >= 0)
n = sniff_slot_head(&src->sback->lnk[b], buf, len);
}
if (n == 0 && src->headers != NULL)
n = sniff_slot_head(src->headers, buf, len);
return n;
}
/* Contested verdicts: magic proving the URL ext keeps it, else wire wins. */
static int wire_patches_ext(httrackp *opt, const sniff_src *src,
const char *wiremime, const char *file) {
char urlmime[256];
switch (wire_ext_verdict(opt, wiremime, file, urlmime, sizeof(urlmime))) {
case WIRE_KEEPS_EXT:
return 0;
case WIRE_WINS:
return 1;
case WIRE_CONTESTED:
break;
}
if (src != NULL) {
if (hts_sniff_mime_known(urlmime)) {
unsigned char head[HTS_SNIFF_LEN];
const size_t n = sniff_body_head(opt, src, head, sizeof(head));
if (n > 0)
return hts_sniff_mime_consistent(head, n, urlmime) ? 0 : 1;
}
/* no bytes: reproduce the previous run's verdict (cached X-Save name) */
if (src->prev_save != NULL && src->prev_save[0] != '\0') {
char prevmime[256];
prevmime[0] = '\0';
if (get_httptype_sized(opt, prevmime, sizeof(prevmime), src->prev_save,
0) &&
strfield2(prevmime, urlmime))
return 0;
}
}
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int hts_ext_sniff_wanted(httrackp *opt, const char *wiremime,
const char *file) {
char urlmime[256];
return wiremime != NULL && strnotempty(wiremime) &&
wire_ext_verdict(opt, wiremime, file, urlmime, sizeof(urlmime)) ==
WIRE_CONTESTED &&
hts_sniff_mime_known(urlmime);
}
/* Wire-metadata name change: a Content-Disposition filename wins (returns 2),
else the declared type's ext when wire_patches_ext() allows (returns 1),
else 0. ext receives the new extension or replacement filename. */
static int resolve_extension(httrackp *opt, const sniff_src *src,
const char *cdispo, const char *contenttype,
const char *fil, char *ext, size_t ext_size) {
if (strnotempty(cdispo)) {
strlcpybuff(ext, cdispo, ext_size);
return 2;
}
if (wire_patches_ext(opt, src, contenttype, fil) &&
give_mimext(ext, ext_size, contenttype))
return 1;
return 0;
}
// Build the local save name (save) from adr/fil; renames on collision
// (e.g. INDEX.HTML vs index.html).
int url_savename(lien_adrfilsave *const afs,
lien_adrfil *const former,
const char *referer_adr, const char *referer_fil,
@@ -405,45 +569,30 @@ int url_savename(lien_adrfilsave *const afs,
// si option check_type activée
if (is_html < 0 && opt->check_type && !ext_chg) {
int ishtest = 0;
if (protocol != PROTOCOL_FILE
&& protocol != PROTOCOL_FTP
) {
// tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
ishtml(opt, fil) < 0) { // unsure whether it's html or a file
// lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
char BIGSTK previous_save[HTS_URLMAXSIZE * 2];
htsblk r;
if (r.statuscode != -1) { // pas d'erreur de lecture cache
char s[32];
previous_save[0] = '\0';
r = cache_read_including_broken(opt, cache, adr, fil,
previous_save); // test uniquement
s[0] = '\0';
if (r.statuscode != -1) { // cache entry read OK
hts_log_print(opt, LOG_DEBUG, "Testing link type (from cache) %s%s",
adr_complete, fil_complete);
if (!HTTP_IS_REDIRECT(r.statuscode)) {
if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (wire_patches_ext(opt, r.contenttype, fil)) {
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
const sniff_src src = {sback, NULL, adr, fil, previous_save};
ext_chg = resolve_extension(opt, &src, r.cdispo, r.contenttype,
fil, ext, sizeof(ext));
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtest == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
//
} else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
@@ -467,22 +616,13 @@ int url_savename(lien_adrfilsave *const afs,
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess
// Check if the file is ready in backing.
if (headers != NULL && headers->status >= 0 && !is_redirect) {
if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo);
} else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil)) {
char s[16];
if (give_mimext(
s, sizeof(s),
headers->r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
const sniff_src src = {sback, headers, adr, fil, NULL};
ext_chg = resolve_extension(opt, &src, headers->r.cdispo,
headers->r.contenttype,
headers->url_fil, ext, sizeof(ext));
}
else if (mime_type != NULL) {
ext[0] = '\0';
@@ -500,13 +640,6 @@ int url_savename(lien_adrfilsave *const afs,
if (!may_unknown2(opt, mime_type, fil)) {
ext_chg = 1;
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtml(opt, fil) == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
} else {
ext_chg = 0;
}
@@ -696,30 +829,10 @@ int url_savename(lien_adrfilsave *const afs,
// libérer emplacement backing
}
{ // pas d'erreur, changer type?
char s[16];
s[0] = '\0';
if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil)) {
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtest == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
}
// no error: change the type?
ext_chg = resolve_extension(
opt, NULL, back[b].r.cdispo, back[b].r.contenttype,
back[b].url_fil, ext, sizeof(ext));
}
// FIN Si non déplacé, forcer type?

View File

@@ -100,6 +100,8 @@ void standard_name(char *b, size_t bsize, const char *dot_pos,
const char *nom_pos, const char *fil_complete,
int short_ver);
void url_savename_addstr(char *d, const char *s);
/* Contested wire-vs-ext verdict that a body sniff could settle (htssniff.h). */
int hts_ext_sniff_wanted(httrackp *opt, const char *wiremime, const char *file);
char *url_md5(char *digest_buffer, const char *fil_complete);
void url_savename_refname(const char *adr, const char *fil, char *filename);
char *url_savename_refname_fullpath(httrackp * opt, const char *adr,

View File

@@ -49,6 +49,7 @@ Please visit our Website: http://www.httrack.com
#include "htsindex.h"
#include "htscharset.h"
#include "htsencoding.h"
#include "htssniff.h"
/* external modules */
#include "htsmodules.h"
@@ -77,13 +78,14 @@ Please visit our Website: http://www.httrack.com
/** Append to the output buffer the string 'A'. **/
#define HT_ADD(A) TypedArrayAppend(output_buffer, A, strlen(A))
/** Append to the output buffer the string 'A', html-escaped. **/
#define HT_ADD_HTMLESCAPED_ANY(A, FUNCTION) do { \
/* clang-format off: an edit realigns all backslashes, churning the macro. */
/* clang-format off */
/** Append 'A' to the output buffer, html-escaped; FACTOR = max byte expansion. **/
#define HT_ADD_HTMLESCAPED_ANY(A, FUNCTION, FACTOR) do { \
if ((opt->getmode & 1) != 0 && ptr>0) { \
const char *const str_ = (A); \
size_t size_; \
/* &amp; is the maximum expansion */ \
TypedArrayEnsureRoom(output_buffer, strlen(str_) * 5 + 1024); \
TypedArrayEnsureRoom(output_buffer, strlen(str_) * (FACTOR) + 1024); \
size_ = FUNCTION(str_, &TypedArrayTail(output_buffer), \
TypedArrayRoom(output_buffer)); \
TypedArraySize(output_buffer) += size_; \
@@ -91,188 +93,113 @@ Please visit our Website: http://www.httrack.com
} while(0)
/** Append to the output buffer the string 'A', html-escaped for &. **/
#define HT_ADD_HTMLESCAPED(A) HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print)
#define HT_ADD_HTMLESCAPED(A) \
HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print, HTS_HTMLESCAPE_MAXEXP)
/**
* Append to the output buffer the string 'A', html-escaped for & and
* Append to the output buffer the string 'A', html-escaped for & and
* high chars.
**/
#define HT_ADD_HTMLESCAPED_FULL(A) HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print_full)
#define HT_ADD_HTMLESCAPED_FULL(A) \
HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print_full, HTS_HTMLESCAPE_FULL_MAXEXP)
/* clang-format on */
// does nothing
#define XH_uninit do {} while(0)
#define HT_ADD_END { \
int ok=0;\
if (TypedArraySize(output_buffer) != 0) { \
const size_t ht_len = TypedArraySize(output_buffer); \
const char *const ht_buff = TypedArrayElts(output_buffer); \
char digest[32+2];\
off_t fsize_old = fsize(fconv(OPT_GET_BUFF(opt),OPT_GET_BUFF_SIZE(opt),savename()));\
digest[0] = '\0';\
domd5mem(TypedArrayElts(output_buffer), ht_len, digest, 1);\
if (fsize_old == (off_t) ht_len) { \
int mlen = 0;\
char* mbuff;\
cache_readdata(cache,"//[HTML-MD5]//",savename(),&mbuff,&mlen);\
if (mlen) \
mbuff[mlen]='\0';\
if ((mlen == 32) && (strcmp(((mbuff!=NULL)?mbuff:""),digest)==0)) {\
ok=1;\
hts_log_print(opt, LOG_DEBUG, "File not re-written (md5): %s",savename());\
} else {\
ok=0;\
} \
}\
if (!ok) { \
file_notify(opt,urladr(), urlfil(), savename(), 1, 1, r->notmodified); \
fp=filecreate(&opt->state.strc, savename()); \
if (fp) { \
if (ht_len>0) {\
if (fwrite(ht_buff,1,ht_len,fp) != ht_len) { \
int fcheck;\
if ((fcheck=check_fatal_io_errno())) {\
opt->state.exit_xh=-1;\
}\
if (opt->log) { \
hts_log_print(opt, LOG_ERROR | LOG_ERRNO, "Unable to write HTML file %s", savename());\
if (fcheck) {\
hts_log_print(opt, LOG_ERROR, "* * Fatal write error, giving up");\
}\
}\
}\
}\
fclose(fp); fp=NULL; \
if (strnotempty(r->lastmodified)) \
set_filetime_rfc822(savename(),r->lastmodified); \
} else {\
int fcheck;\
if ((fcheck=check_fatal_io_errno())) {\
hts_log_print(opt, LOG_ERROR, "Mirror aborted: disk full or filesystem problems"); \
opt->state.exit_xh=-1;\
}\
hts_log_print(opt, LOG_ERROR | LOG_ERRNO, "Unable to save file %s", savename());\
if (fcheck) {\
hts_log_print(opt, LOG_ERROR, "* * Fatal write error, giving up");\
}\
}\
} else {\
file_notify(opt,urladr(), urlfil(), savename(), 0, 0, r->notmodified); \
filenote(&opt->state.strc, savename(),NULL); \
}\
if (cache->ndx)\
cache_writedata(cache->ndx,cache->dat,"//[HTML-MD5]//",savename(),digest,(int)strlen(digest));\
} \
TypedArrayFree(output_buffer); \
}
#define HT_ADD_FOP
// COPY IN HTSCORE.C
#define HT_INDEX_END do { \
if (!makeindex_done) { \
if (makeindex_fp) { \
char BIGSTK tempo[1024]; \
if (makeindex_links == 1) { \
char BIGSTK link_escaped[HTS_URLMAXSIZE*2]; \
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped)); \
snprintf(tempo,sizeof(tempo),"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">"CRLF,link_escaped); \
} else \
tempo[0]='\0'; \
hts_template_format(makeindex_fp,template_footer, \
"<!-- Mirror and index made by HTTrack Website Copier/"HTTRACK_VERSION" "HTTRACK_AFF_AUTHORS" -->", \
tempo, /* EOF */ NULL \
); \
fflush(makeindex_fp); \
fclose(makeindex_fp); /* à ne pas oublier sinon on passe une nuit blanche */ \
makeindex_fp=NULL; \
usercommand(opt,0,NULL,fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_html_utf8),"index.html"),"primary","primary"); \
} \
} \
makeindex_done=1; /* ok c'est fait */ \
} while(0)
/* Mutable extended-context fields: one source of truth so the DEFINE/SET/SAVE
load and store lists can't drift apart. */
/* clang-format off */
#define ENGINE_MUTABLE_FIELDS(X) \
X(int, error, stre->error_) \
X(int, store_errpage, stre->store_errpage_) \
X(int, makeindex_done, stre->makeindex_done_) \
X(FILE *, makeindex_fp, stre->makeindex_fp_) \
X(int, makeindex_links, stre->makeindex_links_) \
X(LLint, stat_fragment, stre->stat_fragment_)
#define ENGINE_FIELD_DECLARE(type, name, src) type name = *(src);
#define ENGINE_FIELD_LOAD(type, name, src) name = *(src);
#define ENGINE_FIELD_STORE(type, name, src) *(src) = name;
#define ENGINE_DEFINE_CONTEXT() \
ENGINE_DEFINE_CONTEXT_BASE(); \
/* */ \
htsblk* const r HTS_UNUSED = stre->r_; \
hash_struct* const hash HTS_UNUSED = stre->hash_; \
char* const codebase HTS_UNUSED = stre->codebase; \
char* const base HTS_UNUSED = stre->base; \
/* */ \
const char * const template_header HTS_UNUSED = stre->template_header_; \
const char * const template_body HTS_UNUSED = stre->template_body_; \
const char * const template_footer HTS_UNUSED = stre->template_footer_; \
/* */ \
HTS_UNUSED char* const makeindex_firstlink = stre->makeindex_firstlink_; \
/* */ \
/* */ \
int error = * stre->error_; \
int store_errpage = * stre->store_errpage_; \
/* */ \
int makeindex_done = *stre->makeindex_done_; \
FILE* makeindex_fp = *stre->makeindex_fp_; \
int makeindex_links = *stre->makeindex_links_; \
/* */ \
LLint stat_fragment = *stre->stat_fragment_; \
ENGINE_MUTABLE_FIELDS(ENGINE_FIELD_DECLARE) \
/* load-once (kept out of SET/SAVE): re-reading would reset the throttle */ \
HTS_UNUSED TStamp makestat_time = stre->makestat_time; \
HTS_UNUSED FILE* makestat_fp = stre->makestat_fp
#define ENGINE_SET_CONTEXT() \
ENGINE_SET_CONTEXT_BASE(); \
/* */ \
error = * stre->error_; \
store_errpage = * stre->store_errpage_; \
/* */ \
makeindex_done = *stre->makeindex_done_; \
makeindex_fp = *stre->makeindex_fp_; \
makeindex_links = *stre->makeindex_links_; \
/* */ \
stat_fragment = *stre->stat_fragment_; \
makestat_time = stre->makestat_time; \
makestat_fp = stre->makestat_fp
ENGINE_MUTABLE_FIELDS(ENGINE_FIELD_LOAD)
#define ENGINE_LOAD_CONTEXT() \
ENGINE_DEFINE_CONTEXT()
#define ENGINE_SAVE_CONTEXT() \
ENGINE_SAVE_CONTEXT_BASE(); \
/* */ \
* stre->error_ = error; \
* stre->store_errpage_ = store_errpage; \
/* */ \
*stre->makeindex_done_ = makeindex_done; \
*stre->makeindex_fp_ = makeindex_fp; \
*stre->makeindex_links_ = makeindex_links; \
/* */ \
*stre->stat_fragment_ = stat_fragment
ENGINE_MUTABLE_FIELDS(ENGINE_FIELD_STORE)
/* clang-format on */
#define _FILTERS (*opt->filters.filters)
#define _FILTERS_PTR (opt->filters.filptr)
#define _ROBOTS ((robots_wizard*)opt->robotsptr)
/* Apply current *adr character for the script automate */
#define AUTOMATE_LOOKUP_CURRENT_ADR() do { \
if (inscript) { \
int new_state_pos; \
new_state_pos=inscript_state[inscript_state_pos][(unsigned char)*html]; \
if (new_state_pos < 0) { \
new_state_pos=inscript_state[inscript_state_pos][INSCRIPT_DEFAULT]; \
} \
assertf(new_state_pos >= 0); \
assertf(new_state_pos*sizeof(inscript_state[0]) < sizeof(inscript_state)); \
inscript_state_pos=new_state_pos; \
} \
} while(0)
/* JS-detection automaton states; INSCRIPT_DEFAULT is the synthetic "any other
char" column of the transition table. */
typedef enum {
INSCRIPT_START = 0,
INSCRIPT_ANTISLASH,
INSCRIPT_INQUOTE,
INSCRIPT_INQUOTE2,
INSCRIPT_SLASH,
INSCRIPT_SLASHSLASH,
INSCRIPT_COMMENT,
INSCRIPT_COMMENT2,
INSCRIPT_ANTISLASH_IN_QUOTE,
INSCRIPT_ANTISLASH_IN_QUOTE2,
INSCRIPT_DEFAULT = 256
} INSCRIPT;
/* Increment current pointer to 'steps' characters, modifying automate if necessary */
#define INCREMENT_CURRENT_ADR(steps) do { \
int steps__ = (int) ( steps ); \
while(steps__ > 0) { \
html++; \
AUTOMATE_LOOKUP_CURRENT_ADR(); \
steps__ --; \
} \
} while(0)
#define INSCRIPT_NSTATES 10 /* rows in the transition table */
/* Live view of the parser's automaton locals, set up once so the helpers below
can drive it without capturing them by lexical scope. */
typedef struct {
const int *inscript; /* nonzero while inside a script body */
const signed char (*table)[257]; /* [INSCRIPT_NSTATES][257] transitions */
INSCRIPT *pos; /* current state */
const char **html; /* parse cursor */
} script_automate;
/* Feed the current *html byte to the automaton. No-op outside a script body. */
static void hts_automate_lookup(const script_automate *aut) {
if (*aut->inscript) {
int next = aut->table[*aut->pos][(unsigned char) **aut->html];
if (next < 0) {
next = aut->table[*aut->pos][INSCRIPT_DEFAULT];
}
assertf(next >= 0 && next < INSCRIPT_NSTATES);
*aut->pos = (INSCRIPT) next;
}
}
/* Advance the cursor by 'steps' bytes, feeding each to the automaton. */
static void hts_automate_increment(const script_automate *aut, int steps) {
while (steps > 0) {
(*aut->html)++;
hts_automate_lookup(aut);
steps--;
}
}
/* Percent-encode the angle brackets of a string so it is safe to embed inside
an HTML comment (the default footer) or any other HTML context. A URL holding
@@ -417,20 +344,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int incomment = 0; // dans un <!--
int inscript = 0; // dans un scipt pour applets javascript)
int inscript_locked = 0; // in locked script (ie. js file)
signed char inscript_state[10][257];
typedef enum {
INSCRIPT_START = 0,
INSCRIPT_ANTISLASH,
INSCRIPT_INQUOTE,
INSCRIPT_INQUOTE2,
INSCRIPT_SLASH,
INSCRIPT_SLASHSLASH,
INSCRIPT_COMMENT,
INSCRIPT_COMMENT2,
INSCRIPT_ANTISLASH_IN_QUOTE,
INSCRIPT_ANTISLASH_IN_QUOTE2,
INSCRIPT_DEFAULT = 256
} INSCRIPT;
signed char inscript_state[INSCRIPT_NSTATES][257];
INSCRIPT inscript_state_pos = INSCRIPT_START;
const char *inscript_name = NULL; // script tag name
int inscript_tag = 0; // on est dans un <body onLoad="... terminé par >
@@ -491,6 +405,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
inscript_state[INSCRIPT_COMMENT2]['*'] = INSCRIPT_COMMENT2;
inscript_state[INSCRIPT_ANTISLASH_IN_QUOTE][INSCRIPT_DEFAULT] = INSCRIPT_INQUOTE; /* #8: escape in '' */
inscript_state[INSCRIPT_ANTISLASH_IN_QUOTE2][INSCRIPT_DEFAULT] = INSCRIPT_INQUOTE2; /* #9: escape in "" */
const script_automate saut = {&inscript, inscript_state,
&inscript_state_pos, &html};
/* Primary list or URLs */
if (ptr == 0) {
@@ -689,13 +605,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// Decode title with encoding
if (str->page_charset_ != NULL
&& *str->page_charset_ != '\0') {
char *const sUtf =
hts_convertStringToUTF8(s, strlen(s), str->page_charset_);
if (str->page_charset_ != NULL &&
*str->page_charset_ != '\0') {
char *sUtf = hts_convertStringToUTF8(
s, strlen(s), str->page_charset_);
if (sUtf != NULL) {
strcpy(s, sUtf);
free(sUtf);
/* UTF-8 can expand past s[]; truncate to fit */
snprintf(s, sizeof(s), "%s", sUtf);
freet(sUtf);
}
}
@@ -709,7 +626,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
} else if (heap(ptr)->depth < opt->depth) { // on a sauté level1+1 et level1
HT_INDEX_END;
hts_finish_makeindex(opt, &makeindex_done, &makeindex_fp,
makeindex_links, makeindex_firstlink,
template_footer, "primary", "primary");
}
} // if (opt->makeindex)
}
@@ -927,7 +846,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
/* automate */
AUTOMATE_LOOKUP_CURRENT_ADR();
hts_automate_lookup(&saut);
// Note:
// Certaines pages ne respectent pas le html
@@ -1843,7 +1762,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// sauter espaces
// adr+=p;
INCREMENT_CURRENT_ADR(p);
hts_automate_increment(&saut, p);
while((is_space(*html)
|| (inscriptgen && html[0] == '\\' && is_space(html[1])
)
@@ -1858,7 +1777,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// puis quitter
// html++; // sauter les espaces, "" et cie
INCREMENT_CURRENT_ADR(1);
hts_automate_increment(&saut, 1);
}
/* Stop at \n (LF) if primary links or link lists */
@@ -1873,7 +1792,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (*html == '\\') {
if ((*(html + 1) == '\'') || (*(html + 1) == '"')) { // \" ou \'
// html+=2; // sauter
INCREMENT_CURRENT_ADR(2);
hts_automate_increment(&saut, 2);
}
}
}
@@ -1921,7 +1840,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (srcset_p) {
while(html < r->adr + r->size
&& (is_realspace(*html) || *html == ','))
INCREMENT_CURRENT_ADR(1);
hts_automate_increment(&saut, 1);
}
eadr = html;
@@ -3381,7 +3300,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
assertf(eadr - html >= 0); // Should not go back
if (eadr > html) {
INCREMENT_CURRENT_ADR(eadr - 1 - html);
hts_automate_increment(&saut, (int) (eadr - 1 - html));
}
// adr=eadr-1; // ** sauter
@@ -3400,7 +3319,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
q++; // skip whitespace and empty candidates
if (q < endp && *q != '\0' && *q != ',' && *q != quote
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
INCREMENT_CURRENT_ADR(q - html); // keep the automate in sync
hts_automate_increment(
&saut, (int) (q - html)); // keep the automate in sync
ok = 1;
goto srcset_next;
}
@@ -3540,7 +3460,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
/* Flush and save to disk */
HT_ADD_END; // achever
if (TypedArraySize(output_buffer) != 0) {
hts_finish_html_file(
opt, cache, r, &fp, TypedArrayElts(output_buffer),
TypedArraySize(output_buffer), urladr(), urlfil(), savename());
}
TypedArrayFree(output_buffer);
}
//
//
@@ -3565,6 +3490,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
return 0;
}
/* Mirror the savename to tell whether a redirect saves to the same file (#159);
* contract in htsparse.h. */
hts_boolean hts_redirect_same_savefile(httrackp *opt, const char *cur_adr,
const char *cur_fil,
const char *moved_adr,
const char *moved_fil) {
const int norm_slash = opt->urlhack && !opt->no_slash_dedup;
const int norm_query = opt->urlhack && !opt->no_query_dedup;
char BIGSTK n_fil[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];
if (strcasecmp(jump_identification_const(moved_adr),
jump_identification_const(cur_adr)) != 0)
return HTS_FALSE;
fil_normalized_filtered_ex(moved_fil, n_fil, NULL, norm_slash, norm_query);
fil_normalized_filtered_ex(cur_fil, pn_fil, NULL, norm_slash, norm_query);
return strcasecmp(n_fil, pn_fil) == 0;
}
/*
Check 301, 302, .. statuscodes (moved)
*/
@@ -3610,36 +3553,9 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
if ((reponse =
ident_url_relatif(mov_url, urladr(), urlfil(), moved)) >= 0) {
int set_prio_to = 0; // pas de priotité fixéd par wizard
// check whether URLHack is harmless or not (per the effective
// sub-flags)
if (opt->urlhack && (!opt->no_www_dedup || !opt->no_slash_dedup ||
!opt->no_query_dedup)) {
const int norm_host = !opt->no_www_dedup;
const int norm_slash = !opt->no_slash_dedup;
const int norm_query = !opt->no_query_dedup;
char BIGSTK n_adr[HTS_URLMAXSIZE * 2], n_fil[HTS_URLMAXSIZE * 2];
char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];
strlcpybuff(n_adr,
norm_host ? jump_normalized_const(moved->adr)
: jump_identification_const(moved->adr),
sizeof(n_adr));
strlcpybuff(pn_adr,
norm_host ? jump_normalized_const(urladr())
: jump_identification_const(urladr()),
sizeof(pn_adr));
fil_normalized_filtered_ex(moved->fil, n_fil, NULL, norm_slash,
norm_query);
fil_normalized_filtered_ex(urlfil(), pn_fil, NULL, norm_slash,
norm_query);
if (strcasecmp(n_adr, pn_adr) == 0
&& strcasecmp(n_fil, pn_fil) == 0) {
hts_log_print(opt, LOG_WARNING,
"Redirected link is identical because of 'URL Hack' option: %s%s and %s%s",
urladr(), urlfil(), moved->adr, moved->fil);
}
}
// A same-file alias redirect must be followed, not stubbed (#159).
const hts_boolean same_savefile = hts_redirect_same_savefile(
opt, urladr(), urlfil(), moved->adr, moved->fil);
//if (ident_url_absolute(mov_url,moved->adr,moved->fil)!=-1) { // ok URL reconnue
// c'est (en gros) la même URL..
// si c'est un problème de casse dans le host c'est que le serveur est buggé
@@ -3667,7 +3583,17 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
hts_log_print(opt, LOG_DEBUG, "moved link accepted: %s%s",
moved->adr, moved->fil);
}
} /* sinon traité normalement */
} else if (same_savefile) {
// A stub would point at itself; follow the redirect instead.
if (hts_acceptlink(opt, ptr, moved->adr, moved->fil, NULL, NULL,
&set_prio_to, NULL) != 1) {
get_it = 1;
hts_log_print(opt, LOG_WARNING,
"Redirect to a same-file alias, fetching real "
"content: %s%s -> %s%s",
urladr(), urlfil(), moved->adr, moved->fil);
}
} /* sinon traité normalement */
}
//if ((strfield2(moved->adr,urladr())!=0) && (strfield2(moved->fil,urlfil())!=0)) { // identique à casse près
@@ -3690,7 +3616,11 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
heap(heap(ptr)->precedent)->adr,
heap(heap(ptr)->precedent)->fil, opt,
sback, cache, hash, ptr, numero_passe, NULL) != -1) {
if (hash_read(hash, savedmoved.save, NULL, HASH_STRUCT_FILENAME) < 0) { // n'existe pas déja
// Same-file alias: the reserved name is the invalidated source,
// so record anyway.
if (same_savefile ||
hash_read(hash, savedmoved.save, NULL,
HASH_STRUCT_FILENAME) < 0) { // n'existe pas déja
// enregistrer lien avec SAV IDENTIQUE
if (hts_record_link(opt, moved->adr, moved->fil, heap(ptr)->sav, "", "", NULL)) {
// mode test?
@@ -3714,7 +3644,6 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
"moving %s to an existing file %s",
heap(ptr)->fil, urlfil());
}
}
}
@@ -4148,6 +4077,9 @@ void hts_mirror_process_user_interaction(htsmoduleStruct * str,
while(opt->state._hts_setpause || back_pluggable_sockets_strict(sback, opt) <= 0) { // on fait la pause..
opt->state._hts_in_html_parsing = 6;
back_wait(sback, opt, cache, HTS_STAT.stat_timestart);
/* time limit (-E) exceeded: stop waiting for a socket (#481) */
if (!back_checkmirror(opt))
break;
// Transfer rate
engine_stats();
@@ -4767,22 +4699,26 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
if (!RUN_CALLBACK7
(opt, loop, sback->lnk, sback->count, b, ptr, opt->lien_tot,
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) {
back_set_unlocked(sback, b);
return -1;
} else if (opt->state._hts_cancel || !back_checkmirror(opt)) { // cancel 2 ou 1 (cancel parsing)
back_delete(opt, cache, sback, b); // cancel test
break;
}
}
} while(
/* dns/connect/request */
(back[b].status >= 99 && back[b].status <= 101)
||
/* For redirects, wait for request to be terminated */
(HTTP_IS_REDIRECT(back[b].r.statuscode) && back[b].status > 0)
||
/* Same for errors */
(HTTP_IS_ERROR(back[b].r.statuscode) && back[b].status > 0)
);
} while (
/* dns/connect/request */
(back[b].status >= 99 && back[b].status <= 101) ||
/* For redirects, wait for request to be terminated */
(HTTP_IS_REDIRECT(back[b].r.statuscode) && back[b].status > 0) ||
/* Same for errors */
(HTTP_IS_ERROR(back[b].r.statuscode) && back[b].status > 0) ||
/* Contested type: wait for a sniffable body head (or EOF) */
(back[b].r.statuscode == HTTP_OK && back[b].status > 0 &&
strnotempty(back[b].r.cdispo) == 0 &&
back[b].r.size < HTS_SNIFF_LEN &&
hts_ext_sniff_wanted(opt, back[b].r.contenttype,
back[b].url_fil)));
if (b >= 0) {
back_set_unlocked(sback, b); // Unlocked entry
}
@@ -4917,6 +4853,9 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
/* Still have a back reference */
if (b >= 0) {
/* patch url_sav BEFORE finalize: it records/caches under this name
*/
strcpybuff(back[b].url_sav, afs->save);
/* Finalize now as we have the type */
if (back[b].status == STATUS_READY) {
if (!back[b].finalized) {
@@ -4924,8 +4863,6 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
back_finalize(opt, cache, sback, b);
}
}
/* Patch destination filename for direct-to-disk mode */
strcpybuff(back[b].url_sav, afs->save);
}
} // b >= 0

View File

@@ -116,6 +116,19 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre);
int hts_mirror_check_moved(htsmoduleStruct * str,
htsmoduleStructExtended * stre);
/*
Non-zero if a redirect (cur_adr,cur_fil)->(moved_adr,moved_fil) saves to the
same local file, so it must be followed rather than turned into a
self-pointing "moved" stub (#159). Mirrors the savename: scheme+userinfo
stripped, www kept (www dedup is the crawl layer's job), path
slash/query-normalized per the URL-hack flags. Not hash_url_equals: that keys
on the dedup hash, which folds www and never collapses http<->https.
*/
hts_boolean hts_redirect_same_savefile(httrackp *opt, const char *cur_adr,
const char *cur_fil,
const char *moved_adr,
const char *moved_fil);
/*
Process user intercations: pause, add link, delete link..
*/
@@ -162,27 +175,33 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
/* Apply changes */ \
* str->ptr_ = ptr
#define WAIT_FOR_AVAILABLE_SOCKET() do { \
int prev = opt->state._hts_in_html_parsing; \
while(back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state._hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback,opt,cache,0); \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket=back_nsoc(sback); \
HTS_STAT.stat_errors=fspc(opt,NULL,"error"); \
HTS_STAT.stat_warnings=fspc(opt,NULL,"warning"); \
HTS_STAT.stat_infos=fspc(opt,NULL,"info"); \
HTS_STAT.nbk=backlinks_done(sback,opt->liens,opt->lien_tot,ptr); \
HTS_STAT.nb=back_transferred(HTS_STAT.stat_bytes,sback); \
/* Check */ \
if (!RUN_CALLBACK7(opt, loop, sback->lnk, sback->count, -1,ptr,opt->lien_tot,(int) (time_local()-HTS_STAT.stat_timestart),&HTS_STAT)) { \
return -1; \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while(0)
#define WAIT_FOR_AVAILABLE_SOCKET() \
do { \
int prev = opt->state._hts_in_html_parsing; \
while (back_pluggable_sockets_strict(sback, opt) <= 0) { \
opt->state._hts_in_html_parsing = 6; \
/* Wait .. */ \
back_wait(sback, opt, cache, 0); \
/* time limit (-E) exceeded: stop waiting for a socket (#481) */ \
if (!back_checkmirror(opt)) \
break; \
/* Transfer rate */ \
engine_stats(); \
/* Refresh various stats */ \
HTS_STAT.stat_nsocket = back_nsoc(sback); \
HTS_STAT.stat_errors = fspc(opt, NULL, "error"); \
HTS_STAT.stat_warnings = fspc(opt, NULL, "warning"); \
HTS_STAT.stat_infos = fspc(opt, NULL, "info"); \
HTS_STAT.nbk = backlinks_done(sback, opt->liens, opt->lien_tot, ptr); \
HTS_STAT.nb = back_transferred(HTS_STAT.stat_bytes, sback); \
/* Check */ \
if (!RUN_CALLBACK7( \
opt, loop, sback->lnk, sback->count, -1, ptr, opt->lien_tot, \
(int) (time_local() - HTS_STAT.stat_timestart), &HTS_STAT)) { \
return -1; \
} \
} \
opt->state._hts_in_html_parsing = prev; \
} while (0)
#endif

View File

@@ -44,28 +44,84 @@ Please visit our Website: http://www.httrack.com
// -- robots --
/* RFC 9309 path-prefix match; '*' any run, '$' anchors end; linear. */
static hts_boolean robots_pattern_match(const char *pattern, const char *path) {
size_t patlen = strlen(pattern);
hts_boolean anchored = HTS_FALSE;
const char *p, *pend, *s;
const char *star = NULL, *star_s = NULL;
if (patlen > 0 && pattern[patlen - 1] == '$') {
anchored = HTS_TRUE;
patlen--;
}
p = pattern;
pend = pattern + patlen;
s = path;
while (*s != '\0') {
if (p == pend) {
if (!anchored)
return HTS_TRUE; // prefix matched
if (star != NULL) { // anchored: '*' must eat the rest
p = star + 1;
s = ++star_s;
continue;
}
return HTS_FALSE;
}
if (*p == '*') {
star = p++;
star_s = s;
} else if (*p == *s) {
p++;
s++;
} else if (star != NULL) {
p = star + 1;
s = ++star_s;
} else {
return HTS_FALSE;
}
}
while (p < pend && *p == '*')
p++;
return (p == pend) ? HTS_TRUE : HTS_FALSE;
}
// fil="" : vérifier si règle déja enregistrée
int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
while(robots) {
if (strfield2(robots->adr, adr)) {
if (fil[0]) {
/* RFC 9309: longest pattern wins, Allow beats Disallow on ties. */
int ptr = 0;
char line[250];
char line[HTS_ROBOTS_TOKEN_SIZE];
size_t toklen = strlen(robots->token);
size_t best_len = 0;
hts_boolean matched = HTS_FALSE;
hts_boolean best_allow = HTS_FALSE;
if (strnotempty(robots->token)) {
do {
ptr += binput(robots->token + ptr, line, 200);
if (line[0] == '/') { // absolu
if (strfield(fil, line)) { // commence avec ligne
return -1; // interdit
}
} else { // relatif
if (strstrcase(fil, line)) {
return -1;
while (ptr < (int) toklen) {
ptr += binput(robots->token + ptr, line, sizeof(line) - 1);
if (line[0] != 'A' && line[0] != 'D')
continue;
{
const hts_boolean is_allow =
(line[0] == 'A') ? HTS_TRUE : HTS_FALSE;
const char *pat = line + 1;
if (robots_pattern_match(pat, fil)) {
const size_t len = strlen(pat);
if (!matched || len > best_len || (len == best_len && is_allow)) {
matched = HTS_TRUE;
best_len = len;
best_allow = is_allow;
}
}
} while((strnotempty(line)) && (ptr < (int) strlen(robots->token)));
}
}
if (matched && !best_allow)
return -1; // forbidden
} else {
return -1;
}
@@ -74,6 +130,93 @@ int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
}
return 0;
}
/* Append "<marker><pattern>\n" to the bounded rule blob if it fits. */
static void robots_blob_add(char *blob, size_t blobsize, char marker,
const char *pat) {
const size_t used = strlen(blob);
const size_t need = strlen(pat) + 2; // marker + '\n'
if (need < blobsize - used) { // overflow-safe: used <= blobsize-1
blob[used] = marker;
blob[used + 1] = '\0';
strlcatbuff(blob, pat, blobsize);
strlcatbuff(blob, "\n", blobsize);
}
}
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow) {
size_t bptr = 0;
int record = 0;
char BIGSTK line[1024];
char BIGSTK blob[HTS_ROBOTS_TOKEN_SIZE];
blob[0] = '\0';
if (info != NULL && infosize > 0)
info[0] = '\0';
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", body);
#endif
while (bptr < bodysize) {
char *comm;
int llen;
bptr += binput(body + bptr, line, sizeof(line) - 2);
comm = strchr(line, '#'); // strip comment
if (comm != NULL)
*comm = '\0';
llen = (int) strlen(line); // strip trailing spaces
while (llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a = line + 11;
while (is_realspace(*a))
a++;
if (*a == '*') {
if (record != 2)
record = 1; // generic group applies to us
} else if (strfield(a, "httrack") || strfield(a, "winhttrack") ||
strfield(a, "webhttrack")) {
blob[0] = '\0'; // explicit group: restart capture
if (info != NULL && infosize > 0)
info[0] = '\0';
record = 2; // locked to the httrack group
} else
record = 0;
} else if (record) {
hts_boolean is_allow = strfield(line, "allow:");
hts_boolean is_disallow = !is_allow && strfield(line, "disallow:");
if (is_allow || is_disallow) {
char *a = line + (is_allow ? 6 : 9);
while (is_realspace(*a))
a++;
if (strnotempty(a)) {
if (is_disallow && !keep_root_disallow && strcmp(a, "/") == 0) {
// dropped: site-wide disallow ignored by option
} else {
robots_blob_add(blob, sizeof(blob), is_allow ? 'A' : 'D', a);
if (is_disallow && info != NULL &&
strlen(a) + 2 < infosize - strlen(info)) {
if (strnotempty(info))
strlcatbuff(info, ", ", infosize);
strlcatbuff(info, a, infosize);
}
}
}
}
}
}
if (strnotempty(blob))
checkrobots_set(robots, adr, blob);
}
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data) {
if (((int) strlen(adr)) >= sizeof(robots->adr) - 2)
return 0;

View File

@@ -39,17 +39,27 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEF_FWSTRUCT_robots_wizard
typedef struct robots_wizard robots_wizard;
#endif
/* Per-host blob: one rule per line, first byte 'A'/'D' then path pattern. */
#define HTS_ROBOTS_TOKEN_SIZE 4096
struct robots_wizard {
char adr[128];
char token[4096];
char token[HTS_ROBOTS_TOKEN_SIZE];
struct robots_wizard *next;
};
/* Library internal definictions */
#ifdef HTS_INTERNAL_BYTECODE
/* -1 if `fil` disallowed for `adr` (RFC 9309); empty: -1 if rules exist. */
int checkrobots(robots_wizard * robots, const char *adr, const char *fil);
void checkrobots_free(robots_wizard * robots);
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data);
/* Parse robots.txt `body` for `adr`, storing the HTTrack group's rules; `info`
gets a disallow summary, `keep_root_disallow` FALSE drops "Disallow: /". */
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow);
#endif
#endif

View File

@@ -456,6 +456,13 @@ static HTS_INLINE HTS_UNUSED const char *htsbuff_str(const htsbuff *b) {
return b->buf;
}
/** True if 'count' records of >= 1 byte each fit in 'available' bytes; guards
an attacker-controlled count driving a large allocation. */
static HTS_INLINE HTS_UNUSED hts_boolean hts_count_fits(size_t count,
LLint available) {
return (available >= 0 && (LLint) count <= available) ? HTS_TRUE : HTS_FALSE;
}
/* Thin aliases over the libc allocator/memcpy (historical "t" suffix); no
added bounds checking. freet() also NULLs the freed pointer and tolerates
NULL. memcpybuff() despite the name is a raw memcpy: the caller owns the

View File

@@ -45,11 +45,17 @@ Please visit our Website: http://www.httrack.com
#include "htscore.h"
#include "htsdefines.h"
#include "htslib.h"
#include "htsparse.h"
#include "htscache_selftest.h"
#include "htsdns_selftest.h"
#include "htscharset.h"
#include "htsencoding.h"
#include "htsftp.h"
#include "htsmd5.h"
#include "htssniff.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
#include "coucal/coucal.h"
#include <ctype.h>
@@ -57,6 +63,10 @@ Please visit our Website: http://www.httrack.com
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef _WIN32
#include <sys/socket.h>
#include <unistd.h>
#endif
/* very minimalistic internal tests */
static void basic_selftests(void) {
@@ -239,6 +249,14 @@ static void basic_selftests(void) {
assertf(strcmp(ext, "html") == 0);
assertf(give_mimext(ext, sizeof(ext), "no/such-mime-type") == 0);
assertf(ext[0] == '\0');
// modern web formats -> extension. Avoid MIME types the
// application/<=4-char-subtype fallback could fabricate without a row.
assertf(give_mimext(ext, sizeof(ext), "image/webp") == 1);
assertf(strcmp(ext, "webp") == 0);
assertf(give_mimext(ext, sizeof(ext), "application/manifest+json") == 1);
assertf(strcmp(ext, "webmanifest") == 0);
assertf(give_mimext(ext, sizeof(ext), "font/woff2") == 1);
assertf(strcmp(ext, "woff2") == 0);
}
// convtolower(): lower-cases into the caller buffer (bounded by its size).
{
@@ -293,6 +311,16 @@ static void basic_selftests(void) {
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.gif", 0) == 1);
assertf(strcmp(r.contenttype, "image/gif") == 0);
// modern extensions map back to their MIME type
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.webp", 0) == 1);
assertf(strcmp(r.contenttype, "image/webp") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"app.wasm", 0) == 1);
assertf(strcmp(r.contenttype, "application/wasm") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"mod.mjs", 0) == 1);
assertf(strcmp(r.contenttype, "text/javascript") == 0);
// no extension and flag==0: nothing written, returns 0
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 0) == 0);
@@ -503,6 +531,41 @@ static int string_safety_selftests(void) {
return 1;
}
/* StringCatN/StringSetLength must eval SIZE once: (n_eval++, V) leaves
n_eval == 2 on a double-eval macro. */
{
String s = STRING_EMPTY;
int n_eval = 0;
StringCat(s, "hello");
StringCatN(s, "world", (n_eval++, 3)); /* strlen>SIZE so the clamp runs */
if (n_eval != 1 || strcmp(StringBuff(s), "hellowor") != 0) {
StringFree(s);
return 1;
}
n_eval = 0;
StringSetLength(s, (n_eval++, 5));
if (n_eval != 1 || StringLength(s) != 5) {
StringFree(s);
return 1;
}
StringFree(s);
}
/* StringSubRW still reads/writes after dropping its duplicate definition. */
{
String s = STRING_EMPTY;
StringCat(s, "abc");
StringSubRW(s, 1) = 'X';
if (StringSub(s, 1) != 'X' || strcmp(StringBuff(s), "aXc") != 0) {
StringFree(s);
return 1;
}
StringFree(s);
}
return 0;
}
@@ -651,7 +714,8 @@ static int st_entities(httrackp *opt, int argc, char **argv) {
}
s = strdupt(argv[0]);
enc = argc >= 2 ? argv[1] : "UTF-8";
if (s != NULL && hts_unescapeEntitiesWithCharset(s, s, strlen(s), enc) == 0) {
if (s != NULL &&
hts_unescapeEntitiesWithCharset(s, s, strlen(s) + 1, enc) == 0) {
printf("%s\n", s);
freet(s);
} else {
@@ -660,6 +724,34 @@ static int st_entities(httrackp *opt, int argc, char **argv) {
return 0;
}
/* The unescapers must reserve one byte for the trailing NUL: a 'max'-byte
dest holding 'max' output chars pre-fix wrote dest[max] (1-byte OOB, caught
by ASan). Both unescapeEntities and unescapeUrl share the guard. */
static int st_unescape_bounds(httrackp *opt, int argc, char **argv) {
char dest[4];
(void) opt;
(void) argc;
(void) argv;
assertf(hts_unescapeEntities("abcd", dest, sizeof(dest)) == -1);
assertf(hts_unescapeUrl("abcd", dest, sizeof(dest)) == -1);
assertf(hts_unescapeEntities("abc", dest, sizeof(dest)) == 0);
assertf(strcmp(dest, "abc") == 0);
/* raw multi-byte UTF-8 flush path (bypasses the per-byte guard) */
assertf(hts_unescapeUrl("ab\xC3\xA9", dest, sizeof(dest)) == -1);
assertf(hts_unescapeUrl("a\xC3\xA9", dest, sizeof(dest)) == 0);
assertf(strcmp(dest, "a\xC3\xA9") == 0);
{
/* %xx-encoded flush path (utfBufferJ = lastJ rollback) */
char wide[8];
assertf(hts_unescapeUrl("%C3%A9", wide, sizeof(wide)) == 0);
assertf(strcmp(wide, "\xC3\xA9") == 0);
}
printf("unescape-bounds self-test OK\n");
return 0;
}
static int st_hashtable(httrackp *opt, int argc, char **argv) {
char *snum;
unsigned long count = 0;
@@ -1002,35 +1094,218 @@ static int st_resolve(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Extra args are key=value: adr= cdispo= statuscode= status= strip= urlhack=
no-www= no-slash= no-query= n83= type=, plus repeatable prior=adr|fil|sav
registering an already-crawled link (dedup/collision paths). */
/* Parse raw response-header lines and print the naming-relevant fields. */
static int st_header(httrackp *opt, int argc, char **argv) {
htsblk r;
int i;
(void) opt;
if (argc < 1) {
fprintf(stderr, "header: needs at least one raw header line\n");
return 1;
}
memset(&r, 0, sizeof(r));
for (i = 0; i < argc; i++) {
char BIGSTK line[HTS_URLMAXSIZE * 2];
strcpybuff(line, argv[i]);
treathead(NULL, "www.example.com", "/", &r, line);
}
printf("contenttype=%s cdispo=%s\n", r.contenttype, r.cdispo);
return 0;
}
/* Decode a body argument ("hex:FFD8.." or literal text) into buf. */
static size_t st_decode_body(const char *arg, char *buf, size_t size) {
size_t n = 0;
if (strncmp(arg, "hex:", 4) == 0) {
const char *s = arg + 4;
for (; s[0] != '\0' && s[1] != '\0' && n + 1 < size; s += 2) {
unsigned int byte;
if (sscanf(s, "%2x", &byte) != 1)
break;
buf[n++] = (char) byte;
}
} else {
n = strlen(arg);
if (n >= size)
n = size - 1;
memcpy(buf, arg, n);
}
buf[n] = '\0';
return n;
}
static int st_sniff(httrackp *opt, int argc, char **argv) {
char BIGSTK body[1024];
size_t n;
(void) opt;
if (argc < 2) {
fprintf(stderr, "sniff: needs a content-type and a body\n");
return 1;
}
n = st_decode_body(argv[1], body, sizeof(body));
printf("sniff: known=%d consistent=%d\n",
hts_sniff_mime_known(argv[0]) == HTS_TRUE,
hts_sniff_mime_consistent(body, n, argv[0]) == HTS_TRUE);
return 0;
}
static int st_savename(httrackp *opt, int argc, char **argv) {
lien_adrfilsave afs;
cache_back cache;
struct_back *sback;
hash_struct hash;
lien_back headers;
const char *adr = "www.example.com";
const char *cdispo = NULL;
const char *body = NULL;
const char *cached = NULL;
const char *bodyfile = "st-savename-body.tmp";
int statuscode = HTTP_OK, status = 0;
int i;
if (argc < 2) {
fprintf(stderr, "savename: needs a fil and a content-type\n");
return 1;
}
/* knobs first: hash_init and the prior links depend on them */
for (i = 2; i < argc; i++) {
const char *const a = argv[i];
if (strncmp(a, "adr=", 4) == 0)
adr = a + 4;
else if (strncmp(a, "cdispo=", 7) == 0)
cdispo = a + 7;
else if (strncmp(a, "statuscode=", 11) == 0)
statuscode = atoi(a + 11);
else if (strncmp(a, "status=", 7) == 0)
status = atoi(a + 7);
else if (strncmp(a, "strip=", 6) == 0)
StringCopy(opt->strip_query, a + 6);
else if (strncmp(a, "urlhack=", 8) == 0)
opt->urlhack = atoi(a + 8) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-www=", 7) == 0)
opt->no_www_dedup = atoi(a + 7) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-slash=", 9) == 0)
opt->no_slash_dedup = atoi(a + 9) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-query=", 9) == 0)
opt->no_query_dedup = atoi(a + 9) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "n83=", 4) == 0)
opt->savename_83 = atoi(a + 4);
else if (strncmp(a, "type=", 5) == 0)
opt->savename_type = atoi(a + 5);
else if (strncmp(a, "body=", 5) == 0)
body = a + 5;
else if (strncmp(a, "cached=", 7) == 0)
cached = a + 7;
else if (strncmp(a, "prior=", 6) != 0) {
fprintf(stderr, "savename: unknown arg '%s'\n", a);
return 1;
}
}
memset(&afs, 0, sizeof(afs));
strcpybuff(afs.af.adr, "www.example.com");
strcpybuff(afs.af.adr, adr);
strcpybuff(afs.af.fil, argv[0]);
memset(&cache, 0, sizeof(cache));
cache.hashtable = (void *) coucal_new(0);
if (cached != NULL) { /* cached=<content-type>|<save name> */
char *dup = strdupt(cached);
char *const sep = strchr(dup, '|');
char locbuf[64] = "";
htsblk cr;
if (sep == NULL) {
fprintf(stderr, "savename: cached needs ctype|save\n");
return 1;
}
*sep = '\0';
/* one-entry cache in cwd, reopened read-only; body is PNG magic on
purpose: only the recorded name (X-Save) may drive the naming */
StringCopy(opt->path_log, "");
cache.type = 1;
cache.log = cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache_init(&cache, opt);
hts_init_htsblk(&cr);
cr.statuscode = HTTP_OK;
strcpybuff(cr.msg, "OK");
strcpybuff(cr.contenttype, dup);
cr.location = locbuf;
cr.adr = strdupt("\x89PNG\r\n\x1a\n");
cr.size = 8;
cache_add(opt, &cache, &cr, adr, argv[0], sep + 1, 1, NULL);
freet(cr.adr);
if (cache.zipOutput != NULL) {
zipClose(cache.zipOutput, NULL);
cache.zipOutput = NULL;
}
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.ro = 1;
cache_init(&cache, opt);
freet(dup);
} else {
cache.hashtable = (void *) coucal_new(0);
}
sback = back_new(opt, opt->maxsoc * 32 + 1024);
/* same wiring as hts_mirror (htscore.c) */
hash_init(opt, &hash, opt->urlhack);
hash.liens = (const lien_url *const *const *) &opt->liens;
opt->hash = &hash;
hts_record_init(opt);
for (i = 2; i < argc; i++) {
if (strncmp(argv[i], "prior=", 6) == 0) {
char *dup = strdupt(argv[i] + 6);
char *const p1 = strchr(dup, '|');
char *const p2 = p1 != NULL ? strchr(p1 + 1, '|') : NULL;
if (p2 == NULL) {
fprintf(stderr, "savename: prior needs adr|fil|sav\n");
return 1;
}
*p1 = *p2 = '\0';
if (!hts_record_link(opt, dup, p1 + 1, p2 + 1, "", "", NULL))
return 1;
freet(dup);
}
}
memset(&headers, 0, sizeof(headers));
headers.status = 0;
headers.r.statuscode = HTTP_OK;
headers.status = status;
headers.r.statuscode = statuscode;
strcpybuff(headers.r.contenttype, argv[1]);
if (cdispo != NULL)
strcpybuff(headers.r.cdispo, cdispo);
strcpybuff(headers.url_fil, argv[0]);
if (body != NULL) { /* leading body bytes, read via url_sav */
char BIGSTK data[1024];
const size_t n = st_decode_body(body, data, sizeof(data));
FILE *const fp = fopen(bodyfile, "wb");
if (fp == NULL || fwrite(data, 1, n, fp) != n) {
fprintf(stderr, "savename: can not write %s\n", bodyfile);
return 1;
}
fclose(fp);
strcpybuff(headers.url_sav, bodyfile);
}
url_savename(&afs, NULL, NULL, NULL, opt, sback, &cache, &hash, 0, 0,
&headers);
if (body != NULL)
(void) UNLINK(bodyfile);
printf("savename: %s\n", afs.save);
return 0;
}
@@ -1284,6 +1559,514 @@ static int st_urlhack(httrackp *opt, int argc, char **argv) {
return 0;
}
/* #159: hts_redirect_same_savefile decides whether a redirect is a same-file
* alias. */
static int st_redirect_samefile(httrackp *opt, int argc, char **argv) {
(void) argc;
(void) argv;
#define SAME(aa, fa, ab, fb) hts_redirect_same_savefile(opt, aa, fa, ab, fb)
/* scheme and userinfo collapse (the #159 case); a different path does not */
assertf(SAME("http://foo.com", "/a/b", "https://foo.com", "/a/b"));
assertf(SAME("http://user@foo.com", "/a", "http://foo.com", "/a"));
assertf(!SAME("http://foo.com", "/a", "http://foo.com", "/b"));
/* www stays distinct here; the crawl's dedup layer folds www, not this helper
*/
opt->urlhack = HTS_TRUE;
opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
assertf(!SAME("http://www.foo.com", "/a", "http://foo.com", "/a"));
/* slash/query fold only when the dedup flag is on */
assertf(SAME("https://foo.com", "/a//b", "http://foo.com", "/a/b"));
assertf(
SAME("https://foo.com", "/p?b=2&a=1", "http://foo.com", "/p?a=1&b=2"));
opt->no_slash_dedup = opt->no_query_dedup = HTS_TRUE;
assertf(!SAME("https://foo.com", "/a//b", "http://foo.com", "/a/b"));
assertf(
!SAME("https://foo.com", "/p?b=2&a=1", "http://foo.com", "/p?a=1&b=2"));
/* but a pure scheme alias still collapses regardless of dedup opt-outs */
assertf(SAME("http://foo.com", "/a/b", "https://foo.com", "/a/b"));
opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
#undef SAME
printf("redirect-samefile self-test OK\n");
return 0;
}
// hts_finish_makeindex writes the footer, emits the refresh meta only when
// makeindex_links==1, and clears *fp / sets *done. argv[0] is a writable dir.
static int st_makeindex(httrackp *opt, int argc, char **argv) {
char path[HTS_URLMAXSIZE];
char buf[4096];
FILE *fp;
size_t n;
int done;
assertf(argc >= 1);
snprintf(path, sizeof(path), "%s/index.html", argv[0]);
/* single first link: footer + a refresh meta carrying the escaped URL */
done = 0;
fp = fopen(path, "wb");
assertf(fp != NULL);
hts_finish_makeindex(opt, &done, &fp, 1, "http://example.com/a b", "%s%s", "",
"");
assertf(fp == NULL); /* the function closed and cleared it */
assertf(done != 0);
fp = fopen(path, "rb");
assertf(fp != NULL);
n = fread(buf, 1, sizeof(buf) - 1, fp);
fclose(fp);
buf[n] = '\0';
assertf(strstr(buf, "Mirror and index made by HTTrack") != NULL);
assertf(strstr(buf, "Refresh") != NULL);
assertf(strstr(buf, "example.com") != NULL);
/* no single link: footer only, no refresh meta */
done = 0;
fp = fopen(path, "wb");
assertf(fp != NULL);
hts_finish_makeindex(opt, &done, &fp, 0, NULL, "%s%s", "", "");
assertf(fp == NULL);
assertf(done != 0);
fp = fopen(path, "rb");
assertf(fp != NULL);
n = fread(buf, 1, sizeof(buf) - 1, fp);
fclose(fp);
buf[n] = '\0';
assertf(strstr(buf, "Mirror and index made by HTTrack") != NULL);
assertf(strstr(buf, "Refresh") == NULL);
UNLINK(path);
printf("makeindex self-test OK\n");
return 0;
}
/* Each inplace_escape_*() must equal escape_*() on a copy. */
static int st_inplace_escape(httrackp *opt, int argc, char **argv) {
/* >255 bytes forces the helper's malloct path, not the stack buffer */
static char longstr[600];
static const char *const samples[] = {
"", "abc", "a b/c?d=e&f", "h\x8ello w\x94rld",
"a%b\"c<d>", "/path to/file", longstr};
static size_t (*const inplace[])(char *, size_t) = {
inplace_escape_in_url, inplace_escape_spc_url, inplace_escape_uri_utf,
inplace_escape_check_url, inplace_escape_uri};
static size_t (*const plain[])(const char *, char *, size_t) = {
escape_in_url, escape_spc_url, escape_uri_utf, escape_check_url,
escape_uri};
size_t i, f;
(void) opt;
(void) argc;
(void) argv;
memset(longstr, 'a', sizeof(longstr) - 1);
for (f = 0; f < sizeof(inplace) / sizeof(inplace[0]); f++) {
for (i = 0; i < sizeof(samples) / sizeof(samples[0]); i++) {
char ref[4096], work[4096];
size_t rret, iret;
rret = plain[f](samples[i], ref, sizeof(ref));
strcpybuff(work, samples[i]);
iret = inplace[f](work, sizeof(work));
assertf(iret == rret);
assertf(strcmp(work, ref) == 0);
}
}
printf("inplace-escape self-test OK\n");
return 0;
}
/* Pin HTS_HTMLESCAPE*_MAXEXP to each escaper's true max byte expansion. */
static int st_escape_room(httrackp *opt, int argc, char **argv) {
/* N > 1023: where 6n outgrows the old 5n+1024 reservation */
enum { N = 2000 };
char *src = malloct(N + 1);
char *dst;
size_t room, got;
(void) opt;
(void) argc;
(void) argv;
/* _full worst case: a high byte expands to "&#xHH;" (6 bytes) */
memset(src, 0xE9, N);
src[N] = '\0';
room = (size_t) N * HTS_HTMLESCAPE_FULL_MAXEXP + 1024;
dst = malloct(room);
got = escape_for_html_print_full(src, dst, room);
assertf(got == (size_t) N * HTS_HTMLESCAPE_FULL_MAXEXP);
assertf(strlen(dst) == got);
freet(dst);
/* one factor short overflows (returns size), truncating the page: the bug */
room = (size_t) N * (HTS_HTMLESCAPE_FULL_MAXEXP - 1) + 1024;
dst = malloct(room);
got = escape_for_html_print_full(src, dst, room);
assertf(got == room);
freet(dst);
/* plain escaper worst case: '&' -> "&amp;" (5); high bytes stay verbatim */
memset(src, '&', N);
src[N] = '\0';
room = (size_t) N * HTS_HTMLESCAPE_MAXEXP + 1024;
dst = malloct(room);
got = escape_for_html_print(src, dst, room);
assertf(got == (size_t) N * HTS_HTMLESCAPE_MAXEXP);
assertf(strlen(dst) == got);
freet(dst);
freet(src);
printf("escape-room self-test OK\n");
return 0;
}
/* Default User-Agent: honest HTTrack token, no resurrected Windows 98. */
static int st_useragent(httrackp *opt, int argc, char **argv) {
const char *ua = StringBuff(opt->user_agent);
(void) argc;
(void) argv;
assertf(ua != NULL);
assertf(strcmp(ua, HTS_DEFAULT_USER_AGENT) == 0);
/* Teeth independent of the macro: honest token + self-identifier, and no
legacy Mozilla/4.x fake-browser string (rejects the whole relic family). */
assertf(strstr(ua, "HTTrack/") != NULL);
assertf(strstr(ua, "+https://www.httrack.com/") != NULL);
assertf(strstr(ua, "Mozilla/4.") == NULL);
printf("useragent self-test OK: %s\n", ua);
return 0;
}
/* HTTP status code -> reason phrase, including the modern 429/451. */
static int st_status(httrackp *opt, int argc, char **argv) {
const char *s;
(void) opt;
(void) argc;
(void) argv;
s = infostatuscode_const(429);
assertf(s != NULL && strcmp(s, "Too Many Requests") == 0);
s = infostatuscode_const(451);
assertf(s != NULL && strcmp(s, "Unavailable For Legal Reasons") == 0);
/* A spot-check of a long-standing code, and an unknown one. */
s = infostatuscode_const(404);
assertf(s != NULL && strcmp(s, "Not Found") == 0);
assertf(infostatuscode_const(799) == NULL);
printf("status self-test OK\n");
return 0;
}
#if HTS_USEZLIB
/* Deflate src->path at windowBits (16+ gzip, + zlib, - raw); 0 on success. */
static int ae_write_packed(const char *path, int windowBits,
const unsigned char *src, size_t len) {
unsigned char out[8192];
z_stream strm;
FILE *f;
int zerr;
memset(&strm, 0, sizeof(strm));
if (deflateInit2(&strm, Z_DEFAULT_COMPRESSION, Z_DEFLATED, windowBits, 8,
Z_DEFAULT_STRATEGY) != Z_OK)
return 1;
if ((f = FOPEN(path, "wb")) == NULL) {
deflateEnd(&strm);
return 1;
}
strm.next_in = (Bytef *) src;
strm.avail_in = (uInt) len;
do {
size_t n;
strm.next_out = out;
strm.avail_out = sizeof(out);
zerr = deflate(&strm, Z_FINISH);
n = sizeof(out) - strm.avail_out;
if (n > 0 && fwrite(out, 1, n, f) != n) {
deflateEnd(&strm);
fclose(f);
return 1;
}
} while (zerr == Z_OK);
deflateEnd(&strm);
fclose(f);
return (zerr == Z_STREAM_END) ? 0 : 1;
}
/* Forged raw deflate (08 1D) that misdetects as zlib; only fallback decodes */
static int ae_write_collision(const char *path, const unsigned char *src,
size_t len) {
/* block-1 LEN low byte 0x1D: with 0x08, (0x081D)%31==0 */
const size_t n1 = 29;
size_t n2, p = 0;
unsigned char *buf;
FILE *f;
int ok;
if (len < n1 || len - n1 > 0xFFFF)
return 1;
n2 = len - n1;
buf = malloct(10 + len);
if (buf == NULL)
return 1;
buf[p++] = 0x08; /* BFINAL=0, BTYPE=00, forged padding -> zlib CMF nibble */
buf[p++] = (unsigned char) (n1 & 0xff);
buf[p++] = (unsigned char) (n1 >> 8);
buf[p++] = (unsigned char) (~n1 & 0xff);
buf[p++] = (unsigned char) ((~n1 >> 8) & 0xff);
memcpy(buf + p, src, n1);
p += n1;
buf[p++] = 0x01; /* BFINAL=1, BTYPE=00 */
buf[p++] = (unsigned char) (n2 & 0xff);
buf[p++] = (unsigned char) (n2 >> 8);
buf[p++] = (unsigned char) (~n2 & 0xff);
buf[p++] = (unsigned char) ((~n2 >> 8) & 0xff);
memcpy(buf + p, src + n1, n2);
p += n2;
f = FOPEN(path, "wb");
ok = (f != NULL && fwrite(buf, 1, p, f) == p);
if (f != NULL)
fclose(f);
freet(buf);
return ok ? 0 : 1;
}
/* Compare path's bytes to expect[0..len); 0 if equal. Streams (large files). */
static int ae_check_decoded(const char *path, const unsigned char *expect,
size_t len) {
unsigned char buf[8192];
FILE *f = FOPEN(path, "rb");
size_t off = 0, n;
if (f == NULL)
return 1;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
if (n > len - off || memcmp(buf, expect + off, n) != 0) {
fclose(f);
return 1;
}
off += n;
}
fclose(f);
return (off == len) ? 0 : 1;
}
#endif
/* Accept-Encoding (#450): advertise gzip+deflate; both decode (hts_zunpack) */
static int st_acceptencoding(httrackp *opt, int argc, char **argv) {
const char *off = hts_acceptencoding(HTS_FALSE);
const char *on = hts_acceptencoding(HTS_TRUE);
(void) opt;
assertf(strcmp(off, "identity") == 0);
assertf(strstr(on, "gzip") != NULL);
assertf(strstr(on, "deflate") != NULL); /* fails on the old gzip-only list */
#if HTS_USEZLIB
if (argc >= 1) {
static const int windowBits[] = {16 + MAX_WBITS, MAX_WBITS, -MAX_WBITS};
const unsigned char small[] =
"deflate round-trip: HTTrack decodes gzip and deflate alike. "
"deflate round-trip: HTTrack decodes gzip and deflate alike.";
const size_t slen = sizeof(small) - 1;
/* 64 KiB of varied (LCG) bytes: forces the multi-fread loop */
const size_t blen = 64 * 1024;
unsigned char *body = malloct(blen);
uint32_t x = 0x1234567u;
char inpath[HTS_URLMAXSIZE], outpath[HTS_URLMAXSIZE];
size_t i;
assertf(body != NULL);
for (i = 0; i < blen; i++) {
x = x * 1103515245u + 12345u;
body[i] = (unsigned char) (x >> 16);
}
/* gzip, zlib (RFC1950) and raw deflate (RFC1951), both small and large. */
for (i = 0; i < sizeof(windowBits) / sizeof(windowBits[0]); i++) {
snprintf(inpath, sizeof(inpath), "%s/ae-in-%d.z", argv[0], windowBits[i]);
snprintf(outpath, sizeof(outpath), "%s/ae-out-%d", argv[0],
windowBits[i]);
assertf(ae_write_packed(inpath, windowBits[i], small, slen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) slen);
assertf(ae_check_decoded(outpath, small, slen) == 0);
assertf(ae_write_packed(inpath, windowBits[i], body, blen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) blen);
assertf(ae_check_decoded(outpath, body, blen) == 0);
}
/* Fallback teeth: raw deflate misdetected as zlib; -1 without the retry. */
snprintf(inpath, sizeof(inpath), "%s/ae-collide.z", argv[0]);
snprintf(outpath, sizeof(outpath), "%s/ae-collide.out", argv[0]);
assertf(ae_write_collision(inpath, body, 64) == 0);
assertf(hts_zunpack(inpath, outpath) == 64);
assertf(ae_check_decoded(outpath, body, 64) == 0);
freet(body);
}
#else
(void) argc;
(void) argv;
#endif
printf("acceptencoding self-test OK: %s\n", on);
return 0;
}
/* Each call parses `txt` under a fresh host, then checkrobots() for `path`. */
static int rb_decide(robots_wizard *r, const char *txt, const char *path) {
static int n = 0;
char host[64];
snprintf(host, sizeof(host), "h%d.example", n++);
robots_parse(r, host, txt, strlen(txt), NULL, 0, HTS_TRUE);
return checkrobots(r, host, path);
}
static int st_robots(httrackp *opt, int argc, char **argv) {
robots_wizard robots;
(void) opt;
(void) argc;
(void) argv;
memset(&robots, 0, sizeof(robots));
/* Longer Allow re-opens subtree under Disallow: / (old matcher couldn't). */
{
const char *txt = "User-agent: *\nDisallow: /\nAllow: /public/\n";
assertf(rb_decide(&robots, txt, "/public/x") == 0); /* allowed */
assertf(rb_decide(&robots, txt, "/private") == -1); /* denied */
assertf(rb_decide(&robots, txt, "/") == -1); /* denied */
}
/* Equal-length match: Allow wins the tie over Disallow. */
{
const char *txt = "User-agent: *\nDisallow: /foo\nAllow: /foo\n";
assertf(rb_decide(&robots, txt, "/foo/bar") == 0);
}
/* Longest match wins even when it is not the last rule. */
{
assertf(rb_decide(&robots, "User-agent: *\nDisallow: /a/b\nAllow: /a\n",
"/a/b/c") == -1);
assertf(rb_decide(&robots, "User-agent: *\nAllow: /a/b\nDisallow: /a\n",
"/a/b/c") == 0);
}
/* '*' matches any run of characters. */
{
const char *txt = "User-agent: *\nDisallow: /*.php\n";
assertf(rb_decide(&robots, txt, "/a/b/index.php") == -1);
assertf(rb_decide(&robots, txt, "/a/b/index.html") == 0);
}
/* Trailing '$' anchors the end of the path. */
{
const char *txt = "User-agent: *\nDisallow: /a$\n";
assertf(rb_decide(&robots, txt, "/a") == -1);
assertf(rb_decide(&robots, txt, "/ab") == 0);
assertf(rb_decide(&robots, txt, "/a/b") == 0);
}
/* The httrack-specific group replaces the generic '*' group entirely. */
{
const char *txt = "User-agent: *\nDisallow: /everyone\n"
"User-agent: httrack\nDisallow: /\n";
assertf(rb_decide(&robots, txt, "/anything") == -1);
}
/* Replace, not merge: the generic group does not bind the httrack group. */
{
const char *txt = "User-agent: *\nDisallow: /x\n"
"User-agent: httrack\nDisallow: /y\n";
assertf(rb_decide(&robots, txt, "/x") == 0);
assertf(rb_decide(&robots, txt, "/y") == -1);
}
/* No rules: everything is allowed. */
assertf(rb_decide(&robots, "User-agent: *\nDisallow:\n", "/x") == 0);
checkrobots_free(&robots);
printf("robots self-test OK\n");
return 0;
}
/* get_ftp_line must bound a hostile, CRLF-less reply into its internal
1024-byte buffer; ASan turns the pre-fix overflow into an abort here. */
#ifndef _WIN32
static int st_ftpline(httrackp *opt, int argc, char **argv) {
int sv[2];
char line[2048];
char flood[4096];
(void) opt;
(void) argc;
(void) argv;
memset(flood, 'x', sizeof(flood));
assertf(socketpair(AF_UNIX, SOCK_STREAM, 0, sv) == 0);
assertf(write(sv[1], "220 ", 4) == 4); // valid 3-digit code
assertf(write(sv[1], flood, sizeof(flood)) == (ssize_t) sizeof(flood));
assertf(write(sv[1], "\r\n", 2) == 2); // end the line so we return
close(sv[1]);
line[0] = '\0';
get_ftp_line(sv[0], line, sizeof(line), 5);
close(sv[0]);
printf("ftp-line self-test OK (bounded %d-byte reply)\n",
(int) sizeof(flood));
return 0;
}
#endif
/* ftp_split_userpass: well-formed split, plus a hostile over-long userinfo
that pre-fix overran user[256]/pass[256]. */
static int st_ftpuser(httrackp *opt, int argc, char **argv) {
char user[256], pass[256];
char in[1200];
(void) opt;
(void) argc;
(void) argv;
{
const char ok[] = "bob:secret@host/f"; // '@' at index 10
ftp_split_userpass(ok, ok + 11, user, sizeof(user), pass, sizeof(pass));
assertf(strcmp(user, "bob") == 0);
assertf(strcmp(pass, "secret") == 0);
}
memset(in, 'u', 400);
in[400] = ':';
memset(in + 401, 'p', 400);
in[801] = '@';
in[802] = '\0';
ftp_split_userpass(in, in + 802, user, sizeof(user), pass, sizeof(pass));
assertf(strlen(user) == sizeof(user) - 1);
assertf(strlen(pass) == sizeof(pass) - 1);
{
/* tight sizes + guard byte catch an off-by-one the 256 case can't */
char ubuf[16], pbuf[16];
memset(ubuf, 'Z', sizeof(ubuf));
memset(pbuf, 'Z', sizeof(pbuf));
ftp_split_userpass(in, in + 802, ubuf, 8, pbuf, 8);
assertf(strcmp(ubuf, "uuuuuuu") == 0);
assertf(strcmp(pbuf, "ppppppp") == 0);
assertf(ubuf[8] == 'Z' && pbuf[8] == 'Z');
}
printf("ftp-userpass self-test OK\n");
return 0;
}
/* hts_count_fits caps the .class constant-pool entry count to the file size,
rejecting the ~68 MB-per-file calloc DoS. */
static int st_java(httrackp *opt, int argc, char **argv) {
(void) opt;
(void) argc;
(void) argv;
assertf(hts_count_fits(10, 1000) == HTS_TRUE);
assertf(hts_count_fits(0, 10) == HTS_TRUE);
assertf(hts_count_fits(65535, 10) == HTS_FALSE);
assertf(hts_count_fits(1, 0) == HTS_FALSE);
assertf(hts_count_fits(1, -1) == HTS_FALSE);
printf("java constant-pool cap self-test OK\n");
return 0;
}
/* ------------------------------------------------------------ */
/* Registry: name -> handler, with a usage hint and a one-line description. */
/* ------------------------------------------------------------ */
@@ -1304,6 +2087,8 @@ static const struct selftest_entry {
st_stripquery},
{"urlhack", "", "-%u url-hack sub-flag (www/slash/query) self-test",
st_urlhack},
{"redirect-samefile", "", "same-file redirect detection self-test (#159)",
st_redirect_samefile},
{"mime", "<filename>", "MIME type for a filename", st_mime},
{"charset", "<charset> <string>",
"convert a string to UTF-8 from a charset", st_charset},
@@ -1312,6 +2097,8 @@ static const struct selftest_entry {
{"idna-decode", "<host>", "decode an IDNA/punycode hostname",
st_idna_decode},
{"entities", "<string> [encoding]", "unescape HTML entities", st_entities},
{"unescape-bounds", "", "unescapers reserve the NUL byte (no 1-byte OOB)",
st_unescape_bounds},
{"hashtable", "<count|file>", "coucal hashtable stress test", st_hashtable},
{"strsafe", "[overflow|overflow-buff [str]]", "bounded string-op self-test",
st_strsafe},
@@ -1321,8 +2108,12 @@ static const struct selftest_entry {
st_relative},
{"resolve", "<link> <adr> <fil>", "resolve a link against an origin",
st_resolve},
{"savename", "<fil> <content-type>", "local save-name for a URL",
st_savename},
{"header", "<raw-header-line> ...", "response header-line parsing",
st_header},
{"savename", "<fil> <content-type> [key=value ...]",
"local save-name for a URL", st_savename},
{"sniff", "<content-type> <hex:..|text>", "MIME magic consistency",
st_sniff},
{"cache", "<dir>", "cache read/write round-trip self-test", st_cache},
{"cache-golden", "<dir> [regen]", "frozen cache-format read self-test",
st_cache_golden},
@@ -1330,6 +2121,24 @@ static const struct selftest_entry {
st_cache_writefail},
{"dns", "", "DNS resolver/cache self-test", st_dns},
{"cookies", "", "cookie request-header self-test", st_cookies},
{"useragent", "", "default User-Agent self-test", st_useragent},
{"makeindex", "[dir]", "hts_finish_makeindex footer/refresh self-test",
st_makeindex},
{"inplace-escape", "", "inplace_escape_* vs escape_* equivalence self-test",
st_inplace_escape},
{"escape-room", "", "HT_ADD_HTMLESCAPED* reservation-factor self-test",
st_escape_room},
{"status", "", "HTTP status code -> reason phrase self-test", st_status},
{"acceptencoding", "[dir]",
"Accept-Encoding advertises gzip+deflate, both decode", st_acceptencoding},
{"robots", "", "robots.txt RFC 9309 Allow/Disallow precedence self-test",
st_robots},
#ifndef _WIN32
{"ftp-line", "", "get_ftp_line bounds a hostile FTP reply line",
st_ftpline},
#endif
{"ftp-userpass", "", "ftp_split_userpass bounds URL userinfo", st_ftpuser},
{"java", "", "java .class constant-pool count cap self-test", st_java},
};
static void list_selftests(void) {

View File

@@ -358,12 +358,12 @@ int smallserver(T_SOC soc, char *url, char *method, char *data, char *path) {
{NULL, 0}
};
initStrElt initStr[] = {
{"user", "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"},
{"footer",
"<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->"},
{"url2", "+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}
};
{"user", HTS_DEFAULT_USER_AGENT},
{"footer", "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x "
"[XR&CO'2014], %s -->"},
{"url2",
"+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}};
int i = 0;
for(i = 0; initInt[i].name; i++) {

352
src/htssniff.c Normal file
View File

@@ -0,0 +1,352 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2017 Xavier Roche and other contributors
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Important notes:
- We hereby ask people using this source NOT to use it in purpose of grabbing
emails addresses, or collecting any other private information on persons.
This would disgrace our work, and spoil the many hours we spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: MIME magic-byte consistency checks */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#include "htssniff.h"
#include <string.h>
#include "htslib.h"
/* One magic rule: `len` bytes at `off` confirm `mime`. */
typedef struct sniff_magic {
const char *mime;
unsigned short off;
unsigned char len;
const char *bytes;
} sniff_magic;
/* Direction is mime -> magic (verify a claim, never classify); types with
no reliable magic (plain text, css, js..) are deliberately absent. Patterns
follow the WHATWG MIME Sniffing Standard tables where it defines them
(https://mimesniff.spec.whatwg.org/); the rest covers httrack's wider MIME
set. Spec-only types absent from our MIME tables (EOT, font/collection)
are omitted as unreachable. */
static const sniff_magic sniff_table[] = {
/* images */
{"image/jpeg", 0, 3, "\xff\xd8\xff"},
{"image/pipeg", 0, 3, "\xff\xd8\xff"},
{"image/pjpeg", 0, 3, "\xff\xd8\xff"},
{"image/png", 0, 8, "\x89PNG\r\n\x1a\n"},
{"image/gif", 0, 6, "GIF87a"},
{"image/gif", 0, 6, "GIF89a"},
{"image/bmp", 0, 2, "BM"},
{"image/tiff", 0, 4, "II*\0"},
{"image/tiff", 0, 4, "MM\0*"},
{"image/x-icon", 0, 4, "\0\0\1\0"},
{"image/x-icon", 0, 4, "\0\0\2\0"}, /* Windows cursor, per the spec */
{"image/x-portable-bitmap", 0, 2, "P1"},
{"image/x-portable-bitmap", 0, 2, "P4"},
{"image/x-portable-pixmap", 0, 2, "P3"},
{"image/x-portable-pixmap", 0, 2, "P6"},
{"image/x-xpixmap", 0, 9, "/* XPM */"},
{"image/x-xbitmap", 0, 7, "#define"},
{"image/x-rgb", 0, 2, "\x01\xda"},
{"image/x-cmu-raster", 0, 4, "\xf1\x00\x40\xbb"},
/* audio */
{"audio/mpeg", 0, 3, "ID3"},
{"audio/basic", 0, 4, ".snd"},
{"audio/mid", 0, 8, "MThd\0\0\0\6"},
{"audio/midi", 0, 8, "MThd\0\0\0\6"},
{"audio/x-pn-realaudio", 0, 4, ".ra\xfd"},
{"audio/x-pn-realaudio", 0, 4, ".RMF"},
{"audio/x-pn-realaudio-plugin", 0, 4, ".ra\xfd"},
{"audio/x-pn-realaudio-plugin", 0, 4, ".RMF"},
{"audio/flac", 0, 4, "fLaC"},
{"audio/aac", 0, 4, "ADIF"},
/* video */
{"video/mpeg", 0, 4, "\x00\x00\x01\xba"},
{"video/mpeg", 0, 4, "\x00\x00\x01\xb3"},
{"video/x-sgi-movie", 0, 4, "MOVI"},
/* archives / compression */
{"application/x-gzip", 0, 3, "\x1f\x8b\x08"},
{"multipart/x-gzip", 0, 3, "\x1f\x8b\x08"},
{"application/x-compressed", 0, 3, "\x1f\x8b\x08"},
{"application/x-compress", 0, 2, "\x1f\x9d"},
{"application/x-bzip2", 0, 3, "BZh"},
{"application/x-7z-compressed", 0, 6, "7z\xbc\xaf\x27\x1c"},
/* 6-byte prefix common to RAR4 (spec) and RAR5 */
{"application/x-rar-compressed", 0, 6, "Rar!\x1a\x07"},
{"application/zstd", 0, 4, "\x28\xb5\x2f\xfd"},
{"application/arj", 0, 2, "\x60\xea"},
{"application/x-cpio", 0, 6, "070701"},
{"application/x-cpio", 0, 6, "070707"},
{"application/x-cpio", 0, 2, "\xc7\x71"},
{"application/x-sv4cpio", 0, 6, "070701"},
{"application/x-sv4crc", 0, 6, "070702"},
{"application/x-stuffit", 0, 8, "StuffIt "},
{"application/x-stuffit", 0, 4, "SIT!"},
{"application/mac-binhex40", 0, 10, "(This file"},
/* documents */
{"application/pdf", 0, 5, "%PDF-"},
{"application/postscript", 0, 2, "%!"},
{"application/rtf", 0, 5, "{\\rtf"},
{"application/x-dvi", 0, 2, "\xf7\x02"},
{"application/x-hdf", 0, 4, "\x0e\x03\x13\x01"},
{"application/x-hdf", 0, 8, "\x89HDF\r\n\x1a\n"},
{"application/x-netcdf", 0, 4, "CDF\x01"},
{"application/x-netcdf", 0, 4, "CDF\x02"},
{"application/x-msaccess", 0, 19, "\0\1\0\0Standard Jet DB"},
/* fonts */
{"font/woff", 0, 4, "wOFF"},
{"font/woff2", 0, 4, "wOF2"},
{"font/ttf", 0, 4, "\0\1\0\0"},
{"font/ttf", 0, 4, "true"},
{"font/otf", 0, 4, "OTTO"},
/* misc */
{"application/x-shockwave-flash", 0, 3, "FWS"},
{"application/x-shockwave-flash", 0, 3, "CWS"},
{"application/x-shockwave-flash", 0, 3, "ZWS"},
{"application/futuresplash", 0, 3, "FWS"},
{"application/x-director", 0, 4, "RIFX"},
{"application/x-director", 0, 4, "XFIR"},
{"application/x-java-vm", 0, 4, "\xca\xfe\xba\xbe"},
{"application/wasm", 0, 4, "\0asm"},
{"application/x-msmetafile", 0, 4, "\xd7\xcd\xc6\x9a"},
{"application/x-msmetafile", 0, 4, "\x01\x00\x09\x00"},
{"application/x-x509-ca-cert", 0, 2, "\x30\x82"},
{"application/x-pkcs12", 0, 2, "\x30\x82"},
{"application/x-pkcs7-mime", 0, 2, "\x30\x82"},
{"application/x-pkcs7-signature", 0, 2, "\x30\x82"},
{"application/x-pkcs7-certificates", 0, 2, "\x30\x82"},
{"x-world/x-vrml", 0, 5, "#VRML"},
{"application/x-bittorrent", 0, 11, "d8:announce"},
{"drawing/x-dwf", 0, 4, "(DWF"},
{"application/acad", 0, 4, "AC10"},
{NULL, 0, 0, NULL}};
/* MIME families sharing a container magic */
static const char *const zip_mimes[] = {
"application/zip", "application/x-zip-compressed", "multipart/x-zip", NULL};
static const char *const zip_mime_prefixes[] = {
"application/vnd.openxmlformats-officedocument.",
"application/vnd.oasis.opendocument.", NULL};
static const char *const ole_mimes[] = {"application/msword",
"application/excel",
"application/vnd.ms-excel",
"application/powerpoint",
"application/vnd.ms-powerpoint",
"application/vnd.ms-project",
"application/vnd.ms-works",
"application/x-msmoney",
"application/x-mspublisher",
NULL};
static const char *const tar_mimes[] = {
"application/x-tar", "application/x-ustar", "application/x-gtar", NULL};
static const char *const ogg_mimes[] = {"application/ogg", "audio/ogg",
"video/ogg", "audio/opus", NULL};
static const char *const ebml_mimes[] = {"video/webm", "audio/webm", NULL};
/* ISO-BMFF, any 'ftyp' brand: containers overlap too much to split */
static const char *const bmff_mimes[] = {"video/mp4", "audio/mp4",
"video/quicktime", NULL};
static const char *const avif_mimes[] = {"image/avif", NULL};
static const char *const heic_mimes[] = {"image/heic", NULL};
static const char *const asf_mimes[] = {"video/x-ms-asf", "video/x-ms-wmv",
"video/x-la-asf", NULL};
static const char *const xml_mimes[] = {"application/xml", "text/xml",
"image/svg+xml", "image/svg-xml", NULL};
static const char *const svg_mimes[] = {"image/svg+xml", "image/svg-xml", NULL};
static const char *const html_mimes[] = {"text/html", NULL};
static const char *const pem_mimes[] = {
"application/x-x509-ca-cert", "application/x-pkcs7-certificates",
"application/x-pkcs7-mime", "application/x-pkcs7-signature", NULL};
static hts_boolean mime_in(const char *const *list, const char *mime) {
size_t i;
for (i = 0; list[i] != NULL; i++)
if (strfield2(list[i], mime))
return HTS_TRUE;
return HTS_FALSE;
}
static hts_boolean mime_in_prefix(const char *const *list, const char *mime) {
size_t i;
for (i = 0; list[i] != NULL; i++)
if (strfield(mime, list[i]))
return HTS_TRUE;
return HTS_FALSE;
}
static hts_boolean has_bytes(const unsigned char *d, size_t n, size_t off,
const char *bytes, size_t len) {
/* overflow-safe: untrusted n alone on one side */
return n >= off && len <= n - off && memcmp(d + off, bytes, len) == 0
? HTS_TRUE
: HTS_FALSE;
}
static unsigned char ascii_lower(unsigned char c) {
return c >= 'A' && c <= 'Z' ? (unsigned char) (c + 32) : c;
}
/* Case-insensitive text prefix after an optional UTF-8 BOM and whitespace. */
static hts_boolean has_text_prefix(const unsigned char *d, size_t n,
const char *prefix) {
const size_t len = strlen(prefix);
size_t i, k;
i = n >= 3 && memcmp(d, "\xef\xbb\xbf", 3) == 0 ? 3 : 0;
while (i < n && (d[i] == ' ' || d[i] == '\t' || d[i] == '\r' || d[i] == '\n'))
i++;
if (len > n - i) /* i <= n from the loop above */
return HTS_FALSE;
for (k = 0; k < len; k++)
if (ascii_lower(d[i + k]) != ascii_lower((unsigned char) prefix[k]))
return HTS_FALSE;
return HTS_TRUE;
}
typedef enum sniff_op {
SNIFF_QUERY_KNOWN, /* is any rule defined for this MIME? */
SNIFF_QUERY_MATCH /* do the bytes confirm this MIME? */
} sniff_op;
/* Single walk for both queries so the rule set can't drift apart. */
static hts_boolean sniff_eval(sniff_op op, const unsigned char *d, size_t n,
const char *mime) {
size_t i;
/* KNOWN short-circuits; MATCH tests the magic */
#define SNIFF_RULE(cond) \
do { \
if (op == SNIFF_QUERY_KNOWN) \
return HTS_TRUE; \
if (cond) \
return HTS_TRUE; \
} while (0)
for (i = 0; sniff_table[i].mime != NULL; i++) {
if (strfield2(sniff_table[i].mime, mime)) {
SNIFF_RULE(has_bytes(d, n, sniff_table[i].off, sniff_table[i].bytes,
sniff_table[i].len));
}
}
if (mime_in(zip_mimes, mime) || mime_in_prefix(zip_mime_prefixes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "PK\3\4", 4) ||
has_bytes(d, n, 0, "PK\5\6", 4));
}
if (mime_in(ole_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1", 8));
}
if (mime_in(tar_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 257, "ustar", 5));
}
if (mime_in(ogg_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "OggS\0", 5));
}
if (mime_in(ebml_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\x1a\x45\xdf\xa3", 4));
}
if (mime_in(bmff_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 4, "ftyp", 4));
}
if (mime_in(avif_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 4, "ftypavif", 8) ||
has_bytes(d, n, 4, "ftypavis", 8));
}
if (mime_in(heic_mimes, mime)) {
SNIFF_RULE(
has_bytes(d, n, 4, "ftyphei", 7) || has_bytes(d, n, 4, "ftyphev", 7) ||
has_bytes(d, n, 4, "ftypmif1", 8) || has_bytes(d, n, 4, "ftypmsf1", 8));
}
if (mime_in(asf_mimes, mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "\x30\x26\xb2\x75\x8e\x66\xcf\x11", 8));
}
if (strfield2("audio/x-wav", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) && has_bytes(d, n, 8, "WAVE", 4));
}
if (strfield2("video/x-msvideo", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) && has_bytes(d, n, 8, "AVI ", 4));
}
if (strfield2("image/webp", mime)) {
SNIFF_RULE(has_bytes(d, n, 0, "RIFF", 4) &&
has_bytes(d, n, 8, "WEBPVP", 6));
}
if (strfield2("image/x-portable-anymap", mime)) {
SNIFF_RULE(n >= 2 && d[0] == 'P' && d[1] >= '1' && d[1] <= '6');
}
if (strfield2("audio/x-aiff", mime)) {
SNIFF_RULE(
has_bytes(d, n, 0, "FORM", 4) &&
(has_bytes(d, n, 8, "AIFF", 4) || has_bytes(d, n, 8, "AIFC", 4)));
}
if (strfield2("audio/mpeg", mime)) {
/* MPEG audio frame sync (11 bits), valid layer and bitrate fields */
SNIFF_RULE(n >= 2 && d[0] == 0xff && (d[1] & 0xe0) == 0xe0 &&
(d[1] & 0x06) != 0);
}
if (strfield2("audio/aac", mime)) {
/* ADTS sync */
SNIFF_RULE(n >= 2 && d[0] == 0xff && (d[1] & 0xf6) == 0xf0);
}
if (strfield2("video/mp2t", mime)) {
SNIFF_RULE(n >= 1 && d[0] == 0x47 && (n <= 188 || d[188] == 0x47));
}
if (mime_in(xml_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<?xml"));
}
if (mime_in(svg_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<svg") ||
has_text_prefix(d, n, "<!DOCTYPE svg"));
}
if (mime_in(html_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "<!DOCTYPE") ||
has_text_prefix(d, n, "<html") ||
has_text_prefix(d, n, "<head"));
}
if (mime_in(pem_mimes, mime)) {
SNIFF_RULE(has_text_prefix(d, n, "-----BEGIN"));
}
if (strfield2("audio/x-mpegurl", mime)) {
SNIFF_RULE(has_text_prefix(d, n, "#EXTM3U"));
}
if (strfield2("text/x-vcard", mime)) {
SNIFF_RULE(has_text_prefix(d, n, "BEGIN:VCARD"));
}
#undef SNIFF_RULE
return HTS_FALSE;
}
hts_boolean hts_sniff_mime_known(const char *mime) {
if (mime == NULL || *mime == '\0')
return HTS_FALSE;
return sniff_eval(SNIFF_QUERY_KNOWN, NULL, 0, mime);
}
hts_boolean hts_sniff_mime_consistent(const void *data, size_t size,
const char *mime) {
if (data == NULL || size == 0 || mime == NULL || *mime == '\0')
return HTS_FALSE;
return sniff_eval(SNIFF_QUERY_MATCH, (const unsigned char *) data, size,
mime);
}

50
src/htssniff.h Normal file
View File

@@ -0,0 +1,50 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2017 Xavier Roche and other contributors
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Important notes:
- We hereby ask people using this source NOT to use it in purpose of grabbing
emails addresses, or collecting any other private information on persons.
This would disgrace our work, and spoil the many hours we spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: MIME magic-byte consistency checks */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#ifndef HTSSNIFF_DEFH
#define HTSSNIFF_DEFH
#include <stddef.h>
#include "htsglobal.h"
/* Leading-body window read to arbitrate a wire/extension MIME conflict. */
#define HTS_SNIFF_LEN 512
/* Can a magic rule ever confirm this MIME? (whether sniffing is worth it) */
hts_boolean hts_sniff_mime_known(const char *mime);
/* TRUE when the leading body bytes are consistent with the claimed MIME;
FALSE on unknown MIME, unknown magic, or too-short data (fail-safe). */
hts_boolean hts_sniff_mime_consistent(const void *data, size_t size,
const char *mime);
#endif

View File

@@ -121,9 +121,6 @@ struct String {
/** Byte at POS (read/write). No bounds check; POS must be < StringLength. **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Subcharacter (read/write) **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Byte POS positions from the end (read). POS==1 is the last byte. **/
#define StringRight(BLK, POS) (StringBuff(BLK)[StringLength(BLK) - POS])
@@ -191,8 +188,9 @@ HTS_STATIC char *StringBuffN_(String *blk, int size) {
asserts SIZE fits the existing content; does not (re)allocate. **/
#define StringSetLength(BLK, SIZE) \
do { \
if (SIZE >= 0) { \
(BLK).length_ = SIZE; \
const int len__ = (SIZE); /* signed: negative means strlen(buffer_) */ \
if (len__ >= 0) { \
(BLK).length_ = len__; \
} else { \
(BLK).length_ = strlen((BLK).buffer_); \
} \
@@ -308,10 +306,11 @@ HTS_STATIC void StringAttach(String *blk, char **str) {
#define StringCatN(BLK, STR, SIZE) \
do { \
const char *str__ = (STR); \
const size_t usize__ = (SIZE); \
if (str__ != NULL) { \
size_t size__ = strlen(str__); \
if (size__ > (SIZE)) { \
size__ = (SIZE); \
if (size__ > usize__) { \
size__ = usize__; \
} \
StringMemcat(BLK, str__, size__); \
} \

View File

@@ -80,6 +80,10 @@ htspair_t hts_detect_embed[] = {
{NULL, NULL}
};
/* HTML5 media siblings of <img src>: same near-link treatment (#451) */
static const htspair_t hts_detect_embed_html5[] = {
{"source", "src"}, {"source", "srcset"}, {"track", "src"}, {NULL, NULL}};
/* Internal */
static int hts_acceptlink_(httrackp * opt, int ptr, const char *adr,
const char *fil, const char *tag,
@@ -136,6 +140,17 @@ static int cmp_token(const char *tag, const char *cmp) {
&& !isalnum((unsigned char) tag[p]));
}
/* TRUE if (tag, attribute) matches an embedded-asset pair in the table */
static hts_boolean is_embed_pair(const htspair_t *table, const char *tag,
const char *attribute) {
int i;
for (i = 0; table[i].tag != NULL; i++) {
if (cmp_token(tag, table[i].tag) && cmp_token(attribute, table[i].attr))
return HTS_TRUE;
}
return HTS_FALSE;
}
static int hts_acceptlink_(httrackp * opt, int ptr,
const char *adr, const char *fil, const char *tag,
const char *attribute, int *set_prio_to,
@@ -163,15 +178,9 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
/* Built-in known tags (<img src=..>, ..) */
if (forbidden_url != 0 && opt->nearlink && tag != NULL && attribute != NULL) {
int i;
for(i = 0; hts_detect_embed[i].tag != NULL; i++) {
if (cmp_token(tag, hts_detect_embed[i].tag)
&& cmp_token(attribute, hts_detect_embed[i].attr)
) {
embedded_triggered = 1;
break;
}
if (is_embed_pair(hts_detect_embed, tag, attribute) ||
is_embed_pair(hts_detect_embed_html5, tag, attribute)) {
embedded_triggered = 1;
}
}

View File

@@ -47,48 +47,89 @@ Please visit our Website: http://www.httrack.com
*/
/*
Unpack file into a new file
Unpack file into a new file (gzip, zlib RFC1950 or raw deflate RFC1951).
Return value: size of the new file, or -1 if an error occurred
*/
/* Note: utf-8 */
int hts_zunpack(char *filename, char *newfile) {
int ret = -1;
if (filename != NULL && newfile != NULL) {
if (filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
const int fd = in != NULL ? fileno(in) : -1;
const int dup_fd = fd != -1 ? dup(fd) : -1;
// Note: we must dup to be able to flose cleanly.
const gzFile gz = dup_fd != -1 ? gzdopen(dup_fd, "rb") : NULL;
if (filename != NULL && newfile != NULL && filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
if (gz) {
FILE *const fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
int size = 0;
if (in != NULL) {
unsigned char BIGSTK inbuf[8192];
size_t navail = fread(inbuf, 1, sizeof(inbuf), in);
/* gzip/zlib headers -> +32 windowBits; else raw deflate (RFC1951) */
const hts_boolean wrapped =
(navail >= 2 &&
((inbuf[0] == 0x1f && inbuf[1] == 0x8b) ||
((inbuf[0] & 0x0f) == Z_DEFLATED &&
(((unsigned) inbuf[0] << 8 | inbuf[1]) % 31) == 0)));
int attempt;
if (fpout) {
int nr;
/* deflate is ambiguous; on failure retry with the other windowBits */
for (attempt = 0; attempt < 2 && ret < 0; attempt++) {
const int windowBits =
(attempt == 0 ? wrapped : !wrapped) ? (32 + MAX_WBITS) : -MAX_WBITS;
FILE *fpout;
z_stream strm;
do {
char BIGSTK buff[1024];
nr = gzread(gz, buff, sizeof(buff));
if (nr > 0) {
size += nr;
if (fwrite(buff, 1, nr, fpout) != nr)
nr = size = -1;
}
} while(nr > 0);
if (attempt > 0) {
/* rewind input; reopening fpout "wb" discards the partial output */
if (fseek(in, 0, SEEK_SET) != 0)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
}
fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
if (fpout == NULL)
break;
memset(&strm, 0, sizeof(strm));
if (inflateInit2(&strm, windowBits) != Z_OK) {
fclose(fpout);
} else
size = -1;
gzclose(gz);
ret = (int) size;
}
if (in != NULL) {
fclose(in);
break;
}
{
hts_boolean ok = HTS_TRUE;
int size = 0;
int zerr = Z_OK;
/* chunked inflate; first chunk in inbuf, single member */
do {
strm.next_in = inbuf;
strm.avail_in = (uInt) navail;
do {
unsigned char BIGSTK outbuf[8192];
size_t produced;
strm.next_out = outbuf;
strm.avail_out = sizeof(outbuf);
zerr = inflate(&strm, Z_NO_FLUSH);
if (zerr == Z_NEED_DICT || zerr == Z_DATA_ERROR ||
zerr == Z_MEM_ERROR || zerr == Z_STREAM_ERROR) {
ok = HTS_FALSE;
break;
}
produced = sizeof(outbuf) - strm.avail_out;
if (produced > 0 &&
fwrite(outbuf, 1, produced, fpout) != produced) {
ok = HTS_FALSE;
break;
}
size += (int) produced;
} while (strm.avail_out == 0);
if (!ok || zerr == Z_STREAM_END)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
} while (navail > 0);
if (ok && zerr == Z_STREAM_END)
ret = size;
}
inflateEnd(&strm);
fclose(fpout);
}
fclose(in);
}
}
return ret;

View File

@@ -497,6 +497,12 @@ static const char *GetHttpMessage(int statuscode) {
case 417:
return "Expectation Failed";
break;
case 429:
return "Too Many Requests";
break;
case 451:
return "Unavailable For Legal Reasons";
break;
case 500:
return "Internal Server Error";
break;

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# HT_ADD_HTMLESCAPED* must reserve the escaper's worst case (6 for _full).
httrack -O /dev/null -#test=escape-room run | grep -q "escape-room self-test OK"

7
tests/01_engine-ftp-line.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# get_ftp_line bounds a hostile CRLF-less FTP reply into its 1024-byte buffer.
httrack -O /dev/null -#test=ftp-line run | grep -q "ftp-line self-test OK"

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# ftp_split_userpass bounds an over-long user:pass@ from a hostile ftp:// URL.
httrack -O /dev/null -#test=ftp-userpass run | grep -q "ftp-userpass self-test OK"

View File

@@ -0,0 +1,29 @@
#!/bin/bash
#
set -euo pipefail
# Response header-line parsing (treathead via -#test=header <raw-line> ...).
# Isolates the wire layer from url_savename, which strips traversal on its own.
hdr() {
local want="$1"
shift
out="$(httrack -O /dev/null -#test=header "$@" | grep '^contenttype=')"
test "$out" == "$want" || {
echo "FAIL: $* -> '$out' (want '$want')"
exit 1
}
}
hdr 'contenttype=application/pdf cdispo=' 'Content-Type: application/pdf'
# filename= is honored quoted or bare.
hdr 'contenttype= cdispo=report.pdf' \
'Content-Disposition: attachment; filename="report.pdf"'
hdr 'contenttype= cdispo=report.pdf' \
'Content-Disposition: attachment; filename=report.pdf'
# Path components in the filename are dropped on the wire (RFC 2616).
hdr 'contenttype= cdispo=evil.pdf' \
'Content-Disposition: attachment; filename="../../evil.pdf"'

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# inplace_escape_*() must match escape_*() on a copy: guards the shared helper.
httrack -O /dev/null -#test=inplace-escape run | grep -q "inplace-escape self-test OK"

7
tests/01_engine-java.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# .class constant-pool count is capped to the file size (calloc DoS).
httrack -O /dev/null -#test=java run | grep -q "java constant-pool cap self-test OK"

12
tests/01_engine-makeindex.test Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
set -euo pipefail
# hts_finish_makeindex writes the footer and gates the refresh meta on a single
# first link (guards the macro->function extraction).
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
httrack -O /dev/null -#test=makeindex "$dir" run |
grep -q "makeindex self-test OK"

View File

@@ -323,4 +323,33 @@ grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
# HTML5 <source>/<track> follow as embedded near-links past the -r2 depth boundary (#451).
# img.gif positive control; plain.gif (bare <a href>) negative control proves the gate is selective.
site10="$tmp/html5media"
mkdir -p "$site10"
for f in img ss plain; do gif "$site10/$f.gif"; done
printf 'x' >"$site10/v.webm"
printf 'x' >"$site10/subs.vtt"
cat >"$site10/index.html" <<EOF
<html><body><a href="leaf.html">leaf</a></body></html>
EOF
cat >"$site10/leaf.html" <<EOF
<html><body>
<img src="img.gif">
<picture><source srcset="ss.gif 2x"></picture>
<video><source src="v.webm"></video>
<video><track src="subs.vtt"></video>
<a href="plain.gif">plain link past the boundary</a>
</body></html>
EOF
out10="$tmp/html5media-out"
rm -rf "$out10"
mkdir -p "$out10"
httrack "file://$site10/index.html" -O "$out10" --quiet --near -r2 >"$out10/.log" 2>&1 || true
found "img.gif" "$out10"
found "ss.gif" "$out10"
found "v.webm" "$out10"
found "subs.vtt" "$out10"
notfound "plain.gif" "$out10"
exit 0

View File

@@ -0,0 +1,9 @@
#!/bin/bash
#
set -euo pipefail
# #159: a redirect to a same-file alias (http<->https, user@host, ..) must be
# followed through, not turned into a self-pointing "moved" stub. The decision
# helper is exercised by the engine self-test.
httrack -O /dev/null -#test=redirect-samefile run | grep -q "redirect-samefile self-test OK"

7
tests/01_engine-robots.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# robots.txt RFC 9309 Allow/Disallow precedence (#452): longest match wins.
httrack -O /dev/null -#test=robots run | grep -q "robots self-test OK"

View File

@@ -3,13 +3,38 @@
set -euo pipefail
# Local save-name extension resolution (url_savename via -#test=savename <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
# Local save-name resolution (url_savename via -#test=savename <fil> <content-type> [key=value ...]).
# name() asserts on the basename, full() on the whole path; prior= registers an
# already-crawled link whose sav is rooted under the -O path (/dev/null here).
# resolve httrack before cd: make check puts a RELATIVE ../src on PATH
httrack_bin=$(cd "$(dirname "$(command -v httrack)")" && pwd)/httrack
# scratch dir: body= and cached= write temp files (st-savename-body.tmp, hts-cache/)
scratch=$(mktemp -d)
trap 'rm -rf "$scratch"' EXIT
cd "$scratch"
run() {
"$httrack_bin" -O /dev/null -#test=savename "$@" | sed -n 's/^savename: //p'
}
name() {
out="$(httrack -O /dev/null -#test=savename "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
local fil="$1" ctype="$2" want="$3"
shift 3
out="$(run "$fil" "$ctype" "$@")"
test "${out##*/}" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
full() {
local fil="$1" ctype="$2" want="$3"
shift 3
out="$(run "$fil" "$ctype" "$@")"
test "$out" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
@@ -39,3 +64,95 @@ name '/types/data.json' 'application/json' 'data.json'
# Agreeing type must not rewrite the extension's casing (no strip-and-reappend).
name '/x.JPG' 'image/jpeg' 'x.JPG'
# A Content-Disposition filename replaces the URL name outright.
name '/x.php' 'application/pdf' 'report.pdf' cdispo=report.pdf
name '/download' 'text/html' 'setup.exe' cdispo=setup.exe
# Reserved characters in a hostile Content-Disposition name are sanitized.
name '/x.php' 'application/pdf' 'set_up.exe' 'cdispo=set:up.exe'
# The md5-of-query suffix lands inside a Content-Disposition name too.
name '/x.php?id=1' 'application/pdf' 'report681a.pdf' cdispo=report.pdf
# Still-downloading path (status=-1): mime drives the ext, cdispo is ignored
# there (the deliberately unfolded 4th resolve_extension variant).
name '/x.pdf' 'text/html' 'x.html' status=-1
name '/x.html' 'text/html' 'x.html' status=-1
name '/x.php' 'application/pdf' 'x.pdf' status=-1 cdispo=report.pdf
# Contested type (wire disagrees with a specific ext): magic bytes proving the
# extension right keep it, anything else trusts the wire as before.
name '/photo.jpg' 'image/png' 'photo.jpg' body=hex:FFD8FFE000104A46
name '/photo.jpg' 'image/png' 'photo.png' body=hex:89504E470D0A1A0A
name '/photo.jpg' 'image/png' 'photo.png'
name '/doc.pdf' 'text/html' 'doc.pdf' body=hex:255044462D312E34
name '/doc.pdf' 'text/html' 'doc.html' 'body=<html><body>soft 404</body></html>'
name '/style.css' 'image/png' 'style.png' 'body=body { }' # no rule for css: wire wins
# A redirect answer resolves nothing: delayed placeholder name.
name '/x.php' 'text/html' 'x.0.delayed' statuscode=301
# Root and query-only URLs get index + the md5-of-query suffix.
name '/' 'text/html' 'index.html'
name '/?a=1' 'text/html' 'index3872.html'
# Same URL crawled before: reuse its sav verbatim (case preserved).
full '/X.PHP' 'text/html' 'www.example.com/CASE.HTML' \
'prior=www.example.com|/X.PHP|www.example.com/CASE.HTML'
# Another URL owns the name: collision suffix -2, then -3, case-insensitively.
name '/x.php' 'text/html' 'x-2.html' \
'prior=www.example.com|/other.html|/dev/null/www.example.com/x.html'
name '/x.php' 'text/html' 'x-3.html' \
'prior=www.example.com|/o1.html|/dev/null/www.example.com/x.html' \
'prior=www.example.com|/o2.html|/dev/null/www.example.com/x-2.html'
name '/INDEX.HTML' 'text/html' 'INDEX-2.HTML' \
'prior=www.example.com|/index.html|/dev/null/www.example.com/index.html'
# Same basename in another directory is NOT a collision.
name '/x.php' 'text/html' 'x.html' \
'prior=www.example.com|/sub/x.html|/dev/null/www.example.com/sub/x.html'
# 8-3 modes: DOS truncates every component to 8+3, ISO9660 level 2 to 31.
full '/directory-long/verylongfilename.html' 'text/html' \
'/dev/null/EXAMPLE/DIRECTOR/VERYLONG.HTM' n83=1
full '/directory-long/verylongfilename.html' 'text/html' \
'/dev/null/EXAMPLE_C/DIRECTORY_LONG/VERYLONGFILENAME.HTM' n83=2
name '/verylongfilename.php' 'text/html' 'VERYLO-2.HTM' n83=1 \
'prior=www.example.com|/other.html|/dev/null/EXAMPLE/VERYLONG.HTM'
# urlhack dedup (#271): // collapse and www-strip map to the prior link's sav;
# the per-feature negatives opt out and take a fresh name.
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/PRIOR.html' \
'prior=www.example.com|/a/b.php|/dev/null/www.example.com/a/PRIOR.html'
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/b.html' no-slash=1 \
'prior=www.example.com|/a/b.php|/dev/null/www.example.com/a/PRIOR.html'
full '/w.php' 'text/html' '/dev/null/www.example.com/W-PRIOR.html' adr=example.com \
'prior=www.example.com|/w.php|/dev/null/www.example.com/W-PRIOR.html'
full '/w.php' 'text/html' '/dev/null/example.com/w.html' adr=example.com no-www=1 \
'prior=www.example.com|/w.php|/dev/null/www.example.com/W-PRIOR.html'
# Distinct URLs must stay distinct under urlhack (no over-normalization).
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/b.html' \
'prior=www.example.com|/a/c.php|/dev/null/www.example.com/a/C-PRIOR.html'
# --strip-query (#112): stripped key dedups onto the prior sav; without the
# option the same URLs stay distinct.
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/PAGE-PRIOR.html' \
strip=sid 'prior=www.example.com|/page.php?id=3|/dev/null/www.example.com/PAGE-PRIOR.html'
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/page475b.html' \
'prior=www.example.com|/page.php?id=3|/dev/null/www.example.com/PAGE-PRIOR.html'
# A kept key that differs must still block the dedup (no over-stripping).
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/page475b.html' \
strip=sid 'prior=www.example.com|/page.php?id=4|/dev/null/www.example.com/PAGE-PRIOR.html'
# Hostile fils stay rooted under the mirror: ../ (raw or %2e-encoded) drops out,
# control characters become spaces, oversized names cap at 210 chars (the cap
# can chop the extension off entirely).
full '/../../etc/passwd' 'text/html' '/dev/null/www.example.com///etc/passwd.html'
full '/%2e%2e/%2e%2e/etc/passwd' 'text/html' '/dev/null/www.example.com///etc/passwd.html'
full '/x.php' 'application/pdf' '/dev/null/www.example.com///evil.exe' 'cdispo=../../evil.exe'
name $'/evil\rname\t.php' 'text/html' 'evil name .html'
name "/$(printf 'a%.0s' {1..300}).php" 'text/html' "$(printf 'a%.0s' {1..210})"

View File

@@ -0,0 +1,87 @@
#!/bin/bash
#
set -euo pipefail
# MIME magic consistency (-#test=sniff <content-type> <hex:..|text>), the
# tie-break behind htsname's wire-vs-extension naming.
chk() {
local mime="$1" body="$2" want="$3"
out="$(httrack -#test=sniff "$mime" "$body" | sed -n 's/^sniff: //p')"
test "$out" == "$want" || {
echo "FAIL: '$mime' '$body' -> '$out' (want '$want')"
exit 1
}
}
yes='known=1 consistent=1'
no='known=1 consistent=0'
unk='known=0 consistent=0'
# images
chk image/jpeg hex:FFD8FFE000104A46 "$yes"
chk image/png hex:89504E470D0A1A0A "$yes"
chk image/png hex:FFD8FFE000104A46 "$no" # jpeg bytes are not a png
chk image/gif 'GIF89a' "$yes"
chk image/bmp 'BMxxxx' "$yes"
chk image/tiff hex:49492A00 "$yes"
chk image/tiff hex:4D4D002A "$yes" # both endians
chk image/x-icon hex:00000100 "$yes"
chk image/x-icon hex:00000200 "$yes" # Windows cursor, spec maps to x-icon
chk image/webp 'RIFFxxxxWEBPVP' "$yes"
chk image/webp 'RIFFxxxxWAVE' "$no" # riff subtype discriminates
chk image/avif hex:0000001C6674797061766966 "$yes"
chk image/avif hex:0000001C6674797068656963 "$no" # heic brand is not avif
chk image/heic hex:0000001C6674797068656963 "$yes"
chk image/svg+xml '<svg xmlns="x">' "$yes"
chk image/svg+xml $'\xef\xbb\xbf <?xml version="1.0"?>' "$yes" # BOM+ws skip
# audio / video
chk audio/mpeg 'ID3xxx' "$yes"
chk audio/mpeg hex:FFFB9000 "$yes" # bare frame sync
chk audio/aac hex:FFF15080 "$yes"
chk audio/flac 'fLaC' "$yes"
chk audio/ogg hex:4F67675300 "$yes"
chk audio/x-wav 'RIFFxxxxWAVE' "$yes"
chk video/x-msvideo 'RIFFxxxxAVI ' "$yes"
chk video/x-msvideo 'RIFFxxxxWAVE' "$no"
chk video/mp4 hex:000000186674797069736F6D "$yes"
chk video/webm hex:1A45DFA3 "$yes"
chk video/mpeg hex:000001BA "$yes"
chk video/x-ms-wmv hex:3026B2758E66CF11 "$yes"
# archives; zip magic covers the office-container families
chk application/zip hex:504B0304 "$yes"
chk application/vnd.openxmlformats-officedocument.wordprocessingml.document hex:504B0304 "$yes"
chk application/vnd.oasis.opendocument.text hex:504B0304 "$yes"
chk application/msword hex:D0CF11E0A1B11AE1 "$yes"
chk application/msword hex:504B0304 "$no" # legacy .doc is OLE, not zip
chk application/x-gzip hex:1F8B08 "$yes"
chk application/x-bzip2 'BZh9' "$yes"
chk application/x-7z-compressed hex:377ABCAF271C "$yes"
chk application/x-rar-compressed hex:526172211A07 "$yes"
chk application/zstd hex:28B52FFD "$yes"
chk application/x-tar "hex:$(printf '00%.0s' {1..257})7573746172" "$yes" # ustar at 257
chk application/x-tar hex:7573746172 "$no"
# documents, fonts, misc
chk application/pdf '%PDF-1.7' "$yes"
chk application/pdf '<html><body>soft 404</body></html>' "$no"
chk application/postscript '%!PS-Adobe' "$yes"
chk application/rtf '{\rtf1' "$yes"
chk font/woff2 'wOF2' "$yes"
chk font/otf 'OTTO' "$yes"
chk font/ttf hex:0001000000 "$yes"
chk application/x-shockwave-flash 'CWSx' "$yes"
chk application/x-java-vm hex:CAFEBABE "$yes"
chk application/wasm hex:0061736D "$yes"
chk text/html $' \r\n<!DOCTYPE html><html>' "$yes"
chk text/html '<html lang="en">' "$yes"
chk text/html 'plain text, no markup' "$no"
chk text/xml '<?xml version="1.0"?>' "$yes"
# no magic rule at all: never confirmed, never blocks the wire type
chk text/css 'body { }' "$unk"
chk text/plain 'hello' "$unk"
chk application/x-javascript 'var x;' "$unk"

7
tests/01_engine-status.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# HTTP status -> reason phrase, including the modern 429/451 (#453).
httrack -O /dev/null -#test=status run | grep -q "status self-test OK"

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# Entity/URL unescapers reserve one byte for the trailing NUL (no 1-byte OOB).
httrack -O /dev/null -#test=unescape-bounds run | grep -q "unescape-bounds self-test OK"

7
tests/01_engine-useragent.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# Default User-Agent (#449): honest HTTrack token, no Windows 98 relic.
httrack -O /dev/null -#test=useragent run | grep -q "useragent self-test OK"

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
set -euo pipefail
# Accept-Encoding (#450): advertise gzip+deflate; decode gzip/zlib/raw-deflate.
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
httrack -O /dev/null -#test=acceptencoding "$dir" run |
grep -q "acceptencoding self-test OK"

View File

@@ -6,7 +6,7 @@
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
#
# 01_engine-cache.test writes the cache with the same build it reads back (a
# 01_zlib-cache.test writes the cache with the same build it reads back (a
# round-trip), so it cannot catch a read-path or ZIP-format regression where
# writer and reader drift together. This reads a *committed* cache frozen by an
# earlier build and asserts a fixed set of entries still decodes field- and

View File

@@ -0,0 +1,33 @@
#!/bin/bash
#
set -euo pipefail
# Update-run naming from a real cache entry (-#test=savename cached=<ctype>|<save>).
# Named 01_zlib-*: the cache writer needs zlib, which the MSan job can't run.
# resolve httrack before cd: make check puts a RELATIVE ../src on PATH
httrack_bin=$(cd "$(dirname "$(command -v httrack)")" && pwd)/httrack
scratch=$(mktemp -d)
trap 'rm -rf "$scratch"' EXIT
cd "$scratch"
name() {
local fil="$1" ctype="$2" want="$3"
shift 3
out="$("$httrack_bin" -O /dev/null -#test=savename "$fil" "$ctype" "$@" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
# No live bytes: the recorded save name (X-Save) reproduces the previous
# verdict; cached body bytes (PNG magic) are ignored; css has no magic rule.
name '/photo.jpg' 'image/png' 'photo.jpg' 'cached=image/png|www.example.com/photo.jpg'
name '/photo.jpg' 'image/png' 'photo.png' 'cached=image/png|www.example.com/photo.png'
name '/photo.jpg' 'image/jpeg' 'photo.jpg' 'cached=image/jpeg|www.example.com/photo.png'
name '/style.css' 'image/png' 'style.css' 'cached=image/png|www.example.com/style.css'
# agreement keeps the URL ext verbatim (.jpeg), never canonicalized to .jpg
name '/photo.jpeg' 'image/jpeg' 'photo.jpeg' 'cached=image/jpeg|www.example.com/photo.jpeg'

View File

@@ -1,11 +1,10 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
# Content-Type vs URL-extension naming (#267 family, default -%N2). A MISSING
# type keeps a specific non-HTML ext; a DECLARED disagreeing type is trusted
# unless magic bytes prove the ext right (lie/wrongtype/packed keep theirs),
# so a real HTML body (report.pdf) still becomes .html. Wrong names are
# asserted absent so a regression in either direction fails.
: "${top_srcdir:=..}"
@@ -14,7 +13,11 @@ bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/lie.png' --not-found 'types/lie.html' \
--found 'types/wrongtype.jpg' --not-found 'types/wrongtype.png' \
--found 'types/bigtype.jpg' --not-found 'types/bigtype.png' \
--found 'types/mutant.jpg' --not-found 'types/mutant.png' \
--found 'types/packed.jpg' --not-found 'types/packed.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \

View File

@@ -1,15 +1,18 @@
#!/bin/bash
#
# A second (update) pass must keep the names the first crawl chose. The stored
# Content-Type rides the cache, so the update reads back the same value -- the
# unknown/unknown sentinel for a typeless response, the declared type otherwise
# -- and names consistently: a declared-text/html .pdf stays .html and a
# typeless .png stays .png across the update rather than reverting.
# An update pass keeps the names the first crawl chose: type and save name
# ride the cache, so a declared-text/html .pdf stays .html, a typeless .png
# stays .png, and a sniff-kept ext is reproduced from X-Save even when the
# refetched content changed (mutant.jpg serves PNG bytes on the rerun).
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/notype.png' --not-found 'types/notype.html' \
--found 'types/lie.html' \
--found 'types/lie.png' --not-found 'types/lie.html' \
--found 'types/wrongtype.jpg' --not-found 'types/wrongtype.png' \
--found 'types/bigtype.jpg' --not-found 'types/bigtype.png' \
--found 'types/packed.jpg' --not-found 'types/packed.png' \
--found 'types/mutant.jpg' --not-found 'types/mutant.png' \
httrack 'BASEURL/types/index.html'

View File

@@ -20,6 +20,14 @@ if ! command -v python3 >/dev/null 2>&1; then
echo "python3 missing, skipping"
exit 77
fi
# The fixture needs a second loopback IP (dead 127.0.0.2 + live 127.0.0.1) for
# the fallback to have a target; GNU/Hurd has only 127.0.0.1, so skip there.
case "$(uname -s)" in
GNU | GNU/*)
echo "GNU/Hurd: single loopback IP, connect-fallback fixture unbuildable, skipping"
exit 77
;;
esac
server="$top_srcdir/tests/local-server.py"
root="$top_srcdir/tests/server-root"

View File

@@ -9,6 +9,13 @@ set -e
: "${top_srcdir:=..}"
# python3 runs the local server (mirror local-crawl.sh); skip when absent, else
# run() swallows its exit-77 and the serverless 0s/0s crawl looks like a fail.
command -v python3 >/dev/null || {
echo "python3 not found; skipping local crawl tests"
exit 77
}
run() { # echoes the wall-clock seconds of one crawl
local t0 t1
t0=$(date +%s)

View File

@@ -0,0 +1,13 @@
#!/bin/bash
# Issue #279: an anchored link (target.html#sec, quoted or bare) fetches the
# target with the fragment dropped (strict server 400s on a '#' in the request)
# but keeps it in the rewritten local link so the anchor still works.
set -e
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'fraglink/target.html' \
--file-matches 'fraglink/index.html' 'href=target\.html#sec' \
--file-matches 'fraglink/index.html' 'href="target\.html#sec2"' \
httrack 'BASEURL/fraglink/index.html'

View File

@@ -0,0 +1,23 @@
#!/bin/bash
# The java plugin must load (versioned dlopen name) and parse a .class
# constant pool: a resource named only inside Foo.class gets crawled.
set -e
: "${top_srcdir:=..}"
tmproot=$(mktemp -d)
trap 'rm -rf "$tmproot"' EXIT
mkdir "$tmproot/javaclass"
cat >"$tmproot/javaclass/index.html" <<'EOF'
<html><body><a href="Foo.class">applet</a></body></html>
EOF
printf 'GIF89a' >"$tmproot/javaclass/hello.gif"
# magic/minor/major, count=2, one CONSTANT_Utf8 "hello.gif", class/superclass
printf '\xCA\xFE\xBA\xBE\x00\x00\x00\x32\x00\x02\x01\x00\x09hello.gif\x00\x00\x00\x00' \
>"$tmproot/javaclass/Foo.class"
bash "$top_srcdir/tests/local-crawl.sh" --root "$tmproot" --errors 0 \
--found 'javaclass/Foo.class' \
--found 'javaclass/hello.gif' \
httrack 'BASEURL/javaclass/index.html'

View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Content-Disposition names the saved file: the attachment filename replaces
# the URL-derived name, and a traversal filename is reduced to its last
# component, inside the mirror.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'cdispo/report.pdf' \
--file-matches 'cdispo/report.pdf' '%PDF' \
--not-found 'cdispo/fetch.pdf' \
--found 'cdispo/evil.pdf' \
--not-found 'evil.pdf' \
httrack 'BASEURL/cdispo/index.html'

View File

@@ -0,0 +1,20 @@
#!/bin/bash
#
# Degenerate delayed-type paths (#5/#107 family): redirects that never resolve
# a name must drop cleanly -- no .delayed leftovers (audited by local-crawl.sh),
# no "bogus state" cache warnings, resolvable links still land correctly.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --rerun --errors 0 \
--found 'delayed/real.pdf' \
--file-matches 'delayed/real.pdf' '%PDF' \
--found 'delayed/notype.bin.html' \
--found 'delayed/empty.html' \
--not-found 'delayed/noloc.html' \
--not-found 'delayed/selfloop.html' \
--not-found 'delayed/chain9.pdf' \
--log-not-found 'bogus state' \
httrack 'BASEURL/delayed/index.html'

View File

@@ -0,0 +1,21 @@
#!/bin/bash
#
# -E time limit (#481): server pages trickle for minutes; the engine must stop
# on its own at -E plus grace, aborting the in-flight transfers.
set -euo pipefail
: "${top_srcdir:=..}"
# cancelled crawls can orphan .delayed placeholders (#483): skip that audit
start=$(date +%s)
bash "$top_srcdir/tests/local-crawl.sh" \
--skip-delayed-audit \
--log-found 'More than 2 seconds passed' \
httrack 'BASEURL/trickle/index.html' -E2 -c4
wall=$(($(date +%s) - start))
# hard stop is due at -E2 + 5s grace; near TRICKLE_SECONDS means it never fired
if [ "$wall" -ge 30 ]; then
echo "crawl took ${wall}s, -E hard stop did not engage" >&2
exit 1
fi

View File

@@ -0,0 +1,13 @@
#!/bin/bash
#
# -M byte cap (#77): the crawl must stop with the "giving up" error and keep
# the mirror well under the 8 x 640KB the fixture totals uncapped.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" \
--log-found 'More than 400000 bytes have been transferred.. giving up' \
--max-mirror-bytes 4000000 \
httrack 'BASEURL/bigfiles/index.html' -M400000 -c4

View File

@@ -1,4 +1,4 @@
# Committed binary fixture read by 01_engine-cache-golden.test. List it
# Committed binary fixture read by 01_zlib-cache-golden.test. List it
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
@@ -6,6 +6,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
server-root/stripquery/index.html server-root/stripquery/a.html \
server-root/fraglink/index.html server-root/fraglink/target.html \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -25,9 +26,6 @@ TEST_EXTENSIONS = .test
TEST_LOG_COMPILER = $(BASH)
TESTS = \
00_runnable.test \
01_engine-cache.test \
01_engine-cache-golden.test \
01_engine-cache-writefail.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
@@ -37,19 +35,37 @@ TESTS = \
01_engine-entities.test \
01_engine-filelist.test \
01_engine-filter.test \
01_engine-ftp-line.test \
01_engine-ftp-userpass.test \
01_engine-hashtable.test \
01_engine-header.test \
01_engine-idna.test \
01_engine-escape-room.test \
01_engine-inplace-escape.test \
01_engine-java.test \
01_engine-makeindex.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-redirect.test \
01_engine-relative.test \
01_engine-robots.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-sniff.test \
01_engine-status.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
01_engine-urlhack.test \
01_engine-unescape-bounds.test \
01_engine-useragent.test \
01_zlib-acceptencoding.test \
01_zlib-cache.test \
01_zlib-cache-golden.test \
01_zlib-cache-writefail.test \
01_zlib-savename-cached.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \
@@ -76,6 +92,12 @@ TESTS = \
26_local-strip-query.test \
27_local-cookies-file.test \
28_local-pause.test \
29_local-redirect-fragment.test
29_local-redirect-fragment.test \
30_local-fragment-link.test \
31_local-javaclass.test \
32_local-cdispo.test \
33_local-delayed.test \
34_local-maxtime.test \
35_local-maxsize.test
CLEANFILES = check-network_sh.cache

Binary file not shown.

View File

@@ -15,8 +15,13 @@
# bash local-crawl.sh [--tls] [--root DIR] [--cookie NAME=VALUE ...] \
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# --file-matches PATH REGEX ... --file-not-matches PATH REGEX ... \
# --max-mirror-bytes N \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
# --max-mirror-bytes asserts the mirrored content (host root) stays under N.
# --file-matches/--file-not-matches grep (ERE) a mirrored file (PATH under the
# host root), to assert rewritten link/content survived the crawl.
# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
# which the ephemeral port forces into the cookie domain) and passes it to
# httrack via --cookies-file, to exercise preloaded cookies.
@@ -89,6 +94,7 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create
# --- parse leading control flags --------------------------------------------
declare -a audit=()
declare -a cookies=()
skip_delayed_audit=""
scheme=http
pos=0
args=("$@")
@@ -113,14 +119,21 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
cookies+=("${args[$pos]}")
;;
--skip-delayed-audit)
skip_delayed_audit=1
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory | --log-found | --log-not-found)
--found | --not-found | --directory | --log-found | --log-not-found | --max-mirror-bytes)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--file-matches | --file-not-matches)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}" "${args[$((pos + 2))]}")
pos=$((pos + 2))
;;
httrack)
pos=$((pos + 1))
break
@@ -239,6 +252,17 @@ done
test -n "$hostroot" || die "could not find host root under $out"
debug "host root: $hostroot"
# A completed crawl must leave no .delayed temporaries (issue #107).
# --skip-delayed-audit: a cancelled crawl can orphan placeholders (issue #483)
if test -z "$skip_delayed_audit"; then
info "checking for leftover .delayed files"
leftovers=$(find "$out" -name '*.delayed' 2>/dev/null | head -5)
if test -z "$leftovers"; then result "OK"; else
result "leftover: $leftovers"
exit 1
fi
fi
# --- audit -------------------------------------------------------------------
i=0
while test "$i" -lt "${#audit[@]}"; do
@@ -294,6 +318,33 @@ while test "$i" -lt "${#audit[@]}"; do
exit 1
else result "OK"; fi
;;
--max-mirror-bytes)
i=$((i + 1))
sz=$(find "$hostroot" -type f -exec cat {} + | wc -c | tr -d '[:space:]')
info "checking mirror size ${sz} <= ${audit[$i]} bytes"
if test "$sz" -le "${audit[$i]}"; then result "OK"; else
result "mirror too big"
exit 1
fi
;;
--file-matches)
path="${audit[$((i + 1))]}"
i=$((i + 2))
info "checking ${path} matches ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${hostroot}/${path}"; then result "OK"; else
result "no match"
exit 1
fi
;;
--file-not-matches)
path="${audit[$((i + 1))]}"
i=$((i + 2))
info "checking ${path} lacks ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${hostroot}/${path}"; then
result "matched"
exit 1
else result "OK"; fi
;;
esac
i=$((i + 1))
done

View File

@@ -14,6 +14,7 @@ stdlib only (http.server + ssl) -- no new build or runtime dependency.
"""
import argparse
import gzip
import os
import time
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
@@ -134,12 +135,14 @@ class Handler(SimpleHTTPRequestHandler):
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
def send_raw(self, body, content_type, extra_headers=()):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
for name, value in extra_headers:
self.send_header(name, value)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
@@ -148,6 +151,8 @@ class Handler(SimpleHTTPRequestHandler):
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
FAKE_JPEG = b"\xff\xd8\xff\xe0" + b"\x00" * 64
BIG_JPEG = b"\xff\xd8\xff\xe0" + bytes(range(256)) * 64 # > sniff window
# path -> (body, content_type); None sends no header, "" sends an empty
# Content-Type value (no usable type, must be treated like None).
@@ -159,6 +164,8 @@ class Handler(SimpleHTTPRequestHandler):
"/types/notype.pdf": (FAKE_PDF, None),
"/types/emptyct.png": (FAKE_PNG, ""),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/wrongtype.jpg": (FAKE_JPEG, "image/png"),
"/types/bigtype.jpg": (BIG_JPEG, "image/png"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
@@ -176,6 +183,10 @@ class Handler(SimpleHTTPRequestHandler):
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="emptyct.png" />\n'
'\t<img src="lie.png" />\n'
'\t<img src="wrongtype.jpg" />\n'
'\t<img src="bigtype.jpg" />\n'
'\t<img src="mutant.jpg" />\n'
'\t<img src="packed.jpg" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
@@ -190,6 +201,25 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# content changes between crawls: run 1 sniffs JPEG, the update pass must
# keep the run-1 name (recorded verdict) even though the body is now PNG
MUTANT_SEEN = set()
def route_types_mutant(self):
path = urlsplit(self.path).path
body = self.FAKE_PNG if path in self.MUTANT_SEEN else self.FAKE_JPEG
if self.command != "HEAD":
self.MUTANT_SEEN.add(path)
self.send_raw(body, "image/png")
# gzip on the wire: the sniff must see the decoded body, not the stream
def route_types_packed(self):
self.send_raw(
gzip.compress(self.FAKE_JPEG),
"image/png",
extra_headers=[("Content-Encoding", "gzip")],
)
# --- MIME-type exclusion abort (issue #58) -----------------------------
# A -mime:application/pdf filter must abort the transfer once the header
# arrives, not download the whole body and discard it.
@@ -354,6 +384,27 @@ class Handler(SimpleHTTPRequestHandler):
if self.command != "HEAD":
self.wfile.write(body)
# Content-Disposition naming: the attachment filename replaces the
# URL-derived name; path components in it are stripped (RFC 2616).
CDISPO_NAMES = {
"/cdispo/fetch.php": "report.pdf",
"/cdispo/evil.php": "../../evil.pdf",
}
def route_cdispo_index(self):
self.send_html(
'\t<a href="fetch.php">report</a>\n' '\t<a href="evil.php">evil</a>\n'
)
def route_cdispo(self):
filename = self.CDISPO_NAMES[urlsplit(self.path).path]
cdispo = 'attachment; filename="%s"' % filename
self.send_raw(
self.FAKE_PDF,
"application/pdf",
extra_headers=[("Content-Disposition", cdispo)],
)
# 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
# that must be dropped before the target is fetched. A leaked '#' reaches the
# strict-server guard below and 400s.
@@ -369,6 +420,85 @@ class Handler(SimpleHTTPRequestHandler):
def route_redir_target(self):
self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
# --- delayed-type degenerate paths (issues #5/#107) --------------------
def route_delayed_index(self):
self.send_html(
'\t<a href="noloc.php">noloc</a>\n'
'\t<a href="selfloop.php">selfloop</a>\n'
'\t<a href="chain1.php">chain</a>\n'
'\t<a href="redir.php">redir</a>\n'
'\t<a href="notype.bin">notype</a>\n'
'\t<a href="empty.php">empty</a>\n'
)
def send_redirect(self, location):
self.send_response(302, "Found")
if location is not None:
self.send_header("Location", location)
self.send_header("Content-Length", "0")
self.end_headers()
def route_delayed_noloc(self):
self.send_redirect(None) # 302 without Location: name never resolves
def route_delayed_selfloop(self):
self.send_redirect("selfloop.php")
def route_delayed_chain(self):
# chain1..chain9: one more hop than the type-check redirect budget
n = int(urlsplit(self.path).path.rsplit("chain", 1)[1].split(".")[0])
if n < 9:
self.send_redirect("chain%d.php" % (n + 1))
else:
self.send_raw(self.FAKE_PDF, "application/pdf")
def route_delayed_redir(self):
self.send_redirect("real.pdf")
def route_delayed_realpdf(self):
self.send_raw(self.FAKE_PDF, "application/pdf")
def route_delayed_notype(self):
self.send_raw(self.FAKE_PDF, None)
def route_delayed_empty(self):
self.send_raw(b"", "text/html") # 200 + Content-Length: 0
# -E time-limit (#481): pages that trickle far longer than any -E budget,
# so only an engine-side abort can end the crawl.
TRICKLE_SECONDS = 60
def route_trickle_index(self):
self.send_html(
"".join('\t<a href="p%d.bin">p%d</a>\n' % (i, i) for i in range(8))
)
def route_trickle_page(self):
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(2 * self.TRICKLE_SECONDS))
self.end_headers()
if self.command == "HEAD":
return
try:
for _ in range(self.TRICKLE_SECONDS):
self.wfile.write(b"xy")
self.wfile.flush()
time.sleep(1.0)
except OSError:
pass
# -M byte cap (#77): large fast files so a crawl overruns -M immediately.
BIGFILE_BYTES = 640 * 1024
def route_bigfiles_index(self):
self.send_html(
"".join('\t<a href="p%d.bin">p%d</a>\n' % (i, i) for i in range(8))
)
def route_bigfile(self):
self.send_raw(b"x" * self.BIGFILE_BYTES, "application/octet-stream")
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
@@ -384,6 +514,10 @@ class Handler(SimpleHTTPRequestHandler):
"/types/notype.pdf": route_types,
"/types/emptyct.png": route_types,
"/types/lie.png": route_types,
"/types/wrongtype.jpg": route_types,
"/types/bigtype.jpg": route_types,
"/types/mutant.jpg": route_types_mutant,
"/types/packed.jpg": route_types_packed,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
@@ -406,6 +540,43 @@ class Handler(SimpleHTTPRequestHandler):
"/mimex/index.html": route_mimex_index,
"/mimex/blob.pdf": route_mimex_blob,
"/mimex/real.html": route_mimex_real,
"/cdispo/index.html": route_cdispo_index,
"/cdispo/fetch.php": route_cdispo,
"/cdispo/evil.php": route_cdispo,
"/delayed/index.html": route_delayed_index,
"/trickle/index.html": route_trickle_index,
"/trickle/p0.bin": route_trickle_page,
"/trickle/p1.bin": route_trickle_page,
"/trickle/p2.bin": route_trickle_page,
"/trickle/p3.bin": route_trickle_page,
"/trickle/p4.bin": route_trickle_page,
"/trickle/p5.bin": route_trickle_page,
"/trickle/p6.bin": route_trickle_page,
"/trickle/p7.bin": route_trickle_page,
"/bigfiles/index.html": route_bigfiles_index,
"/bigfiles/p0.bin": route_bigfile,
"/bigfiles/p1.bin": route_bigfile,
"/bigfiles/p2.bin": route_bigfile,
"/bigfiles/p3.bin": route_bigfile,
"/bigfiles/p4.bin": route_bigfile,
"/bigfiles/p5.bin": route_bigfile,
"/bigfiles/p6.bin": route_bigfile,
"/bigfiles/p7.bin": route_bigfile,
"/delayed/noloc.php": route_delayed_noloc,
"/delayed/selfloop.php": route_delayed_selfloop,
"/delayed/redir.php": route_delayed_redir,
"/delayed/real.pdf": route_delayed_realpdf,
"/delayed/notype.bin": route_delayed_notype,
"/delayed/empty.php": route_delayed_empty,
"/delayed/chain1.php": route_delayed_chain,
"/delayed/chain2.php": route_delayed_chain,
"/delayed/chain3.php": route_delayed_chain,
"/delayed/chain4.php": route_delayed_chain,
"/delayed/chain5.php": route_delayed_chain,
"/delayed/chain6.php": route_delayed_chain,
"/delayed/chain7.php": route_delayed_chain,
"/delayed/chain8.php": route_delayed_chain,
"/delayed/chain9.php": route_delayed_chain,
"/redir/index.html": route_redir_index,
"/redir/go.php": route_redir_go,
"/redir/target.html": route_redir_target,

View File

@@ -0,0 +1,4 @@
<html><body>
<a href=target.html#sec>unquoted fragment link</a>
<a href="target.html#sec2">quoted fragment link</a>
</body></html>

View File

@@ -0,0 +1 @@
<html><body><a name="sec"></a><a name="sec2"></a>target</body></html>