mirror of
https://github.com/xroche/httrack.git
synced 2026-06-28 21:17:57 +03:00
-%u (--urlhack) bundled three dedup normalizations under one switch: www.host == host, redundant // collapse, and query-argument reordering. A mirror that needed one but not another (e.g. keep www. distinct) had to turn the whole umbrella off. Add three opt-out sub-options, defaulting to the umbrella so existing -%u/-%u0 behavior is unchanged: --keep-www-prefix keep www.foo.com distinct from foo.com (-%j) --keep-double-slashes keep redundant // in the path (-%o) --keep-query-order keep query-argument order significant (-%y) The split is resolved once in hash_init() into norm_host/norm_slash/ norm_query and threaded through the dedup hash (htshash.c), the savename lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so all three stay consistent. fil_normalized() gains an internal fil_normalized_ex(do_slash, do_query) core; the public fil_normalized()/fil_normalized_filtered() keep their signatures. Normalization (slash/query) now follows urlhack and its sub-flags uniformly, while --strip-query stays orthogonal. So with urlhack off, strip-query strips keys without sorting the remainder; the url_savename urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer the hash uses, so a URL is always looked up under the key it was stored with (a self-lookup mismatch this otherwise introduced). http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so that part of the request needs no toggle. The opt-outs are spelled positively (--keep-*) because httrack's generic --no<opt> prefix only appends the disabling "0" for parametered options, not "single" booleans, so --nowww-dedup would silently no-op. opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, matching the strip_query addition in #112). Tested by a -#test=urlhack engine self-test (hash_url_equals over each flag combination) plus a -%u0 + --strip-query crawl case exercising the urlhack-off savename branch. Closes #271 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
9 lines
265 B
Bash
9 lines
265 B
Bash
#!/bin/bash
|
|
#
|
|
|
|
set -euo pipefail
|
|
|
|
# -%u url-hack split (#271): www / // / query-order dedup toggle independently.
|
|
# All assertions live in the engine self-test (hash compare flag resolution).
|
|
httrack -O /dev/null -#test=urlhack run | grep -q "urlhack self-test OK"
|