httrack

mirror of https://github.com/xroche/httrack.git synced 2026-06-30 05:55:46 +03:00

Files

Xavier Roche 40a66600ff Add --strip-query to drop query keys from dedup naming (#112 ) (#434 )

Two URLs that differ only in tracking or session query parameters
(?utm_source=x versus ?utm_source=y) were saved as separate files, and a
single CGI could fan out into thousands of near-duplicate pages.
fil_normalized already sorted query args, so reordered parameters dedup,
but there was no way to drop a named key.

--strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the
listed keys when computing the dedup key and the saved name. The fetched
URL is untouched, so a required sid= is still sent on the wire; only the
local namespace collapses. Patterns match the normalized host/path with
the +/- filter glob (strjoker), last match wins as in the filter list,
and stripping is decoupled from urlhack (-%u) so it never silently
no-ops with -%u0.

It all funnels through one chokepoint, fil_normalized: an internal
fil_normalized_filtered() strips then delegates, and hts_query_strip_keys
resolves the per-URL key list. The strip pass walks every query field,
including empty and trailing ones, so its output is a fixpoint under the
read path's second normalization (otherwise dedup silently misses).
Exported ABI is unchanged; the strip_query field is appended at the tail
of httrackp. Covered by a -#test=stripquery self-test (degenerate queries
like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup
crawl test.

Closes #112

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-27 11:13:16 +02:00

simple

tests: add offline local test server prototype (cookies + HTTPS)

2026-06-20 16:35:13 +02:00

stripquery

Add --strip-query to drop query keys from dedup naming (#112 ) (#434 )

2026-06-27 11:13:16 +02:00