mirror of
https://github.com/xroche/httrack.git
synced 2026-06-30 05:55:46 +03:00
Two URLs that differ only in tracking or session query parameters (?utm_source=x versus ?utm_source=y) were saved as separate files, and a single CGI could fan out into thousands of near-duplicate pages. fil_normalized already sorted query args, so reordered parameters dedup, but there was no way to drop a named key. --strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the listed keys when computing the dedup key and the saved name. The fetched URL is untouched, so a required sid= is still sent on the wire; only the local namespace collapses. Patterns match the normalized host/path with the +/- filter glob (strjoker), last match wins as in the filter list, and stripping is decoupled from urlhack (-%u) so it never silently no-ops with -%u0. It all funnels through one chokepoint, fil_normalized: an internal fil_normalized_filtered() strips then delegates, and hts_query_strip_keys resolves the per-URL key list. The strip pass walks every query field, including empty and trailing ones, so its output is a fixpoint under the read path's second normalization (otherwise dedup silently misses). Exported ABI is unchanged; the strip_query field is appended at the tail of httrackp. Covered by a -#test=stripquery self-test (degenerate queries like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup crawl test. Closes #112 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>