Release 3.49.10

Bump the package version to 3.49.10 and curate the release notes. VERSION_INFO goes 3:1:0 -> 3:2:0: the cycle only appended tail fields to the installed options struct (--cookies-file, --pause, --strip-query, the -%u split), no existing symbol or offset changed, so the soname stays .so.3. history.txt gets the 3.49-10 block; debian/changelog gets 3.49.10-1 with the Debian-specific items (DEP-5 copyright, chromium-first browser dep, minizip embedded-library override). Standards-Version 4.7.0 -> 4.7.4: the intervening Policy changes (usr-merge, /usr/games, Priority recommendation, -dev linker scripts, non-free-firmware) need no package change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
Strip the #fragment from a redirect Location before fetching (#204 ) (#441 )
2026-06-28 21:17:57 +03:00 · 2026-06-28 14:21:37 +02:00 · 2026-06-28 13:52:21 +02:00 · 2026-06-28 12:56:11 +02:00 · 2026-06-28 12:45:07 +02:00 · 2026-06-28 11:34:56 +02:00
37 changed files with 1065 additions and 82 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -39,6 +39,10 @@ Welcome, and nothing to disclose. Two rules:

 The sign-off covers AI-assisted code too.

+## Translations
+
+Interface strings live in [`lang/`](lang/). See [lang/README.md](lang/README.md) for the file format and how to add or update a language.
+
 ## Bugs

 Open an issue with the version, OS, command used, and expected vs actual result.
--- a/configure.ac
+++ b/configure.ac
@@ -1,6 +1,6 @@
 AC_PREREQ([2.71])

-AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
+AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
 AC_COPYRIGHT([
 HTTrack Website Copier, Offline Browser for Windows and Unix
 Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
 AC_CONFIG_MACRO_DIR([m4])
 AC_CONFIG_HEADERS(config.h)
 AM_INIT_AUTOMAKE([subdir-objects])
-# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
-# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
-# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
-VERSION_INFO="3:1:0"
+# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
+# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
+# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
+VERSION_INFO="3:2:0"
 AM_MAINTAINER_MODE
 AC_USE_SYSTEM_EXTENSIONS

--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,16 @@
+httrack (3.49.10-1) unstable; urgency=medium
+
+  * New upstream release: new download-pacing and URL-handling options plus a
+    batch of crawl and robustness fixes (full list in history.txt).
+  * Rewrite debian/copyright in machine-readable DEP-5 format, crediting the
+    bundled minizip, md5 and coucal sources (#415).
+  * Lead the webhttrack browser dependency with chromium so httrack is not
+    dragged into the firefox-esr autoremoval cascade (#436).
+  * Override the embedded-library lint for the bundled minizip (#419).
+  * Bump Standards-Version to 4.7.4 (no changes required).
+
+ -- Xavier Roche <xavier@debian.org>  Sun, 28 Jun 2026 14:01:53 +0200
+
 httrack (3.49.9-1) unstable; urgency=medium

  * New upstream release: Content-Type and file-type detection fixes (trust a
--- a/debian/control
+++ b/debian/control
@@ -2,7 +2,7 @@ Source: httrack
 Section: web
 Priority: optional
 Maintainer: Xavier Roche <roche@httrack.com>
-Standards-Version: 4.7.0
+Standards-Version: 4.7.4
 Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
 Rules-Requires-Root: no
 Homepage: http://www.httrack.com
@@ -30,7 +30,7 @@ Description: Copy websites to your computer (Offline browser)
 Package: webhttrack
 Architecture: any
 Multi-Arch: foreign
-Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, firefox-esr | chromium | www-browser
+Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, chromium | firefox-esr | www-browser
 Replaces: webhttrack-common (<< 3.43.9-2)
 Breaks: webhttrack-common (<< 3.43.9-2)
 Suggests: httrack, httrack-doc
--- a/history.txt
+++ b/history.txt
@@ -4,7 +4,25 @@ HTTrack Website Copier release history:

 This file lists all changes and fixes that have been made for HTTrack

-3.49-9
+3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)
+ New: --strip-query to drop selected query keys from the dedup naming (#112)
+ Changed: split the -%u URL hacks into independent --keep-www-prefix, --keep-double-slashes and --keep-query-order toggles (#271)
+ Fixed: follow a redirect Location after dropping its #fragment, instead of requesting the fragment and polluting the saved name (#204)
+ Fixed: escaped brackets inside a *[...] filter character class (#148)
+ Fixed: honor the server's Content-Range when resuming a partial download, instead of appending overlapping bytes (#198)
+ Fixed: abort the download as soon as the response type is excluded by -mime:, instead of fetching then discarding the body (#58)
+ Fixed: keep size-based filter rules neutral until the file size is known (#143)
+ Fixed: stop the mirror with a clean fatal error on a cache write failure, instead of crashing (#174, #219)
+ Fixed: stop the 412/416 partial re-get loop on --continue and --update (#206)
+ Fixed: keep an unrecognized URL tail instead of mangling it to .html (#115)
+ Fixed: honor --tolerant (-%B) on a broken Content-Length, and fix an out-of-bounds read it exposed (#32, #41)
+ Fixed: fall back to the next resolved address when a connection fails or stalls, instead of hanging on a dead IPv6 address
+ Fixed: report why a -%L URL list could not be loaded (#49)
+ Changed: multiple internal hardening, build and CI improvements
+
+.49-9
 + Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
 + Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
 + Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)
--- a/html/filters.html
+++ b/html/filters.html
@@ -247,7 +247,7 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
        <td>the \ character</td>
      </tr>
      <tr>
-        <td nowrap><tt>*[\[\]]</tt></td>
+        <td nowrap><tt>*[\[,\]]</tt></td>
        <td>the [ or ] character</td>
      </tr>
      <tr>
--- a/lang/English.txt
+++ b/lang/English.txt
@@ -295,7 +295,7 @@ Max Depth
 Maximum external depth:
 Maximum external depth:
 Filters (refuse/accept links) :
-Filters (refuse/accept links) :
+Filters (refuse/accept links):
 Paths
 Paths
 Save prefs
--- a/lang/README.md
+++ b/lang/README.md
@@ -0,0 +1,37 @@
+# Translating HTTrack
+
+Interface strings live here, one `.txt` file per language. `English.txt` is the reference: every other file maps each English string to its translation.
+
+## File format
+
+Plain text, entries in consecutive pairs of lines:
+
+```
+<English string>
+<translation>
+```
+
+The first line of a pair is the lookup key and must stay identical to the one in `English.txt`; translate only the second line. Missing entries fall back to the English text at runtime, so a partial translation works.
+
+Preserve any `\r\n`, `\t` and `printf` placeholders (`%s`, `%d`, ...) in the translation.
+
+A few `LANGUAGE_*` entries at the top describe the file itself:
+
+| Key | Meaning |
+| --- | --- |
+| `LANGUAGE_NAME` | Name shown in the language picker, in its own language (`Deutsch`, not `German`) |
+| `LANGUAGE_ISO` | ISO 639 code, with region if needed (`de`, `pt_BR`) |
+| `LANGUAGE_CHARSET` | Encoding the file is saved in (`ISO-8859-1`, `windows-1251`, `UTF-8`, ...) |
+| `LANGUAGE_AUTHOR` | Your name and contact |
+| `LANGUAGE_WINDOWSID` | Windows locale name used by WinHTTrack (`German (Standard)`) |
+
+Save the file in exactly its declared `LANGUAGE_CHARSET`; an editor that rewrites it as UTF-8 will corrupt the non-ASCII bytes.
+
+## Adding or updating a language
+
+1. Copy `English.txt` to `<Language>.txt`, or edit the existing file.
+2. Translate each second line; leave the English keys untouched.
+3. Fill in the `LANGUAGE_*` header for a new file.
+4. Open a pull request, or attach the file to a GitHub issue.
+
+When new strings land in `English.txt` they show up untranslated (as English) until a translator fills them in.
--- a/man/httrack.1
+++ b/man/httrack.1
@@ -3,7 +3,7 @@
 .\"
 .\" This file is generated by man/makeman.sh; do not edit by hand.
 .\" SPDX-License-Identifier: GPL-3.0-or-later
-.TH httrack 1 "26 June 2026" "httrack website copier"
+.TH httrack 1 "27 June 2026" "httrack website copier"
 .SH NAME
 httrack \- offline browser : copy websites to a local directory
 .SH SYNOPSIS
@@ -24,6 +24,7 @@ httrack \- offline browser : copy websites to a local directory
 [ \fB\-EN, \-\-max\-time[=N]\fR ]
 [ \fB\-AN, \-\-max\-rate[=N]\fR ]
 [ \fB\-%cN, \-\-connection\-per\-second[=N]\fR ]
+[ \fB\-%G, \-\-pause\fR ]
 [ \fB\-GN, \-\-max\-pause[=N]\fR ]
 [ \fB\-cN, \-\-sockets[=N]\fR ]
 [ \fB\-TN, \-\-timeout[=N]\fR ]
@@ -43,11 +44,13 @@ httrack \- offline browser : copy websites to a local directory
 [ \fB\-x, \-\-replace\-external\fR ]
 [ \fB\-%x, \-\-disable\-passwords\fR ]
 [ \fB\-%q, \-\-include\-query\-string\fR ]
+[ \fB\-%g, \-\-strip\-query\fR ]
 [ \fB\-o, \-\-generate\-errors\fR ]
 [ \fB\-X, \-\-purge\-old[=N]\fR ]
 [ \fB\-%p, \-\-preserve\fR ]
 [ \fB\-%T, \-\-utf8\-conversion\fR ]
 [ \fB\-bN, \-\-cookies[=N]\fR ]
+[ \fB\-%K, \-\-cookies\-file\fR ]
 [ \fB\-u, \-\-check\-type[=N]\fR ]
 [ \fB\-j, \-\-parse\-java[=N]\fR ]
 [ \fB\-sN, \-\-robots[=N]\fR ]
@@ -153,6 +156,8 @@ maximum mirror time in seconds (60=1 minute, 3600=1 hour) (\-\-max\-time[=N])
 maximum transfer rate in bytes/seconds (1000=1KB/s max) (\-\-max\-rate[=N])
 .IP \-%cN
 maximum number of connections/seconds (*%c10) (\-\-connection\-per\-second[=N])
+.IP \-%G
+random pause of MIN[:MAX] seconds between files (e.g. %G5:10) (\-\-pause <param>)
 .IP \-GN
 pause transfer if N bytes reached, and wait until lock file is deleted (\-\-max\-pause[=N])
 .SS Flow control:
@@ -198,6 +203,8 @@ replace external html links by error pages (\-\-replace\-external)
 do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
 .IP \-%q
 *include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
+.IP \-%g
+strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
 .IP \-o
 *generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
 .IP \-X
@@ -209,6 +216,8 @@ links conversion to UTF\-8 (\-\-utf8\-conversion)
 .SS Spider options:
 .IP \-bN
 accept cookies in cookies.txt (0=do not accept,* 1=accept) (\-\-cookies[=N])
+.IP \-%K
+load extra cookies from a Netscape cookies.txt (\-\-cookies\-file <param>)
 .IP \-u
 check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (\-\-check\-type[=N])
 .IP \-j
@@ -225,6 +234,8 @@ tolerant requests (accept bogus responses on some servers, but not standard!) (\
 update hacks: various hacks to limit re\-transfers when updating (identical size, bogus response..) (\-\-updatehack)
 .IP \-%u
 url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (\-\-urlhack)
+.br
+opt out of one url\-hack part: \-\-keep\-www\-prefix (www.foo.com<>foo.com), \-\-keep\-double\-slashes (//), \-\-keep\-query\-order (?b&a)
 .IP \-%A
 assume that a type (cgi,asp..) is always linked with a mime type (\-%A php3,cgi=text/html;dat,bin=application/x\-zip) (\-\-assume <param>)
 .br
--- a/src/htsalias.c
+++ b/src/htsalias.c
@@ -60,6 +60,9 @@ Please visit our Website: http://www.httrack.com
  param1 : this option must be alone, and needs one distinct parameter (-P <path>)
  param0 : this option must be alone, but the parameter should be put together (+*.gif)
 */
+/* clang-format off: hand-aligned table; clang-format reflows the whole
+   initializer (2->4 space) on any edit, churning every untouched row. */
+/* clang-format off */
 const char *hts_optalias[][4] = {
  /*   {"","","",""}, */
  {"path", "-O", "param1", "output path"},
@@ -107,6 +110,12 @@ const char *hts_optalias[][4] = {
  {"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
                                               "single", ""},
  {"include-query-string", "-%q", "single", ""},
+  {"strip-query", "-%g", "param1",
+   "strip [host/pattern=]key1,key2,... from URLs"},
+  {"cookies-file", "-%K", "param1",
+   "load extra cookies from a Netscape cookies.txt"},
+  {"pause", "-%G", "param1",
+   "random pause of MIN[:MAX] seconds between files"},
  {"generate-errors", "-o", "single", ""},
  {"do-not-generate-errors", "-o0", "single", ""},
  {"purge-old", "-X", "param", ""},
@@ -123,6 +132,9 @@ const char *hts_optalias[][4] = {
  {"tolerant", "-%B", "single", ""},
  {"updatehack", "-%s", "single", ""}, {"sizehack", "-%s", "single", ""},
  {"urlhack", "-%u", "single", ""},
+  {"keep-www-prefix", "-%j", "single", ""},
+  {"keep-double-slashes", "-%o", "single", ""},
+  {"keep-query-order", "-%y", "single", ""},
  {"user-agent", "-F", "param1", "user-agent identity"},
  {"referer", "-%R", "param1", "default referer URL"},
  {"from", "-%E", "param1", "from email address"},
@@ -241,6 +253,7 @@ const char *hts_optalias[][4] = {

  {"", "", "", ""}
 };
+/* clang-format on */

 /* 
  Check for alias in command-line 
--- a/src/htscore.c
+++ b/src/htscore.c
@@ -35,6 +35,7 @@ Please visit our Website: http://www.httrack.com

 #include <fcntl.h>
 #include <ctype.h>
+#include <stdint.h> /* uint64_t for the pause mixer (already a hard dep via md5.h) */

 /* File defs */
 #include "htscore.h"
@@ -523,9 +524,12 @@ int httpmirror(char *url1, httrackp * opt) {
    opt->cookie = &cookie;
    cookie.max_len = 30000;     // max len
    strcpybuff(cookie.data, "");
-    // Charger cookies.txt par défaut ou cookies.txt du miroir
+    // Load the mirror's cookies.txt, then the one in the current directory
    cookie_load(opt->cookie, StringBuff(opt->path_log), "cookies.txt");
    cookie_load(opt->cookie, "", "cookies.txt");
+    // A user-supplied cookie file is merged last so it wins on conflicts
+    if (strnotempty(StringBuff(opt->cookies_file)))
+      cookie_load(opt->cookie, "", StringBuff(opt->cookies_file));
  } else
    opt->cookie = NULL;

@@ -3311,6 +3315,21 @@ HTS_INLINE int back_fillmax(struct_back * sback, httrackp * opt,
  return -1;                    /* plus de place */
 }

+/* Seed-derived: stable within a gap, rerolls per launch; a per-call rand()
+   would bias the delay toward min_ms (see header). Jitter, not crypto. */
+int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms) {
+  uint64_t z = (uint64_t) seed;
+
+  if (max_ms <= min_ms)
+    return min_ms;
+  /* SplitMix64 finalizer: scrambles the low-entropy ms timestamp. */
+  z += 0x9E3779B97F4A7C15ULL;
+  z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
+  z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL;
+  z ^= z >> 31;
+  return min_ms + (int) (z % (uint64_t) (max_ms - min_ms + 1));
+}
+
 int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
  int n = opt->maxsoc - back_nsoc(sback);

@@ -3331,6 +3350,18 @@ int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
    }
  }

+  // #185 randomized inter-file pause: non-blocking, one launch per gap
+  if (n > 0 && opt->pause_max_ms > 0 && HTS_STAT.last_connect > 0) {
+    TStamp opTime =
+        HTS_STAT.last_request ? HTS_STAT.last_request : HTS_STAT.last_connect;
+    TStamp lap = mtime_local() - opTime;
+
+    if (lap < hts_pause_target_ms(opTime, opt->pause_min_ms, opt->pause_max_ms))
+      n = 0;
+    else
+      n = 1;
+  }
+
  return n;
 }

@@ -3739,6 +3770,17 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
  if (StringNotEmpty(from->user_agent))
    StringCopyS(to->user_agent, from->user_agent);

+  if (StringNotEmpty(from->strip_query))
+    StringCopyS(to->strip_query, from->strip_query);
+
+  if (StringNotEmpty(from->cookies_file))
+    StringCopyS(to->cookies_file, from->cookies_file);
+
+  if (from->pause_max_ms > 0) {
+    to->pause_min_ms = from->pause_min_ms;
+    to->pause_max_ms = from->pause_max_ms;
+  }
+
  if (from->retry > -1)
    to->retry = from->retry;

--- a/src/htscore.h
+++ b/src/htscore.h
@@ -234,8 +234,12 @@ struct hash_struct {
  coucal adrfil;
  /* former address+path -> link index (renamed/moved entries) */
  coucal former_adrfil;
-  /* scratch buffers reused across lookups (not reentrant) */
-  int normalized;
+  /* effective urlhack sub-flags: www.==host / // collapse / query-arg sort */
+  hts_boolean norm_host;
+  hts_boolean norm_slash;
+  hts_boolean norm_query;
+  /* query-strip keys (not owned); set from opt->strip_query at hash_init */
+  const char *strip_query;
  char normfil[HTS_URLMAXSIZE * 2];
  char normfil2[HTS_URLMAXSIZE * 2];
  char catbuff[CATBUFF_SIZE];
@@ -364,6 +368,22 @@ int fspc(httrackp * opt, FILE * fp, const char *type);

 char *next_token(char *p, int flag);

+/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
+   "*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
+char *fil_normalized_filtered(const char *source, char *dest,
+                              const char *strip);
+
+/* As fil_normalized_filtered(), but DO_SLASH/DO_QUERY gate the // collapse and
+   the query-argument sort independently (the urlhack sub-flags). */
+char *fil_normalized_filtered_ex(const char *source, char *dest,
+                                 const char *strip, int do_slash, int do_query);
+
+/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
+   '\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
+   strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
+const char *hts_query_strip_keys(const char *rules, const char *adr,
+                                 const char *fil, char *dest, size_t destsize);
+
 /* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
   owns it and must release it with freet(). Return NULL on missing/unreadable
   file (readfile_or substitutes defaultdata instead). The byte content is NOT
@@ -398,6 +418,10 @@ int back_pluggable_sockets(struct_back * sback, httrackp * opt);

 int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt);

+/* Randomized inter-file pause target in [min_ms,max_ms] (#185), derived from a
+   timestamp seed so it is stable within one gap and rerolls per launch. */
+int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms);
+
 /* Schedule more links from the heap into free slots. Returns the number queued,
   or <=0 if none could be added (no free slot / paused / stopped). */
 int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,
--- a/src/htscoremain.c
+++ b/src/htscoremain.c
@@ -1570,6 +1570,30 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
                  com++;
                }
                break;          // url hack
+              case 'j':
+                opt->no_www_dedup =
+                    HTS_TRUE; // --keep-www-prefix: keep www.X != X
+                if (*(com + 1) == '0') {
+                  opt->no_www_dedup = HTS_FALSE;
+                  com++;
+                }
+                break;
+              case 'o':
+                opt->no_slash_dedup =
+                    HTS_TRUE; // --keep-double-slashes: keep //
+                if (*(com + 1) == '0') {
+                  opt->no_slash_dedup = HTS_FALSE;
+                  com++;
+                }
+                break;
+              case 'y':
+                opt->no_query_dedup =
+                    HTS_TRUE; // --keep-query-order: keep ?b&a order
+                if (*(com + 1) == '0') {
+                  opt->no_query_dedup = HTS_FALSE;
+                  com++;
+                }
+                break;
              case 'v':
                opt->verbosedisplay = HTS_VERBOSE_FULL;
                if (isdigit((unsigned char) *(com + 1))) {
@@ -1937,6 +1961,66 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
                }
                break;

+              case 'g': // strip-query: accumulate "[pattern=]keys" entries
+                if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
+                  HTS_PANIC_PRINTF("Option strip-query needs a blank space and "
+                                   "[host/pattern=]key1,key2,...");
+                  printf("Example: --strip-query "
+                         "\"www.example.com/*=utm_source,sid\"\n");
+                  htsmain_free();
+                  return -1;
+                } else {
+                  na++;
+                  if (StringNotEmpty(opt->strip_query))
+                    StringCat(opt->strip_query, "\n");
+                  StringCat(opt->strip_query, argv[na]);
+                }
+                break;
+              case 'K': // cookies-file: extra Netscape cookies.txt to preload
+                if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
+                  HTS_PANIC_PRINTF(
+                      "Option cookies-file needs a blank space and "
+                      "a cookies.txt path");
+                  printf("Example: --cookies-file \"/home/me/cookies.txt\"\n");
+                  htsmain_free();
+                  return -1;
+                } else {
+                  na++;
+                  if (strlen(argv[na]) >= 1024) {
+                    HTS_PANIC_PRINTF("Cookie file path too long");
+                    htsmain_free();
+                    return -1;
+                  }
+                  StringCopy(opt->cookies_file, argv[na]);
+                }
+                break;
+              case 'G': // pause: randomized inter-file delay MIN[:MAX] seconds
+                if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
+                  HTS_PANIC_PRINTF("Option pause needs a blank space and a "
+                                   "delay in seconds (MIN[:MAX])");
+                  printf("Example: --pause 5:10\n");
+                  htsmain_free();
+                  return -1;
+                } else {
+                  double pmin = 0, pmax = 0;
+                  int nf;
+
+                  na++;
+                  nf = sscanf(argv[na], "%lf:%lf", &pmin, &pmax);
+                  if (nf < 2)
+                    pmax = pmin; /* a single value means a fixed delay */
+                  /* positive-form bounds: NaN fails every comparison, so this
+                     rejects it before the undefined (int)(NaN*1000) cast */
+                  if (nf < 1 || !(pmin >= 0 && pmax >= pmin && pmax <= 86400)) {
+                    HTS_PANIC_PRINTF("Invalid --pause range (expected "
+                                     "MIN[:MAX] seconds, 0<=MIN<=MAX<=86400)");
+                    htsmain_free();
+                    return -1;
+                  }
+                  opt->pause_min_ms = (int) (pmin * 1000.0);
+                  opt->pause_max_ms = (int) (pmax * 1000.0);
+                }
+                break;
              case 't':        /* do not change type (ending) of filenames according to the MIME type */
                opt->no_type_change = 1;
                if (*(com+1)=='0') { opt->no_type_change = 0; com++; }
--- a/src/htsfilters.c
+++ b/src/htsfilters.c
@@ -193,7 +193,12 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
          int len = (int) strlen(joker);

          while((joker[i] != RIGHT) && (joker[i]) && (i < len)) {
-            if ((joker[i] == '<') || (joker[i] == '>')) {       // *[<10]
+            // '\' escapes the next char as a literal member, e.g. *[\[\]]
+            if (joker[i] == '\\' && joker[i + 1] != '\0') {
+              i++;
+              pass[(int) (unsigned char) joker[i]] = 1;
+              i++;
+            } else if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
              int lsize = 0;
              int lverdict;

@@ -221,7 +226,9 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
                while(isdigit((unsigned char) joker[i]))
                  i++;
              }
-            } else if (joker[i + 1] == '-') {   // 2 car, ex: *[A-Z]
+            } else if (joker[i + 1] == '-' && joker[i + 2] != '\0') {
+              // range *[A-Z]; the '\0' guard rejects a truncated *[a- (else
+              // i+=3 overshoots the NUL)
              if ((int) (unsigned char) joker[i + 2] >
                  (int) (unsigned char) joker[i]) {
                int j;
@@ -233,10 +240,7 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
              }
              // else err=1;
              i += 3;
-            } else {            // 1 car, ex: *[ ]
-              if (joker[i + 2] == '\\' && joker[i + 3] != 0) {  // escaped char, such as *[\[] or *[\]]
-                i++;
-              }
+            } else { // 1 car, ex: *[ ]
              pass[(int) (unsigned char) joker[i]] = 1;
              i++;
            }
--- a/src/htsglobal.h
+++ b/src/htsglobal.h
@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
   configure.ac, decoupled from these). VERSION is the display form, VERSIONID
   the dotted numeric form, AFF_VERSION the short form shown in footers,
   LIB_VERSION the data/cache format generation. */
-#define HTTRACK_VERSION "3.49-9"
-#define HTTRACK_VERSIONID "3.49.9"
+#define HTTRACK_VERSION "3.49-10"
+#define HTTRACK_VERSIONID "3.49.10"
 #define HTTRACK_AFF_VERSION "3.x"
 #define HTTRACK_LIB_VERSION "2.0"

--- a/src/htshash.c
+++ b/src/htshash.c
@@ -106,10 +106,10 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
  const lien_url*const lien = (const lien_url*) value;
  const char *const adr = !former ? lien->adr : lien->former_adr;
  const char *const fil = !former ? lien->fil : lien->former_fil;
-  const char *const adr_norm = adr != NULL ? 
-    ( hash->normalized  ? jump_normalized_const(adr)
-                        : jump_identification_const(adr) )
-    : NULL;
+  const char *const adr_norm =
+      adr != NULL ? (hash->norm_host ? jump_normalized_const(adr)
+                                     : jump_identification_const(adr))
+                  : NULL;

  // copy address
  assertf(adr_norm != NULL);
@@ -117,10 +117,18 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,

  // copy link
  assertf(fil != NULL);
-  if (hash->normalized) {
-    fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
-  } else {
-    strcpy(&hash->normfil[strlen(hash->normfil)], fil);
+  {
+    /* resolve the per-URL strip keys; strip applies even when urlhack is off */
+    char BIGSTK keybuf[HTS_URLMAXSIZE];
+    const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
+                                                  keybuf, sizeof(keybuf));
+
+    if (hash->norm_slash || hash->norm_query || keys != NULL) {
+      fil_normalized_filtered_ex(fil, &hash->normfil[strlen(hash->normfil)],
+                                 keys, hash->norm_slash, hash->norm_query);
+    } else {
+      strcpy(&hash->normfil[strlen(hash->normfil)], fil);
+    }
  }

  // hash
@@ -132,8 +140,7 @@ static int key_adrfil_equals_generic(void *arg,
                                     coucal_key_const a_,
                                     coucal_key_const b_, 
                                     const int former) {
-  hash_struct *const hash = (hash_struct*) arg;
-  const int normalized = hash->normalized;
+  hash_struct *const hash = (hash_struct *) arg;
  const lien_url*const a = (const lien_url*) a_;
  const lien_url*const b = (const lien_url*) b_;
  const char *const a_adr = !former ? a->adr : a->former_adr;
@@ -150,10 +157,10 @@ static int key_adrfil_equals_generic(void *arg,
  assertf(b_fil != NULL);

  // skip scheme and authentication to the domain (possibly without www.)
-  ja = normalized
-    ? jump_normalized_const(a_adr) : jump_identification_const(a_adr);
-  jb = normalized
-    ? jump_normalized_const(b_adr) : jump_identification_const(b_adr);
+  ja = hash->norm_host ? jump_normalized_const(a_adr)
+                       : jump_identification_const(a_adr);
+  jb = hash->norm_host ? jump_normalized_const(b_adr)
+                       : jump_identification_const(b_adr);
  assertf(ja != NULL);
  assertf(jb != NULL);
  if (strcasecmp(ja, jb) != 0) {
@@ -161,12 +168,23 @@ static int key_adrfil_equals_generic(void *arg,
  }

  // now compare pathes
-  if (normalized) {
-    fil_normalized(a_fil, hash->normfil);
-    fil_normalized(b_fil, hash->normfil2);
-    return strcmp(hash->normfil, hash->normfil2) == 0;
-  } else {
-    return strcmp(a_fil, b_fil) == 0;
+  {
+    char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
+    const char *const keysa =
+        hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
+    const char *const keysb =
+        hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
+
+    if (hash->norm_slash || hash->norm_query || keysa != NULL ||
+        keysb != NULL) {
+      fil_normalized_filtered_ex(a_fil, hash->normfil, keysa, hash->norm_slash,
+                                 hash->norm_query);
+      fil_normalized_filtered_ex(b_fil, hash->normfil2, keysb, hash->norm_slash,
+                                 hash->norm_query);
+      return strcmp(hash->normfil, hash->normfil2) == 0;
+    } else {
+      return strcmp(a_fil, b_fil) == 0;
+    }
  }
 }

@@ -222,11 +240,17 @@ static int key_former_adrfil_equals(void *arg,
  return key_adrfil_equals_generic(arg, a, b, 1);
 }

-void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
+void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized) {
  hash->sav = coucal_new(0);
  hash->adrfil = coucal_new(0);
  hash->former_adrfil = coucal_new(0);
-  hash->normalized = normalized;
+  /* urlhack is the umbrella; per-feature negatives opt out of each part */
+  hash->norm_host = normalized && !opt->no_www_dedup;
+  hash->norm_slash = normalized && !opt->no_slash_dedup;
+  hash->norm_query = normalized && !opt->no_query_dedup;
+  /* snapshot the query-strip list (not owned; valid for the hash lifetime) */
+  hash->strip_query =
+      StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;

  hts_set_hash_handler(hash->sav, opt);
  hts_set_hash_handler(hash->adrfil, opt);
@@ -282,6 +306,26 @@ void hash_free(hash_struct *hash) {
  }
 }

+/* Test helper: do the two URLs dedupe to the same key under opt's urlhack
+   flags? Exercises the live hash compare (norm_host/slash/query resolution). */
+hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
+                            const char *adrb, const char *filb) {
+  hash_struct hash;
+  lien_url la, lb;
+  hts_boolean eq;
+
+  memset(&la, 0, sizeof(la));
+  memset(&lb, 0, sizeof(lb));
+  la.adr = key_duphandler(NULL, adra);
+  la.fil = key_duphandler(NULL, fila);
+  lb.adr = key_duphandler(NULL, adrb);
+  lb.fil = key_duphandler(NULL, filb);
+  hash_init(opt, &hash, opt->urlhack);
+  eq = key_adrfil_equals(&hash, &la, &lb);
+  hash_free(&hash);
+  return eq;
+}
+
 // retour: position ou -1 si non trouvé
 int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
              hash_struct_type type) {
--- a/src/htshash.h
+++ b/src/htshash.h
@@ -51,8 +51,12 @@ typedef enum hash_struct_type {
 } hash_struct_type;

 // tables de hachage
-void hash_init(httrackp *opt, hash_struct *hash, int normalized);
+void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized);
 void hash_free(hash_struct *hash);
+/* Test helper: HTS_TRUE if the two URLs dedupe together under opt's urlhack
+   flags. */
+hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
+                            const char *adrb, const char *filb);
 int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
              hash_struct_type type);
 void hash_write(hash_struct * hash, size_t lpos);
--- a/src/htshelp.c
+++ b/src/htshelp.c
@@ -521,6 +521,7 @@ void help(const char *app, int more) {
  infomsg("  EN maximum mirror time in seconds (60=1 minute, 3600=1 hour)");
  infomsg("  AN maximum transfer rate in bytes/seconds (1000=1KB/s max)");
  infomsg(" %cN maximum number of connections/seconds (*%c10)");
+  infomsg(" %G  random pause of MIN[:MAX] seconds between files (e.g. %G5:10)");
  infomsg
    ("  GN pause transfer if N bytes reached, and wait until lock file is deleted");
  infomsg("");
@@ -563,6 +564,7 @@ void help(const char *app, int more) {
    (" %x  do not include any password for external password protected websites (%x0 include)");
  infomsg
    (" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
+  infomsg(" %g  strip query keys for dedup ([host/pattern=]key1,key2,...)");
  infomsg
    ("  o *generate output html file in case of error (404..) (o0 don't generate)");
  infomsg("  X *purge old files after update (X0 keep delete)");
@@ -571,6 +573,7 @@ void help(const char *app, int more) {
  infomsg("");
  infomsg("Spider options:");
  infomsg("  bN accept cookies in cookies.txt (0=do not accept,* 1=accept)");
+  infomsg(" %K  load extra cookies from a Netscape cookies.txt");
  infomsg
    ("  u  check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always)");
  infomsg
@@ -587,6 +590,9 @@ void help(const char *app, int more) {
    (" %s  update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..)");
  infomsg
    (" %u  url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..)");
+  infomsg("     opt out of one url-hack part: --keep-www-prefix "
+          "(www.foo.com<>foo.com), --keep-double-slashes (//), "
+          "--keep-query-order (?b&a)");
  infomsg
    (" %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip)");
  infomsg("     shortcut: '--assume standard' is equivalent to -%A "
--- a/src/htslib.c
+++ b/src/htslib.c
@@ -3610,7 +3610,10 @@ static int sortNormFnc(const void *a_, const void *b_) {
  return strcmp(*a + 1, *b + 1);
 }

-HTSEXT_API char *fil_normalized(const char *source, char *dest) {
+/* Path normalizer core: optionally collapse redundant '//' (DO_SLASH) and/or
+   sort query arguments (DO_QUERY) so equivalent URLs dedupe. */
+static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
+                               int do_query) {
  char lastc = 0;
  int gotquery = 0;
  int ampargs = 0;
@@ -3620,8 +3623,8 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  for(i = j = 0; source[i] != '\0'; i++) {
    if (!gotquery && source[i] == '?')
      gotquery = ampargs = 1;
-    if ((!gotquery && lastc == '/' && source[i] == '/') // foo//bar -> foo/bar
-      ) {
+    if (do_slash && !gotquery && lastc == '/' && source[i] == '/') {
+      // foo//bar -> foo/bar
    } else {
      if (gotquery && source[i] == '&') {
        ampargs++;
@@ -3633,7 +3636,7 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  dest[j++] = '\0';

  /* Sort arguments (&foo=1&bar=2 == &bar=2&foo=1) */
-  if (ampargs > 1) {
+  if (do_query && ampargs > 1) {
    char **amps = malloct(ampargs * sizeof(char *));
    char *copyBuff = NULL;
    size_t qLen = 0;
@@ -3681,6 +3684,153 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  return dest;
 }

+HTSEXT_API char *fil_normalized(const char *source, char *dest) {
+  return fil_normalized_ex(source, dest, 1, 1);
+}
+
+/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
+   case-sensitive, space-trimmed tokens. */
+static int hts_query_key_stripped(const char *arg, size_t keylen,
+                                  const char *strip) {
+  const char *p = strip;
+
+  while (*p != '\0') {
+    const char *start = p;
+    size_t toklen;
+
+    while (*p != '\0' && *p != ',')
+      p++;
+    toklen = (size_t) (p - start);
+    while (toklen > 0 && *start == ' ') {
+      start++;
+      toklen--;
+    }
+    while (toklen > 0 && start[toklen - 1] == ' ')
+      toklen--;
+    if (toklen == 1 && start[0] == '*')
+      return 1;
+    if (toklen == keylen && strncmp(start, arg, keylen) == 0)
+      return 1;
+    if (*p == ',')
+      p++;
+  }
+  return 0;
+}
+
+/* see htscore.h */
+char *fil_normalized_filtered_ex(const char *source, char *dest,
+                                 const char *strip, int do_slash,
+                                 int do_query) {
+  const char *query;
+  char BIGSTK tmp[HTS_URLMAXSIZE * 2];
+  htsbuff cb;
+  int wrote = 0;
+
+  /* No strip list, or no query: plain normalization. */
+  if (strip == NULL || *strip == '\0' ||
+      (query = strchr(source, '?')) == NULL) {
+    return fil_normalized_ex(source, dest, do_slash, do_query);
+  }
+
+  /* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
+     every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
+     (the read re-normalizes it; a dropped empty arg would miss dedup). */
+  cb = htsbuff_ptr(tmp, sizeof(tmp));
+  htsbuff_catn(&cb, source, (size_t) (query - source));
+  for (query++;;) {
+    const char *const arg = query;
+    const char *eq = NULL;
+    size_t keylen, arglen;
+
+    while (*query != '\0' && *query != '&') {
+      if (eq == NULL && *query == '=')
+        eq = query;
+      query++;
+    }
+    arglen = (size_t) (query - arg);
+    keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
+    if (!hts_query_key_stripped(arg, keylen, strip)) {
+      htsbuff_catc(&cb, wrote ? '&' : '?');
+      htsbuff_catn(&cb, arg, arglen);
+      wrote = 1;
+    }
+    if (*query == '\0')
+      break;
+    query++;
+  }
+  return fil_normalized_ex(tmp, dest, do_slash, do_query);
+}
+
+/* see htscore.h */
+char *fil_normalized_filtered(const char *source, char *dest,
+                              const char *strip) {
+  return fil_normalized_filtered_ex(source, dest, strip, 1, 1);
+}
+
+/* see htscore.h */
+const char *hts_query_strip_keys(const char *rules, const char *adr,
+                                 const char *fil, char *dest, size_t destsize) {
+  const char *p, *q;
+  const char *result = NULL;
+  char BIGSTK url[HTS_URLMAXSIZE * 2];
+
+  if (rules == NULL || *rules == '\0' || destsize == 0)
+    return NULL;
+
+  /* Match string = normalized host/path, query removed. jump_normalized_const
+     collapses www+scheme/auth so read and write (double-normalized) agree;
+     query excluded keeps the decision on host/path only. */
+  url[0] = '\0';
+  strcatbuff(url, jump_normalized_const(adr));
+  if (fil[0] != '/')
+    strcatbuff(url, "/");
+  q = strchr(fil, '?');
+  if (q != NULL)
+    strncatbuff(url, fil, (int) (q - fil));
+  else
+    strcatbuff(url, fil);
+
+  /* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
+     "pattern=keys"; no '=' is the bare form, pattern "*". */
+  for (p = rules; *p != '\0';) {
+    const char *const line = p;
+    const char *eol, *eq, *keys;
+    char BIGSTK pat[HTS_URLMAXSIZE * 2];
+
+    while (*p != '\0' && *p != '\n')
+      p++;
+    eol = p;
+    if (*p == '\n')
+      p++;
+    if (eol == line)
+      continue;
+    eq = memchr(line, '=', (size_t) (eol - line));
+    if (eq != NULL) {
+      size_t patlen = (size_t) (eq - line);
+
+      if (patlen >= sizeof(pat))
+        patlen = sizeof(pat) - 1;
+      memcpy(pat, line, patlen);
+      pat[patlen] = '\0';
+      keys = eq + 1;
+    } else {
+      pat[0] = '*';
+      pat[1] = '\0';
+      keys = line;
+    }
+    if (strjoker(url, pat, NULL, NULL) != NULL) {
+      size_t klen = (size_t) (eol - keys);
+
+      if (klen >= destsize)
+        klen = destsize - 1;
+      memcpy(dest, keys, klen);
+      dest[klen] = '\0';
+      result = dest;
+    }
+  }
+  return result;
+}
+
 #define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
 HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
                                      size_t destsize) {
@@ -5890,7 +6040,14 @@ HTSEXT_API httrackp *hts_create_opt(void) {
  opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
  opt->sizehack = HTS_FALSE;
  opt->urlhack = HTS_TRUE;
+  opt->no_www_dedup = HTS_FALSE;
+  opt->no_slash_dedup = HTS_FALSE;
+  opt->no_query_dedup = HTS_FALSE;
  StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
+  StringCopy(opt->strip_query, "");
+  StringCopy(opt->cookies_file, "");
+  opt->pause_min_ms = 0;
+  opt->pause_max_ms = 0;
  opt->ftp_proxy = HTS_TRUE;
  opt->convert_utf8 = HTS_TRUE;
  StringCopy(opt->filelist, "");
@@ -6035,6 +6192,8 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
    StringFree(opt->urllist);
    StringFree(opt->footer);
    StringFree(opt->mod_blacklist);
+    StringFree(opt->strip_query);
+    StringFree(opt->cookies_file);

    StringFree(opt->path_html);
    StringFree(opt->path_html_utf8);
--- a/src/htsname.c
+++ b/src/htsname.c
@@ -198,6 +198,13 @@ int url_savename(lien_adrfilsave *const afs,
  // copy of fil, used for lookups (see urlhack)
  const char *normadr = adr;
  const char *normfil = fil_complete;
+  /* query keys to strip for this URL (NULL = none); decoupled from urlhack */
+  char BIGSTK stripkeys[HTS_URLMAXSIZE];
+  const char *const strip =
+      StringNotEmpty(opt->strip_query)
+          ? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
+                                 fil_complete, stripkeys, sizeof(stripkeys))
+          : NULL;
  const char *const print_adr = jump_protocol_const(adr);
  const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL;     // Position nom et point

@@ -230,9 +237,13 @@ int url_savename(lien_adrfilsave *const afs,
  // www-42.foo.com -> foo.com
  // foo.com/bar//foobar -> foo.com/bar/foobar
  if (opt->urlhack) {
-    // copy of adr (without protocol), used for lookups (see urlhack)
-    normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
-    normfil = fil_normalized(fil_complete, normfil_);
+    // dedup-lookup key; honor the per-feature negatives like htshash.c so
+    // distinct URLs keep distinct savenames (else keep normadr = adr)
+    if (!opt->no_www_dedup)
+      normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
+    normfil =
+        fil_normalized_filtered_ex(fil_complete, normfil_, strip,
+                                   !opt->no_slash_dedup, !opt->no_query_dedup);
  } else {
    if (link_has_authority(adr_complete)) {     // https or other protocols : in "http/" subfolder
      char *pos = strchr(adr_complete, ':');
@@ -245,6 +256,11 @@ int url_savename(lien_adrfilsave *const afs,
        normadr = normadr_;
      }
    }
+    // strip still applies with urlhack off (host left untouched); no // or
+    // query-sort here, to match the hash key (norm_slash/norm_query are 0 when
+    // urlhack is off) so a URL is looked up under the key it was stored with
+    if (strip != NULL)
+      normfil = fil_normalized_filtered_ex(fil_complete, normfil_, strip, 0, 0);
  }

  // à afficher sans ftp://
--- a/src/htsopt.h
+++ b/src/htsopt.h
@@ -529,6 +529,16 @@ struct httrackp {
  htslibhandles libHandles; /**< loaded external module handles */
  //
  htsoptstate state; /**< embedded live engine state */
+  String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
+                           appended at the tail to keep field offsets stable */
+  hts_boolean
+      no_www_dedup; /**< with urlhack, keep www.host distinct from host */
+  hts_boolean no_slash_dedup; /**< with urlhack, keep redundant // in paths */
+  hts_boolean no_query_dedup; /**< with urlhack, keep query-argument order */
+  String cookies_file;        /**< extra Netscape cookies.txt to preload
+                                 (--cookies-file) */
+  int pause_min_ms; /**< inter-file pause lower bound, ms (0=off, #185) */
+  int pause_max_ms; /**< inter-file pause upper bound, ms */
 };

 /* Running statistics for a mirror. */
--- a/src/htsparse.c
+++ b/src/htsparse.c
@@ -302,6 +302,14 @@ static HTS_INLINE char html_prevc(const char *html, const char *start) {
  return html > start ? html[-1] : ' ';
 }

+/* Drop a redirect Location's #fragment: a UA anchor, never part of the fetched
+ * resource (#204). */
+static void url_drop_fragment(char *const url) {
+  char *const frag = strchr(url, '#');
+  if (frag != NULL)
+    *frag = '\0';
+}
+
 /* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
   argument is a method, not a URL: #218). Case-insensitive. */
 static int is_http_method(const char *s, size_t len) {
@@ -3596,22 +3604,35 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
        //

        strcpybuff(mov_url, r->location);
+        url_drop_fragment(mov_url);

        // url qque -> adresse+fichier
        if ((reponse =
             ident_url_relatif(mov_url, urladr(), urlfil(), moved)) >= 0) {
          int set_prio_to = 0;  // pas de priotité fixéd par wizard

-          // check whether URLHack is harmless or not
-          if (opt->urlhack) {
+          // check whether URLHack is harmless or not (per the effective
+          // sub-flags)
+          if (opt->urlhack && (!opt->no_www_dedup || !opt->no_slash_dedup ||
+                               !opt->no_query_dedup)) {
+            const int norm_host = !opt->no_www_dedup;
+            const int norm_slash = !opt->no_slash_dedup;
+            const int norm_query = !opt->no_query_dedup;
            char BIGSTK n_adr[HTS_URLMAXSIZE * 2], n_fil[HTS_URLMAXSIZE * 2];
            char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];

-            n_adr[0] = n_fil[0] = '\0';
-            (void) adr_normalized_sized(moved->adr, n_adr, sizeof(n_adr));
-            (void) fil_normalized(moved->fil, n_fil);
-            (void) adr_normalized_sized(urladr(), pn_adr, sizeof(pn_adr));
-            (void) fil_normalized(urlfil(), pn_fil);
+            strlcpybuff(n_adr,
+                        norm_host ? jump_normalized_const(moved->adr)
+                                  : jump_identification_const(moved->adr),
+                        sizeof(n_adr));
+            strlcpybuff(pn_adr,
+                        norm_host ? jump_normalized_const(urladr())
+                                  : jump_identification_const(urladr()),
+                        sizeof(pn_adr));
+            fil_normalized_filtered_ex(moved->fil, n_fil, NULL, norm_slash,
+                                       norm_query);
+            fil_normalized_filtered_ex(urlfil(), pn_fil, NULL, norm_slash,
+                                       norm_query);
            if (strcasecmp(n_adr, pn_adr) == 0
                && strcasecmp(n_fil, pn_fil) == 0) {
              hts_log_print(opt, LOG_WARNING,
@@ -4791,6 +4812,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,

            mov_url[0] = '\0';
            strcpybuff(mov_url, back[b].r.location);    // copier URL
+            url_drop_fragment(mov_url);

            /* Remove (temporarily created) file if it was created */
            UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), back[b].url_sav));
--- a/src/htsselftest.c
+++ b/src/htsselftest.c
@@ -512,15 +512,21 @@ static int string_safety_selftests(void) {
 /* ------------------------------------------------------------ */

 static int st_filter(httrackp *opt, int argc, char **argv) {
+  char *str, *pat;
+  int matched;
+
  (void) opt;
  if (argc < 2) {
    fprintf(stderr, "filter: needs a filter pattern and a string\n");
    return 1;
  }
-  if (strjoker(argv[1], argv[0], NULL, NULL))
-    printf("%s does match %s\n", argv[1], argv[0]);
-  else
-    printf("%s does NOT match %s\n", argv[1], argv[0]);
+  /* exact-size heap copies so a sanitizer traps any over-read of the pattern */
+  str = strdupt(argv[1]);
+  pat = strdupt(argv[0]);
+  matched = strjoker(str, pat, NULL, NULL) != NULL;
+  printf("%s does %s %s\n", argv[1], matched ? "match" : "NOT match", argv[0]);
+  freet(str);
+  freet(pat);
  return 0;
 }

@@ -899,12 +905,71 @@ static int st_copyopt(httrackp *opt, int argc, char **argv) {
  if (to->parseall != HTS_TRUE)
    err = 1;

+  /* String field: a non-empty source deep-copies across, an empty source
+     leaves the target intact (StringNotEmpty guard). Covers the exported
+     copy_htsopt String path that no crawl test reaches. */
+  StringCopy(from->cookies_file, "/tmp/jar.txt");
+  StringCopy(to->cookies_file, "");
+  copy_htsopt(from, to);
+  if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
+    err = 1;
+  StringCopy(from->cookies_file, "");
+  copy_htsopt(from, to);
+  if (strcmp(StringBuff(to->cookies_file), "/tmp/jar.txt") != 0)
+    err = 1;
+
+  /* #185 pause pair: copied when enabled (max>0), the 0 sentinel skips */
+  from->pause_min_ms = 5000;
+  from->pause_max_ms = 10000;
+  to->pause_min_ms = to->pause_max_ms = 0;
+  copy_htsopt(from, to);
+  if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
+    err = 1;
+  from->pause_min_ms = from->pause_max_ms = 0;
+  copy_htsopt(from, to);
+  if (to->pause_min_ms != 5000 || to->pause_max_ms != 10000)
+    err = 1;
+
  hts_free_opt(from);
  hts_free_opt(to);
  printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
  return err;
 }

+static int st_pause(httrackp *opt, int argc, char **argv) {
+  int err = 0, i, seen_low = 0, seen_high = 0;
+
+  (void) opt;
+  (void) argc;
+  (void) argv;
+  /* Consecutive-ms seeds (production shape: launch timestamps a few ms apart)
+     must stay in range and spread, not collapse to a bound -- worst case for a
+     weak low-bit mixer. */
+  for (i = 0; i < 10000; i++) {
+    int t = hts_pause_target_ms((TStamp) (1719500000000LL + i), 5000, 10000);
+
+    if (t < 5000 || t > 10000)
+      err = 1;
+    seen_low |= (t < 6000);
+    seen_high |= (t > 9000);
+  }
+  if (!seen_low || !seen_high)
+    err = 1;
+  if (hts_pause_target_ms(12345, 8000, 8000) != 8000) /* equal bounds = fixed */
+    err = 1;
+  /* deterministic: a seed yields the same target even after an intervening call
+     with another seed (no global PRNG state to perturb it) */
+  {
+    int a = hts_pause_target_ms(99, 5000, 10000);
+
+    (void) hts_pause_target_ms(54321, 5000, 10000);
+    if (hts_pause_target_ms(99, 5000, 10000) != a)
+      err = 1;
+  }
+  printf("pause: %s\n", err ? "FAIL" : "OK");
+  return err;
+}
+
 static int st_relative(httrackp *opt, int argc, char **argv) {
  char s[HTS_URLMAXSIZE * 2];

@@ -1052,6 +1117,173 @@ static int st_cookies(httrackp *opt, int argc, char **argv) {
  return err;
 }

+/* --strip-query: resolver + fil_normalized_filtered, end to end. */
+static int st_stripquery(httrackp *opt, int argc, char **argv) {
+  char dest[1024], keys[256], ref[1024];
+  const char *k;
+
+  (void) opt;
+  (void) argc;
+  (void) argv;
+
+  /* empty rules == plain fil_normalized */
+  assertf(hts_query_strip_keys(NULL, "h.com", "/p?a=1", keys, sizeof(keys)) ==
+          NULL);
+  assertf(hts_query_strip_keys("", "h.com", "/p?a=1", keys, sizeof(keys)) ==
+          NULL);
+  assertf(strcmp(fil_normalized_filtered("/p?b=2&a=1", dest, NULL),
+                 fil_normalized("/p?b=2&a=1", ref)) == 0);
+
+  /* bare form (*=keys): strip the key everywhere, keep+sort the rest */
+  k = hts_query_strip_keys("sid", "any.com", "/p?b=2&sid=x&a=1", keys,
+                           sizeof(keys));
+  assertf(k != NULL && strcmp(k, "sid") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?b=2&sid=x&a=1", dest, k),
+                 "/p?a=1&b=2") == 0);
+
+  /* reordered variant + an extra stripped key == the clean URL */
+  assertf(strcmp(fil_normalized_filtered("/p?sid=y&a=1&b=2", dest, "sid"),
+                 fil_normalized("/p?a=1&b=2", ref)) == 0);
+
+  /* host pattern matches only that host, incl. its www-normalized forms */
+  assertf(hts_query_strip_keys("ex.com/*=utm", "other.com", "/p?utm=1", keys,
+                               sizeof(keys)) == NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "ex.com", "/p?utm=1", keys,
+                               sizeof(keys)) != NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "www.ex.com", "/p?utm=1", keys,
+                               sizeof(keys)) != NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "http://www-3.ex.com",
+                               "/p?utm=1", keys, sizeof(keys)) != NULL);
+
+  /* last match wins, wholesale: host rule overrides global, no union */
+  k = hts_query_strip_keys("*=sid\nex.com/*=utm", "ex.com",
+                           "/p?sid=1&utm=2&a=3", keys, sizeof(keys));
+  assertf(k != NULL && strcmp(k, "utm") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?sid=1&utm=2&a=3", dest, k),
+                 "/p?a=3&sid=1") == 0);
+  k = hts_query_strip_keys("*=sid\nex.com/*=utm", "z.com", "/p?sid=1&a=3", keys,
+                           sizeof(keys));
+  assertf(k != NULL && strcmp(k, "sid") == 0);
+
+  /* whole-key match, not prefix: "utm" must not strip utm_source */
+  assertf(strcmp(fil_normalized_filtered("/p?utm_source=x&a=1", dest, "utm"),
+                 "/p?a=1&utm_source=x") == 0);
+
+  /* "*" drops every param; a fully-stripped single-arg query loses its '?' */
+  assertf(strcmp(fil_normalized_filtered("/p?a=1&b=2", dest, "*"), "/p") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?utm=1", dest, "utm"), "/p") == 0);
+
+  /* degenerate forms a=, b, c== (key 'c'); strip c keeps a= and b */
+  assertf(strcmp(fil_normalized_filtered("/p?a=&b&c==", dest, "c"),
+                 "/p?a=&b") == 0);
+  /* short key must not strip a longer one: 'c' must not touch 'cc' */
+  assertf(strcmp(fil_normalized_filtered("/p?cc=1&c=2", dest, "c"),
+                 "/p?cc=1") == 0);
+
+  /* repeated key: every occurrence is stripped, not just the first */
+  assertf(
+      strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "foo"),
+             "/p?bar=13") == 0);
+  /* repeated key mixing missing/empty values */
+  assertf(
+      strcmp(fil_normalized_filtered("/p?foo&bar=13&foo=42&foo=", dest, "foo"),
+             "/p?bar=13") == 0);
+  /* repeated key kept (no match): all occurrences retained, then sorted */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "z"),
+                 "/p?bar=13&foo=42&foo=43") == 0);
+
+  /* value containing '=': the key is only the part before the first '='. Strip
+     'foo' drops "foo=42=17" whole; the '=' in the value is not a delimiter. */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "foo"),
+                 "/p?bar=") == 0);
+  /* keeping it preserves the embedded '=' verbatim */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "bar"),
+                 "/p?foo=42=17") == 0);
+  /* a value segment is not a key: stripping "42" must not touch foo=42=17 */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17", dest, "42"),
+                 "/p?foo=42=17") == 0);
+
+  /* Idempotency: the read path re-normalizes an already-normalized fil, so the
+     result must be a fixpoint or dedup misses (catches a dropped empty/trailing
+     arg like "?&&", "a&"). */
+  {
+    static const char *const qs[] = {"/p?a=&b&c==",
+                                     "/p?a&&b",
+                                     "/p?&a",
+                                     "/p?a&",
+                                     "/p?",
+                                     "/p?=v",
+                                     "/p?&&",
+                                     "/p?b=2&a=1",
+                                     "/p?utm=x&",
+                                     "/p?&utm=x",
+                                     "/p?foo=42&bar=13&foo=43",
+                                     "/p?foo&bar=13&foo=42&foo=",
+                                     "/p?foo=42=17&bar="};
+    static const char *const strips[] = {NULL, "z", "utm", "*", "a", "foo"};
+    char once[1024], twice[1024];
+    size_t i, j;
+
+    for (i = 0; i < sizeof(qs) / sizeof(qs[0]); i++) {
+      for (j = 0; j < sizeof(strips) / sizeof(strips[0]); j++) {
+        fil_normalized_filtered(qs[i], once, strips[j]);
+        fil_normalized_filtered(once, twice, strips[j]);
+        assertf(strcmp(once, twice) == 0);
+      }
+    }
+  }
+
+  printf("strip-query self-test OK\n");
+  return 0;
+}
+
+/* -%u url-hack split (#271): each sub-flag must toggle independently. */
+static int st_urlhack(httrackp *opt, int argc, char **argv) {
+  (void) argc;
+  (void) argv;
+#define EQ(aa, fa, ab, fb) hash_url_equals(opt, aa, fa, ab, fb)
+  /* urlhack on, no opt-outs: www, // and query order all collapse */
+  opt->urlhack = HTS_TRUE;
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+
+  /* keep-www-prefix: host off; // and query still collapse */
+  opt->no_www_dedup = HTS_TRUE;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->no_www_dedup = HTS_FALSE;
+
+  /* keep-double-slashes: // significant; www, query order still collapse */
+  opt->no_slash_dedup = HTS_TRUE;
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->no_slash_dedup = HTS_FALSE;
+
+  /* keep-query-order: query order significant; www and // still collapse */
+  opt->no_query_dedup = HTS_TRUE;
+  assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  opt->no_query_dedup = HTS_FALSE;
+
+  /* all opt-outs == urlhack off entirely */
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_TRUE;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->urlhack = HTS_FALSE;
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = HTS_FALSE;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+#undef EQ
+  printf("urlhack self-test OK\n");
+  return 0;
+}
+
 /* ------------------------------------------------------------ */
 /* Registry: name -> handler, with a usage hint and a one-line description. */
 /* ------------------------------------------------------------ */
@@ -1068,6 +1300,10 @@ static const struct selftest_entry {
     "size-aware filter verdict (negative size = unknown/scan time)",
     st_filtersize},
    {"simplify", "<path>", "collapse ./ and ../ in a path", st_simplify},
+    {"stripquery", "", "--strip-query pattern/key stripping self-test",
+     st_stripquery},
+    {"urlhack", "", "-%u url-hack sub-flag (www/slash/query) self-test",
+     st_urlhack},
    {"mime", "<filename>", "MIME type for a filename", st_mime},
    {"charset", "<charset> <string>",
     "convert a string to UTF-8 from a charset", st_charset},
@@ -1080,6 +1316,7 @@ static const struct selftest_entry {
    {"strsafe", "[overflow|overflow-buff [str]]", "bounded string-op self-test",
     st_strsafe},
    {"copyopt", "", "copy_htsopt option-copy self-test", st_copyopt},
+    {"pause", "", "randomized inter-file pause target self-test", st_pause},
    {"relative", "<link> <curr-file>", "relative link between two paths",
     st_relative},
    {"resolve", "<link> <adr> <fil>", "resolve a link against an origin",
--- a/tests/01_engine-cmdline.test
+++ b/tests/01_engine-cmdline.test
@@ -90,4 +90,16 @@ refused "dangling-quote argument not refused cleanly"
 run_only "$tmp/q-lone" '"'
 refused "lone-quote argument not refused cleanly"

+# --pause (#185): valid MIN[:MAX] accepted; malformed, reversed, over-range and
+# non-finite values refused cleanly. NaN defeats naive `<`/`>` checks (it
+# compares false to everything), so it must not slip through to the int cast.
+run "$tmp/pause-ok" --pause 0.2:0.4
+accepted "$tmp/pause-ok" "#185: valid --pause range rejected"
+run "$tmp/pause-fix" --pause 0.2
+accepted "$tmp/pause-fix" "#185: valid fixed --pause rejected"
+for bad in nan nan:5 5:nan inf 10:5 99999; do
+    run "$tmp/pause-bad" --pause "$bad"
+    refused "#185: invalid --pause '$bad' not refused cleanly"
+done
+
 exit 0
--- a/tests/01_engine-filter.test
+++ b/tests/01_engine-filter.test
@@ -50,27 +50,54 @@ match '*foo*bar' 'foozbar'
 # '?' is the query-string marker, not a single-char wildcard
 nomatch 'a?c' 'abc'

-# backslash escapes a metacharacter inside a class so it is matched literally.
-# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
-# both X and '\'. These assertions pin that behavior.
+# Inside a class, backslash escapes the next char as a literal member (#148):
+# '\X' matches X only (not '\'), and an escaped ']' is a member, not the terminator.
 match '*[\*]' '*'
-match '*[\*]' "\\"
-nomatch '*[\*]' 'a'
+nomatch '*[\*]' "\\"
 match '*[\\]' "\\"
-nomatch '*[\\]' 'a'
+nomatch '*[\\]' '*'
 match '*[\[]' '['
-match '*[\[]' "\\"
-nomatch '*[\[]' 'a'
+nomatch '*[\[]' "\\"
+match '*[\]]' ']'
+nomatch '*[\]]' "\\"

-# A literal ']' cannot be a class member: the class parser stops at the first
-# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
-# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
-# by a trailing literal ']'. These assertions document the current (buggy)
-# behavior so any future matcher fix is a deliberate, visible change.
-nomatch '*[\[\]]' '[' # not matched, despite the docs
-match '*[\[\]]' ']'   # only via the empty class-match + trailing ']'
-match '*[\[\]]' '[]'  # one of {'[','\'} then the trailing ']'
-nomatch '*[\[\]]' '[]x'
+# '*[\[\]]' is "the [ or ] character", as the filter guide documents.
+match '*[\[\]]' '['
+match '*[\[\]]' ']'
+nomatch '*[\[\]]' 'a'
+match '*[\[,\]]' '[' # comma between members is optional
+match '*[\[,\]]' ']'
+match '*[a,\[]' 'a' # an escaped member no longer eats the preceding one
+match '*[a,\[]' '['
+
+# Escape is decoded before the range/separator/size checks, so '\-' '\,' '\<'
+# are literal members, not operators.
+match '*[a\-z]' 'a'
+match '*[a\-z]' 'z'
+nomatch '*[a\-z]' 'b' # not the a..z range
+match '*[\,]' ','
+nomatch '*[\,]' "\\" # the escape must not leak '\' into the class
+match '*[\<]' '<'
+nomatch '*[\<]' "\\"
+match '*[\[,\],a]' '['
+match '*[\[,\],a]' ']'
+match '*[\[,\],a]' 'a'
+
+# A truncated range '*[a-' is the literal members {a,-}; the parser must not
+# read past the end decoding it (was a 1-byte heap over-read in the range arm).
+match '*[a-' 'a'
+nomatch '*[a-' 'b'
+
+# *(...) matches exactly one char from the class; *[...] matches a run.
+match '*(a,b)' 'a'
+nomatch '*(a,b)' 'aa'
+nomatch '*(a,b)' 'c'
+
+# documented composite filters (filters.html)
+match 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.zip'
+nomatch 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.tar'
+match '*.html*[]' 'page.html'
+nomatch '*.html*[]' 'page.html?x=1' # *[] forbids the trailing query

 # Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
 # means the size is still unknown (scan time). A size exclusion must stay neutral
--- a/tests/01_engine-pause.test
+++ b/tests/01_engine-pause.test
@@ -0,0 +1,15 @@
+#!/bin/bash
+#
+# --pause (#185): the inter-file pause target must stay in [min,max] and spread
+# across it (a per-call rand() would collapse it toward min). Driven by the
+# in-process 'httrack -#test=pause' test. POSIX-portable ($(BASH) is /bin/sh on macOS).
+
+set -eu
+
+# 'run' is an ignored placeholder argument.
+out=$(httrack -#test=pause run)
+
+test "$out" = "pause: OK" || {
+    echo "expected 'pause: OK', got: $out" >&2
+    exit 1
+}
--- a/tests/01_engine-stripquery.test
+++ b/tests/01_engine-stripquery.test
@@ -0,0 +1,8 @@
+#!/bin/bash
+#
+
+set -euo pipefail
+
+# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
+# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
+httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"
--- a/tests/01_engine-urlhack.test
+++ b/tests/01_engine-urlhack.test
@@ -0,0 +1,8 @@
+#!/bin/bash
+#
+
+set -euo pipefail
+
+# -%u url-hack split (#271): www / // / query-order dedup toggle independently.
+# All assertions live in the engine self-test (hash compare flag resolution).
+httrack -O /dev/null -#test=urlhack run | grep -q "urlhack self-test OK"
--- a/tests/26_local-strip-query.test
+++ b/tests/26_local-strip-query.test
@@ -0,0 +1,23 @@
+#!/bin/bash
+#
+# End-to-end --strip-query (#112): two links to one resource differing only by
+# ?utm_source dedup to a single saved file (2 files written: index + resource);
+# the control crawl without the option keeps both variants (3 files). Locks the
+# CLI->opt->hash plumbing the engine self-test can't reach.
+
+set -e
+
+: "${top_srcdir:=..}"
+
+# stripped: the two ?utm_source variants collapse to one resource
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
+    httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
+
+# control: no stripping -> both query-named variants are saved
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
+    httrack 'BASEURL/stripquery/index.html'
+
+# strip still applies with url-hack off (-%u0): exercises the urlhack-off
+# savename branch, which must normalize the dedup key the same way the hash does
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
+    httrack 'BASEURL/stripquery/index.html' -%u0 --strip-query 'utm_source'
--- a/tests/27_local-cookies-file.test
+++ b/tests/27_local-cookies-file.test
@@ -0,0 +1,22 @@
+#!/bin/bash
+#
+# End-to-end --cookies-file (#215): /gated/secret.php needs a cookie no page
+# ever Set-Cookies, so it is reachable only when the option preloads it from a
+# Netscape cookies.txt. Locks the CLI->opt->cookie_load->wire plumbing.
+
+set -e
+
+: "${top_srcdir:=..}"
+
+# preloaded cookie -> secret page is served. -o0 means a 500 leaves no file, so
+# --found/--files only hold when the secret is genuinely fetched (200).
+bash "$top_srcdir/tests/local-crawl.sh" --cookie 'session=opensesame' \
+    --errors 0 --files 2 \
+    --found 'gated/index.html' --found 'gated/secret.html' \
+    httrack 'BASEURL/gated/index.php' -o0
+
+# control: without the cookie the secret 500s; -o0 suppresses the error page so
+# its absence is real (error + missing file)
+bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
+    --found 'gated/index.html' --not-found 'gated/secret.html' \
+    httrack 'BASEURL/gated/index.php' -o0
--- a/tests/28_local-pause.test
+++ b/tests/28_local-pause.test
@@ -0,0 +1,29 @@
+#!/bin/bash
+#
+# --pause (#185): a fixed inter-file delay must slow a multi-file crawl. Measure
+# the same crawl with and without --pause and compare: the harness overhead
+# cancels, leaving only the pause. Integer seconds keep it portable (BSD date
+# has no %N); a lower bound is not timing-flaky since a pause only adds time.
+
+set -e
+
+: "${top_srcdir:=..}"
+
+run() { # echoes the wall-clock seconds of one crawl
+    local t0 t1
+    t0=$(date +%s)
+    bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
+        httrack 'BASEURL/types/index.html' -c1 "$@" >/dev/null 2>&1
+    t1=$(date +%s)
+    echo $((t1 - t0))
+}
+
+base=$(run)
+paused=$(run --pause 0.5)
+delta=$((paused - base))
+
+echo "crawl: ${base}s, with --pause 0.5: ${paused}s (delta ${delta}s)"
+if [ "$delta" -lt 2 ]; then
+    echo "FAIL: --pause did not delay the crawl (delta ${delta}s)" >&2
+    exit 1
+fi
--- a/tests/29_local-redirect-fragment.test
+++ b/tests/29_local-redirect-fragment.test
@@ -0,0 +1,11 @@
+#!/bin/bash
+# Issue #204: a 302 Location with a #fragment must drop the fragment before the
+# target is fetched. The server is strict (400 on a '#' in the request-target),
+# so a leaked fragment logs an error and the target is never saved.
+set -e
+
+: "${top_srcdir:=..}"
+
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
+    --found 'redir/target.html' \
+    httrack 'BASEURL/redir/index.html'
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -5,6 +5,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
 	proxy-https-server.py \
 	local-crawl.sh local-server.py server.crt server.key \
 	server-root/simple/basic.html server-root/simple/link.html \
+	server-root/stripquery/index.html server-root/stripquery/a.html \
 	fixtures/cache-golden/hts-cache/new.zip

 TESTS_ENVIRONMENT =
@@ -40,12 +41,15 @@ TESTS = \
 	01_engine-idna.test \
 	01_engine-mime.test \
 	01_engine-parse.test \
+	01_engine-pause.test \
 	01_engine-rcfile.test \
 	01_engine-relative.test \
 	01_engine-savename.test \
 	01_engine-selftest-dispatch.test \
 	01_engine-simplify.test \
+	01_engine-stripquery.test \
 	01_engine-strsafe.test \
+	01_engine-urlhack.test \
 	02_manpage-regen.test \
 	02_update-cache.test \
 	10_crawl-simple.test \
@@ -68,6 +72,10 @@ TESTS = \
 	22_local-broken-size.test \
 	23_local-errpage.test \
 	24_local-resume-overlap.test \
-	25_local-mime-exclude.test
+	25_local-mime-exclude.test \
+	26_local-strip-query.test \
+	27_local-cookies-file.test \
+	28_local-pause.test \
+	29_local-redirect-fragment.test

 CLEANFILES = check-network_sh.cache
--- a/tests/local-crawl.sh
+++ b/tests/local-crawl.sh
@@ -12,11 +12,14 @@
 # the mirror directory name.
 #
 # Usage:
-#   bash local-crawl.sh [--tls] [--root DIR] \
+#   bash local-crawl.sh [--tls] [--root DIR] [--cookie NAME=VALUE ...] \
 #       --errors N --files N --found PATH ... --directory PATH ... \
 #       --log-found REGEX ... --log-not-found REGEX ... \
 #       httrack BASEURL/some/path [httrack-args...]
 # --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
+# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
+# which the ephemeral port forces into the cookie domain) and passes it to
+# httrack via --cookies-file, to exercise preloaded cookies.

 set -u

@@ -85,6 +88,7 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create

 # --- parse leading control flags --------------------------------------------
 declare -a audit=()
+declare -a cookies=()
 scheme=http
 pos=0
 args=("$@")
@@ -105,6 +109,10 @@ while test "$pos" -lt "$nargs"; do
        pos=$((pos + 1))
        root="${args[$pos]}"
        ;;
+    --cookie)
+        pos=$((pos + 1))
+        cookies+=("${args[$pos]}")
+        ;;
    --errors | --files)
        audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
        pos=$((pos + 1))
@@ -158,6 +166,17 @@ while test "$pos" -lt "$nargs"; do
    pos=$((pos + 1))
 done

+# --- materialize any --cookie entries into a cookies.txt ---------------------
+if test "${#cookies[@]}" -gt 0; then
+    jar="${tmpdir}/cookies.txt"
+    : >"$jar"
+    for spec in "${cookies[@]}"; do
+        printf '127.0.0.1:%s\tTRUE\t/\tFALSE\t1999999999\t%s\t%s\n' \
+            "$port" "${spec%%=*}" "${spec#*=}" >>"$jar"
+    done
+    hts+=(--cookies-file "$jar")
+fi
+
 # --- run httrack -------------------------------------------------------------
 which httrack >/dev/null || die "could not find httrack"
 ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
--- a/tests/local-server.py
+++ b/tests/local-server.py
@@ -110,6 +110,19 @@ class Handler(SimpleHTTPRequestHandler):
            return self.fail_cookie("badger")
        self.send_html("\tThis is a test.")

+    # --cookies-file (#215): the secret page needs a cookie no page ever sets,
+    # so it is reachable only when --cookies-file preloads it.
+    GATE_COOKIE = ("session", "opensesame")
+
+    def route_gated_index(self):
+        self.send_html('\tThis is a <a href="secret.php">link</a>')
+
+    def route_gated_secret(self):
+        name, value = self.GATE_COOKIE
+        if self.request_cookies().get(name) != value:
+            return self.fail_cookie(name)
+        self.send_html("\tThis is the secret.")
+
    def route_robots(self):
        body = b"User-agent: *\nDisallow:\n"
        self.send_response(200)
@@ -341,10 +354,27 @@ class Handler(SimpleHTTPRequestHandler):
        if self.command != "HEAD":
            self.wfile.write(body)

+    # 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
+    # that must be dropped before the target is fetched. A leaked '#' reaches the
+    # strict-server guard below and 400s.
+    def route_redir_index(self):
+        self.send_html('\t<a href="go.php">go</a>')
+
+    def route_redir_go(self):
+        self.send_response(302, "Found")
+        self.send_header("Location", "target.html#section")
+        self.send_header("Content-Length", "0")
+        self.end_headers()
+
+    def route_redir_target(self):
+        self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
+
    ROUTES = {
        "/cookies/entrance.php": route_entrance,
        "/cookies/second.php": route_second,
        "/cookies/third.php": route_third,
+        "/gated/index.php": route_gated_index,
+        "/gated/secret.php": route_gated_secret,
        "/robots.txt": route_robots,
        "/types/index.html": route_types_index,
        "/types/control.php": route_types,
@@ -376,10 +406,23 @@ class Handler(SimpleHTTPRequestHandler):
        "/mimex/index.html": route_mimex_index,
        "/mimex/blob.pdf": route_mimex_blob,
        "/mimex/real.html": route_mimex_real,
+        "/redir/index.html": route_redir_index,
+        "/redir/go.php": route_redir_go,
+        "/redir/target.html": route_redir_target,
    }

    # --- dispatch ----------------------------------------------------------

+    def reject_fragment(self):
+        # Strict server: a '#' in the request-target is the client failing to
+        # drop a fragment (#204). RFC 3986 forbids it on the wire; answer 400.
+        if "#" in self.path:
+            self.send_response(400, "Bad Request")
+            self.send_header("Content-Length", "0")
+            self.end_headers()
+            return True
+        return False
+
    def dispatch(self):
        self._set_cookies = []
        path = urlsplit(self.path).path
@@ -391,10 +434,14 @@ class Handler(SimpleHTTPRequestHandler):
        return False

    def do_GET(self):
+        if self.reject_fragment():
+            return
        if not self.dispatch():
            super().do_GET()

    def do_HEAD(self):
+        if self.reject_fragment():
+            return
        if not self.dispatch():
            super().do_HEAD()

--- a/tests/server-root/stripquery/a.html
+++ b/tests/server-root/stripquery/a.html
@@ -0,0 +1 @@
+<html><body>resource A</body></html>
--- a/tests/server-root/stripquery/index.html
+++ b/tests/server-root/stripquery/index.html
@@ -0,0 +1,5 @@
+<html><body>
+Two links to one resource, differing only by a tracking parameter.
+<a href="a.html?utm_source=x">x</a>
+<a href="a.html?utm_source=y">y</a>
+</body></html>
Author	SHA1	Message	Date
Xavier Roche	aecbaa9993	Release 3.49.10 Bump the package version to 3.49.10 and curate the release notes. VERSION_INFO goes 3:1:0 -> 3:2:0: the cycle only appended tail fields to the installed options struct (--cookies-file, --pause, --strip-query, the -%u split), no existing symbol or offset changed, so the soname stays .so.3. history.txt gets the 3.49-10 block; debian/changelog gets 3.49.10-1 with the Debian-specific items (DEP-5 copyright, chromium-first browser dep, minizip embedded-library override). Standards-Version 4.7.0 -> 4.7.4: the intervening Policy changes (usr-merge, /usr/games, Priority recommendation, -dev linker scripts, non-free-firmware) need no package change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-28 14:21:37 +02:00
Xavier Roche	a62f93a107	Strip the #fragment from a redirect Location before fetching (#204 ) (#441 ) * Strip the #fragment from a redirect Location before fetching (#204) A 302/30x Location is dereferenced, not displayed, so its #fragment is a client-side anchor that must be dropped before the target is requested. httrack kept it: the redirect followers copied r.location verbatim, so the re-request carried `GET /page.html#frag` (strict servers answer 400) and the mirror was saved under a fragment-polluted name. HTML links were already stripped at parse time; only the two Location followers were not. Drop the fragment in a small helper called at both follow sites, covering the live and cached-redirect paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * test(#204): strict-server guard so a leaked fragment is a wire-level failure The first cut of 29_local-redirect-fragment only checked the saved filename. Python's urlsplit() drops the fragment before routing, so a `#` leaked into the GET line still routed to the target and the crawl passed: the assertion was a proxy, not the wire behavior the fix targets. Make the server strict (400 on any `#` in the request-target, like the real servers in #204), so a leaked fragment now logs an error and the target is never saved. Neutering the fix makes the test fail with the exact "400 Bad Request" from the issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 13:52:21 +02:00
Xavier Roche	799ec88dc7	filters: fix escaped brackets inside [...] character classes (#440 ) filters: decode escaped chars correctly inside [...] classes The escape branch in strjoker probed joker[i+2] instead of the current char, so a backslash escape only worked as the first class member: '[\[\]]' (documented as "the [ or ] character") matched only ']', and '[a,\[]' dropped the 'a'. The loop also treated any ']' as the class terminator, so an escaped ']' could never be a member. Decode the escape first in the loop body: a backslash takes the next char as the literal member (only that char, not also the backslash the old code added), and an escaped ']' is consumed before the terminator check. So '[\[\]]' now matches both brackets, and escape precedes the range/size checks ('\-' '\,' '\<' become literal members). The self-test previously pinned the buggy output as expected; it now asserts the documented behavior and fails against the old matcher. Closes #148 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * filters: fix a 1-byte over-read on a truncated range [a- The [...] class parser's range arm does i += 3 unconditionally, so a pattern ending in a dangling '-' (e.g. [a-) read one byte past the NUL: joker[i+2] is the NUL, i jumps to len+1, and the separator skip and loop guard then read joker[len+1]. Guard the range arm on joker[i+2] != '\0' so a truncated range falls through to the literal-member path instead of overshooting. The filter self-test now copies the pattern and string into exact-size heap buffers so a sanitizer traps such over-reads; the pattern previously came straight from argv (no redzone), which is why this stayed invisible. A [a- test case exercises it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:56:11 +02:00
Xavier Roche	71af4a24f0	lang: add a translation guide and fix the English colon spacing (#439 ) Document the lang/*.txt format and contribution flow in lang/README.md (linked from CONTRIBUTING.md); the project had no written instructions for translators. Also drop the stray space before the colon in the English "Filters (refuse/accept links):" label so it matches the other labels. Only the English.txt value is changed, not the msgid key, so existing translations still resolve. Closes #74 Closes #75 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:45:07 +02:00
Xavier Roche	e17f4f12a0	Add --pause to space out file downloads by a random delay (#185 ) (#438 ) A new --pause MIN[:MAX] (seconds, -%G) waits a random MIN..MAX between files so a crawl looks less like a bot and is gentler on the server; a single value is a fixed delay. Disabled by default. It reuses the existing non-blocking launch gate (back_pluggable_sockets_strict): rather than Sleep() -- which would freeze the single select() pump and stall the other in-flight transfers -- the gate just withholds new launches until the delay elapses, one file per gap. The per-gap target is derived from the last-request timestamp so it stays stable across the many gate evaluations within a gap yet rerolls on each launch; sampling rand() per evaluation would instead bias the realized delay toward MIN. Two int fields appended at the httrackp tail (ABI-stable, no soname bump). Covered by a pure-function self-test (range + spread, with teeth against the min-bias bug) and a local-server crawl that asserts the pause slows a multi-file mirror. Closes #185 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:34:56 +02:00
Xavier Roche	5be8ba4bbd	Add --cookies-file to preload a Netscape cookies.txt (#215 ) (#437 ) Mirroring a site behind a login meant either re-implementing the auth flow or dropping a file literally named cookies.txt into the output or working directory, the only two places the engine looked. This adds a CLI option to point at an arbitrary Netscape/Mozilla cookies.txt, so a session exported from a browser (the "Get cookies.txt" extensions write exactly this format) is replayed on the crawl and authenticated pages come down. The plumbing already existed: cookie_load parses the format into the shared jar and the request path sends every matching cookie. The new opt->cookies_file is loaded last, after the mirror/CWD defaults, so a user-supplied value wins on a name/domain/path conflict. The field is appended at the tail of httrackp, so the exported ABI is unchanged. Cookies key on host[:port], so a bare-domain file matches a normal crawl of a default-port site; only an explicit-port URL needs the port in the cookie domain. Covered by 27_local-cookies-file.test: a gated page that 500s without a cookie no page ever sets, reachable only once the file preloads it (with -o0 so the absence of a 500 error page is meaningful), plus a no-cookie control. The local-crawl harness grows a --cookie helper that writes a port-scoped jar. The copyopt self-test also gains a String round-trip so the exported copy_htsopt path for the new field is covered. Closes #215 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 22:57:05 +02:00
Xavier Roche	247a46068e	debian: lead webhttrack browser dep with chromium, not firefox-esr (#436 ) webhttrack's "firefox-esr \| chromium \| www-browser" made the httrack source inherit firefox-esr's autoremoval clock: britney keys off the first alternative of a disjunction, so every firefox-esr RC bug (currently #1127569) dragged httrack toward removal even though webhttrack stays installable via the other alternatives. Lead with chromium instead. lintian requires a real package as the first alternative (virtual-package-depends-without-real-package-depends), so the virtual www-browser cannot go first; chromium is real, keeps the dep lintian-clean, and makes britney track chromium rather than the RC-bug-prone firefox-esr. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 20:44:30 +02:00
Xavier Roche	669947cd23	Split -%u URL Hacks into independent www/slash/query toggles (#271 ) (#435 ) -%u (--urlhack) bundled three dedup normalizations under one switch: www.host == host, redundant // collapse, and query-argument reordering. A mirror that needed one but not another (e.g. keep www. distinct) had to turn the whole umbrella off. Add three opt-out sub-options, defaulting to the umbrella so existing -%u/-%u0 behavior is unchanged: --keep-www-prefix keep www.foo.com distinct from foo.com (-%j) --keep-double-slashes keep redundant // in the path (-%o) --keep-query-order keep query-argument order significant (-%y) The split is resolved once in hash_init() into norm_host/norm_slash/ norm_query and threaded through the dedup hash (htshash.c), the savename lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so all three stay consistent. fil_normalized() gains an internal fil_normalized_ex(do_slash, do_query) core; the public fil_normalized()/fil_normalized_filtered() keep their signatures. Normalization (slash/query) now follows urlhack and its sub-flags uniformly, while --strip-query stays orthogonal. So with urlhack off, strip-query strips keys without sorting the remainder; the url_savename urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer the hash uses, so a URL is always looked up under the key it was stored with (a self-lookup mismatch this otherwise introduced). http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so that part of the request needs no toggle. The opt-outs are spelled positively (--keep-*) because httrack's generic --no<opt> prefix only appends the disabling "0" for parametered options, not "single" booleans, so --nowww-dedup would silently no-op. opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, matching the strip_query addition in #112). Tested by a -#test=urlhack engine self-test (hash_url_equals over each flag combination) plus a -%u0 + --strip-query crawl case exercising the urlhack-off savename branch. Closes #271 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 20:26:28 +02:00
Xavier Roche	40a66600ff	Add --strip-query to drop query keys from dedup naming (#112 ) (#434 ) Two URLs that differ only in tracking or session query parameters (?utm_source=x versus ?utm_source=y) were saved as separate files, and a single CGI could fan out into thousands of near-duplicate pages. fil_normalized already sorted query args, so reordered parameters dedup, but there was no way to drop a named key. --strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the listed keys when computing the dedup key and the saved name. The fetched URL is untouched, so a required sid= is still sent on the wire; only the local namespace collapses. Patterns match the normalized host/path with the +/- filter glob (strjoker), last match wins as in the filter list, and stripping is decoupled from urlhack (-%u) so it never silently no-ops with -%u0. It all funnels through one chokepoint, fil_normalized: an internal fil_normalized_filtered() strips then delegates, and hts_query_strip_keys resolves the per-URL key list. The strip pass walks every query field, including empty and trailing ones, so its output is a fixpoint under the read path's second normalization (otherwise dedup silently misses). Exported ABI is unchanged; the strip_query field is appended at the tail of httrackp. Covered by a -#test=stripquery self-test (degenerate queries like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup crawl test. Closes #112 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 11:13:16 +02:00