Split -%u URL Hacks into independent www/slash/query toggles (#271 )

-%u (--urlhack) bundled three dedup normalizations under one switch: www.host == host, redundant // collapse, and query-argument reordering. A mirror that needed one but not another (e.g. keep www. distinct) had to turn the whole umbrella off. Add three opt-out sub-options, defaulting to the umbrella so existing -%u/-%u0 behavior is unchanged: --keep-www-prefix keep www.foo.com distinct from foo.com (-%j) --keep-double-slashes keep redundant // in the path (-%o) --keep-query-order keep query-argument order significant (-%y) The split is resolved once in hash_init() into norm_host/norm_slash/ norm_query and threaded through the dedup hash (htshash.c), the savename lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so all three stay consistent. fil_normalized() gains an internal fil_normalized_ex(do_slash, do_query) core; the public fil_normalized()/fil_normalized_filtered() keep their signatures. http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so that part of the request needs no toggle. The opt-outs are spelled positively (--keep-*) because httrack's generic --no<opt> prefix only appends the disabling "0" for parametered options, not "single" booleans, so --nowww-dedup would silently no-op. opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, matching the strip_query addition in #112). Tested by a -#test=urlhack engine self-test (hash_url_equals over each flag combination) driven by 01_engine-urlhack.test. Closes #271 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
Add --strip-query to drop query keys from dedup naming (#112 ) (#434 )
2026-06-27 20:47:19 +03:00 · 2026-06-27 13:36:38 +02:00 · 2026-06-27 11:13:16 +02:00 · 2026-06-27 08:40:22 +02:00 · 2026-06-26 22:01:14 +02:00 · 2026-06-26 21:21:54 +02:00
28 changed files with 1071 additions and 76 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -188,6 +188,51 @@ jobs:
        if: failure()
        run: cat tests/test-suite.log 2>/dev/null || true

+  # MemorySanitizer catches reads of uninitialized memory (#143's stack-garbage
+  # size filter) that ASan/UBSan miss. It flags any byte an uninstrumented lib
+  # wrote, so the job stays in our own code: offline self-tests only, no openssl
+  # (--disable-https), no zlib cache tests, static (the runtime is not in .so's).
+  msan:
+    name: msan (MemorySanitizer, clang)
+    runs-on: ubuntu-24.04
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          submodules: recursive
+
+      - name: Install build dependencies
+        run: |
+          set -euo pipefail
+          sudo apt-get update
+          sudo apt-get install -y --no-install-recommends \
+            build-essential clang autoconf automake libtool autoconf-archive \
+            zlib1g-dev
+
+      - name: Configure (MSan, static, no https)
+        run: |
+          set -euo pipefail
+          autoreconf -fi
+          ./configure CC=clang \
+            CFLAGS="-fsanitize=memory -fsanitize-memory-track-origins=2 -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
+            LDFLAGS="-fsanitize=memory" \
+            --disable-https --disable-shared --enable-static
+
+      - name: Build
+        run: make -j"$(nproc)"
+
+      - name: Test (offline self-tests under MSan)
+        env:
+          MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
+        run: |
+          set -euo pipefail
+          # Engine self-tests only; the cache trio pulls in uninstrumented zlib.
+          tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
+          make check TESTS="$tests"
+
+      - name: Print the test log on failure
+        if: failure()
+        run: cat tests/test-suite.log 2>/dev/null || true
+
  # Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
  # the configuration users on minimal systems build, and one libssl is not even
  # installed here so configure cannot silently re-enable it. The matrix above
--- a/man/httrack.1
+++ b/man/httrack.1
@@ -3,7 +3,7 @@
 .\"
 .\" This file is generated by man/makeman.sh; do not edit by hand.
 .\" SPDX-License-Identifier: GPL-3.0-or-later
-.TH httrack 1 "26 June 2026" "httrack website copier"
+.TH httrack 1 "27 June 2026" "httrack website copier"
 .SH NAME
 httrack \- offline browser : copy websites to a local directory
 .SH SYNOPSIS
@@ -43,6 +43,7 @@ httrack \- offline browser : copy websites to a local directory
 [ \fB\-x, \-\-replace\-external\fR ]
 [ \fB\-%x, \-\-disable\-passwords\fR ]
 [ \fB\-%q, \-\-include\-query\-string\fR ]
+[ \fB\-%g, \-\-strip\-query\fR ]
 [ \fB\-o, \-\-generate\-errors\fR ]
 [ \fB\-X, \-\-purge\-old[=N]\fR ]
 [ \fB\-%p, \-\-preserve\fR ]
@@ -198,6 +199,8 @@ replace external html links by error pages (\-\-replace\-external)
 do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
 .IP \-%q
 *include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
+.IP \-%g
+strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
 .IP \-o
 *generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
 .IP \-X
@@ -225,6 +228,8 @@ tolerant requests (accept bogus responses on some servers, but not standard!) (\
 update hacks: various hacks to limit re\-transfers when updating (identical size, bogus response..) (\-\-updatehack)
 .IP \-%u
 url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (\-\-urlhack)
+.br
+opt out of one url\-hack part: \-\-keep\-www\-prefix (www.foo.com<>foo.com), \-\-keep\-double\-slashes (//), \-\-keep\-query\-order (?b&a)
 .IP \-%A
 assume that a type (cgi,asp..) is always linked with a mime type (\-%A php3,cgi=text/html;dat,bin=application/x\-zip) (\-\-assume <param>)
 .br
--- a/src/htsalias.c
+++ b/src/htsalias.c
@@ -60,6 +60,9 @@ Please visit our Website: http://www.httrack.com
  param1 : this option must be alone, and needs one distinct parameter (-P <path>)
  param0 : this option must be alone, but the parameter should be put together (+*.gif)
 */
+/* clang-format off: hand-aligned table; clang-format reflows the whole
+   initializer (2->4 space) on any edit, churning every untouched row. */
+/* clang-format off */
 const char *hts_optalias[][4] = {
  /*   {"","","",""}, */
  {"path", "-O", "param1", "output path"},
@@ -107,6 +110,8 @@ const char *hts_optalias[][4] = {
  {"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
                                               "single", ""},
  {"include-query-string", "-%q", "single", ""},
+  {"strip-query", "-%g", "param1",
+   "strip [host/pattern=]key1,key2,... from URLs"},
  {"generate-errors", "-o", "single", ""},
  {"do-not-generate-errors", "-o0", "single", ""},
  {"purge-old", "-X", "param", ""},
@@ -123,6 +128,9 @@ const char *hts_optalias[][4] = {
  {"tolerant", "-%B", "single", ""},
  {"updatehack", "-%s", "single", ""}, {"sizehack", "-%s", "single", ""},
  {"urlhack", "-%u", "single", ""},
+  {"keep-www-prefix", "-%j", "single", ""},
+  {"keep-double-slashes", "-%o", "single", ""},
+  {"keep-query-order", "-%y", "single", ""},
  {"user-agent", "-F", "param1", "user-agent identity"},
  {"referer", "-%R", "param1", "default referer URL"},
  {"from", "-%E", "param1", "from email address"},
@@ -241,6 +249,7 @@ const char *hts_optalias[][4] = {

  {"", "", "", ""}
 };
+/* clang-format on */

 /* 
  Check for alias in command-line 
--- a/src/htsback.c
+++ b/src/htsback.c
@@ -57,7 +57,10 @@ Please visit our Website: http://www.httrack.com
 // DOS
 #include <process.h>            /* _beginthread, _endthread */
 #endif
+#include <io.h> /* _chsize_s */
+#define HTS_FTRUNCATE(fp, sz) _chsize_s(_fileno(fp), (sz))
 #else
+#define HTS_FTRUNCATE(fp, sz) ftruncate(fileno(fp), (sz))
 #endif

 #define VT_CLREOL       "\33[K"
@@ -3763,7 +3766,27 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
                    }
 #endif
 /********** **************************** ********** */
-                  } else {      // il faut aller le chercher
+                  }
+                  // MIME type excluded by a -mime: filter: abort, don't fetch
+                  // the body (#58)
+                  else if (HTTP_IS_OK(back[i].r.statuscode) &&
+                           !back[i].testmode &&
+                           strnotempty(back[i].r.contenttype) &&
+                           hts_acceptmime(opt, 0, back[i].url_adr,
+                                          back[i].url_fil,
+                                          back[i].r.contenttype) == 1) {
+                    deletehttp(&back[i].r);
+                    back[i].r.soc = INVALID_SOCKET;
+                    back[i].status = STATUS_READY;
+                    back_set_finished(sback, i);
+                    back[i].r.statuscode = STATUSCODE_EXCLUDED;
+                    strcpybuff(back[i].r.msg, "Excluded by MIME type filter");
+                    hts_log_print(
+                        opt, LOG_NOTICE,
+                        "File excluded by MIME type filter (%s): %s%s",
+                        back[i].r.contenttype, back[i].url_adr,
+                        back[i].url_fil);
+                  } else { // il faut aller le chercher

                    // effacer buffer (requète)
                    if (!noFreebuff) {
@@ -3774,35 +3797,70 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
                    // xxc SI CHUNK VERIFIER QUE CA MARCHE??
                    if (back[i].r.statuscode == 206) {  // on nous envoie un morceau (la fin) coz une partie sur disque!
                      off_t sz = fsize_utf8(back[i].url_sav);
+                      /* RFC 7233: resume at the server's Content-Range start,
+                         not the offset we requested; a server may resume
+                         earlier and appending the overlap duplicates bytes
+                         (#198). */
+                      const LLint resume = back[i].r.crange_start;
+                      const hts_boolean range_ok =
+                          back[i].r.crange > 0 && resume >= 0 &&
+                          resume <= (LLint) sz &&
+                          back[i].r.crange_end + 1 == back[i].r.crange &&
+                          (back[i].r.totalsize < 0 ||
+                           back[i].r.totalsize ==
+                               back[i].r.crange_end - resume + 1);

 #if HDEBUG
                      printf("partial content: " LLintP " on disk..\n",
                             (LLint) sz);
 #endif
-                      if (sz >= 0) {
+                      if (sz >= 0 && range_ok) {
                        if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) {  // pas HTML
                          if (opt->getmode & HTS_GETMODE_NONHTML) {
                            filenote(&opt->state.strc, back[i].url_sav, NULL);  // noter fichier comme connu
                            file_notify(opt, back[i].url_adr, back[i].url_fil,
                                        back[i].url_sav, 0, 1,
                                        back[i].r.notmodified);
-                            back[i].r.out = FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "ab");       // append
+                            back[i].r.out =
+                                FOPEN(fconv(catbuff, sizeof(catbuff),
+                                            back[i].url_sav),
+                                      "r+b"); // resume in place
                            if (back[i].r.out && opt->cache != 0) {
-                              back[i].r.is_write = 1;   // écrire
-                              back[i].r.size = sz;      // déja écrit
-                              back[i].r.statuscode = HTTP_OK;   // Forcer 'OK'
+                              back[i].r.is_write = 1;
+                              back[i].r.size = resume; // bytes already on disk
+                              back[i].r.statuscode = HTTP_OK; // force 'OK'
                              if (back[i].r.totalsize >= 0)
-                                back[i].r.totalsize += sz;      // plus en fait
-                              fseek(back[i].r.out, 0, SEEK_END);        // à la fin
-                              /* create a temporary reference file in case of broken mirror */
-                              if (back_serialize_ref(opt, &back[i]) != 0) {
-                                hts_log_print(opt, LOG_WARNING,
-                                              "Could not create temporary reference file for %s%s",
-                                              back[i].url_adr, back[i].url_fil);
-                              }
+                                back[i].r.totalsize += resume; // -> full size
+                              // drop bytes past the resume point; a silent
+                              // failure could leave a stale tail, so on error
+                              // drop the partial and refetch the whole file
+                              if (HTS_FTRUNCATE(back[i].r.out,
+                                                (off_t) resume) != 0) {
+                                fclose(back[i].r.out);
+                                back[i].r.out = NULL;
+                                url_savename_refname_remove(
+                                    opt, back[i].url_adr, back[i].url_fil);
+                                UNLINK(back[i].url_sav);
+                                back[i].status = STATUS_READY;
+                                back_set_finished(sback, i);
+                                strcpybuff(back[i].r.msg,
+                                           "Can not truncate partial file, "
+                                           "restarting");
+                              } else {
+                                fseeko(back[i].r.out, (off_t) resume, SEEK_SET);
+                                /* create a temporary reference file in case of
+                                 * broken mirror */
+                                if (back_serialize_ref(opt, &back[i]) != 0) {
+                                  hts_log_print(opt, LOG_WARNING,
+                                                "Could not create temporary "
+                                                "reference file for %s%s",
+                                                back[i].url_adr,
+                                                back[i].url_fil);
+                                }
 #if HDEBUG
-                              printf("continue interrupted file\n");
+                                printf("continue interrupted file\n");
 #endif
+                              }
                            } else {    // On est dans la m**
                              back[i].status = STATUS_READY;    // terminé (voir plus loin)
                              back_set_finished(sback, i);
@@ -3814,17 +3872,18 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
                          FILE *fp =
                            FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "rb");
                          if (fp) {
-                            LLint alloc_mem = sz + 1;
+                            LLint alloc_mem = resume + 1;

                            if (back[i].r.totalsize >= 0)
                              alloc_mem += back[i].r.totalsize; // AJOUTER RESTANT!
                            if (deleteaddr(&back[i].r)
                                && (back[i].r.adr =
                                    (char *) malloct((size_t) alloc_mem))) {
-                              back[i].r.size = sz;
+                              back[i].r.size = resume;
                              if (back[i].r.totalsize >= 0)
-                                back[i].r.totalsize += sz;      // plus en fait
-                              if ((fread(back[i].r.adr, 1, sz, fp)) != sz) {
+                                back[i].r.totalsize += resume; // -> full size
+                              if ((fread(back[i].r.adr, 1, (size_t) resume,
+                                         fp)) != (size_t) resume) {
                                back[i].status = STATUS_READY;  // terminé (voir plus loin)
                                back_set_finished(sback, i);
                                strcpybuff(back[i].r.msg,
@@ -3842,14 +3901,30 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
                                         "No memory for partial file");
                            }
                            fclose(fp);
-                          } else {      // Argh.. 
+                          } else {                              // open failed
                            back[i].status = STATUS_READY;      // terminé (voir plus loin)
                            back_set_finished(sback, i);
                            strcpybuff(back[i].r.msg,
                                       "Can not open partial file");
                          }
                        }
-                      } else {  // Non trouvé??
+                      } else if (sz >=
+                                 0) { // unusable range -> restart whole file
+                        hts_log_print(opt, LOG_WARNING,
+                                      "Unusable partial-content range for %s%s "
+                                      "(have " LLintP " bytes, got " LLintP
+                                      "-" LLintP "/" LLintP "), restarting",
+                                      back[i].url_adr, back[i].url_fil,
+                                      (LLint) sz, back[i].r.crange_start,
+                                      back[i].r.crange_end, back[i].r.crange);
+                        url_savename_refname_remove(opt, back[i].url_adr,
+                                                    back[i].url_fil);
+                        UNLINK(back[i].url_sav);
+                        back[i].status = STATUS_READY;
+                        back_set_finished(sback, i);
+                        strcpybuff(back[i].r.msg,
+                                   "Unusable partial content, restarting");
+                      } else {                          // partial not found
                        back[i].status = STATUS_READY;  // terminé (voir plus loin)
                        back_set_finished(sback, i);
                        strcpybuff(back[i].r.msg, "Can not find partial file");
@@ -3930,7 +4005,6 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,

                      }
                    }
-
                  }

                  /*} */
--- a/src/htsbasenet.h
+++ b/src/htsbasenet.h
@@ -146,7 +146,8 @@ typedef enum BackStatusCode {
  STATUSCODE_NON_FATAL = -5,
  STATUSCODE_SSL_HANDSHAKE = -6,
  STATUSCODE_TOO_BIG = -7,
-  STATUSCODE_TEST_OK = -10
+  STATUSCODE_TEST_OK = -10,
+  STATUSCODE_EXCLUDED = -11 /* aborted: MIME excluded by a -mime: filter */
 } BackStatusCode;

 /** HTTrack status ('status' member of of 'lien_back') **/
--- a/src/htscore.c
+++ b/src/htscore.c
@@ -736,26 +736,39 @@ int httpmirror(char *url1, httrackp * opt) {
    /* OPTIMIZED for fast load */
    if (StringNotEmpty(opt->filelist)) {
      char *filelist_buff = NULL;
-      const size_t filelist_sz = off_t_to_size_t(fsize(StringBuff(opt->filelist)));
+      size_t filelist_sz = 0;
+      const char *filelist_err = NULL; /* failure reason, NULL on success */
+      const off_t fs = fsize(StringBuff(opt->filelist));

-      if (filelist_sz != (size_t) -1) {
+      if (fs < 0) {
+        /* fsize() hides the cause; redo stat() for a precise errno (#49) */
+        struct stat st;
+        filelist_err = stat(StringBuff(opt->filelist), &st) != 0
+                           ? strerror(errno)
+                           : "not a regular file";
+      } else if ((filelist_sz = off_t_to_size_t(fs)) == (size_t) -1) {
+        filelist_err = "file too large";
+        filelist_sz = 0;
+      } else {
        FILE *fp = fopen(StringBuff(opt->filelist), "rb");

-        if (fp) {
+        if (fp == NULL) {
+          filelist_err = strerror(errno);
+        } else {
          filelist_buff = malloct(filelist_sz + 1);
-          if (filelist_buff) {
-            if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
-              freet(filelist_buff);
-              filelist_buff = NULL;
-            } else {
-              *(filelist_buff + filelist_sz) = '\0';
-            }
+          if (filelist_buff == NULL) {
+            filelist_err = "out of memory";
+          } else if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
+            freet(filelist_buff);
+            filelist_err = "read error";
+          } else {
+            filelist_buff[filelist_sz] = '\0';
          }
          fclose(fp);
        }
      }

-      if (filelist_buff) {
+      if (filelist_buff != NULL) {
        int filelist_ptr = 0;
        int n = 0;
        char BIGSTK line[HTS_URLMAXSIZE * 2];
@@ -780,8 +793,8 @@ int httpmirror(char *url1, httrackp * opt) {
        // Free buffer
        freet(filelist_buff);
      } else {
-        hts_log_print(opt, LOG_ERROR, "Could not include URL list: %s",
-                      StringBuff(opt->filelist));
+        hts_log_print(opt, LOG_ERROR, "Could not include URL list \"%s\": %s",
+                      StringBuff(opt->filelist), filelist_err);
      }
    }

@@ -3726,6 +3739,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
  if (StringNotEmpty(from->user_agent))
    StringCopyS(to->user_agent, from->user_agent);

+  if (StringNotEmpty(from->strip_query))
+    StringCopyS(to->strip_query, from->strip_query);
+
  if (from->retry > -1)
    to->retry = from->retry;

--- a/src/htscore.h
+++ b/src/htscore.h
@@ -234,8 +234,12 @@ struct hash_struct {
  coucal adrfil;
  /* former address+path -> link index (renamed/moved entries) */
  coucal former_adrfil;
-  /* scratch buffers reused across lookups (not reentrant) */
-  int normalized;
+  /* effective urlhack sub-flags: www.==host / // collapse / query-arg sort */
+  int norm_host;
+  int norm_slash;
+  int norm_query;
+  /* query-strip keys (not owned); set from opt->strip_query at hash_init */
+  const char *strip_query;
  char normfil[HTS_URLMAXSIZE * 2];
  char normfil2[HTS_URLMAXSIZE * 2];
  char catbuff[CATBUFF_SIZE];
@@ -364,6 +368,22 @@ int fspc(httrackp * opt, FILE * fp, const char *type);

 char *next_token(char *p, int flag);

+/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
+   "*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
+char *fil_normalized_filtered(const char *source, char *dest,
+                              const char *strip);
+
+/* As fil_normalized_filtered(), but DO_SLASH/DO_QUERY gate the // collapse and
+   the query-argument sort independently (the urlhack sub-flags). */
+char *fil_normalized_filtered_ex(const char *source, char *dest,
+                                 const char *strip, int do_slash, int do_query);
+
+/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
+   '\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
+   strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
+const char *hts_query_strip_keys(const char *rules, const char *adr,
+                                 const char *fil, char *dest, size_t destsize);
+
 /* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
   owns it and must release it with freet(). Return NULL on missing/unreadable
   file (readfile_or substitutes defaultdata instead). The byte content is NOT
--- a/src/htscoremain.c
+++ b/src/htscoremain.c
@@ -1570,6 +1570,27 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
                  com++;
                }
                break;          // url hack
+              case 'j':
+                opt->no_www_dedup = 1; // --keep-www-prefix: keep www.X != X
+                if (*(com + 1) == '0') {
+                  opt->no_www_dedup = 0;
+                  com++;
+                }
+                break;
+              case 'o':
+                opt->no_slash_dedup = 1; // --keep-double-slashes: keep //
+                if (*(com + 1) == '0') {
+                  opt->no_slash_dedup = 0;
+                  com++;
+                }
+                break;
+              case 'y':
+                opt->no_query_dedup = 1; // --keep-query-order: keep ?b&a order
+                if (*(com + 1) == '0') {
+                  opt->no_query_dedup = 0;
+                  com++;
+                }
+                break;
              case 'v':
                opt->verbosedisplay = HTS_VERBOSE_FULL;
                if (isdigit((unsigned char) *(com + 1))) {
@@ -1937,6 +1958,21 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
                }
                break;

+              case 'g': // strip-query: accumulate "[pattern=]keys" entries
+                if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
+                  HTS_PANIC_PRINTF("Option strip-query needs a blank space and "
+                                   "[host/pattern=]key1,key2,...");
+                  printf("Example: --strip-query "
+                         "\"www.example.com/*=utm_source,sid\"\n");
+                  htsmain_free();
+                  return -1;
+                } else {
+                  na++;
+                  if (StringNotEmpty(opt->strip_query))
+                    StringCat(opt->strip_query, "\n");
+                  StringCat(opt->strip_query, argv[na]);
+                }
+                break;
              case 't':        /* do not change type (ending) of filenames according to the MIME type */
                opt->no_type_change = 1;
                if (*(com+1)=='0') { opt->no_type_change = 0; com++; }
--- a/src/htsfilters.c
+++ b/src/htsfilters.c
@@ -76,7 +76,8 @@ int fa_strjoker(int type, char **filters, int nfil, const char *nom, LLint * siz
    }
    if (size)
      sz = *size;
-    if (strjoker(nom, filters[i] + filteroffs, &sz, size_flag)) {       // reconnu
+    /* size unknown (scan time): no size pointer => size tests stay neutral */
+    if (strjoker(nom, filters[i] + filteroffs, size ? &sz : NULL, size_flag)) {
      if (size)
        if (sz != *size)
          sizelimit = sz;
--- a/src/htshash.c
+++ b/src/htshash.c
@@ -106,10 +106,10 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
  const lien_url*const lien = (const lien_url*) value;
  const char *const adr = !former ? lien->adr : lien->former_adr;
  const char *const fil = !former ? lien->fil : lien->former_fil;
-  const char *const adr_norm = adr != NULL ? 
-    ( hash->normalized  ? jump_normalized_const(adr)
-                        : jump_identification_const(adr) )
-    : NULL;
+  const char *const adr_norm =
+      adr != NULL ? (hash->norm_host ? jump_normalized_const(adr)
+                                     : jump_identification_const(adr))
+                  : NULL;

  // copy address
  assertf(adr_norm != NULL);
@@ -117,10 +117,18 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,

  // copy link
  assertf(fil != NULL);
-  if (hash->normalized) {
-    fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
-  } else {
-    strcpy(&hash->normfil[strlen(hash->normfil)], fil);
+  {
+    /* resolve the per-URL strip keys; strip applies even when urlhack is off */
+    char BIGSTK keybuf[HTS_URLMAXSIZE];
+    const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
+                                                  keybuf, sizeof(keybuf));
+
+    if (hash->norm_slash || hash->norm_query || keys != NULL) {
+      fil_normalized_filtered_ex(fil, &hash->normfil[strlen(hash->normfil)],
+                                 keys, hash->norm_slash, hash->norm_query);
+    } else {
+      strcpy(&hash->normfil[strlen(hash->normfil)], fil);
+    }
  }

  // hash
@@ -132,8 +140,7 @@ static int key_adrfil_equals_generic(void *arg,
                                     coucal_key_const a_,
                                     coucal_key_const b_, 
                                     const int former) {
-  hash_struct *const hash = (hash_struct*) arg;
-  const int normalized = hash->normalized;
+  hash_struct *const hash = (hash_struct *) arg;
  const lien_url*const a = (const lien_url*) a_;
  const lien_url*const b = (const lien_url*) b_;
  const char *const a_adr = !former ? a->adr : a->former_adr;
@@ -150,10 +157,10 @@ static int key_adrfil_equals_generic(void *arg,
  assertf(b_fil != NULL);

  // skip scheme and authentication to the domain (possibly without www.)
-  ja = normalized
-    ? jump_normalized_const(a_adr) : jump_identification_const(a_adr);
-  jb = normalized
-    ? jump_normalized_const(b_adr) : jump_identification_const(b_adr);
+  ja = hash->norm_host ? jump_normalized_const(a_adr)
+                       : jump_identification_const(a_adr);
+  jb = hash->norm_host ? jump_normalized_const(b_adr)
+                       : jump_identification_const(b_adr);
  assertf(ja != NULL);
  assertf(jb != NULL);
  if (strcasecmp(ja, jb) != 0) {
@@ -161,12 +168,23 @@ static int key_adrfil_equals_generic(void *arg,
  }

  // now compare pathes
-  if (normalized) {
-    fil_normalized(a_fil, hash->normfil);
-    fil_normalized(b_fil, hash->normfil2);
-    return strcmp(hash->normfil, hash->normfil2) == 0;
-  } else {
-    return strcmp(a_fil, b_fil) == 0;
+  {
+    char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
+    const char *const keysa =
+        hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
+    const char *const keysb =
+        hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
+
+    if (hash->norm_slash || hash->norm_query || keysa != NULL ||
+        keysb != NULL) {
+      fil_normalized_filtered_ex(a_fil, hash->normfil, keysa, hash->norm_slash,
+                                 hash->norm_query);
+      fil_normalized_filtered_ex(b_fil, hash->normfil2, keysb, hash->norm_slash,
+                                 hash->norm_query);
+      return strcmp(hash->normfil, hash->normfil2) == 0;
+    } else {
+      return strcmp(a_fil, b_fil) == 0;
+    }
  }
 }

@@ -226,7 +244,13 @@ void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
  hash->sav = coucal_new(0);
  hash->adrfil = coucal_new(0);
  hash->former_adrfil = coucal_new(0);
-  hash->normalized = normalized;
+  /* urlhack is the umbrella; per-feature negatives opt out of each part */
+  hash->norm_host = normalized && !opt->no_www_dedup;
+  hash->norm_slash = normalized && !opt->no_slash_dedup;
+  hash->norm_query = normalized && !opt->no_query_dedup;
+  /* snapshot the query-strip list (not owned; valid for the hash lifetime) */
+  hash->strip_query =
+      StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;

  hts_set_hash_handler(hash->sav, opt);
  hts_set_hash_handler(hash->adrfil, opt);
@@ -282,6 +306,26 @@ void hash_free(hash_struct *hash) {
  }
 }

+/* Test helper: do the two URLs dedupe to the same key under opt's urlhack
+   flags? Exercises the live hash compare (norm_host/slash/query resolution). */
+int hash_url_equals(httrackp *opt, const char *adra, const char *fila,
+                    const char *adrb, const char *filb) {
+  hash_struct hash;
+  lien_url la, lb;
+  int eq;
+
+  memset(&la, 0, sizeof(la));
+  memset(&lb, 0, sizeof(lb));
+  la.adr = key_duphandler(NULL, adra);
+  la.fil = key_duphandler(NULL, fila);
+  lb.adr = key_duphandler(NULL, adrb);
+  lb.fil = key_duphandler(NULL, filb);
+  hash_init(opt, &hash, opt->urlhack);
+  eq = key_adrfil_equals(&hash, &la, &lb);
+  hash_free(&hash);
+  return eq;
+}
+
 // retour: position ou -1 si non trouvé
 int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
              hash_struct_type type) {
--- a/src/htshash.h
+++ b/src/htshash.h
@@ -53,6 +53,9 @@ typedef enum hash_struct_type {
 // tables de hachage
 void hash_init(httrackp *opt, hash_struct *hash, int normalized);
 void hash_free(hash_struct *hash);
+/* Test helper: 1 if the two URLs dedupe together under opt's urlhack flags. */
+int hash_url_equals(httrackp *opt, const char *adra, const char *fila,
+                    const char *adrb, const char *filb);
 int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
              hash_struct_type type);
 void hash_write(hash_struct * hash, size_t lpos);
--- a/src/htshelp.c
+++ b/src/htshelp.c
@@ -563,6 +563,7 @@ void help(const char *app, int more) {
    (" %x  do not include any password for external password protected websites (%x0 include)");
  infomsg
    (" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
+  infomsg(" %g  strip query keys for dedup ([host/pattern=]key1,key2,...)");
  infomsg
    ("  o *generate output html file in case of error (404..) (o0 don't generate)");
  infomsg("  X *purge old files after update (X0 keep delete)");
@@ -587,6 +588,9 @@ void help(const char *app, int more) {
    (" %s  update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..)");
  infomsg
    (" %u  url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..)");
+  infomsg("     opt out of one url-hack part: --keep-www-prefix "
+          "(www.foo.com<>foo.com), --keep-double-slashes (//), "
+          "--keep-query-order (?b&a)");
  infomsg
    (" %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip)");
  infomsg("     shortcut: '--assume standard' is equivalent to -%A "
--- a/src/htslib.c
+++ b/src/htslib.c
@@ -3610,7 +3610,10 @@ static int sortNormFnc(const void *a_, const void *b_) {
  return strcmp(*a + 1, *b + 1);
 }

-HTSEXT_API char *fil_normalized(const char *source, char *dest) {
+/* Path normalizer core: optionally collapse redundant '//' (DO_SLASH) and/or
+   sort query arguments (DO_QUERY) so equivalent URLs dedupe. */
+static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
+                               int do_query) {
  char lastc = 0;
  int gotquery = 0;
  int ampargs = 0;
@@ -3620,8 +3623,8 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  for(i = j = 0; source[i] != '\0'; i++) {
    if (!gotquery && source[i] == '?')
      gotquery = ampargs = 1;
-    if ((!gotquery && lastc == '/' && source[i] == '/') // foo//bar -> foo/bar
-      ) {
+    if (do_slash && !gotquery && lastc == '/' && source[i] == '/') {
+      // foo//bar -> foo/bar
    } else {
      if (gotquery && source[i] == '&') {
        ampargs++;
@@ -3633,7 +3636,7 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  dest[j++] = '\0';

  /* Sort arguments (&foo=1&bar=2 == &bar=2&foo=1) */
-  if (ampargs > 1) {
+  if (do_query && ampargs > 1) {
    char **amps = malloct(ampargs * sizeof(char *));
    char *copyBuff = NULL;
    size_t qLen = 0;
@@ -3681,6 +3684,153 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
  return dest;
 }

+HTSEXT_API char *fil_normalized(const char *source, char *dest) {
+  return fil_normalized_ex(source, dest, 1, 1);
+}
+
+/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
+   case-sensitive, space-trimmed tokens. */
+static int hts_query_key_stripped(const char *arg, size_t keylen,
+                                  const char *strip) {
+  const char *p = strip;
+
+  while (*p != '\0') {
+    const char *start = p;
+    size_t toklen;
+
+    while (*p != '\0' && *p != ',')
+      p++;
+    toklen = (size_t) (p - start);
+    while (toklen > 0 && *start == ' ') {
+      start++;
+      toklen--;
+    }
+    while (toklen > 0 && start[toklen - 1] == ' ')
+      toklen--;
+    if (toklen == 1 && start[0] == '*')
+      return 1;
+    if (toklen == keylen && strncmp(start, arg, keylen) == 0)
+      return 1;
+    if (*p == ',')
+      p++;
+  }
+  return 0;
+}
+
+/* see htscore.h */
+char *fil_normalized_filtered_ex(const char *source, char *dest,
+                                 const char *strip, int do_slash,
+                                 int do_query) {
+  const char *query;
+  char BIGSTK tmp[HTS_URLMAXSIZE * 2];
+  htsbuff cb;
+  int wrote = 0;
+
+  /* No strip list, or no query: plain normalization. */
+  if (strip == NULL || *strip == '\0' ||
+      (query = strchr(source, '?')) == NULL) {
+    return fil_normalized_ex(source, dest, do_slash, do_query);
+  }
+
+  /* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
+     every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
+     (the read re-normalizes it; a dropped empty arg would miss dedup). */
+  cb = htsbuff_ptr(tmp, sizeof(tmp));
+  htsbuff_catn(&cb, source, (size_t) (query - source));
+  for (query++;;) {
+    const char *const arg = query;
+    const char *eq = NULL;
+    size_t keylen, arglen;
+
+    while (*query != '\0' && *query != '&') {
+      if (eq == NULL && *query == '=')
+        eq = query;
+      query++;
+    }
+    arglen = (size_t) (query - arg);
+    keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
+    if (!hts_query_key_stripped(arg, keylen, strip)) {
+      htsbuff_catc(&cb, wrote ? '&' : '?');
+      htsbuff_catn(&cb, arg, arglen);
+      wrote = 1;
+    }
+    if (*query == '\0')
+      break;
+    query++;
+  }
+  return fil_normalized_ex(tmp, dest, do_slash, do_query);
+}
+
+/* see htscore.h */
+char *fil_normalized_filtered(const char *source, char *dest,
+                              const char *strip) {
+  return fil_normalized_filtered_ex(source, dest, strip, 1, 1);
+}
+
+/* see htscore.h */
+const char *hts_query_strip_keys(const char *rules, const char *adr,
+                                 const char *fil, char *dest, size_t destsize) {
+  const char *p, *q;
+  const char *result = NULL;
+  char BIGSTK url[HTS_URLMAXSIZE * 2];
+
+  if (rules == NULL || *rules == '\0' || destsize == 0)
+    return NULL;
+
+  /* Match string = normalized host/path, query removed. jump_normalized_const
+     collapses www+scheme/auth so read and write (double-normalized) agree;
+     query excluded keeps the decision on host/path only. */
+  url[0] = '\0';
+  strcatbuff(url, jump_normalized_const(adr));
+  if (fil[0] != '/')
+    strcatbuff(url, "/");
+  q = strchr(fil, '?');
+  if (q != NULL)
+    strncatbuff(url, fil, (int) (q - fil));
+  else
+    strcatbuff(url, fil);
+
+  /* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
+     "pattern=keys"; no '=' is the bare form, pattern "*". */
+  for (p = rules; *p != '\0';) {
+    const char *const line = p;
+    const char *eol, *eq, *keys;
+    char BIGSTK pat[HTS_URLMAXSIZE * 2];
+
+    while (*p != '\0' && *p != '\n')
+      p++;
+    eol = p;
+    if (*p == '\n')
+      p++;
+    if (eol == line)
+      continue;
+    eq = memchr(line, '=', (size_t) (eol - line));
+    if (eq != NULL) {
+      size_t patlen = (size_t) (eq - line);
+
+      if (patlen >= sizeof(pat))
+        patlen = sizeof(pat) - 1;
+      memcpy(pat, line, patlen);
+      pat[patlen] = '\0';
+      keys = eq + 1;
+    } else {
+      pat[0] = '*';
+      pat[1] = '\0';
+      keys = line;
+    }
+    if (strjoker(url, pat, NULL, NULL) != NULL) {
+      size_t klen = (size_t) (eol - keys);
+
+      if (klen >= destsize)
+        klen = destsize - 1;
+      memcpy(dest, keys, klen);
+      dest[klen] = '\0';
+      result = dest;
+    }
+  }
+  return result;
+}
+
 #define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
 HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
                                      size_t destsize) {
@@ -5890,7 +6040,11 @@ HTSEXT_API httrackp *hts_create_opt(void) {
  opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
  opt->sizehack = HTS_FALSE;
  opt->urlhack = HTS_TRUE;
+  opt->no_www_dedup = HTS_FALSE;
+  opt->no_slash_dedup = HTS_FALSE;
+  opt->no_query_dedup = HTS_FALSE;
  StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
+  StringCopy(opt->strip_query, "");
  opt->ftp_proxy = HTS_TRUE;
  opt->convert_utf8 = HTS_TRUE;
  StringCopy(opt->filelist, "");
@@ -6035,6 +6189,7 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
    StringFree(opt->urllist);
    StringFree(opt->footer);
    StringFree(opt->mod_blacklist);
+    StringFree(opt->strip_query);

    StringFree(opt->path_html);
    StringFree(opt->path_html_utf8);
--- a/src/htsname.c
+++ b/src/htsname.c
@@ -198,6 +198,13 @@ int url_savename(lien_adrfilsave *const afs,
  // copy of fil, used for lookups (see urlhack)
  const char *normadr = adr;
  const char *normfil = fil_complete;
+  /* query keys to strip for this URL (NULL = none); decoupled from urlhack */
+  char BIGSTK stripkeys[HTS_URLMAXSIZE];
+  const char *const strip =
+      StringNotEmpty(opt->strip_query)
+          ? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
+                                 fil_complete, stripkeys, sizeof(stripkeys))
+          : NULL;
  const char *const print_adr = jump_protocol_const(adr);
  const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL;     // Position nom et point

@@ -230,9 +237,13 @@ int url_savename(lien_adrfilsave *const afs,
  // www-42.foo.com -> foo.com
  // foo.com/bar//foobar -> foo.com/bar/foobar
  if (opt->urlhack) {
-    // copy of adr (without protocol), used for lookups (see urlhack)
-    normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
-    normfil = fil_normalized(fil_complete, normfil_);
+    // dedup-lookup key; honor the per-feature negatives like htshash.c so
+    // distinct URLs keep distinct savenames (else keep normadr = adr)
+    if (!opt->no_www_dedup)
+      normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
+    normfil =
+        fil_normalized_filtered_ex(fil_complete, normfil_, strip,
+                                   !opt->no_slash_dedup, !opt->no_query_dedup);
  } else {
    if (link_has_authority(adr_complete)) {     // https or other protocols : in "http/" subfolder
      char *pos = strchr(adr_complete, ':');
@@ -245,6 +256,9 @@ int url_savename(lien_adrfilsave *const afs,
        normadr = normadr_;
      }
    }
+    // strip still applies with urlhack off (host left untouched)
+    if (strip != NULL)
+      normfil = fil_normalized_filtered(fil_complete, normfil_, strip);
  }

  // à afficher sans ftp://
--- a/src/htsopt.h
+++ b/src/htsopt.h
@@ -529,6 +529,12 @@ struct httrackp {
  htslibhandles libHandles; /**< loaded external module handles */
  //
  htsoptstate state; /**< embedded live engine state */
+  String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
+                           appended at the tail to keep field offsets stable */
+  hts_boolean
+      no_www_dedup; /**< with urlhack, keep www.host distinct from host */
+  hts_boolean no_slash_dedup; /**< with urlhack, keep redundant // in paths */
+  hts_boolean no_query_dedup; /**< with urlhack, keep query-argument order */
 };

 /* Running statistics for a mirror. */
--- a/src/htsparse.c
+++ b/src/htsparse.c
@@ -3602,16 +3602,28 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
             ident_url_relatif(mov_url, urladr(), urlfil(), moved)) >= 0) {
          int set_prio_to = 0;  // pas de priotité fixéd par wizard

-          // check whether URLHack is harmless or not
-          if (opt->urlhack) {
+          // check whether URLHack is harmless or not (per the effective
+          // sub-flags)
+          if (opt->urlhack && (!opt->no_www_dedup || !opt->no_slash_dedup ||
+                               !opt->no_query_dedup)) {
+            const int norm_host = !opt->no_www_dedup;
+            const int norm_slash = !opt->no_slash_dedup;
+            const int norm_query = !opt->no_query_dedup;
            char BIGSTK n_adr[HTS_URLMAXSIZE * 2], n_fil[HTS_URLMAXSIZE * 2];
            char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];

-            n_adr[0] = n_fil[0] = '\0';
-            (void) adr_normalized_sized(moved->adr, n_adr, sizeof(n_adr));
-            (void) fil_normalized(moved->fil, n_fil);
-            (void) adr_normalized_sized(urladr(), pn_adr, sizeof(pn_adr));
-            (void) fil_normalized(urlfil(), pn_fil);
+            strlcpybuff(n_adr,
+                        norm_host ? jump_normalized_const(moved->adr)
+                                  : jump_identification_const(moved->adr),
+                        sizeof(n_adr));
+            strlcpybuff(pn_adr,
+                        norm_host ? jump_normalized_const(urladr())
+                                  : jump_identification_const(urladr()),
+                        sizeof(pn_adr));
+            fil_normalized_filtered_ex(moved->fil, n_fil, NULL, norm_slash,
+                                       norm_query);
+            fil_normalized_filtered_ex(urlfil(), pn_fil, NULL, norm_slash,
+                                       norm_query);
            if (strcasecmp(n_adr, pn_adr) == 0
                && strcasecmp(n_fil, pn_fil) == 0) {
              hts_log_print(opt, LOG_WARNING,
--- a/src/htsselftest.c
+++ b/src/htsselftest.c
@@ -524,6 +524,32 @@ static int st_filter(httrackp *opt, int argc, char **argv) {
  return 0;
 }

+/* Size-aware filter verdict via fa_strjoker: a negative <size> means the size
+   is still unknown (scan time), so a size rule like -*.jpg*[<10] must stay
+   neutral. */
+static int st_filtersize(httrackp *opt, int argc, char **argv) {
+  LLint sz;
+  int size_flag = 0, verdict, known;
+
+  (void) opt;
+  if (argc < 3) {
+    fprintf(stderr, "filtersize: needs <size> <string> <filter> [filter...]\n");
+    return 1;
+  }
+  known = (argv[0][0] != '-'); /* "-1"/"-" => size unknown */
+  sz = -1;
+  if (known)
+    sscanf(argv[0], LLintP, &sz);
+  verdict = fa_strjoker(0, &argv[2], argc - 2, argv[1], known ? &sz : NULL,
+                        known ? &size_flag : NULL, NULL);
+  printf("verdict=%s size_flag=%d\n",
+         verdict > 0   ? "allowed"
+         : verdict < 0 ? "forbidden"
+                       : "unknown",
+         size_flag);
+  return 0;
+}
+
 static int st_simplify(httrackp *opt, int argc, char **argv) {
  (void) opt;
  if (argc < 1) {
@@ -1026,6 +1052,173 @@ static int st_cookies(httrackp *opt, int argc, char **argv) {
  return err;
 }

+/* --strip-query: resolver + fil_normalized_filtered, end to end. */
+static int st_stripquery(httrackp *opt, int argc, char **argv) {
+  char dest[1024], keys[256], ref[1024];
+  const char *k;
+
+  (void) opt;
+  (void) argc;
+  (void) argv;
+
+  /* empty rules == plain fil_normalized */
+  assertf(hts_query_strip_keys(NULL, "h.com", "/p?a=1", keys, sizeof(keys)) ==
+          NULL);
+  assertf(hts_query_strip_keys("", "h.com", "/p?a=1", keys, sizeof(keys)) ==
+          NULL);
+  assertf(strcmp(fil_normalized_filtered("/p?b=2&a=1", dest, NULL),
+                 fil_normalized("/p?b=2&a=1", ref)) == 0);
+
+  /* bare form (*=keys): strip the key everywhere, keep+sort the rest */
+  k = hts_query_strip_keys("sid", "any.com", "/p?b=2&sid=x&a=1", keys,
+                           sizeof(keys));
+  assertf(k != NULL && strcmp(k, "sid") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?b=2&sid=x&a=1", dest, k),
+                 "/p?a=1&b=2") == 0);
+
+  /* reordered variant + an extra stripped key == the clean URL */
+  assertf(strcmp(fil_normalized_filtered("/p?sid=y&a=1&b=2", dest, "sid"),
+                 fil_normalized("/p?a=1&b=2", ref)) == 0);
+
+  /* host pattern matches only that host, incl. its www-normalized forms */
+  assertf(hts_query_strip_keys("ex.com/*=utm", "other.com", "/p?utm=1", keys,
+                               sizeof(keys)) == NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "ex.com", "/p?utm=1", keys,
+                               sizeof(keys)) != NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "www.ex.com", "/p?utm=1", keys,
+                               sizeof(keys)) != NULL);
+  assertf(hts_query_strip_keys("ex.com/*=utm", "http://www-3.ex.com",
+                               "/p?utm=1", keys, sizeof(keys)) != NULL);
+
+  /* last match wins, wholesale: host rule overrides global, no union */
+  k = hts_query_strip_keys("*=sid\nex.com/*=utm", "ex.com",
+                           "/p?sid=1&utm=2&a=3", keys, sizeof(keys));
+  assertf(k != NULL && strcmp(k, "utm") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?sid=1&utm=2&a=3", dest, k),
+                 "/p?a=3&sid=1") == 0);
+  k = hts_query_strip_keys("*=sid\nex.com/*=utm", "z.com", "/p?sid=1&a=3", keys,
+                           sizeof(keys));
+  assertf(k != NULL && strcmp(k, "sid") == 0);
+
+  /* whole-key match, not prefix: "utm" must not strip utm_source */
+  assertf(strcmp(fil_normalized_filtered("/p?utm_source=x&a=1", dest, "utm"),
+                 "/p?a=1&utm_source=x") == 0);
+
+  /* "*" drops every param; a fully-stripped single-arg query loses its '?' */
+  assertf(strcmp(fil_normalized_filtered("/p?a=1&b=2", dest, "*"), "/p") == 0);
+  assertf(strcmp(fil_normalized_filtered("/p?utm=1", dest, "utm"), "/p") == 0);
+
+  /* degenerate forms a=, b, c== (key 'c'); strip c keeps a= and b */
+  assertf(strcmp(fil_normalized_filtered("/p?a=&b&c==", dest, "c"),
+                 "/p?a=&b") == 0);
+  /* short key must not strip a longer one: 'c' must not touch 'cc' */
+  assertf(strcmp(fil_normalized_filtered("/p?cc=1&c=2", dest, "c"),
+                 "/p?cc=1") == 0);
+
+  /* repeated key: every occurrence is stripped, not just the first */
+  assertf(
+      strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "foo"),
+             "/p?bar=13") == 0);
+  /* repeated key mixing missing/empty values */
+  assertf(
+      strcmp(fil_normalized_filtered("/p?foo&bar=13&foo=42&foo=", dest, "foo"),
+             "/p?bar=13") == 0);
+  /* repeated key kept (no match): all occurrences retained, then sorted */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42&bar=13&foo=43", dest, "z"),
+                 "/p?bar=13&foo=42&foo=43") == 0);
+
+  /* value containing '=': the key is only the part before the first '='. Strip
+     'foo' drops "foo=42=17" whole; the '=' in the value is not a delimiter. */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "foo"),
+                 "/p?bar=") == 0);
+  /* keeping it preserves the embedded '=' verbatim */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17&bar=", dest, "bar"),
+                 "/p?foo=42=17") == 0);
+  /* a value segment is not a key: stripping "42" must not touch foo=42=17 */
+  assertf(strcmp(fil_normalized_filtered("/p?foo=42=17", dest, "42"),
+                 "/p?foo=42=17") == 0);
+
+  /* Idempotency: the read path re-normalizes an already-normalized fil, so the
+     result must be a fixpoint or dedup misses (catches a dropped empty/trailing
+     arg like "?&&", "a&"). */
+  {
+    static const char *const qs[] = {"/p?a=&b&c==",
+                                     "/p?a&&b",
+                                     "/p?&a",
+                                     "/p?a&",
+                                     "/p?",
+                                     "/p?=v",
+                                     "/p?&&",
+                                     "/p?b=2&a=1",
+                                     "/p?utm=x&",
+                                     "/p?&utm=x",
+                                     "/p?foo=42&bar=13&foo=43",
+                                     "/p?foo&bar=13&foo=42&foo=",
+                                     "/p?foo=42=17&bar="};
+    static const char *const strips[] = {NULL, "z", "utm", "*", "a", "foo"};
+    char once[1024], twice[1024];
+    size_t i, j;
+
+    for (i = 0; i < sizeof(qs) / sizeof(qs[0]); i++) {
+      for (j = 0; j < sizeof(strips) / sizeof(strips[0]); j++) {
+        fil_normalized_filtered(qs[i], once, strips[j]);
+        fil_normalized_filtered(once, twice, strips[j]);
+        assertf(strcmp(once, twice) == 0);
+      }
+    }
+  }
+
+  printf("strip-query self-test OK\n");
+  return 0;
+}
+
+/* -%u url-hack split (#271): each sub-flag must toggle independently. */
+static int st_urlhack(httrackp *opt, int argc, char **argv) {
+  (void) argc;
+  (void) argv;
+#define EQ(aa, fa, ab, fb) hash_url_equals(opt, aa, fa, ab, fb)
+  /* urlhack on, no opt-outs: www, // and query order all collapse */
+  opt->urlhack = 1;
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = 0;
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+
+  /* keep-www-prefix: host off; // and query still collapse */
+  opt->no_www_dedup = 1;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->no_www_dedup = 0;
+
+  /* keep-double-slashes: // significant; www, query order still collapse */
+  opt->no_slash_dedup = 1;
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->no_slash_dedup = 0;
+
+  /* keep-query-order: query order significant; www and // still collapse */
+  opt->no_query_dedup = 1;
+  assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  assertf(EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  opt->no_query_dedup = 0;
+
+  /* all opt-outs == urlhack off entirely */
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = 1;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+  assertf(!EQ("foo.com", "/p?b=2&a=1", "foo.com", "/p?a=1&b=2"));
+  opt->urlhack = 0;
+  opt->no_www_dedup = opt->no_slash_dedup = opt->no_query_dedup = 0;
+  assertf(!EQ("www.foo.com", "/a", "foo.com", "/a"));
+  assertf(!EQ("foo.com", "/a//b", "foo.com", "/a/b"));
+#undef EQ
+  printf("urlhack self-test OK\n");
+  return 0;
+}
+
 /* ------------------------------------------------------------ */
 /* Registry: name -> handler, with a usage hint and a one-line description. */
 /* ------------------------------------------------------------ */
@@ -1038,7 +1231,14 @@ static const struct selftest_entry {
 } selftests[] = {
    {"filter", "<pattern> <string>", "match a string against a wildcard filter",
     st_filter},
+    {"filtersize", "<size> <string> <filter>...",
+     "size-aware filter verdict (negative size = unknown/scan time)",
+     st_filtersize},
    {"simplify", "<path>", "collapse ./ and ../ in a path", st_simplify},
+    {"stripquery", "", "--strip-query pattern/key stripping self-test",
+     st_stripquery},
+    {"urlhack", "", "-%u url-hack sub-flag (www/slash/query) self-test",
+     st_urlhack},
    {"mime", "<filename>", "MIME type for a filename", st_mime},
    {"charset", "<charset> <string>",
     "convert a string to UTF-8 from a charset", st_charset},
--- a/tests/01_engine-filelist.test
+++ b/tests/01_engine-filelist.test
@@ -0,0 +1,65 @@
+#!/bin/bash
+#
+# -%L URL-list loading (#49): a readable list is honored; an unusable one fails
+# with the reason (errno / not-a-regular-file), not a bare "Could not include
+# URL list". Offline: file:// fixture, no server. Asserts on httrack's own
+# strings and the message shape, so it is locale-independent.
+
+set -euo pipefail
+
+tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_filelist.XXXXXX") || exit 1
+trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
+
+echo '<html><body>hi</body></html>' >"$tmp/index.html"
+
+# run httrack with the given -%L target; structured log lands in $out/hts-log.txt
+run() {
+    local out="$1" list="$2"
+    rm -rf "$out"
+    mkdir -p "$out"
+    httrack -O "$out" --quiet -n "-%L" "$list" >"$out/.stdout" 2>&1 || true
+    LOG="$out/hts-log.txt"
+}
+
+fail() {
+    echo "FAIL: $1"
+    cat "$LOG"
+    exit 1
+}
+loghas() {
+    grep -Eq "$1" "$LOG" || fail "expected /$1/ in $LOG"
+}
+lognot() {
+    if grep -Eq "$1" "$LOG"; then fail "unexpected /$1/ in $LOG"; fi
+}
+
+# readable list: its one URL is loaded and counted (count must be non-zero)
+printf 'file://%s/index.html\n' "$tmp" >"$tmp/urls.txt"
+run "$tmp/ok" "$tmp/urls.txt"
+loghas '[1-9][0-9]* links added from'
+
+# missing file: quoted name + a non-empty reason, never the old reasonless
+# "Could not include URL list: <name>". The reason is the stat() errno, not the
+# directory fallback literal (guards against dropping the errno lookup).
+run "$tmp/miss" "$tmp/nope.txt"
+loghas 'Could not include URL list "[^"]+": .+'
+lognot 'Could not include URL list: '
+lognot 'not a regular file'
+
+# a directory is rejected with our own reason (locale-independent)
+mkdir -p "$tmp/adir"
+run "$tmp/dir" "$tmp/adir"
+loghas 'Could not include URL list "[^"]+": not a regular file'
+
+# unreadable regular file: the fopen() errno arm fires, distinct from the
+# directory branch. Root bypasses mode 000, so skip it there.
+if test "$(id -u)" -ne 0; then
+    : >"$tmp/noperm.txt"
+    chmod 000 "$tmp/noperm.txt"
+    run "$tmp/perm" "$tmp/noperm.txt"
+    chmod 644 "$tmp/noperm.txt"
+    loghas 'Could not include URL list "[^"]+": .+'
+    lognot 'not a regular file'
+fi
+
+exit 0
--- a/tests/01_engine-filter.test
+++ b/tests/01_engine-filter.test
@@ -71,3 +71,27 @@ nomatch '*[\[\]]' '[' # not matched, despite the docs
 match '*[\[\]]' ']'   # only via the empty class-match + trailing ']'
 match '*[\[\]]' '[]'  # one of {'[','\'} then the trailing ']'
 nomatch '*[\[\]]' '[]x'
+
+# Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
+# means the size is still unknown (scan time). A size exclusion must stay neutral
+# then, so the file is fetched and only cancelled once its size is known (#143).
+fsize() {
+    local want="$1"
+    shift
+    test "$(httrack -O /dev/null -#test=filtersize "$@")" == "$want" || exit 1
+}
+fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[<10]'   # scan time: keep
+fsize 'verdict=forbidden size_flag=1' 5 foo.jpg -* '+*.jpg' '-*.jpg*[<10]'  # <10KB: cancel
+fsize 'verdict=allowed size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[<10]'   # >=10KB: keep
+fsize 'verdict=forbidden size_flag=0' -1 foo.txt -* '+*.jpg' '-*.jpg*[<10]' # not a jpg
+# the '>' operator is just as neutral at scan time, and fires once size is known
+fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[>10]'   # scan time: keep
+fsize 'verdict=forbidden size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # >10KB: cancel
+
+# [name]/[file]/[path] never span '?' mid-string; a trailing query is still
+# tolerated by the global '?' rule (same as plain *.aspx), not the class (#144).
+nomatch '*[path]/end' 'a?b/end'
+nomatch '*[file]end' 'foo?xend'
+nomatch '*[name]X' 'abc?X'
+match '*[file]' 'foo?x=1' # trailing query: tolerated, as for *.aspx
+match '*.aspx' 'page.aspx?y=2'
--- a/tests/01_engine-stripquery.test
+++ b/tests/01_engine-stripquery.test
@@ -0,0 +1,8 @@
+#!/bin/bash
+#
+
+set -euo pipefail
+
+# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
+# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
+httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"
--- a/tests/01_engine-urlhack.test
+++ b/tests/01_engine-urlhack.test
@@ -0,0 +1,8 @@
+#!/bin/bash
+#
+
+set -euo pipefail
+
+# -%u url-hack split (#271): www / // / query-order dedup toggle independently.
+# All assertions live in the engine self-test (hash compare flag resolution).
+httrack -O /dev/null -#test=urlhack run | grep -q "urlhack self-test OK"
--- a/tests/24_local-resume-overlap.test
+++ b/tests/24_local-resume-overlap.test
@@ -0,0 +1,109 @@
+#!/bin/bash
+# Issue #198: on a resumed download the server may answer the Range with a 206
+# that starts *before* the offset we asked for (block-aligned ranges). httrack
+# must honor the returned Content-Range, not blindly append, or the overlap
+# bytes get duplicated and the file grows (corrupt PDFs). Pass 1 interrupts
+# flaky.bin mid-body (partial + temp-ref); pass 2 resumes against a 206 that
+# backs up 8 bytes. The result must equal the same bytes fetched whole (full.bin).
+set -eu
+
+: "${top_srcdir:=..}"
+testdir=$(cd "$(dirname "$0")" && pwd)
+server="${testdir}/local-server.py"
+
+command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
+
+tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_198.XXXXXX") || exit 1
+serverpid=
+crawlpid=
+cleanup() {
+    if test -n "$crawlpid"; then kill -9 "$crawlpid" 2>/dev/null || true; fi
+    if test -n "$serverpid"; then
+        kill "$serverpid" 2>/dev/null || true
+        wait "$serverpid" 2>/dev/null || true
+    fi
+    rm -rf "$tmpdir"
+}
+trap cleanup EXIT HUP INT QUIT PIPE TERM
+
+# OVERLAP_COUNTER gets a byte per flaky.bin request so pass 1 knows when to interrupt.
+serverlog="${tmpdir}/server.log"
+counter="${tmpdir}/hits"
+resumed="${tmpdir}/resumed" # gets a byte when the server serves a resume 206
+OVERLAP_COUNTER="$counter" OVERLAP_RESUMED="$resumed" \
+    python3 "$server" --root "${testdir}/server-root" \
+    >"$serverlog" 2>&1 &
+serverpid=$!
+port=
+for _ in $(seq 1 50); do
+    line=$(head -n1 "$serverlog" 2>/dev/null)
+    if test "${line%% *}" == "PORT"; then
+        port="${line#PORT }"
+        break
+    fi
+    kill -0 "$serverpid" 2>/dev/null || {
+        echo "server exited early: $(cat "$serverlog")"
+        exit 1
+    }
+    sleep 0.1
+done
+test -n "$port" || {
+    echo "could not discover server port"
+    exit 1
+}
+base="http://127.0.0.1:${port}"
+
+which httrack >/dev/null || {
+    echo "could not find httrack"
+    exit 1
+}
+out="${tmpdir}/crawl"
+common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0 -c1)
+refdir="${out}/hts-cache/ref"
+
+# pass 1: interrupt once flaky.bin's prefix is streaming (partial + temp-ref).
+printf '[pass 1: interrupt flaky.bin] ..\t'
+httrack "${common[@]}" "${base}/overlap/index.html" >"${tmpdir}/log1" 2>&1 &
+crawlpid=$!
+for _ in $(seq 1 300); do
+    test -s "$counter" && break
+    kill -0 "$crawlpid" 2>/dev/null || break
+    sleep 0.1
+done
+sleep 0.5
+kill -TERM "$crawlpid" 2>/dev/null || true
+wait "$crawlpid" 2>/dev/null || true
+crawlpid=
+test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
+    echo "FAIL: no temp-ref survived pass 1; cannot drive the resume"
+    exit 1
+}
+echo "OK (temp-ref present)"
+
+# pass 2: --continue -> resume Range -> 206 that starts 8 bytes early.
+printf '[pass 2: resume flaky.bin] ..\t'
+httrack "${common[@]}" --continue "${base}/overlap/index.html" >"${tmpdir}/log2" 2>&1 || true
+echo "OK"
+
+# Guard against a silent full re-download: the byte-compare below only tests the
+# fix if pass 2 actually went through the resume Range -> 206 path.
+printf '[resume path was exercised] ..\t'
+if ! test -s "$resumed"; then
+    echo "FAIL: pass 2 never triggered a resume 206; the overlap fix was not exercised"
+    exit 1
+fi
+echo "OK"
+
+printf '[resumed file is not corrupted] ..\t'
+dir=$(find "$out" -maxdepth 1 -type d -name '127.0.0.1*' | head -1)
+flaky="${dir}/overlap/flaky.bin"
+full="${dir}/overlap/full.bin"
+if ! test -f "$flaky" || ! test -f "$full"; then
+    echo "FAIL: flaky.bin or full.bin missing after pass 2"
+    exit 1
+fi
+if ! cmp -s "$flaky" "$full"; then
+    echo "FAIL: resumed flaky.bin ($(wc -c <"$flaky")) != full.bin ($(wc -c <"$full")); overlap duplicated"
+    exit 1
+fi
+echo "OK ($(wc -c <"$flaky") bytes, byte-identical)"
--- a/tests/25_local-mime-exclude.test
+++ b/tests/25_local-mime-exclude.test
@@ -0,0 +1,16 @@
+#!/bin/bash
+#
+# A -mime: exclusion must abort the transfer on the response Content-Type, not
+# fetch the whole 1 MB body then discard it (#58). The bytes-received guard is
+# the real one: the file is absent either way, but only the fix keeps the count
+# tiny (header only) instead of pulling the body. Match it positively (a small,
+# <=4-digit count) so a vanished/reworded summary line fails rather than passes.
+
+: "${top_srcdir:=..}"
+
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
+    --found 'mimex/real.html' \
+    --not-found 'mimex/blob.pdf' \
+    --log-found 'excluded by MIME type filter' \
+    --log-found '\[[0-9]{1,4} bytes received' \
+    httrack 'BASEURL/mimex/index.html' '-mime:application/pdf'
--- a/tests/26_local-strip-query.test
+++ b/tests/26_local-strip-query.test
@@ -0,0 +1,18 @@
+#!/bin/bash
+#
+# End-to-end --strip-query (#112): two links to one resource differing only by
+# ?utm_source dedup to a single saved file (2 files written: index + resource);
+# the control crawl without the option keeps both variants (3 files). Locks the
+# CLI->opt->hash plumbing the engine self-test can't reach.
+
+set -e
+
+: "${top_srcdir:=..}"
+
+# stripped: the two ?utm_source variants collapse to one resource
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
+    httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
+
+# control: no stripping -> both query-named variants are saved
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
+    httrack 'BASEURL/stripquery/index.html'
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -5,6 +5,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
 	proxy-https-server.py \
 	local-crawl.sh local-server.py server.crt server.key \
 	server-root/simple/basic.html server-root/simple/link.html \
+	server-root/stripquery/index.html server-root/stripquery/a.html \
 	fixtures/cache-golden/hts-cache/new.zip

 TESTS_ENVIRONMENT =
@@ -34,6 +35,7 @@ TESTS = \
 	01_engine-dns.test \
 	01_engine-doitlog.test \
 	01_engine-entities.test \
+	01_engine-filelist.test \
 	01_engine-filter.test \
 	01_engine-hashtable.test \
 	01_engine-idna.test \
@@ -44,7 +46,9 @@ TESTS = \
 	01_engine-savename.test \
 	01_engine-selftest-dispatch.test \
 	01_engine-simplify.test \
+	01_engine-stripquery.test \
 	01_engine-strsafe.test \
+	01_engine-urlhack.test \
 	02_manpage-regen.test \
 	02_update-cache.test \
 	10_crawl-simple.test \
@@ -65,6 +69,9 @@ TESTS = \
 	20_local-resume-loop.test \
 	21_local-intl-update.test \
 	22_local-broken-size.test \
-	23_local-errpage.test
+	23_local-errpage.test \
+	24_local-resume-overlap.test \
+	25_local-mime-exclude.test \
+	26_local-strip-query.test

 CLEANFILES = check-network_sh.cache
--- a/tests/local-server.py
+++ b/tests/local-server.py
@@ -177,6 +177,24 @@ class Handler(SimpleHTTPRequestHandler):
        body, ctype = self.TYPE_MATRIX[path]
        self.send_raw(body, ctype)

+    # --- MIME-type exclusion abort (issue #58) -----------------------------
+    # A -mime:application/pdf filter must abort the transfer once the header
+    # arrives, not download the whole body and discard it.
+    def route_mimex_index(self):
+        self.send_html(
+            '\t<a href="blob.pdf">pdf</a>\n' '\t<a href="real.html">real</a>\n'
+        )
+
+    # 1 MB body: the fix aborts after the header, so httrack's "bytes received"
+    # stays tiny; without it the engine reads the body and the count jumps.
+    MIMEX_BLOB = b"%PDF-1.4\n" + b"\x00" * (1024 * 1024)
+
+    def route_mimex_blob(self):
+        self.send_raw(self.MIMEX_BLOB, "application/pdf")
+
+    def route_mimex_real(self):
+        self.send_raw(b"<html><body>real</body></html>", "text/html")
+
    # --- special chars in URLs across an update (issue #157) ---------------
    # A dotless, accented basename served as text/html (MediaWiki style). The
    # name the first crawl picks (.html) must survive the update pass.
@@ -225,6 +243,71 @@ class Handler(SimpleHTTPRequestHandler):
        self.send_header("Content-Length", "0")
        self.end_headers()

+    # 206 resume must honor the server's Content-Range, not the offset we asked
+    # for (#198): a server resuming a few bytes *before* the request must not
+    # leave httrack duplicating the overlap onto the partial. flaky.bin
+    # interrupts once then resumes OVERLAP_EARLY bytes early; full.bin serves
+    # the identical bytes in one shot, so the test can compare the two.
+    OVERLAP_BLOB = b"%PDF-1.4\n" + bytes((i * 37 + 11) % 256 for i in range(8000))
+    OVERLAP_EARLY = 8
+    OVERLAP_PREFIX_LEN = 4000  # flushed before the stall
+    _overlap_started = False
+
+    def route_overlap_index(self):
+        self.send_html('\t<a href="flaky.bin">flaky</a>\n\t<a href="full.bin">full</a>')
+
+    def route_overlap_full(self):
+        self.send_raw(self.OVERLAP_BLOB, "application/octet-stream")
+
+    def route_overlap(self):
+        counter = os.environ.get("OVERLAP_COUNTER")
+        if counter:
+            with open(counter, "a") as fp:
+                fp.write("x")
+        blob = self.OVERLAP_BLOB
+        rng = self.headers.get("Range")
+        # First GET: stream a prefix then stall, so the crawl can be interrupted
+        # mid-body (partial + temp-ref on disk).
+        if rng is None and not Handler._overlap_started:
+            Handler._overlap_started = True
+            self.send_response(200)
+            self.send_header("Content-Type", "application/octet-stream")
+            self.send_header("Content-Length", str(len(blob)))
+            self.send_header("Accept-Ranges", "bytes")
+            self.end_headers()
+            if self.command != "HEAD":
+                self.wfile.write(blob[: self.OVERLAP_PREFIX_LEN])
+                self.wfile.flush()
+                try:
+                    while True:
+                        time.sleep(3600)
+                except OSError:
+                    pass
+            return
+        if rng is None:  # no resume request: serve the whole file
+            return self.route_overlap_full()
+        # Resume: honor the Range, but back up OVERLAP_EARLY bytes.
+        start = (
+            int(rng[len("bytes=") :].split("-")[0]) if rng.startswith("bytes=") else 0
+        )
+        start = max(0, start - self.OVERLAP_EARLY)
+        # Signal that the resume Range -> 206 path actually fired, so the test
+        # can prove it was exercised (not a silent full re-download).
+        resumed = os.environ.get("OVERLAP_RESUMED")
+        if resumed:
+            with open(resumed, "a") as fp:
+                fp.write("x")
+        part = blob[start:]
+        self.send_response(206, "Partial Content")
+        self.send_header("Content-Type", "application/octet-stream")
+        self.send_header("Content-Length", str(len(part)))
+        self.send_header(
+            "Content-Range", "bytes %d-%d/%d" % (start, len(blob) - 1, len(blob))
+        )
+        self.end_headers()
+        if self.command != "HEAD":
+            self.wfile.write(part)
+
    # error pages / 0-byte files (#17): -o0 ("no error pages") must keep 4xx/5xx
    # bodies off disk; a genuine 0-byte 200 is a valid file and stays.
    def route_errpage_index(self):
@@ -281,12 +364,18 @@ class Handler(SimpleHTTPRequestHandler):
        "/intl/" + INTL_NAME: route_intl_page,
        "/resume/index.html": route_resume_index,
        "/resume/blob.txt": route_resume,
+        "/overlap/index.html": route_overlap_index,
+        "/overlap/flaky.bin": route_overlap,
+        "/overlap/full.bin": route_overlap_full,
        "/size/index.html": route_size_index,
        "/size/oversize.bin": route_size_oversize,
        "/errpage/index.html": route_errpage_index,
        "/errpage/good.html": route_errpage_good,
        "/errpage/missing.html": route_errpage_missing,
        "/errpage/empty.html": route_errpage_empty,
+        "/mimex/index.html": route_mimex_index,
+        "/mimex/blob.pdf": route_mimex_blob,
+        "/mimex/real.html": route_mimex_real,
    }

    # --- dispatch ----------------------------------------------------------
--- a/tests/server-root/stripquery/a.html
+++ b/tests/server-root/stripquery/a.html
@@ -0,0 +1 @@
+<html><body>resource A</body></html>
--- a/tests/server-root/stripquery/index.html
+++ b/tests/server-root/stripquery/index.html
@@ -0,0 +1,5 @@
+<html><body>
+Two links to one resource, differing only by a tracking parameter.
+<a href="a.html?utm_source=x">x</a>
+<a href="a.html?utm_source=y">y</a>
+</body></html>
Author	SHA1	Message	Date
Xavier Roche	8ebfcbe416	Split -%u URL Hacks into independent www/slash/query toggles (#271 ) -%u (--urlhack) bundled three dedup normalizations under one switch: www.host == host, redundant // collapse, and query-argument reordering. A mirror that needed one but not another (e.g. keep www. distinct) had to turn the whole umbrella off. Add three opt-out sub-options, defaulting to the umbrella so existing -%u/-%u0 behavior is unchanged: --keep-www-prefix keep www.foo.com distinct from foo.com (-%j) --keep-double-slashes keep redundant // in the path (-%o) --keep-query-order keep query-argument order significant (-%y) The split is resolved once in hash_init() into norm_host/norm_slash/ norm_query and threaded through the dedup hash (htshash.c), the savename lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so all three stay consistent. fil_normalized() gains an internal fil_normalized_ex(do_slash, do_query) core; the public fil_normalized()/fil_normalized_filtered() keep their signatures. http/https are always merged in the dedup key (the scheme is stripped regardless of -%u), so that part of the request needs no toggle. The opt-outs are spelled positively (--keep-*) because httrack's generic --no<opt> prefix only appends the disabling "0" for parametered options, not "single" booleans, so --nowww-dedup would silently no-op. opt grows three hts_boolean fields appended at the struct tail (offsets stable, no soname bump, matching the strip_query addition in #112). Tested by a -#test=urlhack engine self-test (hash_url_equals over each flag combination) driven by 01_engine-urlhack.test. Closes #271 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-27 13:36:38 +02:00
Xavier Roche	40a66600ff	Add --strip-query to drop query keys from dedup naming (#112 ) (#434 ) Two URLs that differ only in tracking or session query parameters (?utm_source=x versus ?utm_source=y) were saved as separate files, and a single CGI could fan out into thousands of near-duplicate pages. fil_normalized already sorted query args, so reordered parameters dedup, but there was no way to drop a named key. --strip-query "[host/pattern=]key1,key2,..." (repeatable) removes the listed keys when computing the dedup key and the saved name. The fetched URL is untouched, so a required sid= is still sent on the wire; only the local namespace collapses. Patterns match the normalized host/path with the +/- filter glob (strjoker), last match wins as in the filter list, and stripping is decoupled from urlhack (-%u) so it never silently no-ops with -%u0. It all funnels through one chokepoint, fil_normalized: an internal fil_normalized_filtered() strips then delegates, and hts_query_strip_keys resolves the per-URL key list. The strip pass walks every query field, including empty and trailing ones, so its output is a fixpoint under the read path's second normalization (otherwise dedup silently misses). Exported ABI is unchanged; the strip_query field is appended at the tail of httrackp. Covered by a -#test=stripquery self-test (degenerate queries like a=&b&c== and a 50-case idempotency fixpoint) and an end-to-end dedup crawl test. Closes #112 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 11:13:16 +02:00
Xavier Roche	768756e231	ci: add a MemorySanitizer job for the offline engine self-tests (#433 ) MSan is the only sanitizer that catches a read of uninitialized memory -- the class of #143, where the size filter tested an uninitialized stack LLint and forbade files at random. ASan and UBSan let that through. MSan reports any byte produced by an uninstrumented library as uninitialized, so the job stays inside our own code: clang, a static link (the MSan runtime is not injected into shared objects), --disable-https to drop openssl, and only the offline 01_engine-* self-tests minus the zlib-backed cache trio. Those self-tests drive the hostile-input parsers (charset, mime, html, entities, idna, filters) straight through MSan. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 08:40:22 +02:00
Xavier Roche	b138c87a93	filtersize self-test: parse the size with sscanf(LLintP), and lock the '>' operator (#432 ) Use the portable sscanf(argv[0], LLintP, &sz) idiom the rest of the tree uses to read an LLint, instead of strtoll: LLint is not always long long (MSVC __int64, plus fallbacks) and strtoll is absent on old MSVC. Add two cases so the size-rule scan-time neutrality is pinned for the '>' operator too, not only '<': -.jpg[>10] stays neutral at scan time and cancels once the size is known. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 22:01:14 +02:00
Xavier Roche	3de47433b7	Keep size-based filter rules neutral until the file size is known (#143 ) (#431 ) A rule such as -.jpg[<10] is meant to fetch every JPG, then delete the ones under 10KB once their size is known. Instead it could forbid all of them up front: at scan time the wizard calls fa_strjoker with no size, but fa_strjoker always handed strjoker the address of an uninitialized local sz, so the [<10] predicate ran against stack garbage. When that garbage fell in [0,10) the rule "matched" and the link was dropped before it was ever downloaded ("(wizard) explicit forbidden (-.jpg[<10])"). Pass no size pointer when the size is unknown, routing into strjoker's existing "test impossible -> no match" path so size rules stay neutral at scan time and only fire once the real size is in. The size-known path is unchanged. Add a filtersize engine self-test that drives fa_strjoker through both phases and a tests/01_engine-filter.test block locking the scenario. Also lock #144: the [name]/[file]/[path] classes do not span '?'; a trailing query is tolerated by the same global rule that lets *.aspx match page.aspx?y=2, not by the class. Working as intended. Closes #143 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 21:21:54 +02:00
Xavier Roche	fb8827718e	htscore: report why a -%L URL list could not be loaded (#49 ) (#430 ) A missing, unreadable, or non-regular -%L file all collapsed into one reasonless "Could not include URL list: <name>", which is what left the #49 reporter unable to tell why the list was rejected. Open and stat() the file explicitly so the log carries the cause: the errno text (no such file, permission denied), "not a regular file", or "file too large". The loader keeps the original regular-file guard, so it still won't open a directory or FIFO. Covered by an offline file:// test: a readable list loads with a non-zero count, while a missing file, an unreadable file, and a directory each fail with a distinct reason instead of the bare message. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 20:49:20 +02:00
Xavier Roche	7228210061	Abort the download when the response MIME type is excluded by -mime: (#58 ) (#429 ) A -mime: exclusion only took effect after the full body had been downloaded and then discarded (leaving a .delayed temp behind), wasting bandwidth. Honor it as soon as the response Content-Type arrives: back_wait now aborts the transfer before the body when hts_acceptmime forbids the declared type, finishing the slot with a new STATUSCODE_EXCLUDED clean-skip status rather than fetching and dropping. Covers the reported case (an HTML-looking URL served as application/pdf past a +*.html include) and any -mime: match regardless of extension. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 20:10:37 +02:00
Xavier Roche	38882c0aee	Honor the server's Content-Range when resuming a partial download (#198 ) (#428 ) * Honor the server's Content-Range when resuming a partial download (#198) A resumed download (Range: bytes=N-) may be answered with a 206 whose range starts before N: block-aligned caches and CDNs routinely round the start down to a block boundary, and RFC 7233 lets the server pick the range it returns. httrack ignored the returned Content-Range and blindly appended the 206 body to the bytes already on disk, so the overlapping bytes were duplicated and the file grew by the overlap. With timing deciding which files get interrupted (and thus resumed), this surfaced as a random subset of files corrupted on each run, each a few bytes too large. Resume at the server's crange_start instead: ftruncate the partial to that offset and write the 206 body there (the in-memory branch keeps only that prefix). When the returned range is unusable (a forward gap, no/garbage Content-Range, or one that doesn't reach EOF) drop the partial and refetch the whole file rather than stitch a corrupt one. Reading the existing crange_start/crange_end/crange fields only, no ABI change. Driven by tests/24_local-resume-overlap.test: pass 1 interrupts a download mid-body, pass 2 resumes against a 206 that backs up 8 bytes, and the result must be byte-identical to the same content fetched whole. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * Harden #198 fix: verify the truncate, assert the test hit the resume path Two follow-ups from review of the resume fix. If HTS_FTRUNCATE fails the partial could keep a stale tail (only when the resource shrank between runs, sz > full, so the body write no longer covers the old end). Check its return and, on failure, drop the partial and refetch the whole file instead of writing a possibly-corrupt one. The resume test only compared the resumed bytes against the whole file, which also passes if httrack silently re-downloads the file with no Range (the bug never fires). Mark when the server actually serves a resume 206 and assert pass 2 hit that path, so a full re-download fails the test instead of passing it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * tests: run 24_local-resume-overlap under set -e Follow the golden rule for shell scripts: start with set -e so a non-last failure can't be masked. Guard the backgrounded-crawl kill/wait spots with \|\| true so the expected SIGTERM exit doesn't abort the run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 17:42:26 +02:00