ci: add a no-python3 job reproducing the Debian buildd chroot

GitHub runners ship python3, so every make-check job exercised the python3-present path and the local-server tests never skipped. The Debian buildds build in a minimal chroot with no python3, where those tests must SKIP (exit 77) -- and 28_local-pause failed there instead, FTBFS on every arch for 3.49.10-1, invisible to CI. Add a job that builds, removes python3, and runs make check, so the skip path is exercised on every PR. Verified to fail on the pre-fix tree and pass after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
tests: skip 28_local-pause when python3 is absent (Debian buildd)
2026-06-29 21:45:24 +03:00 · 2026-06-28 20:02:47 +02:00 · 2026-06-28 19:49:47 +02:00 · 2026-06-28 15:29:03 +02:00 · 2026-06-28 14:26:50 +02:00 · 2026-06-28 13:52:21 +02:00
23 changed files with 12504 additions and 1588 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -61,6 +61,50 @@ jobs:
        if: failure()
        run: cat tests/test-suite.log 2>/dev/null || true

+  # Reproduce the Debian buildds: they build in a minimal chroot with no
+  # python3, so the local-server tests must SKIP (exit 77), not fail. GitHub
+  # runners ship python3, so every other job hides this path; here we remove it
+  # before `make check`. This is the guard that would have caught the 3.49.10-1
+  # FTBFS (28_local-pause failed instead of skipping when python3 was absent).
+  buildd-no-python3:
+    name: build (no python3, Debian buildd)
+    runs-on: ubuntu-24.04
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          submodules: recursive
+
+      - name: Install build dependencies
+        run: |
+          set -euo pipefail
+          sudo apt-get update
+          sudo apt-get install -y --no-install-recommends \
+            build-essential autoconf automake libtool autoconf-archive \
+            zlib1g-dev libssl-dev
+
+      - name: Configure
+        run: |
+          set -euo pipefail
+          autoreconf -fi
+          ./configure
+
+      - name: Build
+        run: make -j"$(nproc)"
+
+      - name: Test without python3
+        run: |
+          set -euo pipefail
+          # Hide every python3* so `command -v python3` fails like it does in the
+          # buildd chroot; masking with /bin/false would still resolve.
+          sudo find /usr/bin /usr/local/bin -maxdepth 1 -name 'python3*' \
+            -exec mv {} {}.hidden \;
+          ! command -v python3
+          make check
+
+      - name: Print the test log on failure
+        if: failure()
+        run: cat tests/test-suite.log 2>/dev/null || true
+
  # Portability: build and test on macOS (Darwin/clang) on a native runner --
  # no VM. The tree has no __APPLE__ branches, so Darwin exercises the
  # generic-Unix path on a second libc and kernel. brew's openssl@3 is keg-only,
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -39,6 +39,10 @@ Welcome, and nothing to disclose. Two rules:

 The sign-off covers AI-assisted code too.

+## Translations
+
+Interface strings live in [`lang/`](lang/). See [lang/README.md](lang/README.md) for the file format and how to add or update a language.
+
 ## Bugs

 Open an issue with the version, OS, command used, and expected vs actual result.
--- a/configure.ac
+++ b/configure.ac
@@ -1,6 +1,6 @@
 AC_PREREQ([2.71])

-AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
+AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
 AC_COPYRIGHT([
 HTTrack Website Copier, Offline Browser for Windows and Unix
 Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
 AC_CONFIG_MACRO_DIR([m4])
 AC_CONFIG_HEADERS(config.h)
 AM_INIT_AUTOMAKE([subdir-objects])
-# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
-# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
-# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
-VERSION_INFO="3:1:0"
+# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
+# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
+# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
+VERSION_INFO="3:2:0"
 AM_MAINTAINER_MODE
 AC_USE_SYSTEM_EXTENSIONS

--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,16 @@
+httrack (3.49.10-1) unstable; urgency=medium
+
+  * New upstream release: new download-pacing and URL-handling options plus a
+    batch of crawl and robustness fixes (full list in history.txt).
+  * Rewrite debian/copyright in machine-readable DEP-5 format, crediting the
+    bundled minizip, md5 and coucal sources (#415).
+  * Lead the webhttrack browser dependency with chromium so httrack is not
+    dragged into the firefox-esr autoremoval cascade (#436).
+  * Override the embedded-library lint for the bundled minizip (#419).
+  * Bump Standards-Version to 4.7.4 (no changes required).
+
+ -- Xavier Roche <xavier@debian.org>  Sun, 28 Jun 2026 14:01:53 +0200
+
 httrack (3.49.9-1) unstable; urgency=medium

  * New upstream release: Content-Type and file-type detection fixes (trust a
--- a/debian/control
+++ b/debian/control
@@ -2,7 +2,7 @@ Source: httrack
 Section: web
 Priority: optional
 Maintainer: Xavier Roche <roche@httrack.com>
-Standards-Version: 4.7.0
+Standards-Version: 4.7.4
 Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
 Rules-Requires-Root: no
 Homepage: http://www.httrack.com
--- a/history.txt
+++ b/history.txt
@@ -4,7 +4,25 @@ HTTrack Website Copier release history:

 This file lists all changes and fixes that have been made for HTTrack

-3.49-9
+3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)
+ New: --strip-query to drop selected query keys from the dedup naming (#112)
+ Changed: split the -%u URL hacks into independent --keep-www-prefix, --keep-double-slashes and --keep-query-order toggles (#271)
+ Fixed: follow a redirect Location after dropping its #fragment, instead of requesting the fragment and polluting the saved name (#204)
+ Fixed: escaped brackets inside a *[...] filter character class (#148)
+ Fixed: honor the server's Content-Range when resuming a partial download, instead of appending overlapping bytes (#198)
+ Fixed: abort the download as soon as the response type is excluded by -mime:, instead of fetching then discarding the body (#58)
+ Fixed: keep size-based filter rules neutral until the file size is known (#143)
+ Fixed: stop the mirror with a clean fatal error on a cache write failure, instead of crashing (#174, #219)
+ Fixed: stop the 412/416 partial re-get loop on --continue and --update (#206)
+ Fixed: keep an unrecognized URL tail instead of mangling it to .html (#115)
+ Fixed: honor --tolerant (-%B) on a broken Content-Length, and fix an out-of-bounds read it exposed (#32, #41)
+ Fixed: fall back to the next resolved address when a connection fails or stalls, instead of hanging on a dead IPv6 address
+ Fixed: report why a -%L URL list could not be loaded (#49)
+ Changed: multiple internal hardening, build and CI improvements
+
+.49-9
 + Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
 + Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
 + Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)
--- a/html/filters.html
+++ b/html/filters.html
@@ -247,7 +247,7 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
        <td>the \ character</td>
      </tr>
      <tr>
-        <td nowrap><tt>*[\[\]]</tt></td>
+        <td nowrap><tt>*[\[,\]]</tt></td>
        <td>the [ or ] character</td>
      </tr>
      <tr>
--- a/lang/English.txt
+++ b/lang/English.txt
@@ -295,7 +295,7 @@ Max Depth
 Maximum external depth:
 Maximum external depth:
 Filters (refuse/accept links) :
-Filters (refuse/accept links) :
+Filters (refuse/accept links):
 Paths
 Paths
 Save prefs
--- a/lang/README.md
+++ b/lang/README.md
@@ -0,0 +1,37 @@
+# Translating HTTrack
+
+Interface strings live here, one `.txt` file per language. `English.txt` is the reference: every other file maps each English string to its translation.
+
+## File format
+
+Plain text, entries in consecutive pairs of lines:
+
+```
+<English string>
+<translation>
+```
+
+The first line of a pair is the lookup key and must stay identical to the one in `English.txt`; translate only the second line. Missing entries fall back to the English text at runtime, so a partial translation works.
+
+Preserve any `\r\n`, `\t` and `printf` placeholders (`%s`, `%d`, ...) in the translation.
+
+A few `LANGUAGE_*` entries at the top describe the file itself:
+
+| Key | Meaning |
+| --- | --- |
+| `LANGUAGE_NAME` | Name shown in the language picker, in its own language (`Deutsch`, not `German`) |
+| `LANGUAGE_ISO` | ISO 639 code, with region if needed (`de`, `pt_BR`) |
+| `LANGUAGE_CHARSET` | Encoding the file is saved in (`ISO-8859-1`, `windows-1251`, `UTF-8`, ...) |
+| `LANGUAGE_AUTHOR` | Your name and contact |
+| `LANGUAGE_WINDOWSID` | Windows locale name used by WinHTTrack (`German (Standard)`) |
+
+Save the file in exactly its declared `LANGUAGE_CHARSET`; an editor that rewrites it as UTF-8 will corrupt the non-ASCII bytes.
+
+## Adding or updating a language
+
+1. Copy `English.txt` to `<Language>.txt`, or edit the existing file.
+2. Translate each second line; leave the English keys untouched.
+3. Fill in the `LANGUAGE_*` header for a new file.
+4. Open a pull request, or attach the file to a GitHub issue.
+
+When new strings land in `English.txt` they show up untranslated (as English) until a translator fills them in.
--- a/src/htsbasiccharsets.sh
+++ b/src/htsbasiccharsets.sh
@@ -3,12 +3,12 @@

 # Change this to download files
 if false; then
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
-    echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
+    echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
    rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
 fi

--- a/src/htsencoding.c
+++ b/src/htsencoding.c
@@ -30,12 +30,14 @@ Please visit our Website: http://www.httrack.com
 /* Author: Xavier Roche                                         */
 /* ------------------------------------------------------------ */

+#include <stdint.h>
+
 #include "htscharset.h"
 #include "htsencoding.h"
 #include "htssafe.h"

-/* static int decode_entity(const unsigned int hash, const size_t len);
-*/
+/* static int decode_entity(const uint64_t hash, const size_t len);
+ */
 #include "htsentities.h"

 /* hexadecimal conversion */
@@ -50,30 +52,31 @@ static int get_hex_value(char c) {
    return -1;
 }

-/* Numerical Recipes,
-   see <http://en.wikipedia.org/wiki/Linear_congruential_generator> */
-#define HASH_PRIME ( 1664525 )
-#define HASH_CONST ( 1013904223 )
-#define HASH_ADD(HASH, C) do {                  \
-    (HASH) *= HASH_PRIME;                       \
-    (HASH) += HASH_CONST;                       \
-    (HASH) += (C);                              \
-  } while(0)
+/* 64-bit FNV-1a; must match htsentities.sh, which keys the entity table on it.
+ */
+#define HASH_INIT 0xcbf29ce484222325ULL
+#define HASH_PRIME 0x100000001b3ULL
+#define HASH_ADD(HASH, C)                                                      \
+  do {                                                                         \
+    (HASH) ^= (unsigned char) (C);                                             \
+    (HASH) *= HASH_PRIME;                                                      \
+  } while (0)

 int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t max, const char *charset) {
  size_t i, j, ampStart, ampStartDest;
  int uc;
  int hex;
-  unsigned int hash;
+  uint64_t hash;

  assertf(max != 0);
-  for(i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0,
-        uc = -1, hex = 0, hash = 0 ; src[i] != '\0' ; i++) {
+  for (i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0, uc = -1, hex = 0,
+      hash = HASH_INIT;
+       src[i] != '\0'; i++) {
    /* start of entity */
    if (src[i] == '&') {
      ampStart = i;
      ampStartDest = j;
-      hash = 0;
+      hash = HASH_INIT;
      uc = -1;
    }
    /* inside a potential entity */
@@ -174,14 +177,11 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
      }
      /* alphanumerical entity */
      else {
-        /* alphanum and not too far ('&thetasym;' is the longest) */
-        if (i <= ampStart + 10 &&
-            (
-             (src[i] >= '0' && src[i] <= '9')
-             || (src[i] >= 'A' && src[i] <= 'Z')
-             || (src[i] >= 'a' && src[i] <= 'z')
-             )
-            ) {
+        /* alphanum, capped at the longest name
+         * '&CounterClockwiseContourIntegral;' (31) */
+        if (i <= ampStart + 31 && ((src[i] >= '0' && src[i] <= '9') ||
+                                   (src[i] >= 'A' && src[i] <= 'Z') ||
+                                   (src[i] >= 'a' && src[i] <= 'z'))) {
          /* compute hash */
          HASH_ADD(hash, (unsigned char) src[i]);
        } else {
--- a/src/htsentities.h
+++ b/src/htsentities.h
--- a/src/htsentities.sh
+++ b/src/htsentities.sh
@@ -1,75 +1,92 @@
 #!/bin/bash
 #
+# Regenerate htsentities.h from the WHATWG named character references.

-src=html40.txt
-url=http://www.w3.org/TR/1998/REC-html40-19980424/html40.txt
+set -euo pipefail
+
+src=entities.json
+url=https://html.spec.whatwg.org/entities.json
 dest=htsentities.h

-(
-    cat <<EOF
-/*
-  -- ${dest} --
-  FILE GENERATED BY $0, DO NOT MODIFY
+# 64-bit FNV-1a of $1, printed as a C constant. Must match the hash in
+# htsencoding.c. The offset basis is stored as its wrapped (signed) bit pattern;
+# bash arithmetic is 64-bit two's complement, so the result is bit-exact.
+fnv1a() {
+    local s=$1 i c h=$((0xcbf29ce484222325))
+    for ((i = 0; i < ${#s}; i++)); do
+        printf -v c '%d' "'${s:i:1}"
+        h=$(((h ^ (c & 0xff)) * 0x100000001b3))
+    done
+    printf '0x%016xULL' "$h"
+}

-  We compute the LCG hash
-  (see <http://en.wikipedia.org/wiki/Linear_congruential_generator>)
-  for each entity. We should in theory check using strncmp() that we
-  actually have the correct entity, but this is actually statistically
-  not needed.
+if [ ! -f "$src" ]; then
+    curl -fsS "$url" -o "$src"
+fi

-  We may want to do better, but we expect the hash function to be uniform, and
-  let the compiler be smart enough to optimize the switch (for example by
-  checking in log2() intervals)
-  
-  This code has been generated using the evil $0 script.
-*/
+# Keep ';'-terminated single-codepoint names; the ~93 multi-codepoint refs can't
+# fit decode_entity's single-codepoint return and are skipped (left verbatim).
+pairs=$(jq -r '
+    to_entries
+    | map(select((.key | endswith(";")) and (.value.codepoints | length == 1)))
+    | sort_by(.key)
+    | .[] | "\(.key | ltrimstr("&") | rtrimstr(";"))\t\(.value.codepoints[0])"' "$src")

-static int decode_entity(const unsigned int hash, const size_t len) {
+# Skipped multi-codepoint names, kept to prove none aliases an emitted hash.
+skipped=$(jq -r '
+    to_entries
+    | map(select((.key | endswith(";")) and (.value.codepoints | length > 1)))
+    | .[] | .key | ltrimstr("&") | rtrimstr(";")' "$src")
+
+cases=""
+emit_hashes=""
+while IFS=$'\t' read -r name cp; do
+    hash=$(fnv1a "$name")
+    cases+="    /* $name */"$'\n'
+    cases+="  case $hash:"$'\n'
+    cases+="    if (len == ${#name}) {"$'\n'
+    cases+="      return $cp;"$'\n'
+    cases+="    }"$'\n'
+    cases+="    break;"$'\n'
+    emit_hashes+="$hash"$'\n'
+done <<<"$pairs"
+
+skip_hashes=""
+while IFS= read -r name; do
+    [ -n "$name" ] && skip_hashes+="$(fnv1a "$name")"$'\n'
+done <<<"$skipped"
+
+# The switch keys on the hash alone, so the dispatch is correct only while every
+# emitted name hashes uniquely; prove it here, no runtime name compare needed.
+dups=$(printf '%s' "$emit_hashes" | sort | uniq -d || true)
+if [ -n "$dups" ]; then
+    echo "FATAL: two entity names share a hash (duplicate switch case); change the hash:" >&2
+    echo "$dups" >&2
+    exit 1
+fi
+# A skipped name colliding with an emitted hash would mis-decode instead of
+# staying verbatim; forbid that too.
+aliased=$(comm -12 <(printf '%s' "$emit_hashes" | sort -u) <(printf '%s' "$skip_hashes" | sort -u) || true)
+if [ -n "$aliased" ]; then
+    echo "FATAL: a skipped multi-codepoint name aliases an emitted hash:" >&2
+    echo "$aliased" >&2
+    exit 1
+fi
+
+cat >"$dest" <<EOF
+/* GENERATED by $0 from the WHATWG named character references
+   (${url}). DO NOT EDIT.
+   Dispatch keys on a 64-bit FNV-1a hash of the entity name; the generator
+   aborts on any hash collision, so no runtime name compare is needed. */
+
+#include <stdint.h>
+
+static int decode_entity(const uint64_t hash, const size_t len) {
  switch(hash) {
-EOF
-    (
-        if test -f ${src}; then
-            cat ${src}
-        else
-            GET "${url}"
-        fi
-    ) |
-        grep -E '^<!ENTITY [a-zA-Z0-9_]' |
-        sed \
-            -e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-            -e 's/-->$//' \
-            -e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
-        (
-            read -r A
-            while test -n "$A"; do
-                ent="${A%% *}"
-                code=$(echo "$A" | cut -f2 -d' ')
-                # compute hash
-                hash=0
-                i=0
-                a=1664525
-                c=1013904223
-                m="$((1 << 32))"
-                while test "$i" -lt ${#ent}; do
-                    d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
-                    hash="$((((hash * a) % (m) + d + c) % (m)))"
-                    i=$((i + 1))
-                done
-                echo -e "    /* $A */"
-                echo -e "  case ${hash}u:"
-                echo -e "    if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
-                echo -e "      return ${code};"
-                echo -e "    }"
-                echo -e "    break;"
-
-                # next
-                read -r A
-            done
-        )
-    cat <<EOF
-  }
+${cases}  }
  /* unknown */
  return -1;
 }
 EOF
-) >${dest}
+
+echo "wrote $dest ($(grep -c '^  case ' "$dest") entities)" >&2
--- a/src/htsfilters.c
+++ b/src/htsfilters.c
@@ -193,7 +193,12 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
          int len = (int) strlen(joker);

          while((joker[i] != RIGHT) && (joker[i]) && (i < len)) {
-            if ((joker[i] == '<') || (joker[i] == '>')) {       // *[<10]
+            // '\' escapes the next char as a literal member, e.g. *[\[\]]
+            if (joker[i] == '\\' && joker[i + 1] != '\0') {
+              i++;
+              pass[(int) (unsigned char) joker[i]] = 1;
+              i++;
+            } else if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
              int lsize = 0;
              int lverdict;

@@ -221,7 +226,9 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
                while(isdigit((unsigned char) joker[i]))
                  i++;
              }
-            } else if (joker[i + 1] == '-') {   // 2 car, ex: *[A-Z]
+            } else if (joker[i + 1] == '-' && joker[i + 2] != '\0') {
+              // range *[A-Z]; the '\0' guard rejects a truncated *[a- (else
+              // i+=3 overshoots the NUL)
              if ((int) (unsigned char) joker[i + 2] >
                  (int) (unsigned char) joker[i]) {
                int j;
@@ -233,10 +240,7 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
              }
              // else err=1;
              i += 3;
-            } else {            // 1 car, ex: *[ ]
-              if (joker[i + 2] == '\\' && joker[i + 3] != 0) {  // escaped char, such as *[\[] or *[\]]
-                i++;
-              }
+            } else { // 1 car, ex: *[ ]
              pass[(int) (unsigned char) joker[i]] = 1;
              i++;
            }
--- a/src/htsglobal.h
+++ b/src/htsglobal.h
@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
   configure.ac, decoupled from these). VERSION is the display form, VERSIONID
   the dotted numeric form, AFF_VERSION the short form shown in footers,
   LIB_VERSION the data/cache format generation. */
-#define HTTRACK_VERSION "3.49-9"
-#define HTTRACK_VERSIONID "3.49.9"
+#define HTTRACK_VERSION "3.49-10"
+#define HTTRACK_VERSIONID "3.49.10"
 #define HTTRACK_AFF_VERSION "3.x"
 #define HTTRACK_LIB_VERSION "2.0"

--- a/src/htsparse.c
+++ b/src/htsparse.c
@@ -302,6 +302,14 @@ static HTS_INLINE char html_prevc(const char *html, const char *start) {
  return html > start ? html[-1] : ' ';
 }

+/* Drop a redirect Location's #fragment: a UA anchor, never part of the fetched
+ * resource (#204). */
+static void url_drop_fragment(char *const url) {
+  char *const frag = strchr(url, '#');
+  if (frag != NULL)
+    *frag = '\0';
+}
+
 /* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
   argument is a method, not a URL: #218). Case-insensitive. */
 static int is_http_method(const char *s, size_t len) {
@@ -3596,6 +3604,7 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
        //

        strcpybuff(mov_url, r->location);
+        url_drop_fragment(mov_url);

        // url qque -> adresse+fichier
        if ((reponse =
@@ -4803,6 +4812,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,

            mov_url[0] = '\0';
            strcpybuff(mov_url, back[b].r.location);    // copier URL
+            url_drop_fragment(mov_url);

            /* Remove (temporarily created) file if it was created */
            UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), back[b].url_sav));
--- a/src/htsselftest.c
+++ b/src/htsselftest.c
@@ -512,15 +512,21 @@ static int string_safety_selftests(void) {
 /* ------------------------------------------------------------ */

 static int st_filter(httrackp *opt, int argc, char **argv) {
+  char *str, *pat;
+  int matched;
+
  (void) opt;
  if (argc < 2) {
    fprintf(stderr, "filter: needs a filter pattern and a string\n");
    return 1;
  }
-  if (strjoker(argv[1], argv[0], NULL, NULL))
-    printf("%s does match %s\n", argv[1], argv[0]);
-  else
-    printf("%s does NOT match %s\n", argv[1], argv[0]);
+  /* exact-size heap copies so a sanitizer traps any over-read of the pattern */
+  str = strdupt(argv[1]);
+  pat = strdupt(argv[0]);
+  matched = strjoker(str, pat, NULL, NULL) != NULL;
+  printf("%s does %s %s\n", argv[1], matched ? "match" : "NOT match", argv[0]);
+  freet(str);
+  freet(pat);
  return 0;
 }

--- a/tests/01_engine-entities.test
+++ b/tests/01_engine-entities.test
@@ -18,6 +18,21 @@ ent '&amp;' '&'
 ent '&lt;&gt;' '<>'
 ent '&eacute;' 'é'

+# HTML5 names from the WHATWG set
+ent '&hellip;' '…'
+ent '&bigcup;' '⋃'
+# longest name (31 chars) exercises the name-length cap
+ent '&CounterClockwiseContourIntegral;' '∳'
+# astral codepoint -> 4-byte UTF-8
+ent '&Aopf;' '𝔸'
+# multi-codepoint refs are skipped at generation, so left verbatim
+ent '&fjlig;' '&fjlig;'
+
+# common HTML4 names still decode (regression guard against accidental drops)
+ent '&copy;&reg;&trade;' '©®™'
+ent '&mdash;&ndash;' '—–'
+ent '&alpha;&beta;' 'αβ'
+
 # numeric: decimal and hex
 ent '&#65;&#66;' 'AB'
 ent '&#x41;' 'A'
--- a/tests/01_engine-filter.test
+++ b/tests/01_engine-filter.test
@@ -50,27 +50,54 @@ match '*foo*bar' 'foozbar'
 # '?' is the query-string marker, not a single-char wildcard
 nomatch 'a?c' 'abc'

-# backslash escapes a metacharacter inside a class so it is matched literally.
-# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
-# both X and '\'. These assertions pin that behavior.
+# Inside a class, backslash escapes the next char as a literal member (#148):
+# '\X' matches X only (not '\'), and an escaped ']' is a member, not the terminator.
 match '*[\*]' '*'
-match '*[\*]' "\\"
-nomatch '*[\*]' 'a'
+nomatch '*[\*]' "\\"
 match '*[\\]' "\\"
-nomatch '*[\\]' 'a'
+nomatch '*[\\]' '*'
 match '*[\[]' '['
-match '*[\[]' "\\"
-nomatch '*[\[]' 'a'
+nomatch '*[\[]' "\\"
+match '*[\]]' ']'
+nomatch '*[\]]' "\\"

-# A literal ']' cannot be a class member: the class parser stops at the first
-# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
-# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
-# by a trailing literal ']'. These assertions document the current (buggy)
-# behavior so any future matcher fix is a deliberate, visible change.
-nomatch '*[\[\]]' '[' # not matched, despite the docs
-match '*[\[\]]' ']'   # only via the empty class-match + trailing ']'
-match '*[\[\]]' '[]'  # one of {'[','\'} then the trailing ']'
-nomatch '*[\[\]]' '[]x'
+# '*[\[\]]' is "the [ or ] character", as the filter guide documents.
+match '*[\[\]]' '['
+match '*[\[\]]' ']'
+nomatch '*[\[\]]' 'a'
+match '*[\[,\]]' '[' # comma between members is optional
+match '*[\[,\]]' ']'
+match '*[a,\[]' 'a' # an escaped member no longer eats the preceding one
+match '*[a,\[]' '['
+
+# Escape is decoded before the range/separator/size checks, so '\-' '\,' '\<'
+# are literal members, not operators.
+match '*[a\-z]' 'a'
+match '*[a\-z]' 'z'
+nomatch '*[a\-z]' 'b' # not the a..z range
+match '*[\,]' ','
+nomatch '*[\,]' "\\" # the escape must not leak '\' into the class
+match '*[\<]' '<'
+nomatch '*[\<]' "\\"
+match '*[\[,\],a]' '['
+match '*[\[,\],a]' ']'
+match '*[\[,\],a]' 'a'
+
+# A truncated range '*[a-' is the literal members {a,-}; the parser must not
+# read past the end decoding it (was a 1-byte heap over-read in the range arm).
+match '*[a-' 'a'
+nomatch '*[a-' 'b'
+
+# *(...) matches exactly one char from the class; *[...] matches a run.
+match '*(a,b)' 'a'
+nomatch '*(a,b)' 'aa'
+nomatch '*(a,b)' 'c'
+
+# documented composite filters (filters.html)
+match 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.zip'
+nomatch 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.tar'
+match '*.html*[]' 'page.html'
+nomatch '*.html*[]' 'page.html?x=1' # *[] forbids the trailing query

 # Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
 # means the size is still unknown (scan time). A size exclusion must stay neutral
--- a/tests/28_local-pause.test
+++ b/tests/28_local-pause.test
@@ -9,6 +9,13 @@ set -e

 : "${top_srcdir:=..}"

+# python3 runs the local server (mirror local-crawl.sh); skip when absent, else
+# run() swallows its exit-77 and the serverless 0s/0s crawl looks like a fail.
+command -v python3 >/dev/null || {
+    echo "python3 not found; skipping local crawl tests"
+    exit 77
+}
+
 run() { # echoes the wall-clock seconds of one crawl
    local t0 t1
    t0=$(date +%s)
--- a/tests/29_local-redirect-fragment.test
+++ b/tests/29_local-redirect-fragment.test
@@ -0,0 +1,11 @@
+#!/bin/bash
+# Issue #204: a 302 Location with a #fragment must drop the fragment before the
+# target is fetched. The server is strict (400 on a '#' in the request-target),
+# so a leaked fragment logs an error and the target is never saved.
+set -e
+
+: "${top_srcdir:=..}"
+
+bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
+    --found 'redir/target.html' \
+    httrack 'BASEURL/redir/index.html'
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -75,6 +75,7 @@ TESTS = \
 	25_local-mime-exclude.test \
 	26_local-strip-query.test \
 	27_local-cookies-file.test \
-	28_local-pause.test
+	28_local-pause.test \
+	29_local-redirect-fragment.test

 CLEANFILES = check-network_sh.cache
--- a/tests/local-server.py
+++ b/tests/local-server.py
@@ -354,6 +354,21 @@ class Handler(SimpleHTTPRequestHandler):
        if self.command != "HEAD":
            self.wfile.write(body)

+    # 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
+    # that must be dropped before the target is fetched. A leaked '#' reaches the
+    # strict-server guard below and 400s.
+    def route_redir_index(self):
+        self.send_html('\t<a href="go.php">go</a>')
+
+    def route_redir_go(self):
+        self.send_response(302, "Found")
+        self.send_header("Location", "target.html#section")
+        self.send_header("Content-Length", "0")
+        self.end_headers()
+
+    def route_redir_target(self):
+        self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
+
    ROUTES = {
        "/cookies/entrance.php": route_entrance,
        "/cookies/second.php": route_second,
@@ -391,10 +406,23 @@ class Handler(SimpleHTTPRequestHandler):
        "/mimex/index.html": route_mimex_index,
        "/mimex/blob.pdf": route_mimex_blob,
        "/mimex/real.html": route_mimex_real,
+        "/redir/index.html": route_redir_index,
+        "/redir/go.php": route_redir_go,
+        "/redir/target.html": route_redir_target,
    }

    # --- dispatch ----------------------------------------------------------

+    def reject_fragment(self):
+        # Strict server: a '#' in the request-target is the client failing to
+        # drop a fragment (#204). RFC 3986 forbids it on the wire; answer 400.
+        if "#" in self.path:
+            self.send_response(400, "Bad Request")
+            self.send_header("Content-Length", "0")
+            self.end_headers()
+            return True
+        return False
+
    def dispatch(self):
        self._set_cookies = []
        path = urlsplit(self.path).path
@@ -406,10 +434,14 @@ class Handler(SimpleHTTPRequestHandler):
        return False

    def do_GET(self):
+        if self.reject_fragment():
+            return
        if not self.dispatch():
            super().do_GET()

    def do_HEAD(self):
+        if self.reject_fragment():
+            return
        if not self.dispatch():
            super().do_HEAD()
Author	SHA1	Message	Date
Xavier Roche	f578bede0c	ci: add a no-python3 job reproducing the Debian buildd chroot GitHub runners ship python3, so every make-check job exercised the python3-present path and the local-server tests never skipped. The Debian buildds build in a minimal chroot with no python3, where those tests must SKIP (exit 77) -- and 28_local-pause failed there instead, FTBFS on every arch for 3.49.10-1, invisible to CI. Add a job that builds, removes python3, and runs make check, so the skip path is exercised on every PR. Verified to fail on the pre-fix tree and pass after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-28 20:02:47 +02:00
Xavier Roche	45279d7357	tests: skip 28_local-pause when python3 is absent (Debian buildd) The local-server tests all skip with exit 77 when python3 is missing, but 28_local-pause runs local-crawl.sh inside a command-substitution with output redirected to /dev/null, swallowing that exit-77 skip signal. On a host with no python3 (every Debian buildd) both crawls then run serverless, finish in 0s, and the test reports FAIL instead of SKIP, turning 3.49.10-1 red on all architectures. Add the same python3 guard the other tests get via local-crawl.sh, up front, so it skips cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-28 19:49:47 +02:00
Xavier Roche	cca83e5f4a	Modernize HTML entity decoding to the WHATWG named character references (#444 ) * Modernize HTML entity decoding to the WHATWG named character references Regenerate htsentities.h from the WHATWG entities.json (2032 single-codepoint names) instead of the 1998 HTML 4.0 set (252 names). The dispatch hash moves from a 32-bit LCG to 64-bit FNV-1a; the generator now aborts on any (hash,len) collision, so the hash-only switch stays correct without a runtime name compare. Bump the consumer name-length cap from 10 to 31, the longest name (CounterClockwiseContourIntegral), or long names would be rejected outright. Multi-codepoint references (~93 obscure math entities) can't fit the single-codepoint return and are skipped, left verbatim as before. Also fix the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh. Closes #443 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * entities: harden the generator collision guard and widen test coverage Review follow-up. The switch keys on the hash alone, so check hash-alone uniqueness among emitted names (a same-hash/different-len pair would otherwise slip the old (hash,len) check and surface only as a cryptic duplicate-case compile error). Also hash the ~93 skipped multi-codepoint names and abort if any aliases an emitted hash, so "skipped means verbatim" is enforced rather than assumed on future regens. Add a runtime sweep of common HTML4 names (copy/reg/trade/mdash/ndash/alpha/beta) to 01_engine-entities.test: a regression guard against accidental drops and a generator-vs-consumer hash cross-check on names beyond the handful already probed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:29:03 +02:00
Xavier Roche	97f398e508	Release 3.49.10 (#442 ) Bump the package version to 3.49.10 and curate the release notes. VERSION_INFO goes 3:1:0 -> 3:2:0: the cycle only appended tail fields to the installed options struct (--cookies-file, --pause, --strip-query, the -%u split), no existing symbol or offset changed, so the soname stays .so.3. history.txt gets the 3.49-10 block; debian/changelog gets 3.49.10-1 with the Debian-specific items (DEP-5 copyright, chromium-first browser dep, minizip embedded-library override). Standards-Version 4.7.0 -> 4.7.4: the intervening Policy changes (usr-merge, /usr/games, Priority recommendation, -dev linker scripts, non-free-firmware) need no package change. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 14:26:50 +02:00
Xavier Roche	a62f93a107	Strip the #fragment from a redirect Location before fetching (#204 ) (#441 ) * Strip the #fragment from a redirect Location before fetching (#204) A 302/30x Location is dereferenced, not displayed, so its #fragment is a client-side anchor that must be dropped before the target is requested. httrack kept it: the redirect followers copied r.location verbatim, so the re-request carried `GET /page.html#frag` (strict servers answer 400) and the mirror was saved under a fragment-polluted name. HTML links were already stripped at parse time; only the two Location followers were not. Drop the fragment in a small helper called at both follow sites, covering the live and cached-redirect paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * test(#204): strict-server guard so a leaked fragment is a wire-level failure The first cut of 29_local-redirect-fragment only checked the saved filename. Python's urlsplit() drops the fragment before routing, so a `#` leaked into the GET line still routed to the target and the crawl passed: the assertion was a proxy, not the wire behavior the fix targets. Make the server strict (400 on any `#` in the request-target, like the real servers in #204), so a leaked fragment now logs an error and the target is never saved. Neutering the fix makes the test fail with the exact "400 Bad Request" from the issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 13:52:21 +02:00
Xavier Roche	799ec88dc7	filters: fix escaped brackets inside [...] character classes (#440 ) filters: decode escaped chars correctly inside [...] classes The escape branch in strjoker probed joker[i+2] instead of the current char, so a backslash escape only worked as the first class member: '[\[\]]' (documented as "the [ or ] character") matched only ']', and '[a,\[]' dropped the 'a'. The loop also treated any ']' as the class terminator, so an escaped ']' could never be a member. Decode the escape first in the loop body: a backslash takes the next char as the literal member (only that char, not also the backslash the old code added), and an escaped ']' is consumed before the terminator check. So '[\[\]]' now matches both brackets, and escape precedes the range/size checks ('\-' '\,' '\<' become literal members). The self-test previously pinned the buggy output as expected; it now asserts the documented behavior and fails against the old matcher. Closes #148 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> * filters: fix a 1-byte over-read on a truncated range [a- The [...] class parser's range arm does i += 3 unconditionally, so a pattern ending in a dangling '-' (e.g. [a-) read one byte past the NUL: joker[i+2] is the NUL, i jumps to len+1, and the separator skip and loop guard then read joker[len+1]. Guard the range arm on joker[i+2] != '\0' so a truncated range falls through to the literal-member path instead of overshooting. The filter self-test now copies the pattern and string into exact-size heap buffers so a sanitizer traps such over-reads; the pattern previously came straight from argv (no redzone), which is why this stayed invisible. A [a- test case exercises it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com> --------- Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:56:11 +02:00
Xavier Roche	71af4a24f0	lang: add a translation guide and fix the English colon spacing (#439 ) Document the lang/*.txt format and contribution flow in lang/README.md (linked from CONTRIBUTING.md); the project had no written instructions for translators. Also drop the stray space before the colon in the English "Filters (refuse/accept links):" label so it matches the other labels. Only the English.txt value is changed, not the msgid key, so existing translations still resolve. Closes #74 Closes #75 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:45:07 +02:00
Xavier Roche	e17f4f12a0	Add --pause to space out file downloads by a random delay (#185 ) (#438 ) A new --pause MIN[:MAX] (seconds, -%G) waits a random MIN..MAX between files so a crawl looks less like a bot and is gentler on the server; a single value is a fixed delay. Disabled by default. It reuses the existing non-blocking launch gate (back_pluggable_sockets_strict): rather than Sleep() -- which would freeze the single select() pump and stall the other in-flight transfers -- the gate just withholds new launches until the delay elapses, one file per gap. The per-gap target is derived from the last-request timestamp so it stays stable across the many gate evaluations within a gap yet rerolls on each launch; sampling rand() per evaluation would instead bias the realized delay toward MIN. Two int fields appended at the httrackp tail (ABI-stable, no soname bump). Covered by a pure-function self-test (range + spread, with teeth against the min-bias bug) and a local-server crawl that asserts the pause slows a multi-file mirror. Closes #185 Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:34:56 +02:00