Lock CSS background-image url() rewriting in the parser test

background-image is already captured and rewritten through the style/CSS url() path, in both an external <style> block and an inline style attribute, with the URL unquoted, double-quoted or single-quoted. Extend the offline parser test to cover all of these so the behavior stays locked. closes #237
Merge pull request #326 from xroche/parser/srcset-candidates
2026-06-14 14:24:34 +03:00 · 2026-06-14 01:07:42 +02:00 · 2026-06-14 00:42:48 +02:00 · 2026-06-14 00:37:20 +02:00 · 2026-06-13 10:41:24 +02:00 · 2026-06-13 10:17:24 +02:00
9 changed files with 271 additions and 45 deletions
--- a/README.md
+++ b/README.md
@@ -23,7 +23,7 @@ http://www.httrack.com/

 ## Compile trunk release
 ```sh
-git clone https://github.com/xroche/httrack.git --recurse
+git clone https://github.com/xroche/httrack.git --recurse-submodules
 cd httrack
 ./configure --prefix=$HOME/usr && make -j8 && make install
 ```
--- a/html/fcguide.html
+++ b/html/fcguide.html
@@ -181,17 +181,17 @@ used for some time.

 <p align=justify> The rest of this manual is dedicated to detailing what
 you find in the help message and providing examples - lots and lots of
-examples...  Here is what you get (page by page - use <enter> to move to
+examples...  Here is what you get (page by page - use &lt;enter&gt; to move to
 the next page in the real program) if you type 'httrack --help':

 <pre>
 >httrack --help
 HTTrack version 3.03BETAo4 (compiled Jul  1 2001)
-	usage: ./httrack <URLs [-option] [+<FILTERs>] [-<FILTERs>]
+	usage: ./httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;]
 	with options listed below: (* is the default value)

 General options:
-  O  path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
+  O  path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path &lt;param&gt;)
 %O  top path if no path defined (-O path_mirror[,path_cache_and_logfiles])

 Action options:
@@ -202,7 +202,7 @@ Action options:
  Y   mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)

 Proxy options:
-  P  proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
+  P  proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy &lt;param&gt;)
 %f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])

 Limits options:
@@ -227,7 +227,7 @@ Links options:
 %P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
  n  get non-html files 'near' an html file (ex: an image located outside) (--near)
  t  test all URLs (even forbidden ones) (--test)
- %L <file add all URL located in this text file (one URL per line) (--list <param>)
+ %L &lt;file&gt; add all URL located in this text file (one URL per line) (--list &lt;param&gt;)

 Build options:
  NN structure type (0 *original structure, 1+: see below) (--structure[=N])
@@ -248,12 +248,12 @@ Spider options:
 %h  force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
 %B  tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
 %s  update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
- %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
+ %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)

 Browser ID:
-  F  user-agent field (-F "user-agent name") (--user-agent <param>)
- %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
- %l  preferred language (-%l "fr, en, jp, *" (--language <param>)
+  F  user-agent field (-F "user-agent name") (--user-agent &lt;param&gt;)
+ %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer &lt;param&gt;)
+ %l  preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)

 Log, index, cache
  C  create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
@@ -303,8 +303,8 @@ Guru options: (do NOT use)
 #!  Execute a shell command (-#! "echo hello")

 Command-line specific options:
-  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
- %U run the engine with another id when called as root (-%U smith) (--user <param>)
+  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
+ %U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)

 Details: Option N
  N0 Site-structure (default)
@@ -340,14 +340,14 @@ Details: User-defined option N
  %[param] param variable in query string

 Shortcuts:
--mirror      <URLs *make a mirror of site(s) (default)
--get         <URLs  get the files indicated, do not seek other URLs (-qg)
--list   <text file  add all URL located in this text file (-%L)
--mirrorlinks <URLs  mirror all links in 1st level pages (-Y)
--testlinks   <URLs  test links in pages (-r1p0C0I0t)
--spider      <URLs  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite    <URLs  identical to --spider
--skeleton    <URLs  make a mirror, but gets only html files (-p1)
+--mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
+--get         &lt;URLs&gt;  get the files indicated, do not seek other URLs (-qg)
+--list   &lt;text file&gt;  add all URL located in this text file (-%L)
+--mirrorlinks &lt;URLs&gt;  mirror all links in 1st level pages (-Y)
+--testlinks   &lt;URLs&gt;  test links in pages (-r1p0C0I0t)
+--spider      &lt;URLs&gt;  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
+--testsite    &lt;URLs&gt;  identical to --spider
+--skeleton    &lt;URLs&gt;  make a mirror, but gets only html files (-p1)
 --update              update a mirror, without confirmation (-iC2)
 --continue            continue a mirror, without confirmation (-iC1)

@@ -387,13 +387,13 @@ with examples... I will be here a while...
 <hr>
 <h2> Syntax </h2>

-<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
+<pre><b><i>httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;] </i></b></pre>

 <p align=justify> The syntax of httrack is quite simple.  You specify
-the URLs you wish to start the process from (<URLS>), any options you
+the URLs you wish to start the process from (&lt;URLS&gt;), any options you
 might want to add ([-option], any filters specifying places you should
-([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
-line by pressing <enter>.  Httrack then goes off and does your bidding.
+([+&lt;FILTERs&gt;]) and should not ([-&lt;FILTERs&gt;]) go, and end the command
+line by pressing &lt;enter&gt;.  Httrack then goes off and does your bidding.
 For example:

 <pre><b><i>
@@ -425,7 +425,7 @@ site. Specifically, the defauls are:
  pN priority mode: (* p3)  *3 save all files
  D  *can only go down into subdirs
  a  *stay on the same address
-  --mirror      <URLs> *make a mirror of site(s) (default)
+  --mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
 </pre>

 <p align=justify> Here's what all of that means:
@@ -542,7 +542,7 @@ subdirectories of the starting directory to be investigated.
 search started are to be collected.  Other sites they point to are not
 to be imaged. 

-<pre><b><i>  --mirror      <URLs> *make a mirror of site(s) (default) </i></b></pre>
+<pre><b><i>  --mirror      &lt;URLs&gt; *make a mirror of site(s) (default) </i></b></pre>

 <p align=justify> This indicates that the program should try to make a
 copy of the site as well as it can. 
@@ -921,7 +921,7 @@ Links options:
 %P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use)
  n   get non-html files 'near' an html file (ex: an image located outside)
  t   test all URLs (even forbidden ones)
- %L <file> add all URL located in this text file (one URL per line)
+ %L &lt;file&gt; add all URL located in this text file (one URL per line)
 </i></b></pre>

 <p align=justify> The links options allow you to control what links are
@@ -1183,7 +1183,7 @@ Spider options:
 %h  force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
 %B  tolerant requests (accept bogus responses on some servers, but not standard!)
 %s  update hacks: various hacks to limit re-transfers when updating
- %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
+ %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
 </i></b></pre>

 <p align=justify> By default, cookies are universally accepted and
@@ -1387,7 +1387,7 @@ web servers leave footprints in the browser.
 Browser ID:
  F  user-agent field (-F "user-agent name")
 %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]"
- %l  preferred language (-%l "fr, en, jp, *" (--language <param>)
+ %l  preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
 </i></b></pre>

 <p align=justify> The user-agent field is used by browsers to determine
@@ -1799,7 +1799,7 @@ based authentication)

 <pre><b><i>
 Command-line specific options:
-  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
+  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
 </i></b></pre>

 <p align=justify> This option is very nice for a wide array of actions
@@ -1811,7 +1811,7 @@ httrack http://www.shoesizes.com/bob/ -O /tmp/shoesizes -V "/bin/echo \$0"
 </i></b></pre>

 <pre>
- %U run the engine with another id when called as root (-%U smith) (--user <param>)
+ %U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
 </pre>

 <p align=justify> Change the UID of the owner when running as r00t
@@ -1856,14 +1856,14 @@ of other options that are commonly used.

 <pre><b><i>
 Shortcuts:
--mirror      <URLs> *make a mirror of site(s) (default)
--get         <URLs>  get the files indicated, do not seek other URLs (-qg)
--list   <text file>  add all URL located in this text file (-%L)
--mirrorlinks <URLs>  mirror all links in 1st level pages (-Y)
--testlinks   <URLs>  test links in pages (-r1p0C0I0t)
--spider      <URLs>  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite    <URLs>  identical to --spider
--skeleton    <URLs>  make a mirror, but gets only html files (-p1)
+--mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
+--get         &lt;URLs&gt;  get the files indicated, do not seek other URLs (-qg)
+--list   &lt;text file&gt;  add all URL located in this text file (-%L)
+--mirrorlinks &lt;URLs&gt;  mirror all links in 1st level pages (-Y)
+--testlinks   &lt;URLs&gt;  test links in pages (-r1p0C0I0t)
+--spider      &lt;URLs&gt;  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
+--testsite    &lt;URLs&gt;  identical to --spider
+--skeleton    &lt;URLs&gt;  make a mirror, but gets only html files (-p1)
 --update              update a mirror, without confirmation (-iC2)
 --continue            continue a mirror, without confirmation (-iC1)
 --catchurl            create a temporary proxy to capture an URL or a form post URL
@@ -2019,15 +2019,15 @@ are in reverse priority order.  Here's an example:
        <td>no characters must be present after</a></td>
      </tr>
 	<tr>
-		<td> <b> <filter>*[&lt NN]</b></td>
+		<td> <b> &lt;filter&gt;*[&lt NN]</b></td>
 		<td> size less than NN Kbytes</td>
 	</tr>
 	<tr>
-		<td> <b> <filter>*[&gt PP]</b></td>
+		<td> <b> &lt;filter&gt;*[&gt PP]</b></td>
 		<td> size more than PP Kbytes</td>
 	</tr>
 	<tr>
-		<td> <b> <filter>*[&lt NN &gt PP]</b></td>
+		<td> <b> &lt;filter&gt;*[&lt NN &gt PP]</b></td>
 		<td> size less than NN Kbytes and more than PP Kbytes</td>
 	</tr>
    </table>
--- a/lang/Ukrainian.txt
+++ b/lang/Ukrainian.txt
@@ -7,7 +7,7 @@ uk
 LANGUAGE_AUTHOR
 Andrij Shevchuk (http://programy.com.ua, http://vic-info.com.ua) \r\n
 LANGUAGE_CHARSET
-ISO-8859-5
+windows-1251
 LANGUAGE_WINDOWSID
 Ukrainian
 OK
--- a/src/htslib.c
+++ b/src/htslib.c
@@ -121,6 +121,7 @@ const char *hts_detect[] = {
  "lowsrc",
  "profile",                    // element META
  "src",
+  "srcset",                     // HTML5 responsive images (<img>, <source>)
  "swurl",
  "url",
  "usemap",
--- a/src/htsparse.c
+++ b/src/htsparse.c
@@ -532,6 +532,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
        int valid_p = 0;        // force to take p even if == 0
        int ending_p = '\0';    // ending quote?
        int archivetag_p = 0;   // avoid multiple-archives with commas
+        int srcset_p = 0;       // srcset="url1 480w, url2 2x": list of URLs
        int unquoted_script = 0;
        INSCRIPT inscript_state_pos_prev = inscript_state_pos;

@@ -1050,6 +1051,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                          if (strcmp(hts_detect[i], "archive") == 0) {
                            archivetag_p = 1;
                          }
+                          /* srcset: a comma-list of candidate URLs, each split
+                             out and rewritten below (#235, #236) */
+                          else if (strcmp(hts_detect[i], "srcset") == 0
+                                   || strcmp(hts_detect[i], "data-srcset") == 0) {
+                            srcset_p = 1;
+                          }
                        }
                        i++;
                      }
@@ -1815,6 +1822,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                html++;          // sauter # pour usemap etc
              }
            }
+ srcset_next:
+            /* srcset: skip leading whitespace/commas before each candidate;
+               the skipped bytes flush verbatim below */
+            if (srcset_p) {
+              while(html < r->adr + r->size
+                    && (is_realspace(*html) || *html == ','))
+                INCREMENT_CURRENT_ADR(1);
+            }
            eadr = html;

            // ne pas flusher après code si on doit écrire le codebase avant!
@@ -1844,6 +1859,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                    if ((*eadr == quote && (!quoteinscript || *(eadr - 1) == '\\'))     // end quote
                        || (noquote && (*eadr == '\"' || *eadr == '\''))        // end at any quote
                        || (!noquote && quote == '\0' && is_realspace(*eadr))   // unquoted href
+                        || srcset_p     // whitespace ends a srcset candidate URL
                      )         // si pas d'attente de quote spéciale ou si quote atteinte
                      ok = 0;
                  } else if (ending_p && (*eadr == ending_p))
@@ -1872,6 +1888,16 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                      break;    // \" ou \' point d'arrêt
                    case '?':  /*quote_adr=adr; */
                      break;    // noter position query
+                    case ',':
+                      if (srcset_p) {
+                        /* split only on a trailing comma; one inside the URL
+                           (data: URI, CDN path) is kept, per the WHATWG algo */
+                        const char *const n = eadr + 1;
+
+                        if (n >= r->adr + r->size || is_space(*n) || *n == ',')
+                          ok = 0;
+                      }
+                      break;
                    }
                  }
                  //}
@@ -3250,6 +3276,28 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
            }
            // adr=eadr-1;  // ** sauter

+            /* srcset candidate loop: skip the descriptor and comma, then
+               re-enter the capture for the next URL. Backward goto, not a loop:
+               the per-candidate body is this whole block. */
+            if (srcset_p && ok == 0) {
+              const char *const endp = r->adr + r->size;
+              const char *q = html;
+              while(q < endp && *q != '\0' && *q != ',' && *q != quote
+                    && *q != '<' && *q != '>' && (unsigned char) *q >= 32)
+                q++;            // skip the descriptor
+              if (q < endp && *q == ',') {
+                q++;
+                while(q < endp && (is_realspace(*q) || *q == ','))
+                  q++;          // skip whitespace and empty candidates
+                if (q < endp && *q != '\0' && *q != ',' && *q != quote
+                    && *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
+                  INCREMENT_CURRENT_ADR(q - html);   // keep the automate in sync
+                  ok = 1;
+                  goto srcset_next;
+                }
+              }
+            }
+
            /* We skipped bytes and skip the " : reset state */
            /*if (inscript) {
               inscript_state_pos = INSCRIPT_START;
--- a/tests/01_engine-filter.test
+++ b/tests/01_engine-filter.test
@@ -47,3 +47,25 @@ match '*foo*bar' 'foozbar'

 # '?' is the query-string marker, not a single-char wildcard
 nomatch 'a?c' 'abc'
+
+# backslash escapes a metacharacter inside a class so it is matched literally.
+# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
+# both X and '\'. These assertions pin that behavior.
+match '*[\*]' '*'
+match '*[\*]' "\\"
+nomatch '*[\*]' 'a'
+match '*[\\]' "\\"
+nomatch '*[\\]' 'a'
+match '*[\[]' '['
+match '*[\[]' "\\"
+nomatch '*[\[]' 'a'
+
+# A literal ']' cannot be a class member: the class parser stops at the first
+# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
+# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
+# by a trailing literal ']'. These assertions document the current (buggy)
+# behavior so any future matcher fix is a deliberate, visible change.
+nomatch '*[\[\]]' '['   # not matched, despite the docs
+match '*[\[\]]' ']'     # only via the empty class-match + trailing ']'
+match '*[\[\]]' '[]'    # one of {'[','\'} then the trailing ']'
+nomatch '*[\[\]]' '[]x'
--- a/tests/01_engine-parse.test
+++ b/tests/01_engine-parse.test
@@ -0,0 +1,155 @@
+#!/bin/bash
+#
+
+# Offline HTML parser tests: each section crawls a file:// fixture (no network)
+# and checks which assets the parser captured and how it rewrote the links.
+
+set -u
+
+tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_parse.XXXXXX") || exit 1
+trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
+
+# a minimal valid 1x1 GIF, reused for every referenced asset
+gif() {
+    printf 'GIF89a\1\0\1\0\200\0\0\0\0\0\377\377\377!\371\4\1\0\0\0\0,\0\0\0\0\1\0\1\0\0\2\2D\1\0;' >"$1"
+}
+
+# crawl <fixture-html> into <out> with link rewriting on, no extra fetching
+crawl() {
+    local html="$1" out="$2"
+    rm -rf "$out"
+    mkdir -p "$out"
+    httrack "file://$html" -O "$out" --quiet --near -n >"$out/.log" 2>&1
+}
+
+# assert a file with the given basename was saved somewhere under <out>
+found() {
+    test -n "$(find "$2" -type f -name "$1" -print -quit)" ||
+        ! echo "FAIL: expected '$1' to be downloaded under $2" || exit 1
+}
+
+# assert NO file with the given basename was saved (e.g. a descriptor token must
+# not be mistaken for a URL)
+notfound() {
+    test -z "$(find "$2" -type f -name "$1" -print -quit)" ||
+        ! echo "FAIL: '$1' should not have been downloaded under $2" || exit 1
+}
+
+# the mirrored fixture page (under "file/"), not HTTrack's own landing index
+savedhtml() {
+    find "$1" -type f -path '*/file/*' -name index.html -print -quit
+}
+
+# srcset on <img> and <source> (#235, #236): every candidate captured and
+# rewritten, descriptors preserved, following attributes left intact.
+site="$tmp/srcset"
+mkdir -p "$site"
+for f in a b c d e f g h i j v dz; do gif "$site/$f.gif"; done
+# unquoted heredoc: $site expands in the absolute-URL candidate
+cat >"$site/index.html" <<EOF
+<html><body>
+<img src="a.gif" srcset="b.gif 480w, c.gif 800w">
+<picture><source srcset="d.gif 1x, c.gif 2x"><img src="a.gif"></picture>
+<img srcset="e.gif, f.gif">
+<img srcset="g.gif 2x" alt="trailing attr after srcset">
+<img srcset="  h.gif   2x ,  i.gif  ">
+<video><source src="v.gif"></video>
+<img srcset="file://$site/j.gif 2x">
+<img srcset="data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x, dz.gif 2x">
+<img srcset="">
+<a href="a.gif">plain link still works</a>
+</body></html>
+EOF
+out="$tmp/srcset-out"
+crawl "$site/index.html" "$out"
+
+# every candidate downloads, incl. unique tails (catches first-only parsing),
+# whitespace-padded (h,i), <source src> (v), absolute (j), post-data: URI (dz)
+for f in a b c d e f g h i j v dz; do found "$f.gif" "$out"; done
+
+# the width/density descriptors are not URLs and must not be fetched
+notfound "480w" "$out"
+notfound "800w" "$out"
+notfound "2x" "$out"
+
+saved=$(savedhtml "$out")
+test -n "$saved" || ! echo "FAIL: saved index.html not found" || exit 1
+
+# descriptors must survive the rewrite (no "b.gif 480w" mangled into a path)
+grep -Eq 'srcset="[^"]*480w[^"]*800w' "$saved" ||
+    ! echo "FAIL: srcset width descriptors lost/reordered in rewritten HTML" || exit 1
+grep -Eq 'srcset="[^"]*1x[^"]*2x' "$saved" ||
+    ! echo "FAIL: srcset density descriptors lost/reordered in rewritten HTML" || exit 1
+# the descriptor-less comma form keeps both candidates and the separator verbatim
+grep -Eq 'srcset="e\.gif, f\.gif"' "$saved" ||
+    ! echo "FAIL: comma-separated srcset without descriptors was altered" || exit 1
+# an attribute following srcset in the same tag must be left intact
+grep -q 'alt="trailing attr after srcset"' "$saved" ||
+    ! echo "FAIL: srcset swallowed a following attribute" || exit 1
+
+# a comma inside a URL (data: URI, CDN path) is part of the URL, not a split
+# point (WHATWG): the data: URI stays verbatim; the next candidate (dz) downloads
+grep -Fq 'data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x' "$saved" ||
+    ! echo "FAIL: a comma inside a data: URI srcset candidate was mis-split" || exit 1
+
+# real rewrite, not passthrough: the absolute file:// candidate becomes local
+# (a flat fixture can't show this; the footer comment's file:// is not in srcset)
+grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
+    ! echo "FAIL: absolute file:// srcset URL was not rewritten to a local link" || exit 1
+! grep -Eq 'srcset="[^"]*file://' "$saved" ||
+    ! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
+
+# xlink:href (#298) and CSS background-image (#237): detected and rewritten to
+# local. background-image is covered in both an external <style> block and an
+# inline style attribute, with the URL unquoted, double-quoted and single-quoted
+# (the quote style is preserved on rewrite). No-detect attributes (title, alt,
+# ...) are left untouched. Asserted by rewrite (deterministic), not download.
+# data-* (#201/#203) is omitted: its detection is currently nondeterministic and
+# can't be locked yet.
+site2="$tmp/attrs"
+mkdir -p "$site2"
+for f in xl ibg ibgs cex cexd cexs tt; do gif "$site2/$f.gif"; done
+cat >"$site2/index.html" <<EOF
+<html><head><style>
+.a { background-image: url(file://$site2/cex.gif); }
+.b { background-image: url("file://$site2/cexd.gif"); }
+.c { background-image: url('file://$site2/cexs.gif'); }
+</style></head><body>
+<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
+<div style="background-image:url(file://$site2/ibg.gif)"></div>
+<div style="background-image:url('file://$site2/ibgs.gif')"></div>
+<span title="file://$site2/tt.gif">excluded attribute</span>
+</body></html>
+EOF
+out2="$tmp/attrs-out"
+crawl "$site2/index.html" "$out2"
+saved2=$(savedhtml "$out2")
+test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
+
+# detected attributes: the absolute URL is rewritten to a local link
+grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
+    ! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
+
+# #237 external <style> block, each quoting form, quote style preserved
+grep -Eq 'url\(cex\.gif\)' "$saved2" ||
+    ! echo "FAIL #237: unquoted background-image in <style> not rewritten" || exit 1
+grep -Eq 'url\("cexd\.gif"\)' "$saved2" ||
+    ! echo "FAIL #237: double-quoted background-image in <style> not rewritten" || exit 1
+grep -Eq "url\('cexs\.gif'\)" "$saved2" ||
+    ! echo "FAIL #237: single-quoted background-image in <style> not rewritten" || exit 1
+
+# #237 inline style attribute, unquoted and single-quoted url()
+grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
+    ! echo "FAIL #237: inline unquoted background-image not rewritten" || exit 1
+grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
+    ! echo "FAIL #237: inline single-quoted background-image not rewritten" || exit 1
+
+# no file:// URL survived inside any rewritten background-image
+! grep -Eq 'background-image:[^;"]*file://' "$saved2" ||
+    ! echo "FAIL #237: a file:// URL survived inside a rewritten background-image" || exit 1
+
+# excluded attribute: title is on the no-detect list, so its value is left as-is
+grep -q 'title="file://' "$saved2" ||
+    ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
+
+exit 0
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -9,6 +9,6 @@ TESTS_ENVIRONMENT += HTTPS_SUPPORT=$(HTTPS_SUPPORT)
 TESTS_ENVIRONMENT += top_srcdir=$(top_srcdir)

 TEST_EXTENSIONS = .test
-TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
+TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test

 CLEANFILES = check-network_sh.cache
--- a/tests/Makefile.in
+++ b/tests/Makefile.in
@@ -472,7 +472,7 @@ TESTS_ENVIRONMENT = PATH=$(top_builddir)/src$(PATH_SEPARATOR)$$PATH \
 	ONLINE_UNIT_TESTS=$(ONLINE_UNIT_TESTS) \
 	HTTPS_SUPPORT=$(HTTPS_SUPPORT) top_srcdir=$(top_srcdir)
 TEST_EXTENSIONS = .test
-TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
+TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
 CLEANFILES = check-network_sh.cache
 all: all-am
Author	SHA1	Message	Date
Xavier Roche	ca810ef7e3	Lock CSS background-image url() rewriting in the parser test background-image is already captured and rewritten through the style/CSS url() path, in both an external <style> block and an inline style attribute, with the URL unquoted, double-quoted or single-quoted. Extend the offline parser test to cover all of these so the behavior stays locked. closes #237	2026-06-14 01:07:42 +02:00
Xavier Roche	1bf90ce294	Merge pull request #326 from xroche/parser/srcset-candidates Capture every srcset candidate URL on <img> and <source>	2026-06-14 00:42:48 +02:00
Xavier Roche	583817dcd4	Capture every srcset candidate URL on <img> and <source> A srcset value is a comma-separated list of "URL descriptor" entries (480w, 2x). HTTrack only had "data-srcset" in the link-detection table and left the plain "srcset" attribute untouched, so responsive images were never mirrored. The parser now captures and rewrites each candidate URL in turn, preserving the descriptors and the commas between entries verbatim, and bounds every new buffer scan against the page end. Candidate splitting follows the WHATWG srcset algorithm: the URL is a run of non-whitespace characters, so a comma inside a URL (a data: URI, a CDN transform path like w_300,c_fill) stays part of the URL and is not mis-split; only a trailing comma or a comma after the descriptor separates candidates. Adds tests/01_engine-parse.test, an offline file:// parser test that asserts each candidate is queued and rewritten (including the comma-in-URL cases), and also locks the existing xlink:href (#298) and inline background-image (#237) handling. closes #235 closes #236	2026-06-14 00:37:20 +02:00
Xavier Roche	5351e96d71	Merge pull request #325 from xroche/docs/rfc2606-example-domains docs: use www.example.com in examples; add html manual regen target	2026-06-13 10:41:24 +02:00
Xavier Roche	a0bf50f6b1	Merge pull request #324 from xroche/test/filter-escape-characterize test: characterize wildcard class escape behavior	2026-06-13 10:17:24 +02:00
Xavier Roche	794404bba2	test: characterize wildcard class escape behavior Add -#0 self-test cases for backslash escapes inside a '[...]' class. They pin two quirks of the current decoder: '\X' matches both X and the backslash itself, and a literal ']' cannot be a class member because the parser stops at the first ']' (escaped or not). The latter is why the filter guide's '[\[\]]' = "the [ or ] character" claim is wrong (#148): it parses as the class {[,\} plus a trailing literal ']'. These tests lock the behavior down so a later matcher fix is a deliberate change. refs #148	2026-06-13 10:15:45 +02:00
Xavier Roche	82d08aaeaf	Merge pull request #323 from xroche/fix/doc-lang-nits docs: fix help-guide placeholders, README clone flag, Ukrainian charset	2026-06-13 10:12:09 +02:00
Xavier Roche	459f06e758	docs: fix help-guide placeholders, README clone flag, Ukrainian charset Escape the literal <URLs>, <FILTERs>, <param>, <filter>, <file> and related placeholders in fcguide.html so they render instead of being swallowed as unknown HTML tags; several were also missing their closing '>'. Use --recurse-submodules in the README clone command. Relabel lang/Ukrainian.txt as windows-1251, which is what its bytes actually are (ISO-8859-5 decodes them to garbage). closes #132, closes #103, closes #167	2026-06-13 10:05:40 +02:00