Compare commits

..

14 Commits

Author SHA1 Message Date
Xavier Roche
5351e96d71 Merge pull request #325 from xroche/docs/rfc2606-example-domains
docs: use www.example.com in examples; add html manual regen target
2026-06-13 10:41:24 +02:00
Xavier Roche
9d39a57576 build: add regen target for html/httrack.man.html
The rendered HTML manual had no regeneration path. Add regen-man-html,
which runs groff's html device over httrack.1, alongside the existing
regen-man target.
2026-06-13 10:38:31 +02:00
Xavier Roche
e3d4ec01f7 docs: use www.example.com in examples instead of www.someweb.com
someweb.com is a real registrable domain; example.com is reserved for
documentation (RFC 2606). Replace it across the HTML guides, the CLI
--help text (htshelp.c) and code comments, then regenerate man/httrack.1
and the rendered html/httrack.man.html. Other placeholder domains are
left alone: they appear inside filter/wildcard examples where the host
interacts with the pattern.
2026-06-13 10:38:31 +02:00
Xavier Roche
a0bf50f6b1 Merge pull request #324 from xroche/test/filter-escape-characterize
test: characterize wildcard class escape behavior
2026-06-13 10:17:24 +02:00
Xavier Roche
794404bba2 test: characterize wildcard class escape behavior
Add -#0 self-test cases for backslash escapes inside a '*[...]' class.
They pin two quirks of the current decoder: '\X' matches both X and the
backslash itself, and a literal ']' cannot be a class member because the
parser stops at the first ']' (escaped or not). The latter is why the
filter guide's '*[\[\]]' = "the [ or ] character" claim is wrong (#148):
it parses as the class {[,\} plus a trailing literal ']'. These tests
lock the behavior down so a later matcher fix is a deliberate change.

refs #148
2026-06-13 10:15:45 +02:00
Xavier Roche
82d08aaeaf Merge pull request #323 from xroche/fix/doc-lang-nits
docs: fix help-guide placeholders, README clone flag, Ukrainian charset
2026-06-13 10:12:09 +02:00
Xavier Roche
459f06e758 docs: fix help-guide placeholders, README clone flag, Ukrainian charset
Escape the literal <URLs>, <FILTERs>, <param>, <filter>, <file> and
related placeholders in fcguide.html so they render instead of being
swallowed as unknown HTML tags; several were also missing their closing
'>'. Use --recurse-submodules in the README clone command. Relabel
lang/Ukrainian.txt as windows-1251, which is what its bytes actually
are (ISO-8859-5 decodes them to garbage).

closes #132, closes #103, closes #167
2026-06-13 10:05:40 +02:00
Xavier Roche
89b25e418b Merge pull request #322 from xroche/test/expand-engine-coverage
test: expand offline engine self-test coverage
2026-06-13 09:58:03 +02:00
Xavier Roche
017c634c53 Merge pull request #321 from xroche/fix/mutex-init-race-297
Fix race in lazy mutex initialization
2026-06-13 09:18:39 +02:00
Xavier Roche
f2b36c4b29 Merge pull request #320 from xroche/fix/lockpath-overflow-183
Fix abort on long log path (lock-file buffer too small)
2026-06-13 09:18:10 +02:00
Xavier Roche
19947efd74 Merge pull request #319 from xroche/fix/footer-xss-165
Fix XSS via unescaped URL in the page footer comment
2026-06-13 09:18:02 +02:00
Xavier Roche
de26ad881a fix: synchronize lazy mutex initialization (closes #297)
Two threads locking the same mutex for the first time could both run the
unsynchronized lazy init, corrupting the underlying pthread mutex and aborting
or deadlocking. Build the object and publish it with a single atomic
compare-and-swap; threads that lose the race free the object they built. This
needs no statically-initializable guard, so it stays valid on Windows 2000.
2026-06-13 09:15:31 +02:00
Xavier Roche
106d34d82c fix: size the lock-file path buffer to the concat buffer (closes #183)
A long log path made the lock-file path overflow the fixed 256-byte n_lock
buffer, tripping the guarded copy and aborting with signal 6. Size n_lock to
the concat-buffer capacity so it holds any path fconcat can produce.

(cherry picked from commit 15144ffd24667712cca2ac0fee96bd355239eff6)
2026-06-12 23:24:20 +02:00
Xavier Roche
61e0b3250b fix: escape angle brackets in the page footer URL (closes #165)
The default footer embeds the page URL inside an HTML comment. A URL
containing "-->" closed the comment and let an attacker inject script into
the mirrored page. Percent-encode < and > before the URL reaches the footer.

(cherry picked from commit 606883229244dc233d16915678e63cfa62000ff0)
2026-06-12 23:24:20 +02:00
18 changed files with 205 additions and 128 deletions

View File

@@ -23,7 +23,7 @@ http://www.httrack.com/
## Compile trunk release
```sh
git clone https://github.com/xroche/httrack.git --recurse
git clone https://github.com/xroche/httrack.git --recurse-submodules
cd httrack
./configure --prefix=$HOME/usr && make -j8 && make install
```

View File

@@ -118,11 +118,11 @@ The command-line version
<br>
<br>
<li>Add the URLs, separated by a blank space</li>
<br><small><tt>httrack www.someweb.com/foo/</tt></small>
<br><small><tt>httrack www.example.com/foo/</tt></small>
<br>
<br>
<li>If you need, add some options (see the <a href="options.html">option list</a>)</li>
<br><small><tt>httrack www.someweb.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
<br><small><tt>httrack www.example.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
<br>
<br>
<li>Launch the command line, and wait until the mirror is finishing</li>

View File

@@ -303,43 +303,43 @@ Okay, let me explain how to precisely control the capture process.<br>
Let's take an example:<br>
<br>
Imagine you want to capture the following site:<br>
<tt>www.someweb.com/gallery/flowers/</tt><br>
<tt>www.example.com/gallery/flowers/</tt><br>
<br>
HTTrack, by default, will capture all links encountered in <tt>www.someweb.com/gallery/flowers/</tt> or in lower directories, like
<tt>www.someweb.com/gallery/flowers/roses/</tt>.<br>
HTTrack, by default, will capture all links encountered in <tt>www.example.com/gallery/flowers/</tt> or in lower directories, like
<tt>www.example.com/gallery/flowers/roses/</tt>.<br>
It will not follow links to other websites, because this behaviour might cause to capture the Web entirely!<br>
It will not follow links located in higher directories, too (for example, <tt>www.someweb.com/gallery/flowers/</tt> itself) because this
It will not follow links located in higher directories, too (for example, <tt>www.example.com/gallery/flowers/</tt> itself) because this
might cause to capture too much data.<br>
<br>
This is the <b><u>default behaviour</b></u> of HTTrack, BUT, of course, if you want, you can tell HTTrack to capture other directorie(s), website(s)!..
<br>
In our example, we might want also to capture all links in <tt>www.someweb.com/gallery/trees/</tt>, and in <tt>www.someweb.com/photos/</tt><br>
In our example, we might want also to capture all links in <tt>www.example.com/gallery/trees/</tt>, and in <tt>www.example.com/photos/</tt><br>
<br>
This can easily done by using filters: go to the Option panel, select the 'Scan rules' tab, and enter this line:
(you can leave a blank space between each rules, instead of entering a carriage return)<br>
<tt>+www.someweb.com/gallery/trees/*<br>
+www.someweb.com/photos/*</tt><br>
<tt>+www.example.com/gallery/trees/*<br>
+www.example.com/photos/*</tt><br>
<br>
This means "accept all links begining with <tt>www.someweb.com/gallery/trees/</tt> and <tt>www.someweb.com/photos/</tt>"
This means "accept all links begining with <tt>www.example.com/gallery/trees/</tt> and <tt>www.example.com/photos/</tt>"
- the <tt>+</tt> means "accept" and the final <tt>*</tt> means "any character will match after the previous ones".
Remember the <tt>*.doc</tt> or <tt>*.zip</tt> encountered when you want to select all files from a certain type on your computer:
it is almost the same here, except the begining "+"<br>
<br>
Now, we might want to exclude all links in <tt>www.someweb.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
Now, we might want to exclude all links in <tt>www.example.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
we accepted too many files. Here again, you can add a filter rule to refuse these links. Modify the previous filters to:<br>
<tt>+www.someweb.com/gallery/trees/*<br>
+www.someweb.com/photos/*<br>
-www.someweb.com/gallery/trees/hugetrees/*</tt><br>
<tt>+www.example.com/gallery/trees/*<br>
+www.example.com/photos/*<br>
-www.example.com/gallery/trees/hugetrees/*</tt><br>
<br>
You have noticed the <tt>-</tt> in the begining of the third rule: this means "refuse links matching the rule"
; and the rule is "any files begining with <tt>www.someweb.com/gallery/trees/hugetrees/</tt><br>
; and the rule is "any files begining with <tt>www.example.com/gallery/trees/hugetrees/</tt><br>
Voila! With these three rules, you have precisely defined what you wanted to capture.<br>
<br>
A more complex example?<br>
<br>
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.someweb.com<br>
<tt>+www.someweb.com/*blue*.jpg</tt><br>
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.example.com<br>
<tt>+www.example.com/*blue*.jpg</tt><br>
<br>
More detailed information can be found <a href="filters.html">here</a>!<br>
<br>
@@ -440,7 +440,7 @@ This will cause a performance loss, but will increase the compatibility with som
<a NAME="QT1">Q: <strong>Only the first page is caught. What's wrong?</a></strong></br>
A: <em>First, check the <tt>hts-log.txt</tt> file (and/or <tt>hts-err.txt</tt> error log file) - this can give you precious information.<br>
The problem can be a website that redirects you to another site (for example, <tt>www.someweb.com</tt> to <tt>public.someweb.com</tt>) :
The problem can be a website that redirects you to another site (for example, <tt>www.example.com</tt> to <tt>public.example.com</tt>) :
in this case, use filters to accept this site<br>
This can be, also, a problem in the HTTrack options (link depth too low, for example)</em>
@@ -485,10 +485,10 @@ You may also want to capture files that are forbidden by default by the <a href=
In these cases, HTTrack does not capture these links automatically, you have to tell it to do so.
<br><br>
<ul><li>Either use the <a href="filters.html">filters</a>.<br>
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get .jpg images located
in <tt>http://www.someweb.com/bar/</tt> (for example, http://www.someweb.com/bar/blue.jpg)<br>
Then, add the filter rule <tt>+www.someweb.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
You can, also, accept all files from the /bar folder with <tt>+www.someweb.com/bar/*</tt>, or only html files with <tt>+www.someweb.com/bar/*.html</tt> and so on..<br><br>
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get .jpg images located
in <tt>http://www.example.com/bar/</tt> (for example, http://www.example.com/bar/blue.jpg)<br>
Then, add the filter rule <tt>+www.example.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
You can, also, accept all files from the /bar folder with <tt>+www.example.com/bar/*</tt>, or only html files with <tt>+www.example.com/bar/*.html</tt> and so on..<br><br>
</li><li>
If the problems are related to robots.txt rules, that do not let you access some folders (check in the logs if you are not sure),
you may want to disable the default robots.txt rules in the options. (but only disable this option with great care,
@@ -509,8 +509,8 @@ and rescan the website as described before. HTTrack will be obliged to recatch t
<a NAME="Q1bb">Q: <strong>FTP links are not caught! What's happening?</strong><br>
A: <em>FTP files might be seen as external links, especially if they are located in outside domain. You have either to accept all external links (See the links options, -n option) or
only specific files (see <a href="filters.html">filters</a> section). <br>
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get ftp://ftp.someweb.com files<br>
Then, add the filter rule <tt>+ftp.someweb.com/*</tt> to accept all files from this (ftp) location<br>
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get ftp://ftp.example.com files<br>
Then, add the filter rule <tt>+ftp.example.com/*</tt> to accept all files from this (ftp) location<br>
</em>
<br>
@@ -551,10 +551,10 @@ Note: In some rare cases, duplicate data files can be found when the website red
<a NAME="Q1b2">Q: <strong>I'm downloading too many files! What can I do?</strong><br>
A: <em>This is often the case when you use too large a filter, for example <tt>+*.html</tt>, which asks the
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.someweb.com/specificfolder/*.html</tt><br>
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.someweb.com/big/,
use <tt>-www.someweb.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
mirroring http://www.someweb.com/big/index.html, is to catch everything in http://www.someweb.com/big/. Filters are your friends,
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.example.com/specificfolder/*.html</tt><br>
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.example.com/big/,
use <tt>-www.example.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
mirroring http://www.example.com/big/index.html, is to catch everything in http://www.example.com/big/. Filters are your friends,
use them!
</em>
<br>
@@ -562,7 +562,7 @@ use them!
<a NAME="Q1b22">Q: <strong>The engine turns crazy, getting thousands of files! What's going on?</strong><br>
A: <em>This can happen if a loop occurs in some bogus website. For example, a page that refers to itself, with a timestamp
in the query string (e.g. <tt>http://www.someweb.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>).
in the query string (e.g. <tt>http://www.example.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>).
These are really annoying, as it is VERY difficult to detect the loop (the timestamp might be a page number).
To limit the problem: set a recurse level (for example to 6), or avoid the bogus pages (use the filters)
</em>
@@ -571,7 +571,7 @@ To limit the problem: set a recurse level (for example to 6), or avoid the bogus
<a NAME="Q1b3">Q: <strong>File are sometimes renamed (the type is changed)! Why?</strong><br>
A: <em>By default, HTTrack tries to know the type of remote files. This is useful when links like
<tt>http://www.someweb.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else.
<tt>http://www.example.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else.
Locally, foo.cgi will not be recognized as an html page, or as an image, by your browser. HTTrack has to rename the file
as foo.html or foo.gif so that it can be viewed.<br>
</em>
@@ -730,8 +730,8 @@ but this is a smart bug..
the domain, too. How to retrieve them?</strong><br>
A: <em>If you just want to retrieve files that can be reached through links, just activate
the 'get file near links' option. But if you want to retrieve html pages too, you can both
use wildcards or explicit addresses ; e.g. add <tt>www.someweb.com/*</tt> to accept all
files and pages from www.someweb.com.<br>
use wildcards or explicit addresses ; e.g. add <tt>www.example.com/*</tt> to accept all
files and pages from www.example.com.<br>
<br>
</em></a><a NAME="Q6">Q: <strong>I have forgotten some URLs of files during a long
mirror.. Should I redo all?</strong><br>
@@ -744,7 +744,7 @@ A: <em>You can use different methods. You can use the 'get files near a link' op
files are in a foreign domain. You can use, too, a filter adress: adding <tt>+*.zip</tt>
in the URL list (or in the filter list) will accept all ZIP files, even if these files are
outside the address. <br>
Example : <tt>httrack www.someweb.com/someaddress.html +*.zip</tt> will allow
Example : <tt>httrack www.example.com/someaddress.html +*.zip</tt> will allow
you to retrieve all zip files that are linked on the site.</em><br>
<br>
</a><a NAME="Q8">Q: <strong>There are ZIP files in a page, but I don't want to transfer
@@ -771,7 +771,7 @@ them on filters!</strong><br>
A: <em>By default, HTTrack retrieves all types of files on authorized links. To avoid
that, define filters like </a><a NAME="Q7"><tt>-* +&lt;website&gt;/*.html
+&lt;website&gt;/*.htm +&lt;website&gt;/ +*.&lt;type wanted&gt;</tt></a><a NAME="Q10"><br>
Example: <tt>httrack www.someweb.com/index.html -* +www.someweb.com/*.htm* +www.someweb.com/*.gif +www.someweb.com/*.jpg</tt><br>
Example: <tt>httrack www.example.com/index.html -* +www.example.com/*.htm* +www.example.com/*.gif +www.example.com/*.jpg</tt><br>
<br>
</em><a NAME="Q10">Q: <strong>When I use filters, I get too many files!</strong><br>
A: <em>You might use too large a filter, for example <tt>*.html</tt> will get ALL html
@@ -779,13 +779,13 @@ files identified. If you want to get all files on an address, use <tt>www.&lt;ad
If you want to get ONLY files defined by your filters, use something like <tt>-* +www.foo.com/*</tt>, because
<tt>+www.foo.com/*</tt> will only accept selected links without forbidding other ones!<br>
There are lots of possibilities using filters.<br>
Example:<tt>httrack www.someweb.com +*.someweb.com/*.htm*</tt><br>
Example:<tt>httrack www.example.com +*.example.com/*.htm*</tt><br>
<br>
</em></a><a NAME="Q11">Q: <strong>When I use filters, I can't access another domain, but I
have filtered it!</strong><br>
A: <em>You may have done a mistake declaring filters, for example <tt>+www.someweb.com/*
-*someweb* </tt></em>will not work, because -*someweb* has an upper priority (because it has
been declared after +www.someweb.com)<br>
A: <em>You may have done a mistake declaring filters, for example <tt>+www.example.com/*
-*example* </tt></em>will not work, because -*example* has an upper priority (because it has
been declared after +www.example.com)<br>
<br>
</a><a NAME="Q12">Q: <strong>Must I add a&nbsp; '+' or '-' in the filter list when I want
to use filters?</strong><br>
@@ -800,7 +800,7 @@ filter list) and accept only html files and the file(s) you want to retrieve (BU
forget to add <tt>+&lt;website&gt;*.html</tt> in the filter list, or pages will not be
scanned! Add the name of files you want with a <tt>*/</tt> before ; i.e. if you want to
retrieve file.zip, add <tt>*/file.zip</tt>)<br>
Example:<tt>httrack www.someweb.com +www.someweb.com/*.htm* +thefileiwant.zip</tt><br>
Example:<tt>httrack www.example.com +www.example.com/*.htm* +thefileiwant.zip</tt><br>
<br>
</em>
@@ -828,7 +828,7 @@ A: <em>Yes. See the URL capture abilities (--catchurl for command-line release,
A: <em>Yes. See the shell system command option (-V option for command-line release)</em>
<br><br><a NAME="QM6">Q: <strong>Can I use username/password authentication on a site?</strong></a><br>
A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.someweb.com/private/mybox.html</tt>)</em>
A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.example.com/private/mybox.html</tt>)</em>
<br><br><a NAME="QM7">Q: <strong>Can I use username/password authentication for a proxy?</strong></a><br>
A: <em>Yes. Use user:password@your_proxy_name as your proxy name (example: <tt>smith:foo@proxy.mycorp.com</tt>)</em>

View File

@@ -181,17 +181,17 @@ used for some time.
<p align=justify> The rest of this manual is dedicated to detailing what
you find in the help message and providing examples - lots and lots of
examples... Here is what you get (page by page - use <enter> to move to
examples... Here is what you get (page by page - use &lt;enter&gt; to move to
the next page in the real program) if you type 'httrack --help':
<pre>
>httrack --help
HTTrack version 3.03BETAo4 (compiled Jul 1 2001)
usage: ./httrack <URLs [-option] [+<FILTERs>] [-<FILTERs>]
usage: ./httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;]
with options listed below: (* is the default value)
General options:
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path &lt;param&gt;)
%O top path if no path defined (-O path_mirror[,path_cache_and_logfiles])
Action options:
@@ -202,7 +202,7 @@ Action options:
Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)
Proxy options:
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy &lt;param&gt;)
%f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])
Limits options:
@@ -227,7 +227,7 @@ Links options:
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
n get non-html files 'near' an html file (ex: an image located outside) (--near)
t test all URLs (even forbidden ones) (--test)
%L <file add all URL located in this text file (one URL per line) (--list <param>)
%L &lt;file&gt; add all URL located in this text file (one URL per line) (--list &lt;param&gt;)
Build options:
NN structure type (0 *original structure, 1+: see below) (--structure[=N])
@@ -248,12 +248,12 @@ Spider options:
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
Browser ID:
F user-agent field (-F "user-agent name") (--user-agent <param>)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
F user-agent field (-F "user-agent name") (--user-agent &lt;param&gt;)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer &lt;param&gt;)
%l preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
Log, index, cache
C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
@@ -303,8 +303,8 @@ Guru options: (do NOT use)
#! Execute a shell command (-#! "echo hello")
Command-line specific options:
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
%U run the engine with another id when called as root (-%U smith) (--user <param>)
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
%U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
Details: Option N
N0 Site-structure (default)
@@ -332,7 +332,7 @@ Details: User-defined option N
%N Name of file, including file type (ex: image.gif)
%t File type (ex: gif)
%p Path [without ending /] (ex: /someimages)
%h Host name (ex: www.someweb.com) (--http-10)
%h Host name (ex: www.example.com) (--http-10)
%M URL MD5 (128 bits, 32 ascii bytes)
%Q query string MD5 (128 bits, 32 ascii bytes)
%q small query string MD5 (16 bits, 4 ascii bytes) (--include-query-string)
@@ -340,14 +340,14 @@ Details: User-defined option N
%[param] param variable in query string
Shortcuts:
--mirror <URLs *make a mirror of site(s) (default)
--get <URLs get the files indicated, do not seek other URLs (-qg)
--list <text file add all URL located in this text file (-%L)
--mirrorlinks <URLs mirror all links in 1st level pages (-Y)
--testlinks <URLs test links in pages (-r1p0C0I0t)
--spider <URLs spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite <URLs identical to --spider
--skeleton <URLs make a mirror, but gets only html files (-p1)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
--get &lt;URLs&gt; get the files indicated, do not seek other URLs (-qg)
--list &lt;text file&gt; add all URL located in this text file (-%L)
--mirrorlinks &lt;URLs&gt; mirror all links in 1st level pages (-Y)
--testlinks &lt;URLs&gt; test links in pages (-r1p0C0I0t)
--spider &lt;URLs&gt; spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite &lt;URLs&gt; identical to --spider
--skeleton &lt;URLs&gt; make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1)
@@ -356,17 +356,17 @@ Shortcuts:
--http10 force http/1.0 requests (-%h)
example: httrack www.someweb.com/bob/
means: mirror site www.someweb.com/bob/ and only this site
example: httrack www.example.com/bob/
means: mirror site www.example.com/bob/ and only this site
example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
means: mirror the two sites together (with shared links) and accept any .jpg files on .com sites
example: httrack www.someweb.com/bob/bobby.html +* -r6
example: httrack www.example.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web
example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy
example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080
runs the spider on www.example.com/bob/bobby.html using a proxy
example: httrack --update
updates a mirror in the current folder
@@ -387,13 +387,13 @@ with examples... I will be here a while...
<hr>
<h2> Syntax </h2>
<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
<pre><b><i>httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;] </i></b></pre>
<p align=justify> The syntax of httrack is quite simple. You specify
the URLs you wish to start the process from (<URLS>), any options you
the URLs you wish to start the process from (&lt;URLS&gt;), any options you
might want to add ([-option], any filters specifying places you should
([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
line by pressing <enter>. Httrack then goes off and does your bidding.
([+&lt;FILTERs&gt;]) and should not ([-&lt;FILTERs&gt;]) go, and end the command
line by pressing &lt;enter&gt;. Httrack then goes off and does your bidding.
For example:
<pre><b><i>
@@ -425,7 +425,7 @@ site. Specifically, the defauls are:
pN priority mode: (* p3) *3 save all files
D *can only go down into subdirs
a *stay on the same address
--mirror <URLs> *make a mirror of site(s) (default)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
</pre>
<p align=justify> Here's what all of that means:
@@ -542,7 +542,7 @@ subdirectories of the starting directory to be investigated.
search started are to be collected. Other sites they point to are not
to be imaged.
<pre><b><i> --mirror <URLs> *make a mirror of site(s) (default) </i></b></pre>
<pre><b><i> --mirror &lt;URLs&gt; *make a mirror of site(s) (default) </i></b></pre>
<p align=justify> This indicates that the program should try to make a
copy of the site as well as it can.
@@ -921,7 +921,7 @@ Links options:
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use)
n get non-html files 'near' an html file (ex: an image located outside)
t test all URLs (even forbidden ones)
%L <file> add all URL located in this text file (one URL per line)
%L &lt;file&gt; add all URL located in this text file (one URL per line)
</i></b></pre>
<p align=justify> The links options allow you to control what links are
@@ -1183,7 +1183,7 @@ Spider options:
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
%B tolerant requests (accept bogus responses on some servers, but not standard!)
%s update hacks: various hacks to limit re-transfers when updating
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
</i></b></pre>
<p align=justify> By default, cookies are universally accepted and
@@ -1387,7 +1387,7 @@ web servers leave footprints in the browser.
Browser ID:
F user-agent field (-F "user-agent name")
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]"
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
%l preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
</i></b></pre>
<p align=justify> The user-agent field is used by browsers to determine
@@ -1799,7 +1799,7 @@ based authentication)
<pre><b><i>
Command-line specific options:
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
</i></b></pre>
<p align=justify> This option is very nice for a wide array of actions
@@ -1811,7 +1811,7 @@ httrack http://www.shoesizes.com/bob/ -O /tmp/shoesizes -V "/bin/echo \$0"
</i></b></pre>
<pre>
%U run the engine with another id when called as root (-%U smith) (--user <param>)
%U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
</pre>
<p align=justify> Change the UID of the owner when running as r00t
@@ -1856,14 +1856,14 @@ of other options that are commonly used.
<pre><b><i>
Shortcuts:
--mirror <URLs> *make a mirror of site(s) (default)
--get <URLs> get the files indicated, do not seek other URLs (-qg)
--list <text file> add all URL located in this text file (-%L)
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
--testlinks <URLs> test links in pages (-r1p0C0I0t)
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite <URLs> identical to --spider
--skeleton <URLs> make a mirror, but gets only html files (-p1)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
--get &lt;URLs&gt; get the files indicated, do not seek other URLs (-qg)
--list &lt;text file&gt; add all URL located in this text file (-%L)
--mirrorlinks &lt;URLs&gt; mirror all links in 1st level pages (-Y)
--testlinks &lt;URLs&gt; test links in pages (-r1p0C0I0t)
--spider &lt;URLs&gt; spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite &lt;URLs&gt; identical to --spider
--skeleton &lt;URLs&gt; make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1)
--catchurl create a temporary proxy to capture an URL or a form post URL
@@ -2019,15 +2019,15 @@ are in reverse priority order. Here's an example:
<td>no characters must be present after</a></td>
</tr>
<tr>
<td> <b> <filter>*[&lt NN]</b></td>
<td> <b> &lt;filter&gt;*[&lt NN]</b></td>
<td> size less than NN Kbytes</td>
</tr>
<tr>
<td> <b> <filter>*[&gt PP]</b></td>
<td> <b> &lt;filter&gt;*[&gt PP]</b></td>
<td> size more than PP Kbytes</td>
</tr>
<tr>
<td> <b> <filter>*[&lt NN &gt PP]</b></td>
<td> <b> &lt;filter&gt;*[&lt NN &gt PP]</b></td>
<td> size less than NN Kbytes and more than PP Kbytes</td>
</tr>
</table>
@@ -2054,8 +2054,8 @@ generated automatically using the interface)
<td>This will accept all zip files in .com addresses</td>
</tr>
<tr>
<td><b>-*someweb*/*.tar*</b></td>
<td>This will refuse all tar (or tar.gz etc.) files in hosts containing someweb</td>
<td><b>-*example*/*.tar*</b></td>
<td>This will refuse all tar (or tar.gz etc.) files in hosts containing example</td>
</tr>
<tr>
<td><b>+*/*somepage*</b></td>

View File

@@ -109,8 +109,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
<i>You have to know that once you have defined
starts links, the default mode is to mirror these links - i.e. if one of your start page is
www.someweb.com/test/index.html, all links starting with www.someweb.com/test/ will be
accepted. But links directly in www.someweb.com/.. will not be accepted, however, because
www.example.com/test/index.html, all links starting with www.example.com/test/ will be
accepted. But links directly in www.example.com/.. will not be accepted, however, because
they are in a higher strcuture. This prevent HTTrack from mirroring the whole site. (All
files in structure levels equal or lower than the primary links will be retrieved.)<br>
</i>
@@ -278,8 +278,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
<td>This will refuse/accept all zip files in .com addresses</td>
</tr>
<tr>
<td nowrap><tt>*someweb*/*.tar*</tt></td>
<td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb</td>
<td nowrap><tt>*example*/*.tar*</tt></td>
<td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing example</td>
</tr>
<tr>
<td nowrap><tt>*/*somepage*</tt></td>
@@ -289,13 +289,13 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
<td nowrap><tt>*.html</tt></td>
<td>This will refuse/accept all html files. <br>
Warning! With this filter you will accept ALL html files, even those in other addresses.
(causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from
(causing a global (!) web mirror..) Use www.example.com/*.html to accept all html files from
a web.</td>
</tr>
<tr>
<td nowrap><tt>*.html*[]</tt></td>
<td>Identical to <tt>*.html</tt>, but the link must not have any supplemental characters
at the end (links with parameters, like <tt>www.someweb.com/index.html?page=10</tt>, will be
at the end (links with parameters, like <tt>www.example.com/index.html?page=10</tt>, will be
refused)</td>
</tr>
</table>

View File

@@ -123,12 +123,12 @@ mirrored site, and resume interrupted downloads.</p>
<p style="margin-left:11%; margin-top: 1em"><b>httrack
www.someweb.com/bob/</b></p>
www.example.com/bob/</b></p>
<p style="margin-left:22%;">mirror site
www.someweb.com/bob/ and only this site</p>
www.example.com/bob/ and only this site</p>
<p style="margin-left:11%;"><b>httrack www.someweb.com/bob/
<p style="margin-left:11%;"><b>httrack www.example.com/bob/
www.anothertest.com/mike/ +*.com/*.jpg <br>
-mime:application/*</b></p>
@@ -137,18 +137,18 @@ www.anothertest.com/mike/ +*.com/*.jpg <br>
sites</p>
<p style="margin-left:11%;"><b>httrack
www.someweb.com/bob/bobby.html +* -r6</b></p>
www.example.com/bob/bobby.html +* -r6</b></p>
<p style="margin-left:22%;">means get all files starting
from bobby.html, with 6 link-depth, and possibility of going
everywhere on the web</p>
<p style="margin-left:11%;"><b>httrack
www.someweb.com/bob/bobby.html --spider -P <br>
www.example.com/bob/bobby.html --spider -P <br>
proxy.myhost.com:8080</b></p>
<p style="margin-left:22%;">runs the spider on
www.someweb.com/bob/bobby.html using a proxy</p>
www.example.com/bob/bobby.html using a proxy</p>
<p style="margin-left:11%;"><b>httrack --update</b></p>
@@ -1877,7 +1877,7 @@ User-defined option N</b> <br>
%N Name of file, including file type (ex: image.gif) <br>
%t File type (ex: gif) <br>
%p Path [without ending /] (ex: /someimages) <br>
%h Host name (ex: www.someweb.com) <br>
%h Host name (ex: www.example.com) <br>
%M URL MD5 (128 bits, 32 ascii bytes) <br>
%Q query string MD5 (128 bits, 32 ascii bytes) <br>
%k full query string <br>

View File

@@ -131,16 +131,16 @@ This is the default primary scanning option, the engine does not go out of domai
d stay on the same principal domain
This option lets the engine go on all sites that exist on the same principal domain.
Example: a link located at www.someweb.com that goes to members.someweb.com will be followed.
Example: a link located at www.example.com that goes to members.example.com will be followed.
l stay on the same location (.com, etc.)
This option lets the engine go on all sites that exist on the same location.
Example: a link located at www.someweb.com that goes to www.anyotherweb.com will be followed.
Example: a link located at www.example.com that goes to www.anyotherweb.com will be followed.
Warning: this is a potentially dangerous option, limit the recurse depth with r option.
e go everywhere on the web
This option lets the engine go on any sites.
Example: a link located at www.someweb.com that goes to www.anyotherweb.org will be followed.
Example: a link located at www.example.com that goes to www.anyotherweb.org will be followed.
Warning: this is a potentially dangerous option, limit the recurse depth with r option.
n get non-html files 'near' an html file (ex: an image located outside)

View File

@@ -117,7 +117,7 @@ h4 { margin: 0; font-weight: bold; font-size: 1.18em; }
<li>HTML Footer</li>
<br><small>Enter here the optionnal text that will be included as a comment in each HTML file to make archiving easier
<br>The string entered is generally an HTML comment (<tt>&lt;!-- HTML comment --&gt;</tt>) with optionnal %s, which will be transformed into a specific string information:
<br>%s #1 : host name (for example, www.someweb.com)
<br>%s #1 : host name (for example, www.example.com)
<br>%s #2 : file name (for example, /index.html)
<br>%s #3 : date of the mirror
<br><b>Example</b>: <tt>&lt;!-- Page mirrored from %s, file %s. Archive date: %s --&gt;</tt>

View File

@@ -7,7 +7,7 @@ uk
LANGUAGE_AUTHOR
Andrij Shevchuk (http://programy.com.ua, http://vic-info.com.ua) \r\n
LANGUAGE_CHARSET
ISO-8859-5
windows-1251
LANGUAGE_WINDOWSID
Ukrainian
OK

View File

@@ -13,3 +13,9 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
README='$(top_srcdir)/README' $(SHELL) $(srcdir)/makeman.sh \
'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
.PHONY: regen-man
# Render html/httrack.man.html from httrack.1. Needs the groff html device
# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
regen-man-html: httrack.1
groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
.PHONY: regen-man-html

View File

@@ -551,6 +551,12 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
.PHONY: regen-man
# Render html/httrack.man.html from httrack.1. Needs the groff html device
# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
regen-man-html: httrack.1
groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
.PHONY: regen-man-html
# Tell versions [3.59,3.63) of GNU make to not export all variables.
# Otherwise a system limit (for SysV at least) may be exceeded.
.NOEXPORT:

View File

@@ -2,7 +2,7 @@
.\" groff -man -Tascii httrack.1
.\"
.\" This file is generated by man/makeman.sh; do not edit by hand.
.TH httrack 1 "07 June 2026" "httrack website copier"
.TH httrack 1 "13 June 2026" "httrack website copier"
.SH NAME
httrack \- offline browser : copy websites to a local directory
.SH SYNOPSIS
@@ -98,15 +98,15 @@ httrack \- offline browser : copy websites to a local directory
allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.
.SH EXAMPLES
.TP
.B httrack www.someweb.com/bob/
mirror site www.someweb.com/bob/ and only this site
.B httrack www.example.com/bob/
mirror site www.example.com/bob/ and only this site
.TP
.B httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
.B httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
mirror the two sites together (with shared links) and accept any .jpg files on .com sites
.TP
.B httrack www.someweb.com/bob/bobby.html +* \-r6
.B httrack www.example.com/bob/bobby.html +* \-r6
.TP
.B httrack www.someweb.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
.B httrack www.example.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
.TP
.B httrack \-\-update
.TP
@@ -411,7 +411,7 @@ File type (ex: gif)
.IP \-%p
Path [without ending /] (ex: /someimages)
.IP \-%h
Host name (ex: www.someweb.com)
Host name (ex: www.example.com)
.IP \-%M
URL MD5 (128 bits, 32 ascii bytes)
.IP \-%Q

View File

@@ -2899,7 +2899,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
{
char n_lock[256];
/* Sized to the concat-buffer capacity so it can always hold the lock-file
path produced by fconcat(), even with a long log path (issue #183). */
char n_lock[OPT_GET_BUFF_SIZE(opt)];
// on peut pas avoir un affichage ET un fichier log
// ca sera pour la version 2

View File

@@ -712,7 +712,7 @@ void help(const char *app, int more) {
infomsg(" '%N' Name of file, including file type (ex: image.gif)");
infomsg(" '%t' File type (ex: gif)");
infomsg(" '%p' Path [without ending /] (ex: /someimages)");
infomsg(" '%h' Host name (ex: www.someweb.com)");
infomsg(" '%h' Host name (ex: www.example.com)");
infomsg(" '%M' URL MD5 (128 bits, 32 ascii bytes)");
infomsg(" '%Q' query string MD5 (128 bits, 32 ascii bytes)");
infomsg(" '%k' full query string");
@@ -767,21 +767,21 @@ void help(const char *app, int more) {
infomsg("Details: Option %W: External callbacks prototypes");
infomsg("see htsdefines.h");
infomsg("");
infomsg("example: httrack www.someweb.com/bob/");
infomsg("means: mirror site www.someweb.com/bob/ and only this site");
infomsg("example: httrack www.example.com/bob/");
infomsg("means: mirror site www.example.com/bob/ and only this site");
infomsg("");
infomsg
("example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
("example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
infomsg
("means: mirror the two sites together (with shared links) and accept any .jpg files on .com sites");
infomsg("");
infomsg("example: httrack www.someweb.com/bob/bobby.html +* -r6");
infomsg("example: httrack www.example.com/bob/bobby.html +* -r6");
infomsg
("means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web");
infomsg("");
infomsg
("example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
infomsg("runs the spider on www.someweb.com/bob/bobby.html using a proxy");
("example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
infomsg("runs the spider on www.example.com/bob/bobby.html using a proxy");
infomsg("");
infomsg("example: httrack --update");
infomsg("updates a mirror in the current folder");

View File

@@ -895,9 +895,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
// possibilité non documentée: >post: et >postfile:
// si présence d'un tag >post: alors executer un POST
// exemple: http://www.someweb.com/test.cgi?foo>post:posteddata=10&foo=5
// exemple: http://www.example.com/test.cgi?foo>post:posteddata=10&foo=5
// si présence d'un tag >postfile: alors envoyer en tête brut contenu dans le fichier en question
// exemple: http://www.someweb.com/test.cgi?foo>postfile:post0.txt
// exemple: http://www.example.com/test.cgi?foo>postfile:post0.txt
search_tag = strstr(fil, POSTTOK ":");
if (!search_tag) {
search_tag = strstr(fil, POSTTOK "file:");

View File

@@ -274,6 +274,28 @@ Please visit our Website: http://www.httrack.com
} \
} while(0)
/* Percent-encode the angle brackets of a string so it is safe to embed inside
an HTML comment (the default footer) or any other HTML context. A URL holding
"-->" would otherwise close the footer comment and inject markup (issue #165).
Raw '<' and '>' are not valid URL characters, so encoding them is harmless. */
static const char *html_inline_safe(const char *src, char *dst, size_t size) {
size_t i, j;
for(i = 0, j = 0; src[i] != '\0' && j + 4 < size; i++) {
const char c = src[i];
if (c == '<' || c == '>') {
dst[j++] = '%';
dst[j++] = '3';
dst[j++] = (c == '<') ? 'C' : 'E';
} else {
dst[j++] = c;
}
}
dst[j] = '\0';
return dst;
}
/* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE];
@@ -719,13 +741,16 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (StringNotEmpty(opt->footer)) {
char BIGSTK tempo[1024 + HTS_URLMAXSIZE * 2];
char gmttime[256];
char BIGSTK safe_adr[HTS_URLMAXSIZE * 3 + 4];
char BIGSTK safe_fil[HTS_URLMAXSIZE * 3 + 4];
tempo[0] = '\0';
time_gmt_rfc822(gmttime);
strcatbuff(tempo, eol);
hts_template_format_str(tempo + strlen(tempo), sizeof(tempo) - strlen(tempo),
StringBuff(opt->footer),
jump_identification_const(urladr()), urlfil(), gmttime,
html_inline_safe(jump_identification_const(urladr()), safe_adr, sizeof(safe_adr)),
html_inline_safe(urlfil(), safe_fil, sizeof(safe_fil)), gmttime,
HTTRACK_VERSIONID, /* EOF */ NULL);
strcatbuff(tempo, eol);
//fwrite(tempo,1,strlen(tempo),fp);

View File

@@ -193,7 +193,23 @@ HTSEXT_API void hts_mutexfree(htsmutex * mutex) {
HTSEXT_API void hts_mutexlock(htsmutex * mutex) {
assertf(mutex != NULL);
if (*mutex == HTSMUTEX_INIT) { /* must be initialized */
hts_mutexinit(mutex);
/* Initialize exactly once, even when several threads race to lock the same
mutex for the first time. Build our own object, then publish it with a
single atomic compare-and-swap; the threads that lose the race free the
object they built (issue #297). No static guard is needed, which keeps
this safe on Windows 2000 (no statically-initializable lock there). */
htsmutex created = HTSMUTEX_INIT;
hts_mutexinit(&created);
#ifdef _WIN32
if (InterlockedCompareExchangePointer((PVOID volatile *) mutex, created,
HTSMUTEX_INIT) != HTSMUTEX_INIT)
#else
if (!__sync_bool_compare_and_swap(mutex, HTSMUTEX_INIT, created))
#endif
{
hts_mutexfree(&created);
}
}
assertf(*mutex != NULL);
#ifdef _WIN32

View File

@@ -47,3 +47,25 @@ match '*foo*bar' 'foozbar'
# '?' is the query-string marker, not a single-char wildcard
nomatch 'a?c' 'abc'
# backslash escapes a metacharacter inside a class so it is matched literally.
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
# both X and '\'. These assertions pin that behavior.
match '*[\*]' '*'
match '*[\*]' "\\"
nomatch '*[\*]' 'a'
match '*[\\]' "\\"
nomatch '*[\\]' 'a'
match '*[\[]' '['
match '*[\[]' "\\"
nomatch '*[\[]' 'a'
# A literal ']' cannot be a class member: the class parser stops at the first
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'