Add obfuscated personal email as alternate security contact

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
Add contributor governance: CONTRIBUTING, COC, SECURITY, DCO
2026-06-14 22:33:54 +03:00 · 2026-06-14 13:47:15 +02:00 · 2026-06-14 13:41:19 +02:00 · 2026-06-14 00:42:48 +02:00 · 2026-06-14 00:37:20 +02:00 · 2026-06-13 10:41:24 +02:00
22 changed files with 495 additions and 127 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -61,6 +61,37 @@ jobs:
        if: failure()
        run: cat tests/test-suite.log 2>/dev/null || true

+  dco:
+    name: DCO sign-off
+    # Only checkable on a PR, where we have the base..head commit range.
+    if: github.event_name == 'pull_request'
+    runs-on: ubuntu-24.04
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Every commit must be signed off
+        env:
+          BASE: ${{ github.event.pull_request.base.sha }}
+          HEAD: ${{ github.event.pull_request.head.sha }}
+        run: |
+          set -euo pipefail
+          fail=0
+          # --no-merges: merge commits are GitHub-generated and carry no sign-off.
+          for sha in $(git rev-list --no-merges "$BASE..$HEAD"); do
+            if [ -z "$(git log -1 --format='%(trailers:key=Signed-off-by)' "$sha")" ]; then
+              echo "Missing Signed-off-by: $(git log -1 --format='%h %s' "$sha")"
+              fail=1
+            fi
+          done
+          if [ "$fail" -ne 0 ]; then
+            echo
+            echo "Sign commits with 'git commit -s'; fix a branch with 'git rebase --signoff $BASE'."
+            echo "See CONTRIBUTING.md (Developer Certificate of Origin)."
+            exit 1
+          fi
+
  lint:
    name: lint (shellcheck, shfmt)
    runs-on: ubuntu-24.04
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,83 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
+
+We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
+
+## Our Standards
+
+Examples of behavior that contributes to a positive environment for our community include:
+
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the overall community
+
+Examples of unacceptable behavior include:
+
+* The use of sexualized language or imagery, and sexual attention or advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email address, without their explicit permission
+* Other conduct which could reasonably be considered inappropriate in a professional setting
+
+## Enforcement Responsibilities
+
+Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
+
+Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
+
+## Scope
+
+This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at <roche@httrack.com>. All complaints will be reviewed and investigated promptly and fairly.
+
+All community leaders are obligated to respect the privacy and security of the reporter of any incident.
+
+## Enforcement Guidelines
+
+Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
+
+### 1. Correction
+
+**Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
+
+**Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
+
+### 2. Warning
+
+**Community Impact**: A violation through a single incident or series of actions.
+
+**Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
+
+### 3. Temporary Ban
+
+**Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
+
+**Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
+
+### 4. Permanent Ban
+
+**Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
+
+**Consequence**: A permanent ban from any sort of public interaction within the community.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
+
+Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
+
+For answers to common questions about this code of conduct, see the FAQ at [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at [https://www.contributor-covenant.org/translations][translations].
+
+[homepage]: https://www.contributor-covenant.org
+[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
+[Mozilla CoC]: https://github.com/mozilla/diversity
+[FAQ]: https://www.contributor-covenant.org/faq
+[translations]: https://www.contributor-covenant.org/translations
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,39 @@
+# Contributing to HTTrack
+
+HTTrack is small and old. Keep changes easy to review and safe to merge.
+
+## Pull requests
+
+- One change per PR. Small diffs merge fast.
+- PRs are squash-merged: the title and description become the commit message, so
+  explain *why*.
+- Add or update tests for engine changes (`tests/`), and keep CI green.
+
+## Style
+
+- C, matching nearby code. **Format only the lines you change** (`git
+  clang-format` against the repo `.clang-format`). Never reformat untouched code.
+- Comment the *why*, in English.
+- HTTrack parses hostile input off the network. Check bounds, avoid unchecked
+  copies, and never let an attacker-controlled length drive arithmetic unchecked.
+
+## Sign your work
+
+Every commit needs a `Signed-off-by` line, the
+[DCO](https://developercertificate.org/): `git commit -s`. CI rejects unsigned
+commits; fix a branch with `git rebase --signoff master`.
+
+## AI assistants
+
+Welcome, and nothing to disclose. Two rules:
+
+- **Own every line** as if you wrote it. Can't explain it in review? Not ready.
+- **Don't push your work onto reviewers.** A raw generated patch a maintainer has
+  to vet from scratch will be closed.
+
+The sign-off covers AI-assisted code too.
+
+## Bugs
+
+Open an issue with the version, OS, command used, and expected vs actual result.
+For security issues see [SECURITY.md](SECURITY.md), not a public issue.
--- a/README.md
+++ b/README.md
@@ -23,7 +23,7 @@ http://www.httrack.com/

 ## Compile trunk release
 ```sh
-git clone https://github.com/xroche/httrack.git --recurse
+git clone https://github.com/xroche/httrack.git --recurse-submodules
 cd httrack
 ./configure --prefix=$HOME/usr && make -j8 && make install
 ```
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -0,0 +1,23 @@
+# Security Policy
+
+## Reporting
+
+Report privately, not in a public issue or PR: use GitHub
+[private advisories](https://github.com/xroche/httrack/security/advisories/new)
+or email <roche@httrack.com> (alternate: `xroche at gmail dot com`).
+
+Include the HTTrack version and platform, a concrete reproduction (command line,
+a sample page or server response, or a small proof of concept), and what an
+attacker gains. We'll acknowledge it and keep you posted. Please allow time for a
+release before disclosing publicly.
+
+## Supported versions
+
+Fixes land on `master` and ship in the next release; older releases aren't
+maintained. Confirm against current `master` when you can.
+
+## AI-assisted findings
+
+Scanners and LLMs are fine, but only send reports you have verified yourself. A
+confirmed, reproducible issue is worth our time; a plausible one that doesn't
+reproduce is not, and will be closed. If a report is AI-assisted, say so.
--- a/html/cmddoc.html
+++ b/html/cmddoc.html
@@ -118,11 +118,11 @@ The command-line version
      <br>
      <br>
    <li>Add the URLs, separated by a blank space</li>
-      <br><small><tt>httrack www.someweb.com/foo/</tt></small>
+      <br><small><tt>httrack www.example.com/foo/</tt></small>
      <br>
      <br>
    <li>If you need, add some options (see the <a href="options.html">option list</a>)</li>
-      <br><small><tt>httrack www.someweb.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
+      <br><small><tt>httrack www.example.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
      <br>
      <br>
    <li>Launch the command line, and wait until the mirror is finishing</li>
--- a/html/faq.html
+++ b/html/faq.html
@@ -303,43 +303,43 @@ Okay, let me explain how to precisely control the capture process.<br>
 Let's take an example:<br>
 <br>
 Imagine you want to capture the following site:<br>
-<tt>www.someweb.com/gallery/flowers/</tt><br>
+<tt>www.example.com/gallery/flowers/</tt><br>
 <br>
-HTTrack, by default, will capture all links encountered in <tt>www.someweb.com/gallery/flowers/</tt> or in lower directories, like
-<tt>www.someweb.com/gallery/flowers/roses/</tt>.<br>
+HTTrack, by default, will capture all links encountered in <tt>www.example.com/gallery/flowers/</tt> or in lower directories, like
+<tt>www.example.com/gallery/flowers/roses/</tt>.<br>
 It will not follow links to other websites, because this behaviour might cause to capture the Web entirely!<br>
-It will not follow links located in higher directories, too (for example, <tt>www.someweb.com/gallery/flowers/</tt> itself) because this 
+It will not follow links located in higher directories, too (for example, <tt>www.example.com/gallery/flowers/</tt> itself) because this 
 might cause to capture too much data.<br>
 <br>
 This is the <b><u>default behaviour</b></u> of HTTrack, BUT, of course, if you want, you can tell HTTrack to capture other directorie(s), website(s)!..
 <br>
-In our example, we might want also to capture all links in <tt>www.someweb.com/gallery/trees/</tt>, and in <tt>www.someweb.com/photos/</tt><br>
+In our example, we might want also to capture all links in <tt>www.example.com/gallery/trees/</tt>, and in <tt>www.example.com/photos/</tt><br>
 <br>
 This can easily done by using filters: go to the Option panel, select the 'Scan rules' tab, and enter this line:
 (you can leave a blank space between each rules, instead of entering a carriage return)<br>
-<tt>+www.someweb.com/gallery/trees/*<br>
-+www.someweb.com/photos/*</tt><br>
+<tt>+www.example.com/gallery/trees/*<br>
+www.example.com/photos/*</tt><br>
 <br>
-This means "accept all links begining with <tt>www.someweb.com/gallery/trees/</tt> and <tt>www.someweb.com/photos/</tt>" 
+This means "accept all links begining with <tt>www.example.com/gallery/trees/</tt> and <tt>www.example.com/photos/</tt>" 
 - the <tt>+</tt> means "accept" and the final <tt>*</tt> means "any character will match after the previous ones".
 Remember the <tt>*.doc</tt> or <tt>*.zip</tt> encountered when you want to select all files from a certain type on your computer: 
 it is almost the same here, except the begining "+"<br>
 <br>
-Now, we might want to exclude all links in <tt>www.someweb.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
+Now, we might want to exclude all links in <tt>www.example.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
 we accepted too many files. Here again, you can add a filter rule to refuse these links. Modify the previous filters to:<br>
-<tt>+www.someweb.com/gallery/trees/*<br>
-+www.someweb.com/photos/*<br>
-www.someweb.com/gallery/trees/hugetrees/*</tt><br>
+<tt>+www.example.com/gallery/trees/*<br>
+www.example.com/photos/*<br>
+-www.example.com/gallery/trees/hugetrees/*</tt><br>
 <br>
 You have noticed the <tt>-</tt> in the begining of the third rule: this means "refuse links matching the rule" 
-; and the rule is "any files begining with <tt>www.someweb.com/gallery/trees/hugetrees/</tt><br>
+; and the rule is "any files begining with <tt>www.example.com/gallery/trees/hugetrees/</tt><br>

 Voila! With these three rules, you have precisely defined what you wanted to capture.<br>
 <br>
 A more complex example?<br>
 <br>
-Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.someweb.com<br>
-<tt>+www.someweb.com/*blue*.jpg</tt><br>
+Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.example.com<br>
+<tt>+www.example.com/*blue*.jpg</tt><br>
 <br>
 More detailed information can be found <a href="filters.html">here</a>!<br>
 <br>
@@ -440,7 +440,7 @@ This will cause a performance loss, but will increase the compatibility with som

 <a NAME="QT1">Q: <strong>Only the first page is caught. What's wrong?</a></strong></br>
 A: <em>First, check the <tt>hts-log.txt</tt> file (and/or <tt>hts-err.txt</tt> error log file) - this can give you precious information.<br>
-The problem can be a website that redirects you to another site (for example, <tt>www.someweb.com</tt> to <tt>public.someweb.com</tt>) : 
+The problem can be a website that redirects you to another site (for example, <tt>www.example.com</tt> to <tt>public.example.com</tt>) : 
 in this case, use filters to accept this site<br>
 This can be, also, a problem in the HTTrack options (link depth too low, for example)</em>

@@ -485,10 +485,10 @@ You may also want to capture files that are forbidden by default by the <a href=
 In these cases, HTTrack does not capture these links automatically, you have to tell it to do so. 
 <br><br>
 <ul><li>Either use the <a href="filters.html">filters</a>.<br>
-Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get .jpg images located
-in <tt>http://www.someweb.com/bar/</tt> (for example, http://www.someweb.com/bar/blue.jpg)<br>
-Then, add the filter rule <tt>+www.someweb.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
-You can, also, accept all files from the /bar folder with <tt>+www.someweb.com/bar/*</tt>, or only html files with <tt>+www.someweb.com/bar/*.html</tt> and so on..<br><br>
+Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get .jpg images located
+in <tt>http://www.example.com/bar/</tt> (for example, http://www.example.com/bar/blue.jpg)<br>
+Then, add the filter rule <tt>+www.example.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
+You can, also, accept all files from the /bar folder with <tt>+www.example.com/bar/*</tt>, or only html files with <tt>+www.example.com/bar/*.html</tt> and so on..<br><br>
 </li><li>
 If the problems are related to robots.txt rules, that do not let you access some folders (check in the logs if you are not sure),
 you may want to disable the default robots.txt rules in the options. (but only disable this option with great care, 
@@ -509,8 +509,8 @@ and rescan the website as described before. HTTrack will be obliged to recatch t
 <a NAME="Q1bb">Q: <strong>FTP links are not caught! What's happening?</strong><br>
 A: <em>FTP files might be seen as external links, especially if they are located in outside domain. You have either to accept all external links (See the links options, -n option) or
 only specific files (see <a href="filters.html">filters</a> section). <br>
-Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get ftp://ftp.someweb.com files<br>
-Then, add the filter rule <tt>+ftp.someweb.com/*</tt> to accept all files from this (ftp) location<br>
+Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get ftp://ftp.example.com files<br>
+Then, add the filter rule <tt>+ftp.example.com/*</tt> to accept all files from this (ftp) location<br>
 </em>
 <br>

@@ -551,10 +551,10 @@ Note: In some rare cases, duplicate data files can be found when the website red

 <a NAME="Q1b2">Q: <strong>I'm downloading too many files! What can I do?</strong><br>
 A: <em>This is often the case when you use too large a filter, for example <tt>+*.html</tt>, which asks the
-engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.someweb.com/specificfolder/*.html</tt><br>
-If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.someweb.com/big/, 
-use <tt>-www.someweb.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
-mirroring http://www.someweb.com/big/index.html, is to catch everything in http://www.someweb.com/big/. Filters are your friends,
+engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.example.com/specificfolder/*.html</tt><br>
+If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.example.com/big/, 
+use <tt>-www.example.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
+mirroring http://www.example.com/big/index.html, is to catch everything in http://www.example.com/big/. Filters are your friends,
 use them!
 </em>
 <br>
@@ -562,7 +562,7 @@ use them!

 <a NAME="Q1b22">Q: <strong>The engine turns crazy, getting thousands of files! What's going on?</strong><br>
 A: <em>This can happen if a loop occurs in some bogus website. For example, a page that refers to itself, with a timestamp
-in the query string (e.g. <tt>http://www.someweb.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>). 
+in the query string (e.g. <tt>http://www.example.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>). 
 These are really annoying, as it is VERY difficult to detect the loop (the timestamp might be a page number).
 To limit the problem: set a recurse level (for example to 6), or avoid the bogus pages (use the filters)
 </em>
@@ -571,7 +571,7 @@ To limit the problem: set a recurse level (for example to 6), or avoid the bogus

 <a NAME="Q1b3">Q: <strong>File are sometimes renamed (the type is changed)! Why?</strong><br>
 A: <em>By default, HTTrack tries to know the type of remote files. This is useful when links like
-<tt>http://www.someweb.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else. 
+<tt>http://www.example.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else. 
 Locally, foo.cgi will not be recognized as an html page, or as an image, by your browser. HTTrack has to rename the file
 as foo.html or foo.gif so that it can be viewed.<br>
 </em>
@@ -730,8 +730,8 @@ but this is a smart bug..
 the domain, too. How to retrieve them?</strong><br>
 A: <em>If you just want to retrieve files that can be reached through links, just activate
 the 'get file near links' option. But if you want to retrieve html pages too, you can both
-use wildcards or explicit addresses ; e.g. add <tt>www.someweb.com/*</tt> to accept all
-files and pages from www.someweb.com.<br>
+use wildcards or explicit addresses ; e.g. add <tt>www.example.com/*</tt> to accept all
+files and pages from www.example.com.<br>
 <br>
 </em></a><a NAME="Q6">Q: <strong>I have forgotten some URLs of files during a long
 mirror.. Should I redo all?</strong><br>
@@ -744,7 +744,7 @@ A: <em>You can use different methods. You can use the 'get files near a link' op
 files are in a foreign domain. You can use, too, a filter adress: adding <tt>+*.zip</tt>
 in the URL list (or in the filter list) will accept all ZIP files, even if these files are
 outside the address. <br>
-Example : <tt>httrack www.someweb.com/someaddress.html +*.zip</tt> will allow
+Example : <tt>httrack www.example.com/someaddress.html +*.zip</tt> will allow
 you to retrieve all zip files that are linked on the site.</em><br>
 <br>
 </a><a NAME="Q8">Q: <strong>There are ZIP files in a page, but I don't want to transfer
@@ -771,7 +771,7 @@ them on filters!</strong><br>
 A: <em>By default, HTTrack retrieves all types of files on authorized links. To avoid
 that, define filters like </a><a NAME="Q7"><tt>-* +&lt;website&gt;/*.html
 +&lt;website&gt;/*.htm +&lt;website&gt;/ +*.&lt;type wanted&gt;</tt></a><a NAME="Q10"><br>
-Example: <tt>httrack www.someweb.com/index.html -* +www.someweb.com/*.htm* +www.someweb.com/*.gif +www.someweb.com/*.jpg</tt><br>
+Example: <tt>httrack www.example.com/index.html -* +www.example.com/*.htm* +www.example.com/*.gif +www.example.com/*.jpg</tt><br>
 <br>
 </em><a NAME="Q10">Q: <strong>When I use filters, I get too many files!</strong><br>
 A: <em>You might use too large a filter, for example <tt>*.html</tt> will get ALL html
@@ -779,13 +779,13 @@ files identified. If you want to get all files on an address, use <tt>www.&lt;ad
 If you want to get ONLY files defined by your filters, use something like <tt>-* +www.foo.com/*</tt>, because 
 <tt>+www.foo.com/*</tt> will only accept selected links without forbidding other ones!<br>
 There are lots of possibilities using filters.<br>
-Example:<tt>httrack www.someweb.com +*.someweb.com/*.htm*</tt><br>
+Example:<tt>httrack www.example.com +*.example.com/*.htm*</tt><br>
 <br>
 </em></a><a NAME="Q11">Q: <strong>When I use filters, I can't access another domain, but I
 have filtered it!</strong><br>
-A: <em>You may have done a mistake declaring filters, for example <tt>+www.someweb.com/*
-*someweb* </tt></em>will not work, because -*someweb* has an upper priority (because it has
-been declared after +www.someweb.com)<br>
+A: <em>You may have done a mistake declaring filters, for example <tt>+www.example.com/*
+-*example* </tt></em>will not work, because -*example* has an upper priority (because it has
+been declared after +www.example.com)<br>
 <br>
 </a><a NAME="Q12">Q: <strong>Must I add a&nbsp; '+' or '-' in the filter list when I want
 to use filters?</strong><br>
@@ -800,7 +800,7 @@ filter list) and accept only html files and the file(s) you want to retrieve (BU
 forget to add <tt>+&lt;website&gt;*.html</tt> in the filter list, or pages will not be
 scanned! Add the name of files you want with a <tt>*/</tt> before ; i.e. if you want to
 retrieve file.zip, add <tt>*/file.zip</tt>)<br>
-Example:<tt>httrack www.someweb.com +www.someweb.com/*.htm* +thefileiwant.zip</tt><br>
+Example:<tt>httrack www.example.com +www.example.com/*.htm* +thefileiwant.zip</tt><br>
 <br>
 </em>

@@ -828,7 +828,7 @@ A: <em>Yes. See the URL capture abilities (--catchurl for command-line release,
 A: <em>Yes. See the shell system command option (-V option for command-line release)</em>

 <br><br><a NAME="QM6">Q: <strong>Can I use username/password authentication on a site?</strong></a><br>
-A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.someweb.com/private/mybox.html</tt>)</em>
+A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.example.com/private/mybox.html</tt>)</em>

 <br><br><a NAME="QM7">Q: <strong>Can I use username/password authentication for a proxy?</strong></a><br>
 A: <em>Yes. Use user:password@your_proxy_name as your proxy name (example: <tt>smith:foo@proxy.mycorp.com</tt>)</em>
--- a/html/fcguide.html
+++ b/html/fcguide.html
@@ -181,17 +181,17 @@ used for some time.

 <p align=justify> The rest of this manual is dedicated to detailing what
 you find in the help message and providing examples - lots and lots of
-examples...  Here is what you get (page by page - use <enter> to move to
+examples...  Here is what you get (page by page - use &lt;enter&gt; to move to
 the next page in the real program) if you type 'httrack --help':

 <pre>
 >httrack --help
 HTTrack version 3.03BETAo4 (compiled Jul  1 2001)
-	usage: ./httrack <URLs [-option] [+<FILTERs>] [-<FILTERs>]
+	usage: ./httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;]
 	with options listed below: (* is the default value)

 General options:
-  O  path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
+  O  path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path &lt;param&gt;)
 %O  top path if no path defined (-O path_mirror[,path_cache_and_logfiles])

 Action options:
@@ -202,7 +202,7 @@ Action options:
  Y   mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)

 Proxy options:
-  P  proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
+  P  proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy &lt;param&gt;)
 %f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])

 Limits options:
@@ -227,7 +227,7 @@ Links options:
 %P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
  n  get non-html files 'near' an html file (ex: an image located outside) (--near)
  t  test all URLs (even forbidden ones) (--test)
- %L <file add all URL located in this text file (one URL per line) (--list <param>)
+ %L &lt;file&gt; add all URL located in this text file (one URL per line) (--list &lt;param&gt;)

 Build options:
  NN structure type (0 *original structure, 1+: see below) (--structure[=N])
@@ -248,12 +248,12 @@ Spider options:
 %h  force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
 %B  tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
 %s  update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
- %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
+ %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)

 Browser ID:
-  F  user-agent field (-F "user-agent name") (--user-agent <param>)
- %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
- %l  preferred language (-%l "fr, en, jp, *" (--language <param>)
+  F  user-agent field (-F "user-agent name") (--user-agent &lt;param&gt;)
+ %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer &lt;param&gt;)
+ %l  preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)

 Log, index, cache
  C  create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
@@ -303,8 +303,8 @@ Guru options: (do NOT use)
 #!  Execute a shell command (-#! "echo hello")

 Command-line specific options:
-  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
- %U run the engine with another id when called as root (-%U smith) (--user <param>)
+  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
+ %U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)

 Details: Option N
  N0 Site-structure (default)
@@ -332,7 +332,7 @@ Details: User-defined option N
  %N Name of file, including file type (ex: image.gif)
  %t File type (ex: gif)
  %p Path [without ending /] (ex: /someimages)
-  %h Host name (ex: www.someweb.com) (--http-10)
+  %h Host name (ex: www.example.com) (--http-10)
  %M URL MD5 (128 bits, 32 ascii bytes)
  %Q query string MD5 (128 bits, 32 ascii bytes)
  %q small query string MD5 (16 bits, 4 ascii bytes) (--include-query-string)
@@ -340,14 +340,14 @@ Details: User-defined option N
  %[param] param variable in query string

 Shortcuts:
--mirror      <URLs *make a mirror of site(s) (default)
--get         <URLs  get the files indicated, do not seek other URLs (-qg)
--list   <text file  add all URL located in this text file (-%L)
--mirrorlinks <URLs  mirror all links in 1st level pages (-Y)
--testlinks   <URLs  test links in pages (-r1p0C0I0t)
--spider      <URLs  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite    <URLs  identical to --spider
--skeleton    <URLs  make a mirror, but gets only html files (-p1)
+--mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
+--get         &lt;URLs&gt;  get the files indicated, do not seek other URLs (-qg)
+--list   &lt;text file&gt;  add all URL located in this text file (-%L)
+--mirrorlinks &lt;URLs&gt;  mirror all links in 1st level pages (-Y)
+--testlinks   &lt;URLs&gt;  test links in pages (-r1p0C0I0t)
+--spider      &lt;URLs&gt;  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
+--testsite    &lt;URLs&gt;  identical to --spider
+--skeleton    &lt;URLs&gt;  make a mirror, but gets only html files (-p1)
 --update              update a mirror, without confirmation (-iC2)
 --continue            continue a mirror, without confirmation (-iC1)

@@ -356,17 +356,17 @@ Shortcuts:

 --http10              force http/1.0 requests (-%h)

-example: httrack www.someweb.com/bob/
-means:   mirror site www.someweb.com/bob/ and only this site
+example: httrack www.example.com/bob/
+means:   mirror site www.example.com/bob/ and only this site

-example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
+example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
 means:   mirror the two sites together (with shared links) and accept any .jpg files on .com sites

-example: httrack www.someweb.com/bob/bobby.html +* -r6
+example: httrack www.example.com/bob/bobby.html +* -r6
 means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web

-example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
-runs the spider on www.someweb.com/bob/bobby.html using a proxy
+example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080
+runs the spider on www.example.com/bob/bobby.html using a proxy

 example: httrack --update
 updates a mirror in the current folder
@@ -387,13 +387,13 @@ with examples... I will be here a while...
 <hr>
 <h2> Syntax </h2>

-<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
+<pre><b><i>httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;] </i></b></pre>

 <p align=justify> The syntax of httrack is quite simple.  You specify
-the URLs you wish to start the process from (<URLS>), any options you
+the URLs you wish to start the process from (&lt;URLS&gt;), any options you
 might want to add ([-option], any filters specifying places you should
-([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
-line by pressing <enter>.  Httrack then goes off and does your bidding.
+([+&lt;FILTERs&gt;]) and should not ([-&lt;FILTERs&gt;]) go, and end the command
+line by pressing &lt;enter&gt;.  Httrack then goes off and does your bidding.
 For example:

 <pre><b><i>
@@ -425,7 +425,7 @@ site. Specifically, the defauls are:
  pN priority mode: (* p3)  *3 save all files
  D  *can only go down into subdirs
  a  *stay on the same address
-  --mirror      <URLs> *make a mirror of site(s) (default)
+  --mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
 </pre>

 <p align=justify> Here's what all of that means:
@@ -542,7 +542,7 @@ subdirectories of the starting directory to be investigated.
 search started are to be collected.  Other sites they point to are not
 to be imaged. 

-<pre><b><i>  --mirror      <URLs> *make a mirror of site(s) (default) </i></b></pre>
+<pre><b><i>  --mirror      &lt;URLs&gt; *make a mirror of site(s) (default) </i></b></pre>

 <p align=justify> This indicates that the program should try to make a
 copy of the site as well as it can. 
@@ -921,7 +921,7 @@ Links options:
 %P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use)
  n   get non-html files 'near' an html file (ex: an image located outside)
  t   test all URLs (even forbidden ones)
- %L <file> add all URL located in this text file (one URL per line)
+ %L &lt;file&gt; add all URL located in this text file (one URL per line)
 </i></b></pre>

 <p align=justify> The links options allow you to control what links are
@@ -1183,7 +1183,7 @@ Spider options:
 %h  force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
 %B  tolerant requests (accept bogus responses on some servers, but not standard!)
 %s  update hacks: various hacks to limit re-transfers when updating
- %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
+ %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
 </i></b></pre>

 <p align=justify> By default, cookies are universally accepted and
@@ -1387,7 +1387,7 @@ web servers leave footprints in the browser.
 Browser ID:
  F  user-agent field (-F "user-agent name")
 %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]"
- %l  preferred language (-%l "fr, en, jp, *" (--language <param>)
+ %l  preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
 </i></b></pre>

 <p align=justify> The user-agent field is used by browsers to determine
@@ -1799,7 +1799,7 @@ based authentication)

 <pre><b><i>
 Command-line specific options:
-  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
+  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
 </i></b></pre>

 <p align=justify> This option is very nice for a wide array of actions
@@ -1811,7 +1811,7 @@ httrack http://www.shoesizes.com/bob/ -O /tmp/shoesizes -V "/bin/echo \$0"
 </i></b></pre>

 <pre>
- %U run the engine with another id when called as root (-%U smith) (--user <param>)
+ %U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
 </pre>

 <p align=justify> Change the UID of the owner when running as r00t
@@ -1856,14 +1856,14 @@ of other options that are commonly used.

 <pre><b><i>
 Shortcuts:
--mirror      <URLs> *make a mirror of site(s) (default)
--get         <URLs>  get the files indicated, do not seek other URLs (-qg)
--list   <text file>  add all URL located in this text file (-%L)
--mirrorlinks <URLs>  mirror all links in 1st level pages (-Y)
--testlinks   <URLs>  test links in pages (-r1p0C0I0t)
--spider      <URLs>  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite    <URLs>  identical to --spider
--skeleton    <URLs>  make a mirror, but gets only html files (-p1)
+--mirror      &lt;URLs&gt; *make a mirror of site(s) (default)
+--get         &lt;URLs&gt;  get the files indicated, do not seek other URLs (-qg)
+--list   &lt;text file&gt;  add all URL located in this text file (-%L)
+--mirrorlinks &lt;URLs&gt;  mirror all links in 1st level pages (-Y)
+--testlinks   &lt;URLs&gt;  test links in pages (-r1p0C0I0t)
+--spider      &lt;URLs&gt;  spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
+--testsite    &lt;URLs&gt;  identical to --spider
+--skeleton    &lt;URLs&gt;  make a mirror, but gets only html files (-p1)
 --update              update a mirror, without confirmation (-iC2)
 --continue            continue a mirror, without confirmation (-iC1)
 --catchurl            create a temporary proxy to capture an URL or a form post URL
@@ -2019,15 +2019,15 @@ are in reverse priority order.  Here's an example:
        <td>no characters must be present after</a></td>
      </tr>
 	<tr>
-		<td> <b> <filter>*[&lt NN]</b></td>
+		<td> <b> &lt;filter&gt;*[&lt NN]</b></td>
 		<td> size less than NN Kbytes</td>
 	</tr>
 	<tr>
-		<td> <b> <filter>*[&gt PP]</b></td>
+		<td> <b> &lt;filter&gt;*[&gt PP]</b></td>
 		<td> size more than PP Kbytes</td>
 	</tr>
 	<tr>
-		<td> <b> <filter>*[&lt NN &gt PP]</b></td>
+		<td> <b> &lt;filter&gt;*[&lt NN &gt PP]</b></td>
 		<td> size less than NN Kbytes and more than PP Kbytes</td>
 	</tr>
    </table>
@@ -2054,8 +2054,8 @@ generated automatically using the interface)
        <td>This will accept all zip files in .com addresses</td>
      </tr>
      <tr>
-        <td><b>-*someweb*/*.tar*</b></td>
-        <td>This will refuse all tar (or tar.gz etc.) files in hosts containing someweb</td>
+        <td><b>-*example*/*.tar*</b></td>
+        <td>This will refuse all tar (or tar.gz etc.) files in hosts containing example</td>
      </tr>
      <tr>
        <td><b>+*/*somepage*</b></td>
--- a/html/filters.html
+++ b/html/filters.html
@@ -109,8 +109,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>

    <i>You have to know that once you have defined
    starts links, the default mode is to mirror these links - i.e. if one of your start page is
-    www.someweb.com/test/index.html, all links starting with www.someweb.com/test/ will be
-    accepted. But links directly in www.someweb.com/.. will not be accepted, however, because
+    www.example.com/test/index.html, all links starting with www.example.com/test/ will be
+    accepted. But links directly in www.example.com/.. will not be accepted, however, because
    they are in a higher strcuture. This prevent HTTrack from mirroring the whole site. (All
    files in structure levels equal or lower than the primary links will be retrieved.)<br>
    </i>
@@ -278,8 +278,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
        <td>This will refuse/accept all zip files in .com addresses</td>
      </tr>
      <tr>
-        <td nowrap><tt>*someweb*/*.tar*</tt></td>
-        <td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb</td>
+        <td nowrap><tt>*example*/*.tar*</tt></td>
+        <td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing example</td>
      </tr>
      <tr>
        <td nowrap><tt>*/*somepage*</tt></td>
@@ -289,13 +289,13 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
        <td nowrap><tt>*.html</tt></td>
        <td>This will refuse/accept all html files. <br>
        Warning! With this filter you will accept ALL html files, even those in other addresses.
-        (causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from
+        (causing a global (!) web mirror..) Use www.example.com/*.html to accept all html files from
        a web.</td>
      </tr>
      <tr>
        <td nowrap><tt>*.html*[]</tt></td>
        <td>Identical to <tt>*.html</tt>, but the link must not have any supplemental characters
-        at the end (links with parameters, like <tt>www.someweb.com/index.html?page=10</tt>, will be
+        at the end (links with parameters, like <tt>www.example.com/index.html?page=10</tt>, will be
        refused)</td>
      </tr>
    </table>
--- a/html/httrack.man.html
+++ b/html/httrack.man.html
@@ -123,12 +123,12 @@ mirrored site, and resume interrupted downloads.</p>


 <p style="margin-left:11%; margin-top: 1em"><b>httrack
-www.someweb.com/bob/</b></p>
+www.example.com/bob/</b></p>

 <p style="margin-left:22%;">mirror site
-www.someweb.com/bob/ and only this site</p>
+www.example.com/bob/ and only this site</p>

-<p style="margin-left:11%;"><b>httrack www.someweb.com/bob/
+<p style="margin-left:11%;"><b>httrack www.example.com/bob/
 www.anothertest.com/mike/ +*.com/*.jpg <br>
 -mime:application/*</b></p>

@@ -137,18 +137,18 @@ www.anothertest.com/mike/ +*.com/*.jpg <br>
 sites</p>

 <p style="margin-left:11%;"><b>httrack
-www.someweb.com/bob/bobby.html +* -r6</b></p>
+www.example.com/bob/bobby.html +* -r6</b></p>

 <p style="margin-left:22%;">means get all files starting
 from bobby.html, with 6 link-depth, and possibility of going
 everywhere on the web</p>

 <p style="margin-left:11%;"><b>httrack
-www.someweb.com/bob/bobby.html --spider -P <br>
+www.example.com/bob/bobby.html --spider -P <br>
 proxy.myhost.com:8080</b></p>

 <p style="margin-left:22%;">runs the spider on
-www.someweb.com/bob/bobby.html using a proxy</p>
+www.example.com/bob/bobby.html using a proxy</p>

 <p style="margin-left:11%;"><b>httrack --update</b></p>

@@ -1877,7 +1877,7 @@ User-defined option N</b> <br>
 %N Name of file, including file type (ex: image.gif) <br>
 %t File type (ex: gif) <br>
 %p Path [without ending /] (ex: /someimages) <br>
-%h Host name (ex: www.someweb.com) <br>
+%h Host name (ex: www.example.com) <br>
 %M URL MD5 (128 bits, 32 ascii bytes) <br>
 %Q query string MD5 (128 bits, 32 ascii bytes) <br>
 %k full query string <br>
--- a/html/options.html
+++ b/html/options.html
@@ -131,16 +131,16 @@ This is the default primary scanning option, the engine does not go out of domai

 d   stay on the same principal domain
 This option lets the engine go on all sites that exist on the same principal domain.
-Example: a link located at www.someweb.com that goes to members.someweb.com will be followed.
+Example: a link located at www.example.com that goes to members.example.com will be followed.

 l   stay on the same location (.com, etc.)
 This option lets the engine go on all sites that exist on the same location.
-Example: a link located at www.someweb.com that goes to www.anyotherweb.com will be followed.
+Example: a link located at www.example.com that goes to www.anyotherweb.com will be followed.
 Warning: this is a potentially dangerous option, limit the recurse depth with r option.

 e   go everywhere on the web
 This option lets the engine go on any sites.
-Example: a link located at www.someweb.com that goes to www.anyotherweb.org will be followed.
+Example: a link located at www.example.com that goes to www.anyotherweb.org will be followed.
 Warning: this is a potentially dangerous option, limit the recurse depth with r option.

 n   get non-html files 'near' an html file (ex: an image located outside)
--- a/html/step9_opt8.html
+++ b/html/step9_opt8.html
@@ -117,7 +117,7 @@ h4 { margin: 0;  font-weight: bold;  font-size: 1.18em; }
  <li>HTML Footer</li>
  <br><small>Enter here the optionnal text that will be included as a comment in each HTML file to make archiving easier
  <br>The string entered is generally an HTML comment (<tt>&lt;!-- HTML comment --&gt;</tt>) with optionnal %s, which will be transformed into a specific string information:
-  <br>%s #1 : host name (for example, www.someweb.com)
+  <br>%s #1 : host name (for example, www.example.com)
  <br>%s #2 : file name (for example, /index.html)
  <br>%s #3 : date of the mirror
  <br><b>Example</b>: <tt>&lt;!-- Page mirrored from %s, file %s. Archive date: %s --&gt;</tt>
--- a/lang/Ukrainian.txt
+++ b/lang/Ukrainian.txt
@@ -7,7 +7,7 @@ uk
 LANGUAGE_AUTHOR
 Andrij Shevchuk (http://programy.com.ua, http://vic-info.com.ua) \r\n
 LANGUAGE_CHARSET
-ISO-8859-5
+windows-1251
 LANGUAGE_WINDOWSID
 Ukrainian
 OK
--- a/man/Makefile.am
+++ b/man/Makefile.am
@@ -13,3 +13,9 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
 	README='$(top_srcdir)/README' $(SHELL) $(srcdir)/makeman.sh \
 		'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
 .PHONY: regen-man
+
+# Render html/httrack.man.html from httrack.1. Needs the groff html device
+# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
+regen-man-html: httrack.1
+	groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
+.PHONY: regen-man-html
--- a/man/Makefile.in
+++ b/man/Makefile.in
@@ -551,6 +551,12 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
 		'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
 .PHONY: regen-man

+# Render html/httrack.man.html from httrack.1. Needs the groff html device
+# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
+regen-man-html: httrack.1
+	groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
+.PHONY: regen-man-html
+
 # Tell versions [3.59,3.63) of GNU make to not export all variables.
 # Otherwise a system limit (for SysV at least) may be exceeded.
 .NOEXPORT:
--- a/man/httrack.1
+++ b/man/httrack.1
@@ -2,7 +2,7 @@
 .\" groff -man -Tascii httrack.1
 .\"
 .\" This file is generated by man/makeman.sh; do not edit by hand.
-.TH httrack 1 "07 June 2026" "httrack website copier"
+.TH httrack 1 "13 June 2026" "httrack website copier"
 .SH NAME
 httrack \- offline browser : copy websites to a local directory
 .SH SYNOPSIS
@@ -98,15 +98,15 @@ httrack \- offline browser : copy websites to a local directory
 allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.
 .SH EXAMPLES
 .TP
-.B httrack www.someweb.com/bob/
-mirror site www.someweb.com/bob/ and only this site
+.B httrack www.example.com/bob/
+mirror site www.example.com/bob/ and only this site
 .TP
-.B httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
+.B httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
 mirror the two sites together (with shared links) and accept any .jpg files on .com sites
 .TP
-.B httrack www.someweb.com/bob/bobby.html +* \-r6
+.B httrack www.example.com/bob/bobby.html +* \-r6
 .TP
-.B httrack www.someweb.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
+.B httrack www.example.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
 .TP
 .B httrack \-\-update
 .TP
@@ -411,7 +411,7 @@ File type (ex: gif)
 .IP \-%p
 Path [without ending /] (ex: /someimages)
 .IP \-%h
-Host name (ex: www.someweb.com)
+Host name (ex: www.example.com)
 .IP \-%M
 URL MD5 (128 bits, 32 ascii bytes)
 .IP \-%Q
--- a/src/htshelp.c
+++ b/src/htshelp.c
@@ -712,7 +712,7 @@ void help(const char *app, int more) {
  infomsg("  '%N' Name of file, including file type (ex: image.gif)");
  infomsg("  '%t' File type (ex: gif)");
  infomsg("  '%p' Path [without ending /] (ex: /someimages)");
-  infomsg("  '%h' Host name (ex: www.someweb.com)");
+  infomsg("  '%h' Host name (ex: www.example.com)");
  infomsg("  '%M' URL MD5 (128 bits, 32 ascii bytes)");
  infomsg("  '%Q' query string MD5 (128 bits, 32 ascii bytes)");
  infomsg("  '%k' full query string");
@@ -767,21 +767,21 @@ void help(const char *app, int more) {
  infomsg("Details: Option %W: External callbacks prototypes");
  infomsg("see htsdefines.h");
  infomsg("");
-  infomsg("example: httrack www.someweb.com/bob/");
-  infomsg("means:   mirror site www.someweb.com/bob/ and only this site");
+  infomsg("example: httrack www.example.com/bob/");
+  infomsg("means:   mirror site www.example.com/bob/ and only this site");
  infomsg("");
  infomsg
-    ("example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
+    ("example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
  infomsg
    ("means:   mirror the two sites together (with shared links) and accept any .jpg files on .com sites");
  infomsg("");
-  infomsg("example: httrack www.someweb.com/bob/bobby.html +* -r6");
+  infomsg("example: httrack www.example.com/bob/bobby.html +* -r6");
  infomsg
    ("means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web");
  infomsg("");
  infomsg
-    ("example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
-  infomsg("runs the spider on www.someweb.com/bob/bobby.html using a proxy");
+    ("example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
+  infomsg("runs the spider on www.example.com/bob/bobby.html using a proxy");
  infomsg("");
  infomsg("example: httrack --update");
  infomsg("updates a mirror in the current folder");
--- a/src/htslib.c
+++ b/src/htslib.c
@@ -121,6 +121,7 @@ const char *hts_detect[] = {
  "lowsrc",
  "profile",                    // element META
  "src",
+  "srcset",                     // HTML5 responsive images (<img>, <source>)
  "swurl",
  "url",
  "usemap",
@@ -895,9 +896,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,

  // possibilité non documentée: >post: et >postfile:
  // si présence d'un tag >post: alors executer un POST
-  // exemple: http://www.someweb.com/test.cgi?foo>post:posteddata=10&foo=5
+  // exemple: http://www.example.com/test.cgi?foo>post:posteddata=10&foo=5
  // si présence d'un tag >postfile: alors envoyer en tête brut contenu dans le fichier en question
-  // exemple: http://www.someweb.com/test.cgi?foo>postfile:post0.txt
+  // exemple: http://www.example.com/test.cgi?foo>postfile:post0.txt
  search_tag = strstr(fil, POSTTOK ":");
  if (!search_tag) {
    search_tag = strstr(fil, POSTTOK "file:");
--- a/src/htsparse.c
+++ b/src/htsparse.c
@@ -532,6 +532,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
        int valid_p = 0;        // force to take p even if == 0
        int ending_p = '\0';    // ending quote?
        int archivetag_p = 0;   // avoid multiple-archives with commas
+        int srcset_p = 0;       // srcset="url1 480w, url2 2x": list of URLs
        int unquoted_script = 0;
        INSCRIPT inscript_state_pos_prev = inscript_state_pos;

@@ -1050,6 +1051,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                          if (strcmp(hts_detect[i], "archive") == 0) {
                            archivetag_p = 1;
                          }
+                          /* srcset: a comma-list of candidate URLs, each split
+                             out and rewritten below (#235, #236) */
+                          else if (strcmp(hts_detect[i], "srcset") == 0
+                                   || strcmp(hts_detect[i], "data-srcset") == 0) {
+                            srcset_p = 1;
+                          }
                        }
                        i++;
                      }
@@ -1815,6 +1822,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                html++;          // sauter # pour usemap etc
              }
            }
+ srcset_next:
+            /* srcset: skip leading whitespace/commas before each candidate;
+               the skipped bytes flush verbatim below */
+            if (srcset_p) {
+              while(html < r->adr + r->size
+                    && (is_realspace(*html) || *html == ','))
+                INCREMENT_CURRENT_ADR(1);
+            }
            eadr = html;

            // ne pas flusher après code si on doit écrire le codebase avant!
@@ -1844,6 +1859,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                    if ((*eadr == quote && (!quoteinscript || *(eadr - 1) == '\\'))     // end quote
                        || (noquote && (*eadr == '\"' || *eadr == '\''))        // end at any quote
                        || (!noquote && quote == '\0' && is_realspace(*eadr))   // unquoted href
+                        || srcset_p     // whitespace ends a srcset candidate URL
                      )         // si pas d'attente de quote spéciale ou si quote atteinte
                      ok = 0;
                  } else if (ending_p && (*eadr == ending_p))
@@ -1872,6 +1888,16 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
                      break;    // \" ou \' point d'arrêt
                    case '?':  /*quote_adr=adr; */
                      break;    // noter position query
+                    case ',':
+                      if (srcset_p) {
+                        /* split only on a trailing comma; one inside the URL
+                           (data: URI, CDN path) is kept, per the WHATWG algo */
+                        const char *const n = eadr + 1;
+
+                        if (n >= r->adr + r->size || is_space(*n) || *n == ',')
+                          ok = 0;
+                      }
+                      break;
                    }
                  }
                  //}
@@ -3250,6 +3276,28 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
            }
            // adr=eadr-1;  // ** sauter

+            /* srcset candidate loop: skip the descriptor and comma, then
+               re-enter the capture for the next URL. Backward goto, not a loop:
+               the per-candidate body is this whole block. */
+            if (srcset_p && ok == 0) {
+              const char *const endp = r->adr + r->size;
+              const char *q = html;
+              while(q < endp && *q != '\0' && *q != ',' && *q != quote
+                    && *q != '<' && *q != '>' && (unsigned char) *q >= 32)
+                q++;            // skip the descriptor
+              if (q < endp && *q == ',') {
+                q++;
+                while(q < endp && (is_realspace(*q) || *q == ','))
+                  q++;          // skip whitespace and empty candidates
+                if (q < endp && *q != '\0' && *q != ',' && *q != quote
+                    && *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
+                  INCREMENT_CURRENT_ADR(q - html);   // keep the automate in sync
+                  ok = 1;
+                  goto srcset_next;
+                }
+              }
+            }
+
            /* We skipped bytes and skip the " : reset state */
            /*if (inscript) {
               inscript_state_pos = INSCRIPT_START;
--- a/tests/01_engine-parse.test
+++ b/tests/01_engine-parse.test
@@ -0,0 +1,131 @@
+#!/bin/bash
+#
+
+# Offline HTML parser tests: each section crawls a file:// fixture (no network)
+# and checks which assets the parser captured and how it rewrote the links.
+
+set -u
+
+tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_parse.XXXXXX") || exit 1
+trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
+
+# a minimal valid 1x1 GIF, reused for every referenced asset
+gif() {
+    printf 'GIF89a\1\0\1\0\200\0\0\0\0\0\377\377\377!\371\4\1\0\0\0\0,\0\0\0\0\1\0\1\0\0\2\2D\1\0;' >"$1"
+}
+
+# crawl <fixture-html> into <out> with link rewriting on, no extra fetching
+crawl() {
+    local html="$1" out="$2"
+    rm -rf "$out"
+    mkdir -p "$out"
+    httrack "file://$html" -O "$out" --quiet --near -n >"$out/.log" 2>&1
+}
+
+# assert a file with the given basename was saved somewhere under <out>
+found() {
+    test -n "$(find "$2" -type f -name "$1" -print -quit)" ||
+        ! echo "FAIL: expected '$1' to be downloaded under $2" || exit 1
+}
+
+# assert NO file with the given basename was saved (e.g. a descriptor token must
+# not be mistaken for a URL)
+notfound() {
+    test -z "$(find "$2" -type f -name "$1" -print -quit)" ||
+        ! echo "FAIL: '$1' should not have been downloaded under $2" || exit 1
+}
+
+# the mirrored fixture page (under "file/"), not HTTrack's own landing index
+savedhtml() {
+    find "$1" -type f -path '*/file/*' -name index.html -print -quit
+}
+
+# srcset on <img> and <source> (#235, #236): every candidate captured and
+# rewritten, descriptors preserved, following attributes left intact.
+site="$tmp/srcset"
+mkdir -p "$site"
+for f in a b c d e f g h i j v dz; do gif "$site/$f.gif"; done
+# unquoted heredoc: $site expands in the absolute-URL candidate
+cat >"$site/index.html" <<EOF
+<html><body>
+<img src="a.gif" srcset="b.gif 480w, c.gif 800w">
+<picture><source srcset="d.gif 1x, c.gif 2x"><img src="a.gif"></picture>
+<img srcset="e.gif, f.gif">
+<img srcset="g.gif 2x" alt="trailing attr after srcset">
+<img srcset="  h.gif   2x ,  i.gif  ">
+<video><source src="v.gif"></video>
+<img srcset="file://$site/j.gif 2x">
+<img srcset="data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x, dz.gif 2x">
+<img srcset="">
+<a href="a.gif">plain link still works</a>
+</body></html>
+EOF
+out="$tmp/srcset-out"
+crawl "$site/index.html" "$out"
+
+# every candidate downloads, incl. unique tails (catches first-only parsing),
+# whitespace-padded (h,i), <source src> (v), absolute (j), post-data: URI (dz)
+for f in a b c d e f g h i j v dz; do found "$f.gif" "$out"; done
+
+# the width/density descriptors are not URLs and must not be fetched
+notfound "480w" "$out"
+notfound "800w" "$out"
+notfound "2x" "$out"
+
+saved=$(savedhtml "$out")
+test -n "$saved" || ! echo "FAIL: saved index.html not found" || exit 1
+
+# descriptors must survive the rewrite (no "b.gif 480w" mangled into a path)
+grep -Eq 'srcset="[^"]*480w[^"]*800w' "$saved" ||
+    ! echo "FAIL: srcset width descriptors lost/reordered in rewritten HTML" || exit 1
+grep -Eq 'srcset="[^"]*1x[^"]*2x' "$saved" ||
+    ! echo "FAIL: srcset density descriptors lost/reordered in rewritten HTML" || exit 1
+# the descriptor-less comma form keeps both candidates and the separator verbatim
+grep -Eq 'srcset="e\.gif, f\.gif"' "$saved" ||
+    ! echo "FAIL: comma-separated srcset without descriptors was altered" || exit 1
+# an attribute following srcset in the same tag must be left intact
+grep -q 'alt="trailing attr after srcset"' "$saved" ||
+    ! echo "FAIL: srcset swallowed a following attribute" || exit 1
+
+# a comma inside a URL (data: URI, CDN path) is part of the URL, not a split
+# point (WHATWG): the data: URI stays verbatim; the next candidate (dz) downloads
+grep -Fq 'data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x' "$saved" ||
+    ! echo "FAIL: a comma inside a data: URI srcset candidate was mis-split" || exit 1
+
+# real rewrite, not passthrough: the absolute file:// candidate becomes local
+# (a flat fixture can't show this; the footer comment's file:// is not in srcset)
+grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
+    ! echo "FAIL: absolute file:// srcset URL was not rewritten to a local link" || exit 1
+! grep -Eq 'srcset="[^"]*file://' "$saved" ||
+    ! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
+
+# xlink:href (#298) and inline background-image (#237): detected and rewritten
+# to local; no-detect attributes (title, alt, ...) left untouched. Asserted by
+# rewrite (deterministic), not download. data-* (#201/#203) is omitted: its
+# detection is currently nondeterministic and can't be locked yet.
+site2="$tmp/attrs"
+mkdir -p "$site2"
+for f in xl ibg tt; do gif "$site2/$f.gif"; done
+cat >"$site2/index.html" <<EOF
+<html><body>
+<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
+<div style="background-image:url(file://$site2/ibg.gif)"></div>
+<span title="file://$site2/tt.gif">excluded attribute</span>
+</body></html>
+EOF
+out2="$tmp/attrs-out"
+crawl "$site2/index.html" "$out2"
+saved2=$(savedhtml "$out2")
+test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
+
+# detected attributes: the absolute URL is rewritten to a local link
+grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
+    ! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
+grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
+    ! echo "FAIL #237: inline background-image url() not detected/rewritten" || exit 1
+
+# excluded attribute: title is on the no-detect list, so its value is left as-is
+grep -q 'title="file://' "$saved2" ||
+    ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
+
+exit 0
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -9,6 +9,6 @@ TESTS_ENVIRONMENT += HTTPS_SUPPORT=$(HTTPS_SUPPORT)
 TESTS_ENVIRONMENT += top_srcdir=$(top_srcdir)

 TEST_EXTENSIONS = .test
-TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
+TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test

 CLEANFILES = check-network_sh.cache
--- a/tests/Makefile.in
+++ b/tests/Makefile.in
@@ -472,7 +472,7 @@ TESTS_ENVIRONMENT = PATH=$(top_builddir)/src$(PATH_SEPARATOR)$$PATH \
 	ONLINE_UNIT_TESTS=$(ONLINE_UNIT_TESTS) \
 	HTTPS_SUPPORT=$(HTTPS_SUPPORT) top_srcdir=$(top_srcdir)
 TEST_EXTENSIONS = .test
-TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
+TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
 CLEANFILES = check-network_sh.cache
 all: all-am
Author	SHA1	Message	Date
Xavier Roche	62be177e35	Add obfuscated personal email as alternate security contact Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-14 13:47:15 +02:00
Xavier Roche	452a9f6c67	Add contributor governance: CONTRIBUTING, COC, SECURITY, DCO httrack had no community-health files. Add a short CONTRIBUTING (PR/style basics, security-sensitivity, an outcome-only AI-assistance policy), the Contributor Covenant 2.1 as CODE_OF_CONDUCT, and a SECURITY policy with a verified-reproduction bar for AI-assisted reports. Require a Signed-off-by (DCO) on every commit and enforce it in CI via a new pull_request-only job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>	2026-06-14 13:41:19 +02:00
Xavier Roche	1bf90ce294	Merge pull request #326 from xroche/parser/srcset-candidates Capture every srcset candidate URL on <img> and <source>	2026-06-14 00:42:48 +02:00
Xavier Roche	583817dcd4	Capture every srcset candidate URL on <img> and <source> A srcset value is a comma-separated list of "URL descriptor" entries (480w, 2x). HTTrack only had "data-srcset" in the link-detection table and left the plain "srcset" attribute untouched, so responsive images were never mirrored. The parser now captures and rewrites each candidate URL in turn, preserving the descriptors and the commas between entries verbatim, and bounds every new buffer scan against the page end. Candidate splitting follows the WHATWG srcset algorithm: the URL is a run of non-whitespace characters, so a comma inside a URL (a data: URI, a CDN transform path like w_300,c_fill) stays part of the URL and is not mis-split; only a trailing comma or a comma after the descriptor separates candidates. Adds tests/01_engine-parse.test, an offline file:// parser test that asserts each candidate is queued and rewritten (including the comma-in-URL cases), and also locks the existing xlink:href (#298) and inline background-image (#237) handling. closes #235 closes #236	2026-06-14 00:37:20 +02:00
Xavier Roche	5351e96d71	Merge pull request #325 from xroche/docs/rfc2606-example-domains docs: use www.example.com in examples; add html manual regen target	2026-06-13 10:41:24 +02:00
Xavier Roche	9d39a57576	build: add regen target for html/httrack.man.html The rendered HTML manual had no regeneration path. Add regen-man-html, which runs groff's html device over httrack.1, alongside the existing regen-man target.	2026-06-13 10:38:31 +02:00
Xavier Roche	e3d4ec01f7	docs: use www.example.com in examples instead of www.someweb.com someweb.com is a real registrable domain; example.com is reserved for documentation (RFC 2606). Replace it across the HTML guides, the CLI --help text (htshelp.c) and code comments, then regenerate man/httrack.1 and the rendered html/httrack.man.html. Other placeholder domains are left alone: they appear inside filter/wildcard examples where the host interacts with the pattern.	2026-06-13 10:38:31 +02:00
Xavier Roche	a0bf50f6b1	Merge pull request #324 from xroche/test/filter-escape-characterize test: characterize wildcard class escape behavior	2026-06-13 10:17:24 +02:00
Xavier Roche	82d08aaeaf	Merge pull request #323 from xroche/fix/doc-lang-nits docs: fix help-guide placeholders, README clone flag, Ukrainian charset	2026-06-13 10:12:09 +02:00
Xavier Roche	459f06e758	docs: fix help-guide placeholders, README clone flag, Ukrainian charset Escape the literal <URLs>, <FILTERs>, <param>, <filter>, <file> and related placeholders in fcguide.html so they render instead of being swallowed as unknown HTML tags; several were also missing their closing '>'. Use --recurse-submodules in the README clone command. Relabel lang/Ukrainian.txt as windows-1251, which is what its bytes actually are (ISO-8859-5 decodes them to garbage). closes #132, closes #103, closes #167	2026-06-13 10:05:40 +02:00