Compare commits

..

9 Commits

Author SHA1 Message Date
Xavier Roche
62be177e35 Add obfuscated personal email as alternate security contact
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 13:47:15 +02:00
Xavier Roche
452a9f6c67 Add contributor governance: CONTRIBUTING, COC, SECURITY, DCO
httrack had no community-health files. Add a short CONTRIBUTING (PR/style
basics, security-sensitivity, an outcome-only AI-assistance policy), the
Contributor Covenant 2.1 as CODE_OF_CONDUCT, and a SECURITY policy with a
verified-reproduction bar for AI-assisted reports.

Require a Signed-off-by (DCO) on every commit and enforce it in CI via a new
pull_request-only job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-14 13:41:19 +02:00
Xavier Roche
1bf90ce294 Merge pull request #326 from xroche/parser/srcset-candidates
Capture every srcset candidate URL on <img> and <source>
2026-06-14 00:42:48 +02:00
Xavier Roche
583817dcd4 Capture every srcset candidate URL on <img> and <source>
A srcset value is a comma-separated list of "URL descriptor" entries
(480w, 2x). HTTrack only had "data-srcset" in the link-detection table and
left the plain "srcset" attribute untouched, so responsive images were never
mirrored. The parser now captures and rewrites each candidate URL in turn,
preserving the descriptors and the commas between entries verbatim, and bounds
every new buffer scan against the page end.

Candidate splitting follows the WHATWG srcset algorithm: the URL is a run of
non-whitespace characters, so a comma inside a URL (a data: URI, a CDN
transform path like w_300,c_fill) stays part of the URL and is not mis-split;
only a trailing comma or a comma after the descriptor separates candidates.

Adds tests/01_engine-parse.test, an offline file:// parser test that asserts
each candidate is queued and rewritten (including the comma-in-URL cases), and
also locks the existing xlink:href (#298) and inline background-image (#237)
handling.

closes #235
closes #236
2026-06-14 00:37:20 +02:00
Xavier Roche
5351e96d71 Merge pull request #325 from xroche/docs/rfc2606-example-domains
docs: use www.example.com in examples; add html manual regen target
2026-06-13 10:41:24 +02:00
Xavier Roche
a0bf50f6b1 Merge pull request #324 from xroche/test/filter-escape-characterize
test: characterize wildcard class escape behavior
2026-06-13 10:17:24 +02:00
Xavier Roche
794404bba2 test: characterize wildcard class escape behavior
Add -#0 self-test cases for backslash escapes inside a '*[...]' class.
They pin two quirks of the current decoder: '\X' matches both X and the
backslash itself, and a literal ']' cannot be a class member because the
parser stops at the first ']' (escaped or not). The latter is why the
filter guide's '*[\[\]]' = "the [ or ] character" claim is wrong (#148):
it parses as the class {[,\} plus a trailing literal ']'. These tests
lock the behavior down so a later matcher fix is a deliberate change.

refs #148
2026-06-13 10:15:45 +02:00
Xavier Roche
82d08aaeaf Merge pull request #323 from xroche/fix/doc-lang-nits
docs: fix help-guide placeholders, README clone flag, Ukrainian charset
2026-06-13 10:12:09 +02:00
Xavier Roche
459f06e758 docs: fix help-guide placeholders, README clone flag, Ukrainian charset
Escape the literal <URLs>, <FILTERs>, <param>, <filter>, <file> and
related placeholders in fcguide.html so they render instead of being
swallowed as unknown HTML tags; several were also missing their closing
'>'. Use --recurse-submodules in the README clone command. Relabel
lang/Ukrainian.txt as windows-1251, which is what its bytes actually
are (ISO-8859-5 decodes them to garbage).

closes #132, closes #103, closes #167
2026-06-13 10:05:40 +02:00
13 changed files with 423 additions and 45 deletions

View File

@@ -61,6 +61,37 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
dco:
name: DCO sign-off
# Only checkable on a PR, where we have the base..head commit range.
if: github.event_name == 'pull_request'
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Every commit must be signed off
env:
BASE: ${{ github.event.pull_request.base.sha }}
HEAD: ${{ github.event.pull_request.head.sha }}
run: |
set -euo pipefail
fail=0
# --no-merges: merge commits are GitHub-generated and carry no sign-off.
for sha in $(git rev-list --no-merges "$BASE..$HEAD"); do
if [ -z "$(git log -1 --format='%(trailers:key=Signed-off-by)' "$sha")" ]; then
echo "Missing Signed-off-by: $(git log -1 --format='%h %s' "$sha")"
fail=1
fi
done
if [ "$fail" -ne 0 ]; then
echo
echo "Sign commits with 'git commit -s'; fix a branch with 'git rebase --signoff $BASE'."
echo "See CONTRIBUTING.md (Developer Certificate of Origin)."
exit 1
fi
lint:
name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04

83
CODE_OF_CONDUCT.md Normal file
View File

@@ -0,0 +1,83 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at <roche@httrack.com>. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of actions.
**Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at [https://www.contributor-covenant.org/translations][translations].
[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations

39
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,39 @@
# Contributing to HTTrack
HTTrack is small and old. Keep changes easy to review and safe to merge.
## Pull requests
- One change per PR. Small diffs merge fast.
- PRs are squash-merged: the title and description become the commit message, so
explain *why*.
- Add or update tests for engine changes (`tests/`), and keep CI green.
## Style
- C, matching nearby code. **Format only the lines you change** (`git
clang-format` against the repo `.clang-format`). Never reformat untouched code.
- Comment the *why*, in English.
- HTTrack parses hostile input off the network. Check bounds, avoid unchecked
copies, and never let an attacker-controlled length drive arithmetic unchecked.
## Sign your work
Every commit needs a `Signed-off-by` line, the
[DCO](https://developercertificate.org/): `git commit -s`. CI rejects unsigned
commits; fix a branch with `git rebase --signoff master`.
## AI assistants
Welcome, and nothing to disclose. Two rules:
- **Own every line** as if you wrote it. Can't explain it in review? Not ready.
- **Don't push your work onto reviewers.** A raw generated patch a maintainer has
to vet from scratch will be closed.
The sign-off covers AI-assisted code too.
## Bugs
Open an issue with the version, OS, command used, and expected vs actual result.
For security issues see [SECURITY.md](SECURITY.md), not a public issue.

View File

@@ -23,7 +23,7 @@ http://www.httrack.com/
## Compile trunk release
```sh
git clone https://github.com/xroche/httrack.git --recurse
git clone https://github.com/xroche/httrack.git --recurse-submodules
cd httrack
./configure --prefix=$HOME/usr && make -j8 && make install
```

23
SECURITY.md Normal file
View File

@@ -0,0 +1,23 @@
# Security Policy
## Reporting
Report privately, not in a public issue or PR: use GitHub
[private advisories](https://github.com/xroche/httrack/security/advisories/new)
or email <roche@httrack.com> (alternate: `xroche at gmail dot com`).
Include the HTTrack version and platform, a concrete reproduction (command line,
a sample page or server response, or a small proof of concept), and what an
attacker gains. We'll acknowledge it and keep you posted. Please allow time for a
release before disclosing publicly.
## Supported versions
Fixes land on `master` and ship in the next release; older releases aren't
maintained. Confirm against current `master` when you can.
## AI-assisted findings
Scanners and LLMs are fine, but only send reports you have verified yourself. A
confirmed, reproducible issue is worth our time; a plausible one that doesn't
reproduce is not, and will be closed. If a report is AI-assisted, say so.

View File

@@ -181,17 +181,17 @@ used for some time.
<p align=justify> The rest of this manual is dedicated to detailing what
you find in the help message and providing examples - lots and lots of
examples... Here is what you get (page by page - use <enter> to move to
examples... Here is what you get (page by page - use &lt;enter&gt; to move to
the next page in the real program) if you type 'httrack --help':
<pre>
>httrack --help
HTTrack version 3.03BETAo4 (compiled Jul 1 2001)
usage: ./httrack <URLs [-option] [+<FILTERs>] [-<FILTERs>]
usage: ./httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;]
with options listed below: (* is the default value)
General options:
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path &lt;param&gt;)
%O top path if no path defined (-O path_mirror[,path_cache_and_logfiles])
Action options:
@@ -202,7 +202,7 @@ Action options:
Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)
Proxy options:
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy &lt;param&gt;)
%f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])
Limits options:
@@ -227,7 +227,7 @@ Links options:
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
n get non-html files 'near' an html file (ex: an image located outside) (--near)
t test all URLs (even forbidden ones) (--test)
%L <file add all URL located in this text file (one URL per line) (--list <param>)
%L &lt;file&gt; add all URL located in this text file (one URL per line) (--list &lt;param&gt;)
Build options:
NN structure type (0 *original structure, 1+: see below) (--structure[=N])
@@ -248,12 +248,12 @@ Spider options:
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
Browser ID:
F user-agent field (-F "user-agent name") (--user-agent <param>)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
F user-agent field (-F "user-agent name") (--user-agent &lt;param&gt;)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer &lt;param&gt;)
%l preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
Log, index, cache
C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
@@ -303,8 +303,8 @@ Guru options: (do NOT use)
#! Execute a shell command (-#! "echo hello")
Command-line specific options:
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
%U run the engine with another id when called as root (-%U smith) (--user <param>)
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
%U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
Details: Option N
N0 Site-structure (default)
@@ -340,14 +340,14 @@ Details: User-defined option N
%[param] param variable in query string
Shortcuts:
--mirror <URLs *make a mirror of site(s) (default)
--get <URLs get the files indicated, do not seek other URLs (-qg)
--list <text file add all URL located in this text file (-%L)
--mirrorlinks <URLs mirror all links in 1st level pages (-Y)
--testlinks <URLs test links in pages (-r1p0C0I0t)
--spider <URLs spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite <URLs identical to --spider
--skeleton <URLs make a mirror, but gets only html files (-p1)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
--get &lt;URLs&gt; get the files indicated, do not seek other URLs (-qg)
--list &lt;text file&gt; add all URL located in this text file (-%L)
--mirrorlinks &lt;URLs&gt; mirror all links in 1st level pages (-Y)
--testlinks &lt;URLs&gt; test links in pages (-r1p0C0I0t)
--spider &lt;URLs&gt; spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite &lt;URLs&gt; identical to --spider
--skeleton &lt;URLs&gt; make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1)
@@ -387,13 +387,13 @@ with examples... I will be here a while...
<hr>
<h2> Syntax </h2>
<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
<pre><b><i>httrack &lt;URLs&gt; [-option] [+&lt;FILTERs&gt;] [-&lt;FILTERs&gt;] </i></b></pre>
<p align=justify> The syntax of httrack is quite simple. You specify
the URLs you wish to start the process from (<URLS>), any options you
the URLs you wish to start the process from (&lt;URLS&gt;), any options you
might want to add ([-option], any filters specifying places you should
([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
line by pressing <enter>. Httrack then goes off and does your bidding.
([+&lt;FILTERs&gt;]) and should not ([-&lt;FILTERs&gt;]) go, and end the command
line by pressing &lt;enter&gt;. Httrack then goes off and does your bidding.
For example:
<pre><b><i>
@@ -425,7 +425,7 @@ site. Specifically, the defauls are:
pN priority mode: (* p3) *3 save all files
D *can only go down into subdirs
a *stay on the same address
--mirror <URLs> *make a mirror of site(s) (default)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
</pre>
<p align=justify> Here's what all of that means:
@@ -542,7 +542,7 @@ subdirectories of the starting directory to be investigated.
search started are to be collected. Other sites they point to are not
to be imaged.
<pre><b><i> --mirror <URLs> *make a mirror of site(s) (default) </i></b></pre>
<pre><b><i> --mirror &lt;URLs&gt; *make a mirror of site(s) (default) </i></b></pre>
<p align=justify> This indicates that the program should try to make a
copy of the site as well as it can.
@@ -921,7 +921,7 @@ Links options:
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use)
n get non-html files 'near' an html file (ex: an image located outside)
t test all URLs (even forbidden ones)
%L <file> add all URL located in this text file (one URL per line)
%L &lt;file&gt; add all URL located in this text file (one URL per line)
</i></b></pre>
<p align=justify> The links options allow you to control what links are
@@ -1183,7 +1183,7 @@ Spider options:
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
%B tolerant requests (accept bogus responses on some servers, but not standard!)
%s update hacks: various hacks to limit re-transfers when updating
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume &lt;param&gt;)
</i></b></pre>
<p align=justify> By default, cookies are universally accepted and
@@ -1387,7 +1387,7 @@ web servers leave footprints in the browser.
Browser ID:
F user-agent field (-F "user-agent name")
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]"
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
%l preferred language (-%l "fr, en, jp, *" (--language &lt;param&gt;)
</i></b></pre>
<p align=justify> The user-agent field is used by browsers to determine
@@ -1799,7 +1799,7 @@ based authentication)
<pre><b><i>
Command-line specific options:
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd &lt;param&gt;)
</i></b></pre>
<p align=justify> This option is very nice for a wide array of actions
@@ -1811,7 +1811,7 @@ httrack http://www.shoesizes.com/bob/ -O /tmp/shoesizes -V "/bin/echo \$0"
</i></b></pre>
<pre>
%U run the engine with another id when called as root (-%U smith) (--user <param>)
%U run the engine with another id when called as root (-%U smith) (--user &lt;param&gt;)
</pre>
<p align=justify> Change the UID of the owner when running as r00t
@@ -1856,14 +1856,14 @@ of other options that are commonly used.
<pre><b><i>
Shortcuts:
--mirror <URLs> *make a mirror of site(s) (default)
--get <URLs> get the files indicated, do not seek other URLs (-qg)
--list <text file> add all URL located in this text file (-%L)
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
--testlinks <URLs> test links in pages (-r1p0C0I0t)
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite <URLs> identical to --spider
--skeleton <URLs> make a mirror, but gets only html files (-p1)
--mirror &lt;URLs&gt; *make a mirror of site(s) (default)
--get &lt;URLs&gt; get the files indicated, do not seek other URLs (-qg)
--list &lt;text file&gt; add all URL located in this text file (-%L)
--mirrorlinks &lt;URLs&gt; mirror all links in 1st level pages (-Y)
--testlinks &lt;URLs&gt; test links in pages (-r1p0C0I0t)
--spider &lt;URLs&gt; spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite &lt;URLs&gt; identical to --spider
--skeleton &lt;URLs&gt; make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1)
--catchurl create a temporary proxy to capture an URL or a form post URL
@@ -2019,15 +2019,15 @@ are in reverse priority order. Here's an example:
<td>no characters must be present after</a></td>
</tr>
<tr>
<td> <b> <filter>*[&lt NN]</b></td>
<td> <b> &lt;filter&gt;*[&lt NN]</b></td>
<td> size less than NN Kbytes</td>
</tr>
<tr>
<td> <b> <filter>*[&gt PP]</b></td>
<td> <b> &lt;filter&gt;*[&gt PP]</b></td>
<td> size more than PP Kbytes</td>
</tr>
<tr>
<td> <b> <filter>*[&lt NN &gt PP]</b></td>
<td> <b> &lt;filter&gt;*[&lt NN &gt PP]</b></td>
<td> size less than NN Kbytes and more than PP Kbytes</td>
</tr>
</table>

View File

@@ -7,7 +7,7 @@ uk
LANGUAGE_AUTHOR
Andrij Shevchuk (http://programy.com.ua, http://vic-info.com.ua) \r\n
LANGUAGE_CHARSET
ISO-8859-5
windows-1251
LANGUAGE_WINDOWSID
Ukrainian
OK

View File

@@ -121,6 +121,7 @@ const char *hts_detect[] = {
"lowsrc",
"profile", // element META
"src",
"srcset", // HTML5 responsive images (<img>, <source>)
"swurl",
"url",
"usemap",

View File

@@ -532,6 +532,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int valid_p = 0; // force to take p even if == 0
int ending_p = '\0'; // ending quote?
int archivetag_p = 0; // avoid multiple-archives with commas
int srcset_p = 0; // srcset="url1 480w, url2 2x": list of URLs
int unquoted_script = 0;
INSCRIPT inscript_state_pos_prev = inscript_state_pos;
@@ -1050,6 +1051,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (strcmp(hts_detect[i], "archive") == 0) {
archivetag_p = 1;
}
/* srcset: a comma-list of candidate URLs, each split
out and rewritten below (#235, #236) */
else if (strcmp(hts_detect[i], "srcset") == 0
|| strcmp(hts_detect[i], "data-srcset") == 0) {
srcset_p = 1;
}
}
i++;
}
@@ -1815,6 +1822,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
html++; // sauter # pour usemap etc
}
}
srcset_next:
/* srcset: skip leading whitespace/commas before each candidate;
the skipped bytes flush verbatim below */
if (srcset_p) {
while(html < r->adr + r->size
&& (is_realspace(*html) || *html == ','))
INCREMENT_CURRENT_ADR(1);
}
eadr = html;
// ne pas flusher après code si on doit écrire le codebase avant!
@@ -1844,6 +1859,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*eadr == quote && (!quoteinscript || *(eadr - 1) == '\\')) // end quote
|| (noquote && (*eadr == '\"' || *eadr == '\'')) // end at any quote
|| (!noquote && quote == '\0' && is_realspace(*eadr)) // unquoted href
|| srcset_p // whitespace ends a srcset candidate URL
) // si pas d'attente de quote spéciale ou si quote atteinte
ok = 0;
} else if (ending_p && (*eadr == ending_p))
@@ -1872,6 +1888,16 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
break; // \" ou \' point d'arrêt
case '?': /*quote_adr=adr; */
break; // noter position query
case ',':
if (srcset_p) {
/* split only on a trailing comma; one inside the URL
(data: URI, CDN path) is kept, per the WHATWG algo */
const char *const n = eadr + 1;
if (n >= r->adr + r->size || is_space(*n) || *n == ',')
ok = 0;
}
break;
}
}
//}
@@ -3250,6 +3276,28 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// adr=eadr-1; // ** sauter
/* srcset candidate loop: skip the descriptor and comma, then
re-enter the capture for the next URL. Backward goto, not a loop:
the per-candidate body is this whole block. */
if (srcset_p && ok == 0) {
const char *const endp = r->adr + r->size;
const char *q = html;
while(q < endp && *q != '\0' && *q != ',' && *q != quote
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32)
q++; // skip the descriptor
if (q < endp && *q == ',') {
q++;
while(q < endp && (is_realspace(*q) || *q == ','))
q++; // skip whitespace and empty candidates
if (q < endp && *q != '\0' && *q != ',' && *q != quote
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
INCREMENT_CURRENT_ADR(q - html); // keep the automate in sync
ok = 1;
goto srcset_next;
}
}
}
/* We skipped bytes and skip the " : reset state */
/*if (inscript) {
inscript_state_pos = INSCRIPT_START;

View File

@@ -47,3 +47,25 @@ match '*foo*bar' 'foozbar'
# '?' is the query-string marker, not a single-char wildcard
nomatch 'a?c' 'abc'
# backslash escapes a metacharacter inside a class so it is matched literally.
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
# both X and '\'. These assertions pin that behavior.
match '*[\*]' '*'
match '*[\*]' "\\"
nomatch '*[\*]' 'a'
match '*[\\]' "\\"
nomatch '*[\\]' 'a'
match '*[\[]' '['
match '*[\[]' "\\"
nomatch '*[\[]' 'a'
# A literal ']' cannot be a class member: the class parser stops at the first
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'

131
tests/01_engine-parse.test Executable file
View File

@@ -0,0 +1,131 @@
#!/bin/bash
#
# Offline HTML parser tests: each section crawls a file:// fixture (no network)
# and checks which assets the parser captured and how it rewrote the links.
set -u
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_parse.XXXXXX") || exit 1
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
# a minimal valid 1x1 GIF, reused for every referenced asset
gif() {
printf 'GIF89a\1\0\1\0\200\0\0\0\0\0\377\377\377!\371\4\1\0\0\0\0,\0\0\0\0\1\0\1\0\0\2\2D\1\0;' >"$1"
}
# crawl <fixture-html> into <out> with link rewriting on, no extra fetching
crawl() {
local html="$1" out="$2"
rm -rf "$out"
mkdir -p "$out"
httrack "file://$html" -O "$out" --quiet --near -n >"$out/.log" 2>&1
}
# assert a file with the given basename was saved somewhere under <out>
found() {
test -n "$(find "$2" -type f -name "$1" -print -quit)" ||
! echo "FAIL: expected '$1' to be downloaded under $2" || exit 1
}
# assert NO file with the given basename was saved (e.g. a descriptor token must
# not be mistaken for a URL)
notfound() {
test -z "$(find "$2" -type f -name "$1" -print -quit)" ||
! echo "FAIL: '$1' should not have been downloaded under $2" || exit 1
}
# the mirrored fixture page (under "file/"), not HTTrack's own landing index
savedhtml() {
find "$1" -type f -path '*/file/*' -name index.html -print -quit
}
# srcset on <img> and <source> (#235, #236): every candidate captured and
# rewritten, descriptors preserved, following attributes left intact.
site="$tmp/srcset"
mkdir -p "$site"
for f in a b c d e f g h i j v dz; do gif "$site/$f.gif"; done
# unquoted heredoc: $site expands in the absolute-URL candidate
cat >"$site/index.html" <<EOF
<html><body>
<img src="a.gif" srcset="b.gif 480w, c.gif 800w">
<picture><source srcset="d.gif 1x, c.gif 2x"><img src="a.gif"></picture>
<img srcset="e.gif, f.gif">
<img srcset="g.gif 2x" alt="trailing attr after srcset">
<img srcset=" h.gif 2x , i.gif ">
<video><source src="v.gif"></video>
<img srcset="file://$site/j.gif 2x">
<img srcset="data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x, dz.gif 2x">
<img srcset="">
<a href="a.gif">plain link still works</a>
</body></html>
EOF
out="$tmp/srcset-out"
crawl "$site/index.html" "$out"
# every candidate downloads, incl. unique tails (catches first-only parsing),
# whitespace-padded (h,i), <source src> (v), absolute (j), post-data: URI (dz)
for f in a b c d e f g h i j v dz; do found "$f.gif" "$out"; done
# the width/density descriptors are not URLs and must not be fetched
notfound "480w" "$out"
notfound "800w" "$out"
notfound "2x" "$out"
saved=$(savedhtml "$out")
test -n "$saved" || ! echo "FAIL: saved index.html not found" || exit 1
# descriptors must survive the rewrite (no "b.gif 480w" mangled into a path)
grep -Eq 'srcset="[^"]*480w[^"]*800w' "$saved" ||
! echo "FAIL: srcset width descriptors lost/reordered in rewritten HTML" || exit 1
grep -Eq 'srcset="[^"]*1x[^"]*2x' "$saved" ||
! echo "FAIL: srcset density descriptors lost/reordered in rewritten HTML" || exit 1
# the descriptor-less comma form keeps both candidates and the separator verbatim
grep -Eq 'srcset="e\.gif, f\.gif"' "$saved" ||
! echo "FAIL: comma-separated srcset without descriptors was altered" || exit 1
# an attribute following srcset in the same tag must be left intact
grep -q 'alt="trailing attr after srcset"' "$saved" ||
! echo "FAIL: srcset swallowed a following attribute" || exit 1
# a comma inside a URL (data: URI, CDN path) is part of the URL, not a split
# point (WHATWG): the data: URI stays verbatim; the next candidate (dz) downloads
grep -Fq 'data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x' "$saved" ||
! echo "FAIL: a comma inside a data: URI srcset candidate was mis-split" || exit 1
# real rewrite, not passthrough: the absolute file:// candidate becomes local
# (a flat fixture can't show this; the footer comment's file:// is not in srcset)
grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
! echo "FAIL: absolute file:// srcset URL was not rewritten to a local link" || exit 1
! grep -Eq 'srcset="[^"]*file://' "$saved" ||
! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
# xlink:href (#298) and inline background-image (#237): detected and rewritten
# to local; no-detect attributes (title, alt, ...) left untouched. Asserted by
# rewrite (deterministic), not download. data-* (#201/#203) is omitted: its
# detection is currently nondeterministic and can't be locked yet.
site2="$tmp/attrs"
mkdir -p "$site2"
for f in xl ibg tt; do gif "$site2/$f.gif"; done
cat >"$site2/index.html" <<EOF
<html><body>
<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
<div style="background-image:url(file://$site2/ibg.gif)"></div>
<span title="file://$site2/tt.gif">excluded attribute</span>
</body></html>
EOF
out2="$tmp/attrs-out"
crawl "$site2/index.html" "$out2"
saved2=$(savedhtml "$out2")
test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
# detected attributes: the absolute URL is rewritten to a local link
grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
! echo "FAIL #237: inline background-image url() not detected/rewritten" || exit 1
# excluded attribute: title is on the no-detect list, so its value is left as-is
grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
exit 0

View File

@@ -9,6 +9,6 @@ TESTS_ENVIRONMENT += HTTPS_SUPPORT=$(HTTPS_SUPPORT)
TESTS_ENVIRONMENT += top_srcdir=$(top_srcdir)
TEST_EXTENSIONS = .test
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
CLEANFILES = check-network_sh.cache

View File

@@ -472,7 +472,7 @@ TESTS_ENVIRONMENT = PATH=$(top_builddir)/src$(PATH_SEPARATOR)$$PATH \
ONLINE_UNIT_TESTS=$(ONLINE_UNIT_TESTS) \
HTTPS_SUPPORT=$(HTTPS_SUPPORT) top_srcdir=$(top_srcdir)
TEST_EXTENSIONS = .test
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
CLEANFILES = check-network_sh.cache
all: all-am