Compare commits

...

12 Commits

Author SHA1 Message Date
Xavier Roche
02796dff34 Fix macro-hygiene defects in htsstrings.h (dup define, double-eval)
StringSubRW was defined twice (the second under a redundant comment); drop
the duplicate. StringCatN and StringSetLength each evaluated their SIZE
argument twice, so a side-effecting argument would run twice and a wrapping
expression could clamp inconsistently. Capture SIZE once into a local, the
way StringCopyN already does. StringSetLength keeps a signed local so the
"negative means strlen(buffer_)" contract is preserved.

No current call site passes a side-effecting SIZE, so the change is
behavior-preserving; the strsafe self-test now passes a (counter++, V)
argument and asserts a single evaluation, which fails on the old macros.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-29 20:16:30 +02:00
Xavier Roche
13b31986d5 tests: skip connect-fallback test on GNU/Hurd (#461)
19_local-connect-fallback synthesizes a multi-address host by pinning
"deadhost" to a dead 127.0.0.2 then the live 127.0.0.1, so httrack's
per-address connect fallback has somewhere to fall back to. That fixture
needs a second loopback IP: the dead and live addresses share the URL's
port (htslib.c applies it to every resolved address), so they must differ
by IP. Linux and macOS route all of 127/8 to loopback; GNU/Hurd has only
127.0.0.1, so the dead address fails synchronously, the fallback never
engages, and the test fails on hurd-i386.

This is a fixture limitation, not a bug in the fallback code, so skip on
GNU/Hurd (uname -s = GNU). A runtime bind/connect probe was tried first
but both wrongly skipped macOS, which connects to 127.0.0.2 fine but does
not treat it as bindable; the one-loopback-IP fact is what actually
distinguishes Hurd. hurd-i386 is a non-release port and did not block
migration.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 19:40:07 +02:00
Xavier Roche
bd7e0989f6 Parse robots.txt with RFC 9309 Allow/Disallow precedence (#458)
The robots.txt handler only did substring Disallow matching against a flat
token blob: no Allow:, no path wildcards. Sites using "Disallow: /" plus
"Allow: /public/" were over-blocked, since Allow was never parsed.

Move the body parsing into robots_parse() (htsrobots.c) so both the crawler
and a self-test feed raw robots.txt. Rules are stored Allow/Disallow-tagged
and consulted with RFC 9309 precedence: the longest matching path pattern
wins, Allow breaking ties. Pattern matching supports '*' (any run) and a
trailing '$' (end-of-path anchor) via a linear two-pointer matcher with a
single resumable star position, so hostile patterns cannot trigger
exponential backtracking. Path matching is now case-sensitive per the RFC.

robots_wizard is internal (not in DevIncludes_DATA, no HTSEXT_API; htsopt.h
holds only an opaque pointer), so the in-memory format changed without an ABI
break. Sitemap:/Crawl-delay: lines are tolerated but ignored, as before.

New -#test=robots self-test plus tests/01_engine-robots.test cover the
Allow-over-Disallow longest match, the equal-length Allow tie, '*'/'$'
wildcards, and httrack-group selection.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 09:07:54 +02:00
Xavier Roche
bd74ec7cab Advertise deflate in Accept-Encoding and decode it (#459)
The request Accept-Encoding offered only gzip even though the response
parser already recognized deflate/x-deflate. But the actual decode path
(hts_zunpack) used zlib's gzread, which only inflates gzip and copies any
deflate body through verbatim, so a deflate response would have been
written out still compressed. Advertising deflate without fixing that
would corrupt files.

Rewrite hts_zunpack to inflate via inflateInit2 with format detection:
gzip and zlib (RFC1950) auto-detect with +32 windowBits, everything else
is treated as raw deflate (RFC1951). Then add deflate to the advertised
list through a small hts_acceptencoding() helper shared with the test.

A new -#test=acceptencoding self-test asserts the advertised header
carries both gzip and deflate, and round-trips gzip, zlib and raw-deflate
bodies through hts_zunpack on disk. Both halves fail on the old binary.

Brotli is intentionally out of scope (new dependency, larger change).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:54:03 +02:00
Xavier Roche
1ed8ffad64 tests: group zlib-dependent self-tests under 01_zlib-* (#460)
The MSan job runs the offline 01_engine-* self-tests but must skip any that
exercise the system zlib: uninstrumented libz floods MemorySanitizer with
false positives (MSan can't see libz initialize its own internal state, so
every byte deflate/inflate produces reads as "uninitialized"). That was a
grep -v -- '-cache' exclusion, a list that would grow with each new zlib test.

Rename the three cache tests to a 01_zlib-* prefix so the MSan job selects
01_engine-* with no exclusion list. They still run in the normal suite and
under ASan+UBSan (full make check), where uninstrumented libz is fine. The
deflate Accept-Encoding test (PR #459) follows the same convention.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:39:43 +02:00
Xavier Roche
b68de172fa Follow <source>/<track> as embedded near-links (#451) (#457)
src/srcset are already extracted from any element via hts_detect[], so
<source>/<track> URLs were captured. But hts_detect_embed only listed
<img src>, so at the recursion-depth boundary under --near these media
were treated as plain links and dropped, unlike <img>. Add a separate
HTML5 media table (keeping the legacy one clang-format-stable) and chain
the embed lookup through both.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:18:28 +02:00
Xavier Roche
aabfd34380 Refresh minor legacy constants (#453) (#456)
Three independent low-risk items under tracking issue #453:

- HTTP status enum/reason map: add 429 (Too Many Requests) and 451
  (Unavailable For Legal Reasons); appended to the HTTPStatusCode enum
  (installed header, tail-append only) and to infostatuscode_const().
- Drop the 1990s relic port 31337 from the local proxy bind fallback list.
- Pin a TLS protocol floor (TLS1.2 via SSL_CTX_set_min_proto_version, with an
  SSL_OP_NO_* fallback for OpenSSL < 1.1.0) so obsolete SSLv3/TLS1.0/1.1 aren't
  negotiated. No cert verification added: that remains by design.

Adds a -#test=status engine self-test (01_engine-status.test) asserting the
reason phrases for 429/451.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:12:17 +02:00
Xavier Roche
65ff9e0f11 Modernize the default User-Agent (#449) (#455)
The default UA was "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)",
a 25-year-old string naming a dead OS and trivially fingerprinted. Replace
it with an honest crawler UA carrying the HTTrack token and a reference URL:
"Mozilla/5.0 (compatible; HTTrack/3.x; +https://www.httrack.com/)".

Both the engine default (hts_create_opt) and the webhttrack mini-server
config default now share one HTS_DEFAULT_USER_AGENT macro, built from
HTTRACK_AFF_VERSION so the version token tracks releases without going stale.

Closes #449

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 23:18:28 +02:00
Xavier Roche
730a1c8c5b Add modern web MIME types to the type/extension table (#454)
The MIME ⇄ extension table in htslib.c had no entry for formats that are now
common: webp, avif, woff/woff2, json, wasm, mp4/webm, opus/flac, and friends.
A crawl that met these relied on the application/<subtype> fallback (or got no
extension at all), so saved files landed with wrong or missing extensions and
get_httptype guessed nothing from the extension.

The new rows live in a separate hts_mime_modern[] table rather than appended to
the legacy hts_mime[]: clang-format reflows a whole brace initializer on any
in-place edit, which would churn every untouched legacy row. A small
hts_mime_lookup() helper scans a table in either direction; give_mimext() and
get_httptype_sized() now consult the legacy table first, then the modern one,
so a new row can never shadow a legacy mapping. Legacy aliases (flash,
realaudio) stay for archiving old content.

Self-test covers both lookups; the give_mimext cases use MIME types the
application/<=4-char-subtype fallback can't fabricate, so a missing row
actually fails the assert.

Closes #448

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 22:53:48 +02:00
Xavier Roche
f9ee4702a2 tests: skip 28_local-pause without python3; add a buildd-reproducing CI job (#445)
* tests: skip 28_local-pause when python3 is absent (Debian buildd)

The local-server tests all skip with exit 77 when python3 is missing, but
28_local-pause runs local-crawl.sh inside a command-substitution with output
redirected to /dev/null, swallowing that exit-77 skip signal. On a host with
no python3 (every Debian buildd) both crawls then run serverless, finish in
0s, and the test reports FAIL instead of SKIP, turning 3.49.10-1 red on all
architectures. Add the same python3 guard the other tests get via
local-crawl.sh, up front, so it skips cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* ci: add a no-python3 job reproducing the Debian buildd chroot

GitHub runners ship python3, so every make-check job exercised the
python3-present path and the local-server tests never skipped. The Debian
buildds build in a minimal chroot with no python3, where those tests must
SKIP (exit 77) -- and 28_local-pause failed there instead, FTBFS on every
arch for 3.49.10-1, invisible to CI.

Add a job that builds, removes python3, and runs make check, so the skip
path is exercised on every PR. Verified to fail on the pre-fix tree and pass
after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 20:12:49 +02:00
Xavier Roche
cca83e5f4a Modernize HTML entity decoding to the WHATWG named character references (#444)
* Modernize HTML entity decoding to the WHATWG named character references

Regenerate htsentities.h from the WHATWG entities.json (2032 single-codepoint
names) instead of the 1998 HTML 4.0 set (252 names). The dispatch hash moves
from a 32-bit LCG to 64-bit FNV-1a; the generator now aborts on any (hash,len)
collision, so the hash-only switch stays correct without a runtime name compare.
Bump the consumer name-length cap from 10 to 31, the longest name
(CounterClockwiseContourIntegral), or long names would be rejected outright.
Multi-codepoint references (~93 obscure math entities) can't fit the
single-codepoint return and are skipped, left verbatim as before.

Also fix the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh.

Closes #443

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* entities: harden the generator collision guard and widen test coverage

Review follow-up. The switch keys on the hash alone, so check hash-alone
uniqueness among emitted names (a same-hash/different-len pair would otherwise
slip the old (hash,len) check and surface only as a cryptic duplicate-case
compile error). Also hash the ~93 skipped multi-codepoint names and abort if any
aliases an emitted hash, so "skipped means verbatim" is enforced rather than
assumed on future regens.

Add a runtime sweep of common HTML4 names (copy/reg/trade/mdash/ndash/alpha/beta)
to 01_engine-entities.test: a regression guard against accidental drops and a
generator-vs-consumer hash cross-check on names beyond the handful already
probed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:29:03 +02:00
Xavier Roche
97f398e508 Release 3.49.10 (#442)
Bump the package version to 3.49.10 and curate the release notes.

VERSION_INFO goes 3:1:0 -> 3:2:0: the cycle only appended tail fields to the
installed options struct (--cookies-file, --pause, --strip-query, the -%u
split), no existing symbol or offset changed, so the soname stays .so.3.

history.txt gets the 3.49-10 block; debian/changelog gets 3.49.10-1 with the
Debian-specific items (DEP-5 copyright, chromium-first browser dep, minizip
embedded-library override). Standards-Version 4.7.0 -> 4.7.4: the intervening
Policy changes (usr-merge, /usr/games, Priority recommendation, -dev linker
scripts, non-free-firmware) need no package change.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 14:26:50 +02:00
35 changed files with 13129 additions and 1741 deletions

View File

@@ -61,6 +61,50 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Reproduce the Debian buildds: they build in a minimal chroot with no
# python3, so the local-server tests must SKIP (exit 77), not fail. GitHub
# runners ship python3, so every other job hides this path; here we remove it
# before `make check`. This is the guard that would have caught the 3.49.10-1
# FTBFS (28_local-pause failed instead of skipping when python3 was absent).
buildd-no-python3:
name: build (no python3, Debian buildd)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: Configure
run: |
set -euo pipefail
autoreconf -fi
./configure
- name: Build
run: make -j"$(nproc)"
- name: Test without python3
run: |
set -euo pipefail
# Hide every python3* so `command -v python3` fails like it does in the
# buildd chroot; masking with /bin/false would still resolve.
sudo find /usr/bin /usr/local/bin -maxdepth 1 -name 'python3*' \
-exec mv {} {}.hidden \;
! command -v python3
make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Portability: build and test on macOS (Darwin/clang) on a native runner --
# no VM. The tree has no __APPLE__ branches, so Darwin exercises the
# generic-Unix path on a second libc and kernel. brew's openssl@3 is keg-only,
@@ -225,8 +269,9 @@ jobs:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
# 01_engine-* only; zlib-dependent self-tests are named 01_zlib-* and
# skipped here (uninstrumented libz floods MSan with false positives).
tests="$(cd tests && ls 01_engine-*.test | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
VERSION_INFO="3:1:0"
# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
VERSION_INFO="3:2:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

13
debian/changelog vendored
View File

@@ -1,3 +1,16 @@
httrack (3.49.10-1) unstable; urgency=medium
* New upstream release: new download-pacing and URL-handling options plus a
batch of crawl and robustness fixes (full list in history.txt).
* Rewrite debian/copyright in machine-readable DEP-5 format, crediting the
bundled minizip, md5 and coucal sources (#415).
* Lead the webhttrack browser dependency with chromium so httrack is not
dragged into the firefox-esr autoremoval cascade (#436).
* Override the embedded-library lint for the bundled minizip (#419).
* Bump Standards-Version to 4.7.4 (no changes required).
-- Xavier Roche <xavier@debian.org> Sun, 28 Jun 2026 14:01:53 +0200
httrack (3.49.9-1) unstable; urgency=medium
* New upstream release: Content-Type and file-type detection fixes (trust a

2
debian/control vendored
View File

@@ -2,7 +2,7 @@ Source: httrack
Section: web
Priority: optional
Maintainer: Xavier Roche <roche@httrack.com>
Standards-Version: 4.7.0
Standards-Version: 4.7.4
Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
Rules-Requires-Root: no
Homepage: http://www.httrack.com

View File

@@ -4,7 +4,25 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack
3.49-9
3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)
+ New: --strip-query to drop selected query keys from the dedup naming (#112)
+ Changed: split the -%u URL hacks into independent --keep-www-prefix, --keep-double-slashes and --keep-query-order toggles (#271)
+ Fixed: follow a redirect Location after dropping its #fragment, instead of requesting the fragment and polluting the saved name (#204)
+ Fixed: escaped brackets inside a *[...] filter character class (#148)
+ Fixed: honor the server's Content-Range when resuming a partial download, instead of appending overlapping bytes (#198)
+ Fixed: abort the download as soon as the response type is excluded by -mime:, instead of fetching then discarding the body (#58)
+ Fixed: keep size-based filter rules neutral until the file size is known (#143)
+ Fixed: stop the mirror with a clean fatal error on a cache write failure, instead of crashing (#174, #219)
+ Fixed: stop the 412/416 partial re-get loop on --continue and --update (#206)
+ Fixed: keep an unrecognized URL tail instead of mangling it to .html (#115)
+ Fixed: honor --tolerant (-%B) on a broken Content-Length, and fix an out-of-bounds read it exposed (#32, #41)
+ Fixed: fall back to the next resolved address when a connection fails or stalls, instead of hanging on a dead IPv6 address
+ Fixed: report why a -%L URL list could not be loaded (#49)
+ Changed: multiple internal hardening, build and CI improvements
.49-9
+ Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
+ Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
+ Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)

View File

@@ -129,6 +129,8 @@ typedef enum HTTPStatusCode {
HTTP_UNSUPPORTED_MEDIA_TYPE = 415,
HTTP_REQUESTED_RANGE_NOT_SATISFIABLE = 416,
HTTP_EXPECTATION_FAILED = 417,
HTTP_TOO_MANY_REQUESTS = 429,
HTTP_UNAVAILABLE_FOR_LEGAL_REASONS = 451,
HTTP_INTERNAL_SERVER_ERROR = 500,
HTTP_NOT_IMPLEMENTED = 501,
HTTP_BAD_GATEWAY = 502,

View File

@@ -3,12 +3,12 @@
# Change this to download files
if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi

View File

@@ -64,7 +64,7 @@ Please visit our Website: http://www.httrack.com
// catch_url_init(&port,&return_host);
HTSEXT_API T_SOC catch_url_init_std(int *port_prox, char *adr_prox) {
T_SOC soc;
int try_to_listen_to[] = { 8080, 3128, 80, 81, 82, 8081, 3129, 31337, 0, -1 };
int try_to_listen_to[] = {8080, 3128, 80, 81, 82, 8081, 3129, 0, -1};
int i = 0;
do {

View File

@@ -1796,90 +1796,18 @@ int httpmirror(char *url1, httrackp * opt) {
if (strnotempty(savename()) == 0) { // pas de chemin de sauvegarde
if (strcmp(urlfil(), "/robots.txt") == 0) { // robots.txt
if (r.adr) {
int bptr = 0;
char BIGSTK line[1024];
char BIGSTK buff[8192];
char BIGSTK infobuff[8192];
int record = 0;
line[0] = '\0';
buff[0] = '\0';
infobuff[0] = '\0';
//
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", r.adr);
#endif
do {
char *comm;
int llen;
bptr += binput(r.adr + bptr, line, sizeof(line) - 2);
/* strip comment */
comm = strchr(line, '#');
if (comm != NULL) {
*comm = '\0';
}
/* strip spaces */
llen = (int) strlen(line);
while(llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a;
a = line + 11;
while(is_realspace(*a))
a++; // sauter espace(s)
if (*a == '*') {
if (record != 2)
record = 1; // c pour nous
} else if (strfield(a, "httrack") || strfield(a, "winhttrack")
|| strfield(a, "webhttrack")) {
buff[0] = '\0'; // re-enregistrer
infobuff[0] = '\0';
record = 2; // locked
#if DEBUG_ROBOTS
printf("explicit disallow for httrack\n");
#endif
} else
record = 0;
} else if (record) {
if (strfield(line, "disallow:")) {
char *a = line + 9;
while(is_realspace(*a))
a++; // sauter espace(s)
if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
hts_boolean keep_root = (opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
? HTS_TRUE
: HTS_FALSE;
#else
hts_boolean keep_root = HTS_TRUE;
#endif
{ /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a);
strcatbuff(buff, "\n");
if ((strlen(infobuff) + strlen(a) + 8) <
sizeof(infobuff)) {
if (strnotempty(infobuff))
strcatbuff(infobuff, ", ");
strcatbuff(infobuff, a);
}
}
}
#ifdef IGNORE_RESTRICTIVE_ROBOTS
else {
hts_log_print(opt, LOG_NOTICE,
"Note: %s robots.txt rules are too restrictive, ignoring /",
urladr());
}
#endif
}
}
}
} while((bptr < r.size) && (strlen(buff) < (sizeof(buff) - 32)));
if (strnotempty(buff)) {
checkrobots_set(&robots, urladr(), buff);
robots_parse(&robots, urladr(), r.adr, r.size, infobuff,
sizeof(infobuff), keep_root);
if (strnotempty(infobuff)) {
hts_log_print(opt, LOG_INFO,
"Note: robots.txt forbidden links for %s are: %s",
urladr(), infobuff);

View File

@@ -30,12 +30,14 @@ Please visit our Website: http://www.httrack.com
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#include <stdint.h>
#include "htscharset.h"
#include "htsencoding.h"
#include "htssafe.h"
/* static int decode_entity(const unsigned int hash, const size_t len);
*/
/* static int decode_entity(const uint64_t hash, const size_t len);
*/
#include "htsentities.h"
/* hexadecimal conversion */
@@ -50,30 +52,31 @@ static int get_hex_value(char c) {
return -1;
}
/* Numerical Recipes,
see <http://en.wikipedia.org/wiki/Linear_congruential_generator> */
#define HASH_PRIME ( 1664525 )
#define HASH_CONST ( 1013904223 )
#define HASH_ADD(HASH, C) do { \
(HASH) *= HASH_PRIME; \
(HASH) += HASH_CONST; \
(HASH) += (C); \
} while(0)
/* 64-bit FNV-1a; must match htsentities.sh, which keys the entity table on it.
*/
#define HASH_INIT 0xcbf29ce484222325ULL
#define HASH_PRIME 0x100000001b3ULL
#define HASH_ADD(HASH, C) \
do { \
(HASH) ^= (unsigned char) (C); \
(HASH) *= HASH_PRIME; \
} while (0)
int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t max, const char *charset) {
size_t i, j, ampStart, ampStartDest;
int uc;
int hex;
unsigned int hash;
uint64_t hash;
assertf(max != 0);
for(i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0,
uc = -1, hex = 0, hash = 0 ; src[i] != '\0' ; i++) {
for (i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0, uc = -1, hex = 0,
hash = HASH_INIT;
src[i] != '\0'; i++) {
/* start of entity */
if (src[i] == '&') {
ampStart = i;
ampStartDest = j;
hash = 0;
hash = HASH_INIT;
uc = -1;
}
/* inside a potential entity */
@@ -174,14 +177,11 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
}
/* alphanumerical entity */
else {
/* alphanum and not too far ('&thetasym;' is the longest) */
if (i <= ampStart + 10 &&
(
(src[i] >= '0' && src[i] <= '9')
|| (src[i] >= 'A' && src[i] <= 'Z')
|| (src[i] >= 'a' && src[i] <= 'z')
)
) {
/* alphanum, capped at the longest name
* '&CounterClockwiseContourIntegral;' (31) */
if (i <= ampStart + 31 && ((src[i] >= '0' && src[i] <= '9') ||
(src[i] >= 'A' && src[i] <= 'Z') ||
(src[i] >= 'a' && src[i] <= 'z'))) {
/* compute hash */
HASH_ADD(hash, (unsigned char) src[i]);
} else {

File diff suppressed because it is too large Load Diff

View File

@@ -1,75 +1,92 @@
#!/bin/bash
#
# Regenerate htsentities.h from the WHATWG named character references.
src=html40.txt
url=http://www.w3.org/TR/1998/REC-html40-19980424/html40.txt
set -euo pipefail
src=entities.json
url=https://html.spec.whatwg.org/entities.json
dest=htsentities.h
(
cat <<EOF
/*
-- ${dest} --
FILE GENERATED BY $0, DO NOT MODIFY
# 64-bit FNV-1a of $1, printed as a C constant. Must match the hash in
# htsencoding.c. The offset basis is stored as its wrapped (signed) bit pattern;
# bash arithmetic is 64-bit two's complement, so the result is bit-exact.
fnv1a() {
local s=$1 i c h=$((0xcbf29ce484222325))
for ((i = 0; i < ${#s}; i++)); do
printf -v c '%d' "'${s:i:1}"
h=$(((h ^ (c & 0xff)) * 0x100000001b3))
done
printf '0x%016xULL' "$h"
}
We compute the LCG hash
(see <http://en.wikipedia.org/wiki/Linear_congruential_generator>)
for each entity. We should in theory check using strncmp() that we
actually have the correct entity, but this is actually statistically
not needed.
if [ ! -f "$src" ]; then
curl -fsS "$url" -o "$src"
fi
We may want to do better, but we expect the hash function to be uniform, and
let the compiler be smart enough to optimize the switch (for example by
checking in log2() intervals)
This code has been generated using the evil $0 script.
*/
# Keep ';'-terminated single-codepoint names; the ~93 multi-codepoint refs can't
# fit decode_entity's single-codepoint return and are skipped (left verbatim).
pairs=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length == 1)))
| sort_by(.key)
| .[] | "\(.key | ltrimstr("&") | rtrimstr(";"))\t\(.value.codepoints[0])"' "$src")
static int decode_entity(const unsigned int hash, const size_t len) {
# Skipped multi-codepoint names, kept to prove none aliases an emitted hash.
skipped=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length > 1)))
| .[] | .key | ltrimstr("&") | rtrimstr(";")' "$src")
cases=""
emit_hashes=""
while IFS=$'\t' read -r name cp; do
hash=$(fnv1a "$name")
cases+=" /* $name */"$'\n'
cases+=" case $hash:"$'\n'
cases+=" if (len == ${#name}) {"$'\n'
cases+=" return $cp;"$'\n'
cases+=" }"$'\n'
cases+=" break;"$'\n'
emit_hashes+="$hash"$'\n'
done <<<"$pairs"
skip_hashes=""
while IFS= read -r name; do
[ -n "$name" ] && skip_hashes+="$(fnv1a "$name")"$'\n'
done <<<"$skipped"
# The switch keys on the hash alone, so the dispatch is correct only while every
# emitted name hashes uniquely; prove it here, no runtime name compare needed.
dups=$(printf '%s' "$emit_hashes" | sort | uniq -d || true)
if [ -n "$dups" ]; then
echo "FATAL: two entity names share a hash (duplicate switch case); change the hash:" >&2
echo "$dups" >&2
exit 1
fi
# A skipped name colliding with an emitted hash would mis-decode instead of
# staying verbatim; forbid that too.
aliased=$(comm -12 <(printf '%s' "$emit_hashes" | sort -u) <(printf '%s' "$skip_hashes" | sort -u) || true)
if [ -n "$aliased" ]; then
echo "FATAL: a skipped multi-codepoint name aliases an emitted hash:" >&2
echo "$aliased" >&2
exit 1
fi
cat >"$dest" <<EOF
/* GENERATED by $0 from the WHATWG named character references
(${url}). DO NOT EDIT.
Dispatch keys on a 64-bit FNV-1a hash of the entity name; the generator
aborts on any hash collision, so no runtime name compare is needed. */
#include <stdint.h>
static int decode_entity(const uint64_t hash, const size_t len) {
switch(hash) {
EOF
(
if test -f ${src}; then
cat ${src}
else
GET "${url}"
fi
) |
grep -E '^<!ENTITY [a-zA-Z0-9_]' |
sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
(
read -r A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$((1 << 32))"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$((i + 1))
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
# next
read -r A
done
)
cat <<EOF
}
${cases} }
/* unknown */
return -1;
}
EOF
) >${dest}
echo "wrote $dest ($(grep -c '^ case ' "$dest") entities)" >&2

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-9"
#define HTTRACK_VERSIONID "3.49.9"
#define HTTRACK_VERSION "3.49-10"
#define HTTRACK_VERSIONID "3.49.10"
#define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0"
@@ -229,6 +229,10 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
/* Honest crawler User-Agent; no fake OS/browser to go stale. */
#define HTS_DEFAULT_USER_AGENT \
"Mozilla/5.0 (compatible; HTTrack/" HTTRACK_AFF_VERSION \
"; +https://www.httrack.com/)"
#define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \

View File

@@ -563,6 +563,39 @@ const char *hts_mime[][2] = {
{"", ""}
};
/* Modern web formats (post-2010), kept in their own table: appending to the
legacy hts_mime[] above makes clang-format reflow its whole initializer.
Scanned after hts_mime[], so it never shadows a legacy mapping. */
static const char *hts_mime_modern[][2] = {
{"image/webp", "webp"},
{"image/avif", "avif"},
{"image/heic", "heic"},
{"font/woff", "woff"},
{"font/woff2", "woff2"},
{"font/ttf", "ttf"},
{"font/otf", "otf"},
{"application/json", "json"},
{"application/ld+json", "jsonld"},
{"application/manifest+json", "webmanifest"},
{"application/wasm", "wasm"},
{"text/javascript", "js"},
{"text/javascript", "mjs"},
{"text/markdown", "md"},
{"video/mp4", "mp4"},
{"video/webm", "webm"},
{"video/ogg", "ogv"},
{"video/mp2t", "ts"},
{"audio/mp4", "m4a"},
{"audio/aac", "aac"},
{"audio/ogg", "oga"},
{"audio/opus", "opus"},
{"audio/flac", "flac"},
{"audio/webm", "weba"},
{"application/x-7z-compressed", "7z"},
{"application/x-rar-compressed", "rar"},
{"application/zstd", "zst"},
{"", ""}};
// Reserved (RFC2396)
#define CIS(c,ch) ( ((unsigned char)(c)) == (ch) )
#define CHAR_RESERVED(c) ( CIS(c,';') \
@@ -1293,16 +1326,12 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
// Compression accepted ?
if (retour->req.http11) {
hts_boolean compressible = HTS_FALSE;
#if HTS_USEZLIB
if ((!retour->req.range_used)
&& (!retour->req.nocompression))
print_buffer(&bstr, "Accept-Encoding: " "gzip" /* gzip if the preffered encoding */
", " "identity;q=0.9" H_CRLF);
else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
#else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
compressible = (!retour->req.range_used && !retour->req.nocompression);
#endif
print_buffer(&bstr, "Accept-Encoding: %s" H_CRLF,
hts_acceptencoding(compressible));
}
/* Authentification */
@@ -1918,6 +1947,10 @@ HTSEXT_API const char *infostatuscode_const(int statuscode) {
return "Requested Range Not Satisfiable";
case 417:
return "Expectation Failed";
case 429:
return "Too Many Requests";
case 451:
return "Unavailable For Legal Reasons";
case 500:
return "Internal Server Error";
case 501:
@@ -4308,6 +4341,20 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, 1);
}
// first match in a NUL-terminated {mime,ext} table. key selects the lookup
// column (0=mime, 1=ext); returns the other column, or NULL if no row matches
// (a "*" partner means the row carries no value).
static const char *hts_mime_lookup(const char *(*table)[2], int key,
const char *needle) {
int j;
for (j = 0; strnotempty(table[j][1]); j++) {
if (strfield2(table[j][key], needle) && table[j][!key][0] != '*')
return table[j][!key];
}
return NULL;
}
// write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise
@@ -4331,17 +4378,15 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
while ((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (a >= fil && *a == '.' && strlen(a) < 32) {
int j = 0;
const char *mime;
a++;
while(strnotempty(hts_mime[j][1])) {
if (strfield2(hts_mime[j][1], a)) {
if (hts_mime[j][0][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][0], ssize);
return 1;
}
}
j++;
mime = hts_mime_lookup(hts_mime, 1, a);
if (mime == NULL)
mime = hts_mime_lookup(hts_mime_modern, 1, a);
if (mime != NULL) {
strlcpybuff(s, mime, ssize);
return 1;
}
if (flag) {
@@ -4365,6 +4410,11 @@ HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil,
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, flag);
}
/* Advertised Accept-Encoding; gzip and deflate both decode via hts_zunpack */
const char *hts_acceptencoding(hts_boolean compressible) {
return compressible ? "gzip, deflate, identity;q=0.9" : "identity";
}
// get type of fil (php)
// s: buffer (text/html) or NULL
// return: 1 if known by user
@@ -4476,18 +4526,16 @@ int get_userhttptype(httrackp * opt, char *s, const char *fil) {
// returns 1 if an extension was found (and written to s), 0 otherwise
int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
const char *ext;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {
if (hts_mime[j][1][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][1], ssize);
ok = 1;
}
}
j++;
ext = hts_mime_lookup(hts_mime, 0, st);
if (ext == NULL)
ext = hts_mime_lookup(hts_mime_modern, 0, st);
if (ext != NULL) {
strlcpybuff(s, ext, ssize);
ok = 1;
}
// wrap "x" mimetypes, such as:
// application/x-mp3
@@ -5754,6 +5802,13 @@ HTSEXT_API int hts_init(void) {
abortLog("unable to initialize TLS: SSL_CTX_new()");
assertf("unable to initialize TLS" == NULL);
}
/* Pin a TLS floor (no SSLv3/TLS1.0/1.1); no cert verify, by design. */
#if OPENSSL_VERSION_NUMBER >= 0x10100000L
SSL_CTX_set_min_proto_version(openssl_ctx, TLS1_2_VERSION);
#else
SSL_CTX_set_options(openssl_ctx, SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3 |
SSL_OP_NO_TLSv1 | SSL_OP_NO_TLSv1_1);
#endif
}
#endif
@@ -6005,8 +6060,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->user_agent, HTS_DEFAULT_USER_AGENT);
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default

View File

@@ -285,6 +285,9 @@ int ishttperror(int err);
int get_userhttptype(httrackp * opt, char *s, const char *fil);
int give_mimext(char *s, size_t ssize, const char *st);
/* Advertised Accept-Encoding value (no header name/CRLF); see htslib.c. */
const char *hts_acceptencoding(hts_boolean compressible);
int may_bogus_multiple(httrackp * opt, const char *mime, const char *filename);
int may_unknown2(httrackp * opt, const char *mime, const char *filename);

View File

@@ -44,28 +44,84 @@ Please visit our Website: http://www.httrack.com
// -- robots --
/* RFC 9309 path-prefix match; '*' any run, '$' anchors end; linear. */
static hts_boolean robots_pattern_match(const char *pattern, const char *path) {
size_t patlen = strlen(pattern);
hts_boolean anchored = HTS_FALSE;
const char *p, *pend, *s;
const char *star = NULL, *star_s = NULL;
if (patlen > 0 && pattern[patlen - 1] == '$') {
anchored = HTS_TRUE;
patlen--;
}
p = pattern;
pend = pattern + patlen;
s = path;
while (*s != '\0') {
if (p == pend) {
if (!anchored)
return HTS_TRUE; // prefix matched
if (star != NULL) { // anchored: '*' must eat the rest
p = star + 1;
s = ++star_s;
continue;
}
return HTS_FALSE;
}
if (*p == '*') {
star = p++;
star_s = s;
} else if (*p == *s) {
p++;
s++;
} else if (star != NULL) {
p = star + 1;
s = ++star_s;
} else {
return HTS_FALSE;
}
}
while (p < pend && *p == '*')
p++;
return (p == pend) ? HTS_TRUE : HTS_FALSE;
}
// fil="" : vérifier si règle déja enregistrée
int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
while(robots) {
if (strfield2(robots->adr, adr)) {
if (fil[0]) {
/* RFC 9309: longest pattern wins, Allow beats Disallow on ties. */
int ptr = 0;
char line[250];
char line[HTS_ROBOTS_TOKEN_SIZE];
size_t toklen = strlen(robots->token);
size_t best_len = 0;
hts_boolean matched = HTS_FALSE;
hts_boolean best_allow = HTS_FALSE;
if (strnotempty(robots->token)) {
do {
ptr += binput(robots->token + ptr, line, 200);
if (line[0] == '/') { // absolu
if (strfield(fil, line)) { // commence avec ligne
return -1; // interdit
}
} else { // relatif
if (strstrcase(fil, line)) {
return -1;
while (ptr < (int) toklen) {
ptr += binput(robots->token + ptr, line, sizeof(line) - 1);
if (line[0] != 'A' && line[0] != 'D')
continue;
{
const hts_boolean is_allow =
(line[0] == 'A') ? HTS_TRUE : HTS_FALSE;
const char *pat = line + 1;
if (robots_pattern_match(pat, fil)) {
const size_t len = strlen(pat);
if (!matched || len > best_len || (len == best_len && is_allow)) {
matched = HTS_TRUE;
best_len = len;
best_allow = is_allow;
}
}
} while((strnotempty(line)) && (ptr < (int) strlen(robots->token)));
}
}
if (matched && !best_allow)
return -1; // forbidden
} else {
return -1;
}
@@ -74,6 +130,93 @@ int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
}
return 0;
}
/* Append "<marker><pattern>\n" to the bounded rule blob if it fits. */
static void robots_blob_add(char *blob, size_t blobsize, char marker,
const char *pat) {
const size_t used = strlen(blob);
const size_t need = strlen(pat) + 2; // marker + '\n'
if (need < blobsize - used) { // overflow-safe: used <= blobsize-1
blob[used] = marker;
blob[used + 1] = '\0';
strlcatbuff(blob, pat, blobsize);
strlcatbuff(blob, "\n", blobsize);
}
}
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow) {
size_t bptr = 0;
int record = 0;
char BIGSTK line[1024];
char BIGSTK blob[HTS_ROBOTS_TOKEN_SIZE];
blob[0] = '\0';
if (info != NULL && infosize > 0)
info[0] = '\0';
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", body);
#endif
while (bptr < bodysize) {
char *comm;
int llen;
bptr += binput(body + bptr, line, sizeof(line) - 2);
comm = strchr(line, '#'); // strip comment
if (comm != NULL)
*comm = '\0';
llen = (int) strlen(line); // strip trailing spaces
while (llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a = line + 11;
while (is_realspace(*a))
a++;
if (*a == '*') {
if (record != 2)
record = 1; // generic group applies to us
} else if (strfield(a, "httrack") || strfield(a, "winhttrack") ||
strfield(a, "webhttrack")) {
blob[0] = '\0'; // explicit group: restart capture
if (info != NULL && infosize > 0)
info[0] = '\0';
record = 2; // locked to the httrack group
} else
record = 0;
} else if (record) {
hts_boolean is_allow = strfield(line, "allow:");
hts_boolean is_disallow = !is_allow && strfield(line, "disallow:");
if (is_allow || is_disallow) {
char *a = line + (is_allow ? 6 : 9);
while (is_realspace(*a))
a++;
if (strnotempty(a)) {
if (is_disallow && !keep_root_disallow && strcmp(a, "/") == 0) {
// dropped: site-wide disallow ignored by option
} else {
robots_blob_add(blob, sizeof(blob), is_allow ? 'A' : 'D', a);
if (is_disallow && info != NULL &&
strlen(a) + 2 < infosize - strlen(info)) {
if (strnotempty(info))
strlcatbuff(info, ", ", infosize);
strlcatbuff(info, a, infosize);
}
}
}
}
}
}
if (strnotempty(blob))
checkrobots_set(robots, adr, blob);
}
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data) {
if (((int) strlen(adr)) >= sizeof(robots->adr) - 2)
return 0;

View File

@@ -39,17 +39,27 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEF_FWSTRUCT_robots_wizard
typedef struct robots_wizard robots_wizard;
#endif
/* Per-host blob: one rule per line, first byte 'A'/'D' then path pattern. */
#define HTS_ROBOTS_TOKEN_SIZE 4096
struct robots_wizard {
char adr[128];
char token[4096];
char token[HTS_ROBOTS_TOKEN_SIZE];
struct robots_wizard *next;
};
/* Library internal definictions */
#ifdef HTS_INTERNAL_BYTECODE
/* -1 if `fil` disallowed for `adr` (RFC 9309); empty: -1 if rules exist. */
int checkrobots(robots_wizard * robots, const char *adr, const char *fil);
void checkrobots_free(robots_wizard * robots);
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data);
/* Parse robots.txt `body` for `adr`, storing the HTTrack group's rules; `info`
gets a disallow summary, `keep_root_disallow` FALSE drops "Disallow: /". */
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow);
#endif
#endif

View File

@@ -50,6 +50,9 @@ Please visit our Website: http://www.httrack.com
#include "htscharset.h"
#include "htsencoding.h"
#include "htsmd5.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
#include "coucal/coucal.h"
#include <ctype.h>
@@ -239,6 +242,14 @@ static void basic_selftests(void) {
assertf(strcmp(ext, "html") == 0);
assertf(give_mimext(ext, sizeof(ext), "no/such-mime-type") == 0);
assertf(ext[0] == '\0');
// modern web formats -> extension. Avoid MIME types the
// application/<=4-char-subtype fallback could fabricate without a row.
assertf(give_mimext(ext, sizeof(ext), "image/webp") == 1);
assertf(strcmp(ext, "webp") == 0);
assertf(give_mimext(ext, sizeof(ext), "application/manifest+json") == 1);
assertf(strcmp(ext, "webmanifest") == 0);
assertf(give_mimext(ext, sizeof(ext), "font/woff2") == 1);
assertf(strcmp(ext, "woff2") == 0);
}
// convtolower(): lower-cases into the caller buffer (bounded by its size).
{
@@ -293,6 +304,16 @@ static void basic_selftests(void) {
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.gif", 0) == 1);
assertf(strcmp(r.contenttype, "image/gif") == 0);
// modern extensions map back to their MIME type
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.webp", 0) == 1);
assertf(strcmp(r.contenttype, "image/webp") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"app.wasm", 0) == 1);
assertf(strcmp(r.contenttype, "application/wasm") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"mod.mjs", 0) == 1);
assertf(strcmp(r.contenttype, "text/javascript") == 0);
// no extension and flag==0: nothing written, returns 0
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 0) == 0);
@@ -503,6 +524,41 @@ static int string_safety_selftests(void) {
return 1;
}
/* StringCatN/StringSetLength must eval SIZE once: (n_eval++, V) leaves
n_eval == 2 on a double-eval macro. */
{
String s = STRING_EMPTY;
int n_eval = 0;
StringCat(s, "hello");
StringCatN(s, "world", (n_eval++, 3)); /* strlen>SIZE so the clamp runs */
if (n_eval != 1 || strcmp(StringBuff(s), "hellowor") != 0) {
StringFree(s);
return 1;
}
n_eval = 0;
StringSetLength(s, (n_eval++, 5));
if (n_eval != 1 || StringLength(s) != 5) {
StringFree(s);
return 1;
}
StringFree(s);
}
/* StringSubRW still reads/writes after dropping its duplicate definition. */
{
String s = STRING_EMPTY;
StringCat(s, "abc");
StringSubRW(s, 1) = 'X';
if (StringSub(s, 1) != 'X' || strcmp(StringBuff(s), "aXc") != 0) {
StringFree(s);
return 1;
}
StringFree(s);
}
return 0;
}
@@ -1284,6 +1340,275 @@ static int st_urlhack(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Default User-Agent: honest HTTrack token, no resurrected Windows 98. */
static int st_useragent(httrackp *opt, int argc, char **argv) {
const char *ua = StringBuff(opt->user_agent);
(void) argc;
(void) argv;
assertf(ua != NULL);
assertf(strcmp(ua, HTS_DEFAULT_USER_AGENT) == 0);
/* Teeth independent of the macro: honest token + self-identifier, and no
legacy Mozilla/4.x fake-browser string (rejects the whole relic family). */
assertf(strstr(ua, "HTTrack/") != NULL);
assertf(strstr(ua, "+https://www.httrack.com/") != NULL);
assertf(strstr(ua, "Mozilla/4.") == NULL);
printf("useragent self-test OK: %s\n", ua);
return 0;
}
/* HTTP status code -> reason phrase, including the modern 429/451. */
static int st_status(httrackp *opt, int argc, char **argv) {
const char *s;
(void) opt;
(void) argc;
(void) argv;
s = infostatuscode_const(429);
assertf(s != NULL && strcmp(s, "Too Many Requests") == 0);
s = infostatuscode_const(451);
assertf(s != NULL && strcmp(s, "Unavailable For Legal Reasons") == 0);
/* A spot-check of a long-standing code, and an unknown one. */
s = infostatuscode_const(404);
assertf(s != NULL && strcmp(s, "Not Found") == 0);
assertf(infostatuscode_const(799) == NULL);
printf("status self-test OK\n");
return 0;
}
#if HTS_USEZLIB
/* Deflate src->path at windowBits (16+ gzip, + zlib, - raw); 0 on success. */
static int ae_write_packed(const char *path, int windowBits,
const unsigned char *src, size_t len) {
unsigned char out[8192];
z_stream strm;
FILE *f;
int zerr;
memset(&strm, 0, sizeof(strm));
if (deflateInit2(&strm, Z_DEFAULT_COMPRESSION, Z_DEFLATED, windowBits, 8,
Z_DEFAULT_STRATEGY) != Z_OK)
return 1;
if ((f = FOPEN(path, "wb")) == NULL) {
deflateEnd(&strm);
return 1;
}
strm.next_in = (Bytef *) src;
strm.avail_in = (uInt) len;
do {
size_t n;
strm.next_out = out;
strm.avail_out = sizeof(out);
zerr = deflate(&strm, Z_FINISH);
n = sizeof(out) - strm.avail_out;
if (n > 0 && fwrite(out, 1, n, f) != n) {
deflateEnd(&strm);
fclose(f);
return 1;
}
} while (zerr == Z_OK);
deflateEnd(&strm);
fclose(f);
return (zerr == Z_STREAM_END) ? 0 : 1;
}
/* Forged raw deflate (08 1D) that misdetects as zlib; only fallback decodes */
static int ae_write_collision(const char *path, const unsigned char *src,
size_t len) {
/* block-1 LEN low byte 0x1D: with 0x08, (0x081D)%31==0 */
const size_t n1 = 29;
size_t n2, p = 0;
unsigned char *buf;
FILE *f;
int ok;
if (len < n1 || len - n1 > 0xFFFF)
return 1;
n2 = len - n1;
buf = malloct(10 + len);
if (buf == NULL)
return 1;
buf[p++] = 0x08; /* BFINAL=0, BTYPE=00, forged padding -> zlib CMF nibble */
buf[p++] = (unsigned char) (n1 & 0xff);
buf[p++] = (unsigned char) (n1 >> 8);
buf[p++] = (unsigned char) (~n1 & 0xff);
buf[p++] = (unsigned char) ((~n1 >> 8) & 0xff);
memcpy(buf + p, src, n1);
p += n1;
buf[p++] = 0x01; /* BFINAL=1, BTYPE=00 */
buf[p++] = (unsigned char) (n2 & 0xff);
buf[p++] = (unsigned char) (n2 >> 8);
buf[p++] = (unsigned char) (~n2 & 0xff);
buf[p++] = (unsigned char) ((~n2 >> 8) & 0xff);
memcpy(buf + p, src + n1, n2);
p += n2;
f = FOPEN(path, "wb");
ok = (f != NULL && fwrite(buf, 1, p, f) == p);
if (f != NULL)
fclose(f);
freet(buf);
return ok ? 0 : 1;
}
/* Compare path's bytes to expect[0..len); 0 if equal. Streams (large files). */
static int ae_check_decoded(const char *path, const unsigned char *expect,
size_t len) {
unsigned char buf[8192];
FILE *f = FOPEN(path, "rb");
size_t off = 0, n;
if (f == NULL)
return 1;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
if (n > len - off || memcmp(buf, expect + off, n) != 0) {
fclose(f);
return 1;
}
off += n;
}
fclose(f);
return (off == len) ? 0 : 1;
}
#endif
/* Accept-Encoding (#450): advertise gzip+deflate; both decode (hts_zunpack) */
static int st_acceptencoding(httrackp *opt, int argc, char **argv) {
const char *off = hts_acceptencoding(HTS_FALSE);
const char *on = hts_acceptencoding(HTS_TRUE);
(void) opt;
assertf(strcmp(off, "identity") == 0);
assertf(strstr(on, "gzip") != NULL);
assertf(strstr(on, "deflate") != NULL); /* fails on the old gzip-only list */
#if HTS_USEZLIB
if (argc >= 1) {
static const int windowBits[] = {16 + MAX_WBITS, MAX_WBITS, -MAX_WBITS};
const unsigned char small[] =
"deflate round-trip: HTTrack decodes gzip and deflate alike. "
"deflate round-trip: HTTrack decodes gzip and deflate alike.";
const size_t slen = sizeof(small) - 1;
/* 64 KiB of varied (LCG) bytes: forces the multi-fread loop */
const size_t blen = 64 * 1024;
unsigned char *body = malloct(blen);
uint32_t x = 0x1234567u;
char inpath[HTS_URLMAXSIZE], outpath[HTS_URLMAXSIZE];
size_t i;
assertf(body != NULL);
for (i = 0; i < blen; i++) {
x = x * 1103515245u + 12345u;
body[i] = (unsigned char) (x >> 16);
}
/* gzip, zlib (RFC1950) and raw deflate (RFC1951), both small and large. */
for (i = 0; i < sizeof(windowBits) / sizeof(windowBits[0]); i++) {
snprintf(inpath, sizeof(inpath), "%s/ae-in-%d.z", argv[0], windowBits[i]);
snprintf(outpath, sizeof(outpath), "%s/ae-out-%d", argv[0],
windowBits[i]);
assertf(ae_write_packed(inpath, windowBits[i], small, slen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) slen);
assertf(ae_check_decoded(outpath, small, slen) == 0);
assertf(ae_write_packed(inpath, windowBits[i], body, blen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) blen);
assertf(ae_check_decoded(outpath, body, blen) == 0);
}
/* Fallback teeth: raw deflate misdetected as zlib; -1 without the retry. */
snprintf(inpath, sizeof(inpath), "%s/ae-collide.z", argv[0]);
snprintf(outpath, sizeof(outpath), "%s/ae-collide.out", argv[0]);
assertf(ae_write_collision(inpath, body, 64) == 0);
assertf(hts_zunpack(inpath, outpath) == 64);
assertf(ae_check_decoded(outpath, body, 64) == 0);
freet(body);
}
#else
(void) argc;
(void) argv;
#endif
printf("acceptencoding self-test OK: %s\n", on);
return 0;
}
/* Each call parses `txt` under a fresh host, then checkrobots() for `path`. */
static int rb_decide(robots_wizard *r, const char *txt, const char *path) {
static int n = 0;
char host[64];
snprintf(host, sizeof(host), "h%d.example", n++);
robots_parse(r, host, txt, strlen(txt), NULL, 0, HTS_TRUE);
return checkrobots(r, host, path);
}
static int st_robots(httrackp *opt, int argc, char **argv) {
robots_wizard robots;
(void) opt;
(void) argc;
(void) argv;
memset(&robots, 0, sizeof(robots));
/* Longer Allow re-opens subtree under Disallow: / (old matcher couldn't). */
{
const char *txt = "User-agent: *\nDisallow: /\nAllow: /public/\n";
assertf(rb_decide(&robots, txt, "/public/x") == 0); /* allowed */
assertf(rb_decide(&robots, txt, "/private") == -1); /* denied */
assertf(rb_decide(&robots, txt, "/") == -1); /* denied */
}
/* Equal-length match: Allow wins the tie over Disallow. */
{
const char *txt = "User-agent: *\nDisallow: /foo\nAllow: /foo\n";
assertf(rb_decide(&robots, txt, "/foo/bar") == 0);
}
/* Longest match wins even when it is not the last rule. */
{
assertf(rb_decide(&robots, "User-agent: *\nDisallow: /a/b\nAllow: /a\n",
"/a/b/c") == -1);
assertf(rb_decide(&robots, "User-agent: *\nAllow: /a/b\nDisallow: /a\n",
"/a/b/c") == 0);
}
/* '*' matches any run of characters. */
{
const char *txt = "User-agent: *\nDisallow: /*.php\n";
assertf(rb_decide(&robots, txt, "/a/b/index.php") == -1);
assertf(rb_decide(&robots, txt, "/a/b/index.html") == 0);
}
/* Trailing '$' anchors the end of the path. */
{
const char *txt = "User-agent: *\nDisallow: /a$\n";
assertf(rb_decide(&robots, txt, "/a") == -1);
assertf(rb_decide(&robots, txt, "/ab") == 0);
assertf(rb_decide(&robots, txt, "/a/b") == 0);
}
/* The httrack-specific group replaces the generic '*' group entirely. */
{
const char *txt = "User-agent: *\nDisallow: /everyone\n"
"User-agent: httrack\nDisallow: /\n";
assertf(rb_decide(&robots, txt, "/anything") == -1);
}
/* Replace, not merge: the generic group does not bind the httrack group. */
{
const char *txt = "User-agent: *\nDisallow: /x\n"
"User-agent: httrack\nDisallow: /y\n";
assertf(rb_decide(&robots, txt, "/x") == 0);
assertf(rb_decide(&robots, txt, "/y") == -1);
}
/* No rules: everything is allowed. */
assertf(rb_decide(&robots, "User-agent: *\nDisallow:\n", "/x") == 0);
checkrobots_free(&robots);
printf("robots self-test OK\n");
return 0;
}
/* ------------------------------------------------------------ */
/* Registry: name -> handler, with a usage hint and a one-line description. */
/* ------------------------------------------------------------ */
@@ -1330,6 +1655,12 @@ static const struct selftest_entry {
st_cache_writefail},
{"dns", "", "DNS resolver/cache self-test", st_dns},
{"cookies", "", "cookie request-header self-test", st_cookies},
{"useragent", "", "default User-Agent self-test", st_useragent},
{"status", "", "HTTP status code -> reason phrase self-test", st_status},
{"acceptencoding", "[dir]",
"Accept-Encoding advertises gzip+deflate, both decode", st_acceptencoding},
{"robots", "", "robots.txt RFC 9309 Allow/Disallow precedence self-test",
st_robots},
};
static void list_selftests(void) {

View File

@@ -358,12 +358,12 @@ int smallserver(T_SOC soc, char *url, char *method, char *data, char *path) {
{NULL, 0}
};
initStrElt initStr[] = {
{"user", "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"},
{"footer",
"<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->"},
{"url2", "+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}
};
{"user", HTS_DEFAULT_USER_AGENT},
{"footer", "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x "
"[XR&CO'2014], %s -->"},
{"url2",
"+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}};
int i = 0;
for(i = 0; initInt[i].name; i++) {

View File

@@ -121,9 +121,6 @@ struct String {
/** Byte at POS (read/write). No bounds check; POS must be < StringLength. **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Subcharacter (read/write) **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Byte POS positions from the end (read). POS==1 is the last byte. **/
#define StringRight(BLK, POS) (StringBuff(BLK)[StringLength(BLK) - POS])
@@ -191,8 +188,9 @@ HTS_STATIC char *StringBuffN_(String *blk, int size) {
asserts SIZE fits the existing content; does not (re)allocate. **/
#define StringSetLength(BLK, SIZE) \
do { \
if (SIZE >= 0) { \
(BLK).length_ = SIZE; \
const int len__ = (SIZE); /* signed: negative means strlen(buffer_) */ \
if (len__ >= 0) { \
(BLK).length_ = len__; \
} else { \
(BLK).length_ = strlen((BLK).buffer_); \
} \
@@ -308,10 +306,11 @@ HTS_STATIC void StringAttach(String *blk, char **str) {
#define StringCatN(BLK, STR, SIZE) \
do { \
const char *str__ = (STR); \
const size_t usize__ = (SIZE); \
if (str__ != NULL) { \
size_t size__ = strlen(str__); \
if (size__ > (SIZE)) { \
size__ = (SIZE); \
if (size__ > usize__) { \
size__ = usize__; \
} \
StringMemcat(BLK, str__, size__); \
} \

View File

@@ -80,6 +80,10 @@ htspair_t hts_detect_embed[] = {
{NULL, NULL}
};
/* HTML5 media siblings of <img src>: same near-link treatment (#451) */
static const htspair_t hts_detect_embed_html5[] = {
{"source", "src"}, {"source", "srcset"}, {"track", "src"}, {NULL, NULL}};
/* Internal */
static int hts_acceptlink_(httrackp * opt, int ptr, const char *adr,
const char *fil, const char *tag,
@@ -136,6 +140,17 @@ static int cmp_token(const char *tag, const char *cmp) {
&& !isalnum((unsigned char) tag[p]));
}
/* TRUE if (tag, attribute) matches an embedded-asset pair in the table */
static hts_boolean is_embed_pair(const htspair_t *table, const char *tag,
const char *attribute) {
int i;
for (i = 0; table[i].tag != NULL; i++) {
if (cmp_token(tag, table[i].tag) && cmp_token(attribute, table[i].attr))
return HTS_TRUE;
}
return HTS_FALSE;
}
static int hts_acceptlink_(httrackp * opt, int ptr,
const char *adr, const char *fil, const char *tag,
const char *attribute, int *set_prio_to,
@@ -163,15 +178,9 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
/* Built-in known tags (<img src=..>, ..) */
if (forbidden_url != 0 && opt->nearlink && tag != NULL && attribute != NULL) {
int i;
for(i = 0; hts_detect_embed[i].tag != NULL; i++) {
if (cmp_token(tag, hts_detect_embed[i].tag)
&& cmp_token(attribute, hts_detect_embed[i].attr)
) {
embedded_triggered = 1;
break;
}
if (is_embed_pair(hts_detect_embed, tag, attribute) ||
is_embed_pair(hts_detect_embed_html5, tag, attribute)) {
embedded_triggered = 1;
}
}

View File

@@ -47,48 +47,89 @@ Please visit our Website: http://www.httrack.com
*/
/*
Unpack file into a new file
Unpack file into a new file (gzip, zlib RFC1950 or raw deflate RFC1951).
Return value: size of the new file, or -1 if an error occurred
*/
/* Note: utf-8 */
int hts_zunpack(char *filename, char *newfile) {
int ret = -1;
if (filename != NULL && newfile != NULL) {
if (filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
const int fd = in != NULL ? fileno(in) : -1;
const int dup_fd = fd != -1 ? dup(fd) : -1;
// Note: we must dup to be able to flose cleanly.
const gzFile gz = dup_fd != -1 ? gzdopen(dup_fd, "rb") : NULL;
if (filename != NULL && newfile != NULL && filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
if (gz) {
FILE *const fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
int size = 0;
if (in != NULL) {
unsigned char BIGSTK inbuf[8192];
size_t navail = fread(inbuf, 1, sizeof(inbuf), in);
/* gzip/zlib headers -> +32 windowBits; else raw deflate (RFC1951) */
const hts_boolean wrapped =
(navail >= 2 &&
((inbuf[0] == 0x1f && inbuf[1] == 0x8b) ||
((inbuf[0] & 0x0f) == Z_DEFLATED &&
(((unsigned) inbuf[0] << 8 | inbuf[1]) % 31) == 0)));
int attempt;
if (fpout) {
int nr;
/* deflate is ambiguous; on failure retry with the other windowBits */
for (attempt = 0; attempt < 2 && ret < 0; attempt++) {
const int windowBits =
(attempt == 0 ? wrapped : !wrapped) ? (32 + MAX_WBITS) : -MAX_WBITS;
FILE *fpout;
z_stream strm;
do {
char BIGSTK buff[1024];
nr = gzread(gz, buff, sizeof(buff));
if (nr > 0) {
size += nr;
if (fwrite(buff, 1, nr, fpout) != nr)
nr = size = -1;
}
} while(nr > 0);
if (attempt > 0) {
/* rewind input; reopening fpout "wb" discards the partial output */
if (fseek(in, 0, SEEK_SET) != 0)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
}
fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
if (fpout == NULL)
break;
memset(&strm, 0, sizeof(strm));
if (inflateInit2(&strm, windowBits) != Z_OK) {
fclose(fpout);
} else
size = -1;
gzclose(gz);
ret = (int) size;
}
if (in != NULL) {
fclose(in);
break;
}
{
hts_boolean ok = HTS_TRUE;
int size = 0;
int zerr = Z_OK;
/* chunked inflate; first chunk in inbuf, single member */
do {
strm.next_in = inbuf;
strm.avail_in = (uInt) navail;
do {
unsigned char BIGSTK outbuf[8192];
size_t produced;
strm.next_out = outbuf;
strm.avail_out = sizeof(outbuf);
zerr = inflate(&strm, Z_NO_FLUSH);
if (zerr == Z_NEED_DICT || zerr == Z_DATA_ERROR ||
zerr == Z_MEM_ERROR || zerr == Z_STREAM_ERROR) {
ok = HTS_FALSE;
break;
}
produced = sizeof(outbuf) - strm.avail_out;
if (produced > 0 &&
fwrite(outbuf, 1, produced, fpout) != produced) {
ok = HTS_FALSE;
break;
}
size += (int) produced;
} while (strm.avail_out == 0);
if (!ok || zerr == Z_STREAM_END)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
} while (navail > 0);
if (ok && zerr == Z_STREAM_END)
ret = size;
}
inflateEnd(&strm);
fclose(fpout);
}
fclose(in);
}
}
return ret;

View File

@@ -497,6 +497,12 @@ static const char *GetHttpMessage(int statuscode) {
case 417:
return "Expectation Failed";
break;
case 429:
return "Too Many Requests";
break;
case 451:
return "Unavailable For Legal Reasons";
break;
case 500:
return "Internal Server Error";
break;

View File

@@ -18,6 +18,21 @@ ent '&amp;' '&'
ent '&lt;&gt;' '<>'
ent '&eacute;' 'é'
# HTML5 names from the WHATWG set
ent '&hellip;' '…'
ent '&bigcup;' ''
# longest name (31 chars) exercises the name-length cap
ent '&CounterClockwiseContourIntegral;' '∳'
# astral codepoint -> 4-byte UTF-8
ent '&Aopf;' '𝔸'
# multi-codepoint refs are skipped at generation, so left verbatim
ent '&fjlig;' '&fjlig;'
# common HTML4 names still decode (regression guard against accidental drops)
ent '&copy;&reg;&trade;' '©®™'
ent '&mdash;&ndash;' '—–'
ent '&alpha;&beta;' 'αβ'
# numeric: decimal and hex
ent '&#65;&#66;' 'AB'
ent '&#x41;' 'A'

View File

@@ -323,4 +323,33 @@ grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
# HTML5 <source>/<track> follow as embedded near-links past the -r2 depth boundary (#451).
# img.gif positive control; plain.gif (bare <a href>) negative control proves the gate is selective.
site10="$tmp/html5media"
mkdir -p "$site10"
for f in img ss plain; do gif "$site10/$f.gif"; done
printf 'x' >"$site10/v.webm"
printf 'x' >"$site10/subs.vtt"
cat >"$site10/index.html" <<EOF
<html><body><a href="leaf.html">leaf</a></body></html>
EOF
cat >"$site10/leaf.html" <<EOF
<html><body>
<img src="img.gif">
<picture><source srcset="ss.gif 2x"></picture>
<video><source src="v.webm"></video>
<video><track src="subs.vtt"></video>
<a href="plain.gif">plain link past the boundary</a>
</body></html>
EOF
out10="$tmp/html5media-out"
rm -rf "$out10"
mkdir -p "$out10"
httrack "file://$site10/index.html" -O "$out10" --quiet --near -r2 >"$out10/.log" 2>&1 || true
found "img.gif" "$out10"
found "ss.gif" "$out10"
found "v.webm" "$out10"
found "subs.vtt" "$out10"
notfound "plain.gif" "$out10"
exit 0

7
tests/01_engine-robots.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# robots.txt RFC 9309 Allow/Disallow precedence (#452): longest match wins.
httrack -O /dev/null -#test=robots run | grep -q "robots self-test OK"

7
tests/01_engine-status.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# HTTP status -> reason phrase, including the modern 429/451 (#453).
httrack -O /dev/null -#test=status run | grep -q "status self-test OK"

7
tests/01_engine-useragent.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# Default User-Agent (#449): honest HTTrack token, no Windows 98 relic.
httrack -O /dev/null -#test=useragent run | grep -q "useragent self-test OK"

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
set -euo pipefail
# Accept-Encoding (#450): advertise gzip+deflate; decode gzip/zlib/raw-deflate.
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
httrack -O /dev/null -#test=acceptencoding "$dir" run |
grep -q "acceptencoding self-test OK"

View File

@@ -6,7 +6,7 @@
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
#
# 01_engine-cache.test writes the cache with the same build it reads back (a
# 01_zlib-cache.test writes the cache with the same build it reads back (a
# round-trip), so it cannot catch a read-path or ZIP-format regression where
# writer and reader drift together. This reads a *committed* cache frozen by an
# earlier build and asserts a fixed set of entries still decodes field- and

View File

@@ -20,6 +20,14 @@ if ! command -v python3 >/dev/null 2>&1; then
echo "python3 missing, skipping"
exit 77
fi
# The fixture needs a second loopback IP (dead 127.0.0.2 + live 127.0.0.1) for
# the fallback to have a target; GNU/Hurd has only 127.0.0.1, so skip there.
case "$(uname -s)" in
GNU | GNU/*)
echo "GNU/Hurd: single loopback IP, connect-fallback fixture unbuildable, skipping"
exit 77
;;
esac
server="$top_srcdir/tests/local-server.py"
root="$top_srcdir/tests/server-root"

View File

@@ -9,6 +9,13 @@ set -e
: "${top_srcdir:=..}"
# python3 runs the local server (mirror local-crawl.sh); skip when absent, else
# run() swallows its exit-77 and the serverless 0s/0s crawl looks like a fail.
command -v python3 >/dev/null || {
echo "python3 not found; skipping local crawl tests"
exit 77
}
run() { # echoes the wall-clock seconds of one crawl
local t0 t1
t0=$(date +%s)

View File

@@ -1,4 +1,4 @@
# Committed binary fixture read by 01_engine-cache-golden.test. List it
# Committed binary fixture read by 01_zlib-cache-golden.test. List it
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
@@ -25,9 +25,6 @@ TEST_EXTENSIONS = .test
TEST_LOG_COMPILER = $(BASH)
TESTS = \
00_runnable.test \
01_engine-cache.test \
01_engine-cache-golden.test \
01_engine-cache-writefail.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
@@ -44,12 +41,19 @@ TESTS = \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-robots.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-status.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
01_engine-urlhack.test \
01_engine-useragent.test \
01_zlib-acceptencoding.test \
01_zlib-cache.test \
01_zlib-cache-golden.test \
01_zlib-cache-writefail.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \