Compare commits

..

1 Commits

Author SHA1 Message Date
Xavier Roche
63671fb4cf Stop the 412/416 partial-reget loop on continue/update (#206)
On resume, the Range request is rebuilt by back_add from a temp-ref keyed on
(adr,fil) that records the partial download's real save name. A 412/416
("Range Not Satisfiable") means that partial is stale and the whole file
must be re-fetched. The handler only removed heap->sav, so when the resume
pass recomputed a save name different from the temp-ref's (the default
delayed-type machinery renames freely), the partial was never cleared:
back_add re-sent the same Range, earned the same 416, and the link was
re-recorded forever, growing the scan counter without bound.

Clear the whole partial wherever it lives -- the temp-ref and the file it
points at, plus heap->sav -- so the re-record falls through to a plain full
GET. Re-get only when there was a partial to discard and both Range triggers
(the ref and the on-disk file) are actually gone; once they are, a fresh 416
with nothing left to drop means the whole-file GET itself failed, so the link
gives up cleanly instead of re-queueing. A failed removal (read-only or full
cache) also gives up rather than looping, since back_add would otherwise
re-Range the surviving ref; url_savename_refname_remove now reports the
removal result so the handler can tell. (The request's range_used flag would
be the natural one-shot signal, but it does not survive the delayed-type
two-pass, so we key off the partial instead.)

tests/20_local-resume-loop.test drives it offline: pass 1 is interrupted
(SIGTERM, so the exit handler finalizes the cache and the temp-ref) to leave
a partial, then pass 2 --continue gets 416 on every resume request. A
portable watchdog kills pass 2 if it loops; the test asserts it terminates
and attempts exactly one whole-file re-get (2 <= requests <= 8). It fails on
the pre-fix handler (loops) and on a re-get that silently drops the link.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-24 21:06:05 +02:00
76 changed files with 2774 additions and 15624 deletions

View File

@@ -61,50 +61,6 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Reproduce the Debian buildds: they build in a minimal chroot with no
# python3, so the local-server tests must SKIP (exit 77), not fail. GitHub
# runners ship python3, so every other job hides this path; here we remove it
# before `make check`. This is the guard that would have caught the 3.49.10-1
# FTBFS (28_local-pause failed instead of skipping when python3 was absent).
buildd-no-python3:
name: build (no python3, Debian buildd)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev
- name: Configure
run: |
set -euo pipefail
autoreconf -fi
./configure
- name: Build
run: make -j"$(nproc)"
- name: Test without python3
run: |
set -euo pipefail
# Hide every python3* so `command -v python3` fails like it does in the
# buildd chroot; masking with /bin/false would still resolve.
sudo find /usr/bin /usr/local/bin -maxdepth 1 -name 'python3*' \
-exec mv {} {}.hidden \;
! command -v python3
make check
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Portability: build and test on macOS (Darwin/clang) on a native runner --
# no VM. The tree has no __APPLE__ branches, so Darwin exercises the
# generic-Unix path on a second libc and kernel. brew's openssl@3 is keg-only,
@@ -232,51 +188,6 @@ jobs:
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# MemorySanitizer catches reads of uninitialized memory (#143's stack-garbage
# size filter) that ASan/UBSan miss. It flags any byte an uninstrumented lib
# wrote, so the job stays in our own code: offline self-tests only, no openssl
# (--disable-https), no zlib cache tests, static (the runtime is not in .so's).
msan:
name: msan (MemorySanitizer, clang)
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install build dependencies
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential clang autoconf automake libtool autoconf-archive \
zlib1g-dev
- name: Configure (MSan, static, no https)
run: |
set -euo pipefail
autoreconf -fi
./configure CC=clang \
CFLAGS="-fsanitize=memory -fsanitize-memory-track-origins=2 -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
LDFLAGS="-fsanitize=memory" \
--disable-https --disable-shared --enable-static
- name: Build
run: make -j"$(nproc)"
- name: Test (offline self-tests under MSan)
env:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure
if: failure()
run: cat tests/test-suite.log 2>/dev/null || true
# Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
# the configuration users on minimal systems build, and one libssl is not even
# installed here so configure cannot silently re-enable it. The matrix above

View File

@@ -33,9 +33,8 @@ the operational checklist: toolchain, invariants, and how to ship a change.
- Be terse. Comment the why, in English; translate French comments you touch.
- Strip AI tells from prose (em-dash overuse, rule-of-three, filler, vague
attributions). Ref: Wikipedia "Signs of AI writing". Claude Code: `/humanizer`.
- Behavior change → add a test. Fast path: a hidden `httrack -#test=NAME` engine
self-test (registry in `htsselftest.c`; `-#test` lists them) driven by a
`tests/NN_*.test`, over a slow crawl.
- Behavior change → add a test. Fast path: a hidden `httrack -#N` debug
subcommand (`htscoremain.c`) driven by a `tests/NN_*.test`, over a slow crawl.
## Review your change adversarially (strongly suggested)
Before pushing, and when reviewing others, don't skim for bugs:

View File

@@ -39,10 +39,6 @@ Welcome, and nothing to disclose. Two rules:
The sign-off covers AI-assisted code too.
## Translations
Interface strings live in [`lang/`](lang/). See [lang/README.md](lang/README.md) for the file format and how to add or update a language.
## Bugs
Open an issue with the version, OS, command used, and expected vs actual result.

View File

@@ -1,6 +1,6 @@
AC_PREREQ([2.71])
AC_INIT([httrack], [3.49.10], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_INIT([httrack], [3.49.9], [roche+packaging@httrack.com], [httrack], [http://www.httrack.com/])
AC_COPYRIGHT([
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 1998-2015 Xavier Roche and other contributors
@@ -29,10 +29,10 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects])
# 3:2:0: 3.49.10 only appends tail fields to the options struct (no existing
# symbol or offset changed vs 3.49.9), so it stays soname .so.3; bump revision.
# (3:0:0 was the htsblk mime-buffer widening, the ABI break that moved .so.2 -> .so.3.)
VERSION_INFO="3:2:0"
# 3:1:0: 3.49.9 changed code but not the exported interface vs 3.49.8 (same 164
# symbols, no struct-layout change), so bump revision only. (3:0:0 was the htsblk
# mime-buffer widening, an ABI break that moved the soname .so.2 -> .so.3.)
VERSION_INFO="3:1:0"
AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS

13
debian/changelog vendored
View File

@@ -1,16 +1,3 @@
httrack (3.49.10-1) unstable; urgency=medium
* New upstream release: new download-pacing and URL-handling options plus a
batch of crawl and robustness fixes (full list in history.txt).
* Rewrite debian/copyright in machine-readable DEP-5 format, crediting the
bundled minizip, md5 and coucal sources (#415).
* Lead the webhttrack browser dependency with chromium so httrack is not
dragged into the firefox-esr autoremoval cascade (#436).
* Override the embedded-library lint for the bundled minizip (#419).
* Bump Standards-Version to 4.7.4 (no changes required).
-- Xavier Roche <xavier@debian.org> Sun, 28 Jun 2026 14:01:53 +0200
httrack (3.49.9-1) unstable; urgency=medium
* New upstream release: Content-Type and file-type detection fixes (trust a

4
debian/control vendored
View File

@@ -2,7 +2,7 @@ Source: httrack
Section: web
Priority: optional
Maintainer: Xavier Roche <roche@httrack.com>
Standards-Version: 4.7.4
Standards-Version: 4.7.0
Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
Rules-Requires-Root: no
Homepage: http://www.httrack.com
@@ -30,7 +30,7 @@ Description: Copy websites to your computer (Offline browser)
Package: webhttrack
Architecture: any
Multi-Arch: foreign
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, chromium | firefox-esr | www-browser
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, firefox-esr | chromium | www-browser
Replaces: webhttrack-common (<< 3.43.9-2)
Breaks: webhttrack-common (<< 3.43.9-2)
Suggests: httrack, httrack-doc

View File

@@ -4,25 +4,7 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack
3.49-10
+ New: --cookies-file to preload a Netscape cookies.txt before crawling (#215)
+ New: --pause to space out file downloads by a random delay (#185)
+ New: --strip-query to drop selected query keys from the dedup naming (#112)
+ Changed: split the -%u URL hacks into independent --keep-www-prefix, --keep-double-slashes and --keep-query-order toggles (#271)
+ Fixed: follow a redirect Location after dropping its #fragment, instead of requesting the fragment and polluting the saved name (#204)
+ Fixed: escaped brackets inside a *[...] filter character class (#148)
+ Fixed: honor the server's Content-Range when resuming a partial download, instead of appending overlapping bytes (#198)
+ Fixed: abort the download as soon as the response type is excluded by -mime:, instead of fetching then discarding the body (#58)
+ Fixed: keep size-based filter rules neutral until the file size is known (#143)
+ Fixed: stop the mirror with a clean fatal error on a cache write failure, instead of crashing (#174, #219)
+ Fixed: stop the 412/416 partial re-get loop on --continue and --update (#206)
+ Fixed: keep an unrecognized URL tail instead of mangling it to .html (#115)
+ Fixed: honor --tolerant (-%B) on a broken Content-Length, and fix an out-of-bounds read it exposed (#32, #41)
+ Fixed: fall back to the next resolved address when a connection fails or stalls, instead of hanging on a dead IPv6 address
+ Fixed: report why a -%L URL list could not be loaded (#49)
+ Changed: multiple internal hardening, build and CI improvements
.49-9
3.49-9
+ Fixed: file-type detection from the Content-Type header: trust a declared type over a binary URL extension, honor --assume under the delayed type check, and keep a known extension against a bogus or empty Content-Type (#267, #29, #56)
+ Fixed: an uninitialized-buffer read when the Content-Type is empty (#411)
+ Fixed: restored C++ source-compatibility of the installed headers so reverse dependencies (httraqt) build again (#413)

View File

@@ -247,7 +247,7 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
<td>the \ character</td>
</tr>
<tr>
<td nowrap><tt>*[\[,\]]</tt></td>
<td nowrap><tt>*[\[\]]</tt></td>
<td>the [ or ] character</td>
</tr>
<tr>

View File

@@ -295,7 +295,7 @@ Max Depth
Maximum external depth:
Maximum external depth:
Filters (refuse/accept links) :
Filters (refuse/accept links):
Filters (refuse/accept links) :
Paths
Paths
Save prefs

View File

@@ -1,37 +0,0 @@
# Translating HTTrack
Interface strings live here, one `.txt` file per language. `English.txt` is the reference: every other file maps each English string to its translation.
## File format
Plain text, entries in consecutive pairs of lines:
```
<English string>
<translation>
```
The first line of a pair is the lookup key and must stay identical to the one in `English.txt`; translate only the second line. Missing entries fall back to the English text at runtime, so a partial translation works.
Preserve any `\r\n`, `\t` and `printf` placeholders (`%s`, `%d`, ...) in the translation.
A few `LANGUAGE_*` entries at the top describe the file itself:
| Key | Meaning |
| --- | --- |
| `LANGUAGE_NAME` | Name shown in the language picker, in its own language (`Deutsch`, not `German`) |
| `LANGUAGE_ISO` | ISO 639 code, with region if needed (`de`, `pt_BR`) |
| `LANGUAGE_CHARSET` | Encoding the file is saved in (`ISO-8859-1`, `windows-1251`, `UTF-8`, ...) |
| `LANGUAGE_AUTHOR` | Your name and contact |
| `LANGUAGE_WINDOWSID` | Windows locale name used by WinHTTrack (`German (Standard)`) |
Save the file in exactly its declared `LANGUAGE_CHARSET`; an editor that rewrites it as UTF-8 will corrupt the non-ASCII bytes.
## Adding or updating a language
1. Copy `English.txt` to `<Language>.txt`, or edit the existing file.
2. Translate each second line; leave the English keys untouched.
3. Fill in the `LANGUAGE_*` header for a new file.
4. Open a pull request, or attach the file to a GitHub issue.
When new strings land in `English.txt` they show up untranslated (as English) until a translator fills them in.

View File

@@ -3,7 +3,7 @@
.\"
.\" This file is generated by man/makeman.sh; do not edit by hand.
.\" SPDX-License-Identifier: GPL-3.0-or-later
.TH httrack 1 "27 June 2026" "httrack website copier"
.TH httrack 1 "13 June 2026" "httrack website copier"
.SH NAME
httrack \- offline browser : copy websites to a local directory
.SH SYNOPSIS
@@ -24,7 +24,6 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-EN, \-\-max\-time[=N]\fR ]
[ \fB\-AN, \-\-max\-rate[=N]\fR ]
[ \fB\-%cN, \-\-connection\-per\-second[=N]\fR ]
[ \fB\-%G, \-\-pause\fR ]
[ \fB\-GN, \-\-max\-pause[=N]\fR ]
[ \fB\-cN, \-\-sockets[=N]\fR ]
[ \fB\-TN, \-\-timeout[=N]\fR ]
@@ -44,13 +43,11 @@ httrack \- offline browser : copy websites to a local directory
[ \fB\-x, \-\-replace\-external\fR ]
[ \fB\-%x, \-\-disable\-passwords\fR ]
[ \fB\-%q, \-\-include\-query\-string\fR ]
[ \fB\-%g, \-\-strip\-query\fR ]
[ \fB\-o, \-\-generate\-errors\fR ]
[ \fB\-X, \-\-purge\-old[=N]\fR ]
[ \fB\-%p, \-\-preserve\fR ]
[ \fB\-%T, \-\-utf8\-conversion\fR ]
[ \fB\-bN, \-\-cookies[=N]\fR ]
[ \fB\-%K, \-\-cookies\-file\fR ]
[ \fB\-u, \-\-check\-type[=N]\fR ]
[ \fB\-j, \-\-parse\-java[=N]\fR ]
[ \fB\-sN, \-\-robots[=N]\fR ]
@@ -156,8 +153,6 @@ maximum mirror time in seconds (60=1 minute, 3600=1 hour) (\-\-max\-time[=N])
maximum transfer rate in bytes/seconds (1000=1KB/s max) (\-\-max\-rate[=N])
.IP \-%cN
maximum number of connections/seconds (*%c10) (\-\-connection\-per\-second[=N])
.IP \-%G
random pause of MIN[:MAX] seconds between files (e.g. %G5:10) (\-\-pause <param>)
.IP \-GN
pause transfer if N bytes reached, and wait until lock file is deleted (\-\-max\-pause[=N])
.SS Flow control:
@@ -203,8 +198,6 @@ replace external html links by error pages (\-\-replace\-external)
do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
.IP \-%q
*include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
.IP \-%g
strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
.IP \-o
*generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
.IP \-X
@@ -216,8 +209,6 @@ links conversion to UTF\-8 (\-\-utf8\-conversion)
.SS Spider options:
.IP \-bN
accept cookies in cookies.txt (0=do not accept,* 1=accept) (\-\-cookies[=N])
.IP \-%K
load extra cookies from a Netscape cookies.txt (\-\-cookies\-file <param>)
.IP \-u
check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (\-\-check\-type[=N])
.IP \-j
@@ -234,8 +225,6 @@ tolerant requests (accept bogus responses on some servers, but not standard!) (\
update hacks: various hacks to limit re\-transfers when updating (identical size, bogus response..) (\-\-updatehack)
.IP \-%u
url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (\-\-urlhack)
.br
opt out of one url\-hack part: \-\-keep\-www\-prefix (www.foo.com<>foo.com), \-\-keep\-double\-slashes (//), \-\-keep\-query\-order (?b&a)
.IP \-%A
assume that a type (cgi,asp..) is always linked with a mime type (\-%A php3,cgi=text/html;dat,bin=application/x\-zip) (\-\-assume <param>)
.br
@@ -324,8 +313,12 @@ debug HTTP headers in logfile (\-\-debug\-headers)
.SS Guru options: (do NOT use if possible)
.IP \-#X
*use optimized engine (limited memory boundary checks) (\-\-fast\-engine)
.IP \-#test
list engine self\-tests (run one with \-#test=NAME [args])
.IP \-#0
filter test (\-#0 '*.gif' 'www.bar.com/foo.gif') (\-\-debug\-testfilters <param>)
.IP \-#1
simplify test (\-#1 ./foo/bar/../foobar)
.IP \-#2
type test (\-#2 /foo/bar.php)
.IP \-#C
cache list (\-#C '*.com/spider*.gif' (\-\-debug\-cache <param>)
.IP \-#R

View File

@@ -56,7 +56,7 @@ whttrackrundir = $(bindir)
whttrackrun_SCRIPTS = webhttrack
libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
htscache_selftest.c htsdns_selftest.c htsselftest.c \
htscache_selftest.c htsdns_selftest.c \
htscatchurl.c htsfilters.c htsftp.c htshash.c coucal/coucal.c \
htshelp.c htslib.c htscoremain.c \
htsname.c htsrobots.c htstools.c htswizard.c \
@@ -66,7 +66,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
md5.c \
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htsselftest.h htscatchurl.h \
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htscatchurl.h \
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
htshelp.h htsindex.h htslib.h htsmd5.h \

View File

@@ -60,9 +60,6 @@ Please visit our Website: http://www.httrack.com
param1 : this option must be alone, and needs one distinct parameter (-P <path>)
param0 : this option must be alone, but the parameter should be put together (+*.gif)
*/
/* clang-format off: hand-aligned table; clang-format reflows the whole
initializer (2->4 space) on any edit, churning every untouched row. */
/* clang-format off */
const char *hts_optalias[][4] = {
/* {"","","",""}, */
{"path", "-O", "param1", "output path"},
@@ -110,12 +107,6 @@ const char *hts_optalias[][4] = {
{"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
"single", ""},
{"include-query-string", "-%q", "single", ""},
{"strip-query", "-%g", "param1",
"strip [host/pattern=]key1,key2,... from URLs"},
{"cookies-file", "-%K", "param1",
"load extra cookies from a Netscape cookies.txt"},
{"pause", "-%G", "param1",
"random pause of MIN[:MAX] seconds between files"},
{"generate-errors", "-o", "single", ""},
{"do-not-generate-errors", "-o0", "single", ""},
{"purge-old", "-X", "param", ""},
@@ -132,9 +123,6 @@ const char *hts_optalias[][4] = {
{"tolerant", "-%B", "single", ""},
{"updatehack", "-%s", "single", ""}, {"sizehack", "-%s", "single", ""},
{"urlhack", "-%u", "single", ""},
{"keep-www-prefix", "-%j", "single", ""},
{"keep-double-slashes", "-%o", "single", ""},
{"keep-query-order", "-%y", "single", ""},
{"user-agent", "-F", "param1", "user-agent identity"},
{"referer", "-%R", "param1", "default referer URL"},
{"from", "-%E", "param1", "from email address"},
@@ -253,7 +241,6 @@ const char *hts_optalias[][4] = {
{"", "", "", ""}
};
/* clang-format on */
/*
Check for alias in command-line

View File

@@ -57,10 +57,7 @@ Please visit our Website: http://www.httrack.com
// DOS
#include <process.h> /* _beginthread, _endthread */
#endif
#include <io.h> /* _chsize_s */
#define HTS_FTRUNCATE(fp, sz) _chsize_s(_fileno(fp), (sz))
#else
#define HTS_FTRUNCATE(fp, sz) ftruncate(fileno(fp), (sz))
#endif
#define VT_CLREOL "\33[K"
@@ -3766,27 +3763,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
}
#endif
/********** **************************** ********** */
}
// MIME type excluded by a -mime: filter: abort, don't fetch
// the body (#58)
else if (HTTP_IS_OK(back[i].r.statuscode) &&
!back[i].testmode &&
strnotempty(back[i].r.contenttype) &&
hts_acceptmime(opt, 0, back[i].url_adr,
back[i].url_fil,
back[i].r.contenttype) == 1) {
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
back[i].r.statuscode = STATUSCODE_EXCLUDED;
strcpybuff(back[i].r.msg, "Excluded by MIME type filter");
hts_log_print(
opt, LOG_NOTICE,
"File excluded by MIME type filter (%s): %s%s",
back[i].r.contenttype, back[i].url_adr,
back[i].url_fil);
} else { // il faut aller le chercher
} else { // il faut aller le chercher
// effacer buffer (requète)
if (!noFreebuff) {
@@ -3797,70 +3774,35 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
// xxc SI CHUNK VERIFIER QUE CA MARCHE??
if (back[i].r.statuscode == 206) { // on nous envoie un morceau (la fin) coz une partie sur disque!
off_t sz = fsize_utf8(back[i].url_sav);
/* RFC 7233: resume at the server's Content-Range start,
not the offset we requested; a server may resume
earlier and appending the overlap duplicates bytes
(#198). */
const LLint resume = back[i].r.crange_start;
const hts_boolean range_ok =
back[i].r.crange > 0 && resume >= 0 &&
resume <= (LLint) sz &&
back[i].r.crange_end + 1 == back[i].r.crange &&
(back[i].r.totalsize < 0 ||
back[i].r.totalsize ==
back[i].r.crange_end - resume + 1);
#if HDEBUG
printf("partial content: " LLintP " on disk..\n",
(LLint) sz);
#endif
if (sz >= 0 && range_ok) {
if (sz >= 0) {
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML
if (opt->getmode & HTS_GETMODE_NONHTML) {
filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu
file_notify(opt, back[i].url_adr, back[i].url_fil,
back[i].url_sav, 0, 1,
back[i].r.notmodified);
back[i].r.out =
FOPEN(fconv(catbuff, sizeof(catbuff),
back[i].url_sav),
"r+b"); // resume in place
back[i].r.out = FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "ab"); // append
if (back[i].r.out && opt->cache != 0) {
back[i].r.is_write = 1;
back[i].r.size = resume; // bytes already on disk
back[i].r.statuscode = HTTP_OK; // force 'OK'
back[i].r.is_write = 1; // écrire
back[i].r.size = sz; // déja écrit
back[i].r.statuscode = HTTP_OK; // Forcer 'OK'
if (back[i].r.totalsize >= 0)
back[i].r.totalsize += resume; // -> full size
// drop bytes past the resume point; a silent
// failure could leave a stale tail, so on error
// drop the partial and refetch the whole file
if (HTS_FTRUNCATE(back[i].r.out,
(off_t) resume) != 0) {
fclose(back[i].r.out);
back[i].r.out = NULL;
url_savename_refname_remove(
opt, back[i].url_adr, back[i].url_fil);
UNLINK(back[i].url_sav);
back[i].status = STATUS_READY;
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Can not truncate partial file, "
"restarting");
} else {
fseeko(back[i].r.out, (off_t) resume, SEEK_SET);
/* create a temporary reference file in case of
* broken mirror */
if (back_serialize_ref(opt, &back[i]) != 0) {
hts_log_print(opt, LOG_WARNING,
"Could not create temporary "
"reference file for %s%s",
back[i].url_adr,
back[i].url_fil);
}
#if HDEBUG
printf("continue interrupted file\n");
#endif
back[i].r.totalsize += sz; // plus en fait
fseek(back[i].r.out, 0, SEEK_END); // à la fin
/* create a temporary reference file in case of broken mirror */
if (back_serialize_ref(opt, &back[i]) != 0) {
hts_log_print(opt, LOG_WARNING,
"Could not create temporary reference file for %s%s",
back[i].url_adr, back[i].url_fil);
}
#if HDEBUG
printf("continue interrupted file\n");
#endif
} else { // On est dans la m**
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
@@ -3872,18 +3814,17 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
FILE *fp =
FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "rb");
if (fp) {
LLint alloc_mem = resume + 1;
LLint alloc_mem = sz + 1;
if (back[i].r.totalsize >= 0)
alloc_mem += back[i].r.totalsize; // AJOUTER RESTANT!
if (deleteaddr(&back[i].r)
&& (back[i].r.adr =
(char *) malloct((size_t) alloc_mem))) {
back[i].r.size = resume;
back[i].r.size = sz;
if (back[i].r.totalsize >= 0)
back[i].r.totalsize += resume; // -> full size
if ((fread(back[i].r.adr, 1, (size_t) resume,
fp)) != (size_t) resume) {
back[i].r.totalsize += sz; // plus en fait
if ((fread(back[i].r.adr, 1, sz, fp)) != sz) {
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
@@ -3901,30 +3842,14 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
"No memory for partial file");
}
fclose(fp);
} else { // open failed
} else { // Argh..
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Can not open partial file");
}
}
} else if (sz >=
0) { // unusable range -> restart whole file
hts_log_print(opt, LOG_WARNING,
"Unusable partial-content range for %s%s "
"(have " LLintP " bytes, got " LLintP
"-" LLintP "/" LLintP "), restarting",
back[i].url_adr, back[i].url_fil,
(LLint) sz, back[i].r.crange_start,
back[i].r.crange_end, back[i].r.crange);
url_savename_refname_remove(opt, back[i].url_adr,
back[i].url_fil);
UNLINK(back[i].url_sav);
back[i].status = STATUS_READY;
back_set_finished(sback, i);
strcpybuff(back[i].r.msg,
"Unusable partial content, restarting");
} else { // partial not found
} else { // Non trouvé??
back[i].status = STATUS_READY; // terminé (voir plus loin)
back_set_finished(sback, i);
strcpybuff(back[i].r.msg, "Can not find partial file");
@@ -4005,6 +3930,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
}
}
}
/*} */

View File

@@ -146,8 +146,7 @@ typedef enum BackStatusCode {
STATUSCODE_NON_FATAL = -5,
STATUSCODE_SSL_HANDSHAKE = -6,
STATUSCODE_TOO_BIG = -7,
STATUSCODE_TEST_OK = -10,
STATUSCODE_EXCLUDED = -11 /* aborted: MIME excluded by a -mime: filter */
STATUSCODE_TEST_OK = -10
} BackStatusCode;
/** HTTrack status ('status' member of of 'lien_back') **/

View File

@@ -3,12 +3,12 @@
# Change this to download files
if false; then
echo "mget https://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget https://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi

View File

@@ -220,25 +220,6 @@ struct cache_back_zip_entry {
} \
} while(0)
/* A cache (new.zip) write failed: storage is gone (disk full / dropped share),
so the mirror is doomed too. Abort it via exit_xh, don't crash as assertf
did. */
static void cache_zip_write_failed(httrackp *opt, cache_back *cache,
const char *what, int zErr) {
if (!cache->zipWriteFailed) {
cache->zipWriteFailed = HTS_TRUE;
if (check_fatal_io_errno()) {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: disk full or filesystem problems");
} else {
hts_log_print(opt, LOG_ERROR,
"Mirror aborted: cache write failed (%s): %s", what,
hts_get_zerror(zErr));
}
}
opt->state.exit_xh = -1; /* fatal: stop the mirror, exit non-zero */
}
/* Ajout d'un fichier en cache */
void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
const char *url_adr, const char *url_fil, const char *url_save,
@@ -255,10 +236,6 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
const char *url_save_suffix = url_save;
int zErr;
/* already failed and aborting; don't touch the broken stream again */
if (cache->zipWriteFailed)
return;
// robots.txt hack
if (url_save == NULL) {
dataincache = 0; // testing links
@@ -369,8 +346,9 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
*/
headers, (uInt) strlen(headers), NULL, 0, NULL, /* comment */
Z_DEFLATED, Z_DEFAULT_COMPRESSION)) != Z_OK) {
cache_zip_write_failed(opt, cache, "opening a cache entry", zErr);
return;
int zip_zipOpenNewFileInZip_failed = 0;
assertf(zip_zipOpenNewFileInZip_failed);
}
/* Write data in cache */
@@ -380,8 +358,9 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
if ((zErr =
zipWriteInFileInZip((zipFile) cache->zipOutput, r->adr,
(int) r->size)) != Z_OK) {
cache_zip_write_failed(opt, cache, "writing to the cache", zErr);
return;
int zip_zipWriteInFileInZip_failed = 0;
assertf(zip_zipWriteInFileInZip_failed);
}
}
} else {
@@ -402,10 +381,9 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
if ((zErr =
zipWriteInFileInZip((zipFile) cache->zipOutput, buff,
(int) nl)) != Z_OK) {
cache_zip_write_failed(opt, cache, "writing to the cache",
zErr);
fclose(fp);
return;
int zip_zipWriteInFileInZip_failed = 0;
assertf(zip_zipWriteInFileInZip_failed);
}
}
} while(nl > 0);
@@ -419,14 +397,16 @@ void cache_add(httrackp * opt, cache_back * cache, const htsblk * r,
/* Close */
if ((zErr = zipCloseFileInZip((zipFile) cache->zipOutput)) != Z_OK) {
cache_zip_write_failed(opt, cache, "closing a cache entry", zErr);
return;
int zip_zipCloseFileInZip_failed = 0;
assertf(zip_zipCloseFileInZip_failed);
}
/* Flush */
if ((zErr = zipFlush((zipFile) cache->zipOutput)) != 0) {
cache_zip_write_failed(opt, cache, "flushing the cache", zErr);
return;
int zip_zipFlush_failed = 0;
assertf(zip_zipFlush_failed);
}
}

View File

@@ -47,7 +47,6 @@ Please visit our Website: http://www.httrack.com
#include "htslib.h"
#include "htszlib.h"
#include <errno.h>
#include <stdio.h>
#include <string.h>
@@ -317,136 +316,6 @@ static int disk_fallback_selftest(httrackp *opt) {
return fail;
}
typedef struct {
size_t budget; /**< bytes allowed through before writes start failing */
int fail_errno; /**< errno set on the failing write (ENOSPC, EIO, ...) */
int writes; /**< zwrite call count, to detect re-entry into the stream */
} writefail_inject;
/* zwrite that copies until the budget runs out, then fails with inj->fail_errno
(the #174/#219 condition). Counts calls so the test can prove a flagged cache
never re-enters the stream. */
static uLong selftest_failing_zwrite(voidpf opaque, voidpf stream,
const void *buf, uLong size) {
writefail_inject *inj = (writefail_inject *) opaque;
inj->writes++;
if (inj->budget >= (size_t) size) {
inj->budget -= (size_t) size;
return (uLong) fwrite(buf, 1, (size_t) size, (FILE *) stream);
}
errno = inj->fail_errno;
return 0; /* short write -> the minizip op returns an error */
}
/* Open a ZIP whose writes fail past inj->budget, so cache_add() hits an error.
*/
static zipFile selftest_open_failing_zip(const char *path,
writefail_inject *inj) {
zlib_filefunc_def ff;
fill_fopen_filefunc(&ff); /* real fopen/read/seek/close; ignores opaque */
ff.zwrite_file = selftest_failing_zwrite;
ff.opaque = inj;
return zipOpen2(path, APPEND_STATUS_CREATE, NULL, &ff);
}
/* Store one octet-stream body into `cache` (all-in-cache, body in the ZIP). */
static void writefail_store(httrackp *opt, cache_back *cache, const char *fil,
const char *body, size_t body_len) {
htsblk r;
char locbuf[4];
char *bodycopy = malloct(body_len);
hts_init_htsblk(&r);
r.statuscode = 200;
r.size = (LLint) body_len;
strcpybuff(r.msg, "OK");
strcpybuff(r.contenttype, "application/octet-stream");
locbuf[0] = '\0';
r.location = locbuf;
r.is_write = 0;
memcpy(bodycopy, body, body_len);
r.adr = bodycopy;
cache_add(opt, cache, &r, "example.com", fil, "example.com/blob.bin", 1,
NULL);
freet(bodycopy);
}
/* #174/#219: a failing cache write used to crash via assertf(); it must instead
stop the mirror (exit_xh = -1) without crashing. Assert that, plus the cache
is flagged and a sibling write doesn't re-enter the broken stream. */
int cache_write_failure_selftest(httrackp *opt, const char *dir) {
int fail = 0;
char path[HTS_URLMAXSIZE];
/* incompressible + big, so deflate flushes (and fails) mid-write, before
* close */
static const size_t body_len = 256 * 1024;
char *body = malloct(body_len);
int phase;
gen_body(body, body_len, 1 /* incompressible */);
fconcat(path, sizeof(path), dir, "/wfail.zip");
/* phase 0: fail on the body write, fatal errno (ENOSPC, the disk-full
branch). phase 1: fail on the open, non-fatal errno (EIO, dropped-share
branch). Both must abort the mirror. */
for (phase = 0; phase < 2; phase++) {
cache_back cache;
writefail_inject inj;
int writes_after_fail;
inj.budget = (phase == 0) ? 4096 : 0;
inj.fail_errno = (phase == 0) ? ENOSPC : EIO;
inj.writes = 0;
memset(&cache, 0, sizeof(cache));
cache.type = 1;
cache.log = stderr;
cache.errlog = stderr;
cache.hashtable = coucal_new(0);
cache.zipOutput = selftest_open_failing_zip(path, &inj);
if (cache.zipOutput == NULL) {
fprintf(stderr, "cache-writefail: could not open injected ZIP\n");
fail++;
continue;
}
opt->state.exit_xh = 0; /* clear; the failing write must set it to -1 */
writefail_store(opt, &cache, "/blob.bin", body, body_len);
if (!cache.zipWriteFailed) {
fprintf(stderr, "cache-writefail: phase %d: write error not caught\n",
phase);
fail++;
}
if (opt->state.exit_xh != -1) {
fprintf(stderr,
"cache-writefail: phase %d: mirror not aborted (exit_xh=%d)\n",
phase, opt->state.exit_xh);
fail++;
}
/* a flagged cache must no-op a sibling write: no further backend write */
writes_after_fail = inj.writes;
writefail_store(opt, &cache, "/blob2.bin", body, 16);
if (inj.writes != writes_after_fail) {
fprintf(stderr,
"cache-writefail: phase %d: sibling write re-entered the broken "
"stream (%d extra backend writes)\n",
phase, inj.writes - writes_after_fail);
fail++;
}
if (cache.zipOutput != NULL) {
zipClose(cache.zipOutput,
NULL); /* best-effort; may fail on the backend */
cache.zipOutput = NULL;
}
}
freet(body);
return fail;
}
int cache_selftests(httrackp *opt, const char *dir) {
int failures = 0;
cache_back cache;

View File

@@ -52,10 +52,6 @@ int cache_selftests(httrackp *opt, const char *dir);
committed file, never by the test). Returns the failed-check count. */
int cache_golden_selftest(httrackp *opt, const char *dir, int regen);
/* #174/#219: assert a failing cache write aborts the mirror cleanly instead of
crashing. Returns the failed-check count. */
int cache_write_failure_selftest(httrackp *opt, const char *dir);
#endif
#endif

View File

@@ -35,7 +35,6 @@ Please visit our Website: http://www.httrack.com
#include <fcntl.h>
#include <ctype.h>
#include <stdint.h> /* uint64_t for the pause mixer (already a hard dep via md5.h) */
/* File defs */
#include "htscore.h"
@@ -524,12 +523,9 @@ int httpmirror(char *url1, httrackp * opt) {
opt->cookie = &cookie;
cookie.max_len = 30000; // max len
strcpybuff(cookie.data, "");
// Load the mirror's cookies.txt, then the one in the current directory
// Charger cookies.txt par défaut ou cookies.txt du miroir
cookie_load(opt->cookie, StringBuff(opt->path_log), "cookies.txt");
cookie_load(opt->cookie, "", "cookies.txt");
// A user-supplied cookie file is merged last so it wins on conflicts
if (strnotempty(StringBuff(opt->cookies_file)))
cookie_load(opt->cookie, "", StringBuff(opt->cookies_file));
} else
opt->cookie = NULL;
@@ -740,39 +736,26 @@ int httpmirror(char *url1, httrackp * opt) {
/* OPTIMIZED for fast load */
if (StringNotEmpty(opt->filelist)) {
char *filelist_buff = NULL;
size_t filelist_sz = 0;
const char *filelist_err = NULL; /* failure reason, NULL on success */
const off_t fs = fsize(StringBuff(opt->filelist));
const size_t filelist_sz = off_t_to_size_t(fsize(StringBuff(opt->filelist)));
if (fs < 0) {
/* fsize() hides the cause; redo stat() for a precise errno (#49) */
struct stat st;
filelist_err = stat(StringBuff(opt->filelist), &st) != 0
? strerror(errno)
: "not a regular file";
} else if ((filelist_sz = off_t_to_size_t(fs)) == (size_t) -1) {
filelist_err = "file too large";
filelist_sz = 0;
} else {
if (filelist_sz != (size_t) -1) {
FILE *fp = fopen(StringBuff(opt->filelist), "rb");
if (fp == NULL) {
filelist_err = strerror(errno);
} else {
if (fp) {
filelist_buff = malloct(filelist_sz + 1);
if (filelist_buff == NULL) {
filelist_err = "out of memory";
} else if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
freet(filelist_buff);
filelist_err = "read error";
} else {
filelist_buff[filelist_sz] = '\0';
if (filelist_buff) {
if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
freet(filelist_buff);
filelist_buff = NULL;
} else {
*(filelist_buff + filelist_sz) = '\0';
}
}
fclose(fp);
}
}
if (filelist_buff != NULL) {
if (filelist_buff) {
int filelist_ptr = 0;
int n = 0;
char BIGSTK line[HTS_URLMAXSIZE * 2];
@@ -797,8 +780,8 @@ int httpmirror(char *url1, httrackp * opt) {
// Free buffer
freet(filelist_buff);
} else {
hts_log_print(opt, LOG_ERROR, "Could not include URL list \"%s\": %s",
StringBuff(opt->filelist), filelist_err);
hts_log_print(opt, LOG_ERROR, "Could not include URL list: %s",
StringBuff(opt->filelist));
}
}
@@ -3315,21 +3298,6 @@ HTS_INLINE int back_fillmax(struct_back * sback, httrackp * opt,
return -1; /* plus de place */
}
/* Seed-derived: stable within a gap, rerolls per launch; a per-call rand()
would bias the delay toward min_ms (see header). Jitter, not crypto. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms) {
uint64_t z = (uint64_t) seed;
if (max_ms <= min_ms)
return min_ms;
/* SplitMix64 finalizer: scrambles the low-entropy ms timestamp. */
z += 0x9E3779B97F4A7C15ULL;
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL;
z ^= z >> 31;
return min_ms + (int) (z % (uint64_t) (max_ms - min_ms + 1));
}
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
int n = opt->maxsoc - back_nsoc(sback);
@@ -3350,18 +3318,6 @@ int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt) {
}
}
// #185 randomized inter-file pause: non-blocking, one launch per gap
if (n > 0 && opt->pause_max_ms > 0 && HTS_STAT.last_connect > 0) {
TStamp opTime =
HTS_STAT.last_request ? HTS_STAT.last_request : HTS_STAT.last_connect;
TStamp lap = mtime_local() - opTime;
if (lap < hts_pause_target_ms(opTime, opt->pause_min_ms, opt->pause_max_ms))
n = 0;
else
n = 1;
}
return n;
}
@@ -3770,17 +3726,6 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (StringNotEmpty(from->user_agent))
StringCopyS(to->user_agent, from->user_agent);
if (StringNotEmpty(from->strip_query))
StringCopyS(to->strip_query, from->strip_query);
if (StringNotEmpty(from->cookies_file))
StringCopyS(to->cookies_file, from->cookies_file);
if (from->pause_max_ms > 0) {
to->pause_min_ms = from->pause_min_ms;
to->pause_max_ms = from->pause_max_ms;
}
if (from->retry > -1)
to->retry = from->retry;

View File

@@ -214,8 +214,6 @@ struct cache_back {
cache_back_zip_entry *zipEntries;
int zipEntriesOffs;
int zipEntriesCapa;
hts_boolean
zipWriteFailed; /**< a cache write failed; stop touching the stream */
};
#ifndef HTS_DEF_FWSTRUCT_hash_struct
@@ -234,12 +232,8 @@ struct hash_struct {
coucal adrfil;
/* former address+path -> link index (renamed/moved entries) */
coucal former_adrfil;
/* effective urlhack sub-flags: www.==host / // collapse / query-arg sort */
hts_boolean norm_host;
hts_boolean norm_slash;
hts_boolean norm_query;
/* query-strip keys (not owned); set from opt->strip_query at hash_init */
const char *strip_query;
/* scratch buffers reused across lookups (not reentrant) */
int normalized;
char normfil[HTS_URLMAXSIZE * 2];
char normfil2[HTS_URLMAXSIZE * 2];
char catbuff[CATBUFF_SIZE];
@@ -368,22 +362,6 @@ int fspc(httrackp * opt, FILE * fp, const char *type);
char *next_token(char *p, int flag);
/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
"*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip);
/* As fil_normalized_filtered(), but DO_SLASH/DO_QUERY gate the // collapse and
the query-argument sort independently (the urlhack sub-flags). */
char *fil_normalized_filtered_ex(const char *source, char *dest,
const char *strip, int do_slash, int do_query);
/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
'\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize);
/* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
owns it and must release it with freet(). Return NULL on missing/unreadable
file (readfile_or substitutes defaultdata instead). The byte content is NOT
@@ -418,10 +396,6 @@ int back_pluggable_sockets(struct_back * sback, httrackp * opt);
int back_pluggable_sockets_strict(struct_back * sback, httrackp * opt);
/* Randomized inter-file pause target in [min_ms,max_ms] (#185), derived from a
timestamp seed so it is stable within one gap and rerolls per launch. */
int hts_pause_target_ms(TStamp seed, int min_ms, int max_ms);
/* Schedule more links from the heap into free slots. Returns the number queued,
or <=0 if none could be added (no free slot / paused / stopped). */
int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,

File diff suppressed because it is too large Load Diff

View File

@@ -30,14 +30,12 @@ Please visit our Website: http://www.httrack.com
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#include <stdint.h>
#include "htscharset.h"
#include "htsencoding.h"
#include "htssafe.h"
/* static int decode_entity(const uint64_t hash, const size_t len);
*/
/* static int decode_entity(const unsigned int hash, const size_t len);
*/
#include "htsentities.h"
/* hexadecimal conversion */
@@ -52,31 +50,30 @@ static int get_hex_value(char c) {
return -1;
}
/* 64-bit FNV-1a; must match htsentities.sh, which keys the entity table on it.
*/
#define HASH_INIT 0xcbf29ce484222325ULL
#define HASH_PRIME 0x100000001b3ULL
#define HASH_ADD(HASH, C) \
do { \
(HASH) ^= (unsigned char) (C); \
(HASH) *= HASH_PRIME; \
} while (0)
/* Numerical Recipes,
see <http://en.wikipedia.org/wiki/Linear_congruential_generator> */
#define HASH_PRIME ( 1664525 )
#define HASH_CONST ( 1013904223 )
#define HASH_ADD(HASH, C) do { \
(HASH) *= HASH_PRIME; \
(HASH) += HASH_CONST; \
(HASH) += (C); \
} while(0)
int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t max, const char *charset) {
size_t i, j, ampStart, ampStartDest;
int uc;
int hex;
uint64_t hash;
unsigned int hash;
assertf(max != 0);
for (i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0, uc = -1, hex = 0,
hash = HASH_INIT;
src[i] != '\0'; i++) {
for(i = 0, j = 0, ampStart = (size_t) -1, ampStartDest = 0,
uc = -1, hex = 0, hash = 0 ; src[i] != '\0' ; i++) {
/* start of entity */
if (src[i] == '&') {
ampStart = i;
ampStartDest = j;
hash = HASH_INIT;
hash = 0;
uc = -1;
}
/* inside a potential entity */
@@ -177,11 +174,14 @@ int hts_unescapeEntitiesWithCharset(const char *src, char *dest, const size_t ma
}
/* alphanumerical entity */
else {
/* alphanum, capped at the longest name
* '&CounterClockwiseContourIntegral;' (31) */
if (i <= ampStart + 31 && ((src[i] >= '0' && src[i] <= '9') ||
(src[i] >= 'A' && src[i] <= 'Z') ||
(src[i] >= 'a' && src[i] <= 'z'))) {
/* alphanum and not too far ('&thetasym;' is the longest) */
if (i <= ampStart + 10 &&
(
(src[i] >= '0' && src[i] <= '9')
|| (src[i] >= 'A' && src[i] <= 'Z')
|| (src[i] >= 'a' && src[i] <= 'z')
)
) {
/* compute hash */
HASH_ADD(hash, (unsigned char) src[i]);
} else {

File diff suppressed because it is too large Load Diff

View File

@@ -1,92 +1,75 @@
#!/bin/bash
#
# Regenerate htsentities.h from the WHATWG named character references.
set -euo pipefail
src=entities.json
url=https://html.spec.whatwg.org/entities.json
src=html40.txt
url=http://www.w3.org/TR/1998/REC-html40-19980424/html40.txt
dest=htsentities.h
# 64-bit FNV-1a of $1, printed as a C constant. Must match the hash in
# htsencoding.c. The offset basis is stored as its wrapped (signed) bit pattern;
# bash arithmetic is 64-bit two's complement, so the result is bit-exact.
fnv1a() {
local s=$1 i c h=$((0xcbf29ce484222325))
for ((i = 0; i < ${#s}; i++)); do
printf -v c '%d' "'${s:i:1}"
h=$(((h ^ (c & 0xff)) * 0x100000001b3))
done
printf '0x%016xULL' "$h"
}
(
cat <<EOF
/*
-- ${dest} --
FILE GENERATED BY $0, DO NOT MODIFY
if [ ! -f "$src" ]; then
curl -fsS "$url" -o "$src"
fi
We compute the LCG hash
(see <http://en.wikipedia.org/wiki/Linear_congruential_generator>)
for each entity. We should in theory check using strncmp() that we
actually have the correct entity, but this is actually statistically
not needed.
# Keep ';'-terminated single-codepoint names; the ~93 multi-codepoint refs can't
# fit decode_entity's single-codepoint return and are skipped (left verbatim).
pairs=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length == 1)))
| sort_by(.key)
| .[] | "\(.key | ltrimstr("&") | rtrimstr(";"))\t\(.value.codepoints[0])"' "$src")
We may want to do better, but we expect the hash function to be uniform, and
let the compiler be smart enough to optimize the switch (for example by
checking in log2() intervals)
This code has been generated using the evil $0 script.
*/
# Skipped multi-codepoint names, kept to prove none aliases an emitted hash.
skipped=$(jq -r '
to_entries
| map(select((.key | endswith(";")) and (.value.codepoints | length > 1)))
| .[] | .key | ltrimstr("&") | rtrimstr(";")' "$src")
cases=""
emit_hashes=""
while IFS=$'\t' read -r name cp; do
hash=$(fnv1a "$name")
cases+=" /* $name */"$'\n'
cases+=" case $hash:"$'\n'
cases+=" if (len == ${#name}) {"$'\n'
cases+=" return $cp;"$'\n'
cases+=" }"$'\n'
cases+=" break;"$'\n'
emit_hashes+="$hash"$'\n'
done <<<"$pairs"
skip_hashes=""
while IFS= read -r name; do
[ -n "$name" ] && skip_hashes+="$(fnv1a "$name")"$'\n'
done <<<"$skipped"
# The switch keys on the hash alone, so the dispatch is correct only while every
# emitted name hashes uniquely; prove it here, no runtime name compare needed.
dups=$(printf '%s' "$emit_hashes" | sort | uniq -d || true)
if [ -n "$dups" ]; then
echo "FATAL: two entity names share a hash (duplicate switch case); change the hash:" >&2
echo "$dups" >&2
exit 1
fi
# A skipped name colliding with an emitted hash would mis-decode instead of
# staying verbatim; forbid that too.
aliased=$(comm -12 <(printf '%s' "$emit_hashes" | sort -u) <(printf '%s' "$skip_hashes" | sort -u) || true)
if [ -n "$aliased" ]; then
echo "FATAL: a skipped multi-codepoint name aliases an emitted hash:" >&2
echo "$aliased" >&2
exit 1
fi
cat >"$dest" <<EOF
/* GENERATED by $0 from the WHATWG named character references
(${url}). DO NOT EDIT.
Dispatch keys on a 64-bit FNV-1a hash of the entity name; the generator
aborts on any hash collision, so no runtime name compare is needed. */
#include <stdint.h>
static int decode_entity(const uint64_t hash, const size_t len) {
static int decode_entity(const unsigned int hash, const size_t len) {
switch(hash) {
${cases} }
EOF
(
if test -f ${src}; then
cat ${src}
else
GET "${url}"
fi
) |
grep -E '^<!ENTITY [a-zA-Z0-9_]' |
sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
(
read -r A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$((1 << 32))"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$((i + 1))
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
# next
read -r A
done
)
cat <<EOF
}
/* unknown */
return -1;
}
EOF
echo "wrote $dest ($(grep -c '^ case ' "$dest") entities)" >&2
) >${dest}

View File

@@ -76,8 +76,7 @@ int fa_strjoker(int type, char **filters, int nfil, const char *nom, LLint * siz
}
if (size)
sz = *size;
/* size unknown (scan time): no size pointer => size tests stay neutral */
if (strjoker(nom, filters[i] + filteroffs, size ? &sz : NULL, size_flag)) {
if (strjoker(nom, filters[i] + filteroffs, &sz, size_flag)) { // reconnu
if (size)
if (sz != *size)
sizelimit = sz;
@@ -193,12 +192,7 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
int len = (int) strlen(joker);
while((joker[i] != RIGHT) && (joker[i]) && (i < len)) {
// '\' escapes the next char as a literal member, e.g. *[\[\]]
if (joker[i] == '\\' && joker[i + 1] != '\0') {
i++;
pass[(int) (unsigned char) joker[i]] = 1;
i++;
} else if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
if ((joker[i] == '<') || (joker[i] == '>')) { // *[<10]
int lsize = 0;
int lverdict;
@@ -226,9 +220,7 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
while(isdigit((unsigned char) joker[i]))
i++;
}
} else if (joker[i + 1] == '-' && joker[i + 2] != '\0') {
// range *[A-Z]; the '\0' guard rejects a truncated *[a- (else
// i+=3 overshoots the NUL)
} else if (joker[i + 1] == '-') { // 2 car, ex: *[A-Z]
if ((int) (unsigned char) joker[i + 2] >
(int) (unsigned char) joker[i]) {
int j;
@@ -240,7 +232,10 @@ HTS_INLINE const char *strjoker(const char *chaine, const char *joker, LLint * s
}
// else err=1;
i += 3;
} else { // 1 car, ex: *[ ]
} else { // 1 car, ex: *[ ]
if (joker[i + 2] == '\\' && joker[i + 3] != 0) { // escaped char, such as *[\[] or *[\]]
i++;
}
pass[(int) (unsigned char) joker[i]] = 1;
i++;
}

View File

@@ -43,8 +43,8 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-10"
#define HTTRACK_VERSIONID "3.49.10"
#define HTTRACK_VERSION "3.49-9"
#define HTTRACK_VERSIONID "3.49.9"
#define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0"
@@ -229,10 +229,6 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
/* Honest crawler User-Agent; no fake OS/browser to go stale. */
#define HTS_DEFAULT_USER_AGENT \
"Mozilla/5.0 (compatible; HTTrack/" HTTRACK_AFF_VERSION \
"; +https://www.httrack.com/)"
#define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \

View File

@@ -106,10 +106,10 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
const lien_url*const lien = (const lien_url*) value;
const char *const adr = !former ? lien->adr : lien->former_adr;
const char *const fil = !former ? lien->fil : lien->former_fil;
const char *const adr_norm =
adr != NULL ? (hash->norm_host ? jump_normalized_const(adr)
: jump_identification_const(adr))
: NULL;
const char *const adr_norm = adr != NULL ?
( hash->normalized ? jump_normalized_const(adr)
: jump_identification_const(adr) )
: NULL;
// copy address
assertf(adr_norm != NULL);
@@ -117,18 +117,10 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
// copy link
assertf(fil != NULL);
{
/* resolve the per-URL strip keys; strip applies even when urlhack is off */
char BIGSTK keybuf[HTS_URLMAXSIZE];
const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
keybuf, sizeof(keybuf));
if (hash->norm_slash || hash->norm_query || keys != NULL) {
fil_normalized_filtered_ex(fil, &hash->normfil[strlen(hash->normfil)],
keys, hash->norm_slash, hash->norm_query);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
}
if (hash->normalized) {
fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
} else {
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
}
// hash
@@ -140,7 +132,8 @@ static int key_adrfil_equals_generic(void *arg,
coucal_key_const a_,
coucal_key_const b_,
const int former) {
hash_struct *const hash = (hash_struct *) arg;
hash_struct *const hash = (hash_struct*) arg;
const int normalized = hash->normalized;
const lien_url*const a = (const lien_url*) a_;
const lien_url*const b = (const lien_url*) b_;
const char *const a_adr = !former ? a->adr : a->former_adr;
@@ -157,10 +150,10 @@ static int key_adrfil_equals_generic(void *arg,
assertf(b_fil != NULL);
// skip scheme and authentication to the domain (possibly without www.)
ja = hash->norm_host ? jump_normalized_const(a_adr)
: jump_identification_const(a_adr);
jb = hash->norm_host ? jump_normalized_const(b_adr)
: jump_identification_const(b_adr);
ja = normalized
? jump_normalized_const(a_adr) : jump_identification_const(a_adr);
jb = normalized
? jump_normalized_const(b_adr) : jump_identification_const(b_adr);
assertf(ja != NULL);
assertf(jb != NULL);
if (strcasecmp(ja, jb) != 0) {
@@ -168,23 +161,12 @@ static int key_adrfil_equals_generic(void *arg,
}
// now compare pathes
{
char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
const char *const keysa =
hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
const char *const keysb =
hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
if (hash->norm_slash || hash->norm_query || keysa != NULL ||
keysb != NULL) {
fil_normalized_filtered_ex(a_fil, hash->normfil, keysa, hash->norm_slash,
hash->norm_query);
fil_normalized_filtered_ex(b_fil, hash->normfil2, keysb, hash->norm_slash,
hash->norm_query);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
}
if (normalized) {
fil_normalized(a_fil, hash->normfil);
fil_normalized(b_fil, hash->normfil2);
return strcmp(hash->normfil, hash->normfil2) == 0;
} else {
return strcmp(a_fil, b_fil) == 0;
}
}
@@ -240,17 +222,11 @@ static int key_former_adrfil_equals(void *arg,
return key_adrfil_equals_generic(arg, a, b, 1);
}
void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized) {
void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
hash->sav = coucal_new(0);
hash->adrfil = coucal_new(0);
hash->former_adrfil = coucal_new(0);
/* urlhack is the umbrella; per-feature negatives opt out of each part */
hash->norm_host = normalized && !opt->no_www_dedup;
hash->norm_slash = normalized && !opt->no_slash_dedup;
hash->norm_query = normalized && !opt->no_query_dedup;
/* snapshot the query-strip list (not owned; valid for the hash lifetime) */
hash->strip_query =
StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;
hash->normalized = normalized;
hts_set_hash_handler(hash->sav, opt);
hts_set_hash_handler(hash->adrfil, opt);
@@ -306,26 +282,6 @@ void hash_free(hash_struct *hash) {
}
}
/* Test helper: do the two URLs dedupe to the same key under opt's urlhack
flags? Exercises the live hash compare (norm_host/slash/query resolution). */
hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
const char *adrb, const char *filb) {
hash_struct hash;
lien_url la, lb;
hts_boolean eq;
memset(&la, 0, sizeof(la));
memset(&lb, 0, sizeof(lb));
la.adr = key_duphandler(NULL, adra);
la.fil = key_duphandler(NULL, fila);
lb.adr = key_duphandler(NULL, adrb);
lb.fil = key_duphandler(NULL, filb);
hash_init(opt, &hash, opt->urlhack);
eq = key_adrfil_equals(&hash, &la, &lb);
hash_free(&hash);
return eq;
}
// retour: position ou -1 si non trouvé
int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
hash_struct_type type) {

View File

@@ -51,12 +51,8 @@ typedef enum hash_struct_type {
} hash_struct_type;
// tables de hachage
void hash_init(httrackp *opt, hash_struct *hash, hts_boolean normalized);
void hash_init(httrackp *opt, hash_struct *hash, int normalized);
void hash_free(hash_struct *hash);
/* Test helper: HTS_TRUE if the two URLs dedupe together under opt's urlhack
flags. */
hts_boolean hash_url_equals(httrackp *opt, const char *adra, const char *fila,
const char *adrb, const char *filb);
int hash_read(const hash_struct * hash, const char *nom1, const char *nom2,
hash_struct_type type);
void hash_write(hash_struct * hash, size_t lpos);

View File

@@ -521,7 +521,6 @@ void help(const char *app, int more) {
infomsg(" EN maximum mirror time in seconds (60=1 minute, 3600=1 hour)");
infomsg(" AN maximum transfer rate in bytes/seconds (1000=1KB/s max)");
infomsg(" %cN maximum number of connections/seconds (*%c10)");
infomsg(" %G random pause of MIN[:MAX] seconds between files (e.g. %G5:10)");
infomsg
(" GN pause transfer if N bytes reached, and wait until lock file is deleted");
infomsg("");
@@ -564,7 +563,6 @@ void help(const char *app, int more) {
(" %x do not include any password for external password protected websites (%x0 include)");
infomsg
(" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
infomsg(" %g strip query keys for dedup ([host/pattern=]key1,key2,...)");
infomsg
(" o *generate output html file in case of error (404..) (o0 don't generate)");
infomsg(" X *purge old files after update (X0 keep delete)");
@@ -573,7 +571,6 @@ void help(const char *app, int more) {
infomsg("");
infomsg("Spider options:");
infomsg(" bN accept cookies in cookies.txt (0=do not accept,* 1=accept)");
infomsg(" %K load extra cookies from a Netscape cookies.txt");
infomsg
(" u check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always)");
infomsg
@@ -590,9 +587,6 @@ void help(const char *app, int more) {
(" %s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..)");
infomsg
(" %u url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..)");
infomsg(" opt out of one url-hack part: --keep-www-prefix "
"(www.foo.com<>foo.com), --keep-double-slashes (//), "
"--keep-query-order (?b&a)");
infomsg
(" %A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip)");
infomsg(" shortcut: '--assume standard' is equivalent to -%A "
@@ -652,7 +646,9 @@ void help(const char *app, int more) {
infomsg("");
infomsg("Guru options: (do NOT use if possible)");
infomsg(" #X *use optimized engine (limited memory boundary checks)");
infomsg(" #test list engine self-tests (run one with -#test=NAME [args])");
infomsg(" #0 filter test (-#0 '*.gif' 'www.bar.com/foo.gif')");
infomsg(" #1 simplify test (-#1 ./foo/bar/../foobar)");
infomsg(" #2 type test (-#2 /foo/bar.php)");
infomsg(" #C cache list (-#C '*.com/spider*.gif'");
infomsg(" #R cache repair (damaged cache)");
infomsg(" #d debug parser");

View File

@@ -563,39 +563,6 @@ const char *hts_mime[][2] = {
{"", ""}
};
/* Modern web formats (post-2010), kept in their own table: appending to the
legacy hts_mime[] above makes clang-format reflow its whole initializer.
Scanned after hts_mime[], so it never shadows a legacy mapping. */
static const char *hts_mime_modern[][2] = {
{"image/webp", "webp"},
{"image/avif", "avif"},
{"image/heic", "heic"},
{"font/woff", "woff"},
{"font/woff2", "woff2"},
{"font/ttf", "ttf"},
{"font/otf", "otf"},
{"application/json", "json"},
{"application/ld+json", "jsonld"},
{"application/manifest+json", "webmanifest"},
{"application/wasm", "wasm"},
{"text/javascript", "js"},
{"text/javascript", "mjs"},
{"text/markdown", "md"},
{"video/mp4", "mp4"},
{"video/webm", "webm"},
{"video/ogg", "ogv"},
{"video/mp2t", "ts"},
{"audio/mp4", "m4a"},
{"audio/aac", "aac"},
{"audio/ogg", "oga"},
{"audio/opus", "opus"},
{"audio/flac", "flac"},
{"audio/webm", "weba"},
{"application/x-7z-compressed", "7z"},
{"application/x-rar-compressed", "rar"},
{"application/zstd", "zst"},
{"", ""}};
// Reserved (RFC2396)
#define CIS(c,ch) ( ((unsigned char)(c)) == (ch) )
#define CHAR_RESERVED(c) ( CIS(c,';') \
@@ -3643,10 +3610,7 @@ static int sortNormFnc(const void *a_, const void *b_) {
return strcmp(*a + 1, *b + 1);
}
/* Path normalizer core: optionally collapse redundant '//' (DO_SLASH) and/or
sort query arguments (DO_QUERY) so equivalent URLs dedupe. */
static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
int do_query) {
HTSEXT_API char *fil_normalized(const char *source, char *dest) {
char lastc = 0;
int gotquery = 0;
int ampargs = 0;
@@ -3656,8 +3620,8 @@ static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
for(i = j = 0; source[i] != '\0'; i++) {
if (!gotquery && source[i] == '?')
gotquery = ampargs = 1;
if (do_slash && !gotquery && lastc == '/' && source[i] == '/') {
// foo//bar -> foo/bar
if ((!gotquery && lastc == '/' && source[i] == '/') // foo//bar -> foo/bar
) {
} else {
if (gotquery && source[i] == '&') {
ampargs++;
@@ -3669,7 +3633,7 @@ static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
dest[j++] = '\0';
/* Sort arguments (&foo=1&bar=2 == &bar=2&foo=1) */
if (do_query && ampargs > 1) {
if (ampargs > 1) {
char **amps = malloct(ampargs * sizeof(char *));
char *copyBuff = NULL;
size_t qLen = 0;
@@ -3717,153 +3681,6 @@ static char *fil_normalized_ex(const char *source, char *dest, int do_slash,
return dest;
}
HTSEXT_API char *fil_normalized(const char *source, char *dest) {
return fil_normalized_ex(source, dest, 1, 1);
}
/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
case-sensitive, space-trimmed tokens. */
static int hts_query_key_stripped(const char *arg, size_t keylen,
const char *strip) {
const char *p = strip;
while (*p != '\0') {
const char *start = p;
size_t toklen;
while (*p != '\0' && *p != ',')
p++;
toklen = (size_t) (p - start);
while (toklen > 0 && *start == ' ') {
start++;
toklen--;
}
while (toklen > 0 && start[toklen - 1] == ' ')
toklen--;
if (toklen == 1 && start[0] == '*')
return 1;
if (toklen == keylen && strncmp(start, arg, keylen) == 0)
return 1;
if (*p == ',')
p++;
}
return 0;
}
/* see htscore.h */
char *fil_normalized_filtered_ex(const char *source, char *dest,
const char *strip, int do_slash,
int do_query) {
const char *query;
char BIGSTK tmp[HTS_URLMAXSIZE * 2];
htsbuff cb;
int wrote = 0;
/* No strip list, or no query: plain normalization. */
if (strip == NULL || *strip == '\0' ||
(query = strchr(source, '?')) == NULL) {
return fil_normalized_ex(source, dest, do_slash, do_query);
}
/* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
(the read re-normalizes it; a dropped empty arg would miss dedup). */
cb = htsbuff_ptr(tmp, sizeof(tmp));
htsbuff_catn(&cb, source, (size_t) (query - source));
for (query++;;) {
const char *const arg = query;
const char *eq = NULL;
size_t keylen, arglen;
while (*query != '\0' && *query != '&') {
if (eq == NULL && *query == '=')
eq = query;
query++;
}
arglen = (size_t) (query - arg);
keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
if (!hts_query_key_stripped(arg, keylen, strip)) {
htsbuff_catc(&cb, wrote ? '&' : '?');
htsbuff_catn(&cb, arg, arglen);
wrote = 1;
}
if (*query == '\0')
break;
query++;
}
return fil_normalized_ex(tmp, dest, do_slash, do_query);
}
/* see htscore.h */
char *fil_normalized_filtered(const char *source, char *dest,
const char *strip) {
return fil_normalized_filtered_ex(source, dest, strip, 1, 1);
}
/* see htscore.h */
const char *hts_query_strip_keys(const char *rules, const char *adr,
const char *fil, char *dest, size_t destsize) {
const char *p, *q;
const char *result = NULL;
char BIGSTK url[HTS_URLMAXSIZE * 2];
if (rules == NULL || *rules == '\0' || destsize == 0)
return NULL;
/* Match string = normalized host/path, query removed. jump_normalized_const
collapses www+scheme/auth so read and write (double-normalized) agree;
query excluded keeps the decision on host/path only. */
url[0] = '\0';
strcatbuff(url, jump_normalized_const(adr));
if (fil[0] != '/')
strcatbuff(url, "/");
q = strchr(fil, '?');
if (q != NULL)
strncatbuff(url, fil, (int) (q - fil));
else
strcatbuff(url, fil);
/* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
"pattern=keys"; no '=' is the bare form, pattern "*". */
for (p = rules; *p != '\0';) {
const char *const line = p;
const char *eol, *eq, *keys;
char BIGSTK pat[HTS_URLMAXSIZE * 2];
while (*p != '\0' && *p != '\n')
p++;
eol = p;
if (*p == '\n')
p++;
if (eol == line)
continue;
eq = memchr(line, '=', (size_t) (eol - line));
if (eq != NULL) {
size_t patlen = (size_t) (eq - line);
if (patlen >= sizeof(pat))
patlen = sizeof(pat) - 1;
memcpy(pat, line, patlen);
pat[patlen] = '\0';
keys = eq + 1;
} else {
pat[0] = '*';
pat[1] = '\0';
keys = line;
}
if (strjoker(url, pat, NULL, NULL) != NULL) {
size_t klen = (size_t) (eol - keys);
if (klen >= destsize)
klen = destsize - 1;
memcpy(dest, keys, klen);
dest[klen] = '\0';
result = dest;
}
}
return result;
}
#define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
size_t destsize) {
@@ -4341,20 +4158,6 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, 1);
}
// first match in a NUL-terminated {mime,ext} table. key selects the lookup
// column (0=mime, 1=ext); returns the other column, or NULL if no row matches
// (a "*" partner means the row carries no value).
static const char *hts_mime_lookup(const char *(*table)[2], int key,
const char *needle) {
int j;
for (j = 0; strnotempty(table[j][1]); j++) {
if (strfield2(table[j][key], needle) && table[j][!key][0] != '*')
return table[j][!key];
}
return NULL;
}
// write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise
@@ -4374,19 +4177,20 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
/* Check html -> text/html */
const char *a = fil + strlen(fil) - 1;
/* a < fil when fil is empty: bound before dereferencing */
while ((a > fil) && (*a != '.') && (*a != '/'))
while((*a != '.') && (*a != '/') && (a > fil))
a--;
if (a >= fil && *a == '.' && strlen(a) < 32) {
const char *mime;
if (*a == '.' && strlen(a) < 32) {
int j = 0;
a++;
mime = hts_mime_lookup(hts_mime, 1, a);
if (mime == NULL)
mime = hts_mime_lookup(hts_mime_modern, 1, a);
if (mime != NULL) {
strlcpybuff(s, mime, ssize);
return 1;
while(strnotempty(hts_mime[j][1])) {
if (strfield2(hts_mime[j][1], a)) {
if (hts_mime[j][0][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][0], ssize);
return 1;
}
}
j++;
}
if (flag) {
@@ -4521,16 +4325,18 @@ int get_userhttptype(httrackp * opt, char *s, const char *fil) {
// returns 1 if an extension was found (and written to s), 0 otherwise
int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
const char *ext;
int j = 0;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
ext = hts_mime_lookup(hts_mime, 0, st);
if (ext == NULL)
ext = hts_mime_lookup(hts_mime_modern, 0, st);
if (ext != NULL) {
strlcpybuff(s, ext, ssize);
ok = 1;
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {
if (hts_mime[j][1][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][1], ssize);
ok = 1;
}
}
j++;
}
// wrap "x" mimetypes, such as:
// application/x-mp3
@@ -6048,7 +5854,8 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent, HTS_DEFAULT_USER_AGENT);
StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default
@@ -6082,14 +5889,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
opt->no_www_dedup = HTS_FALSE;
opt->no_slash_dedup = HTS_FALSE;
opt->no_query_dedup = HTS_FALSE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
StringCopy(opt->strip_query, "");
StringCopy(opt->cookies_file, "");
opt->pause_min_ms = 0;
opt->pause_max_ms = 0;
opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, "");
@@ -6234,8 +6034,6 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
StringFree(opt->urllist);
StringFree(opt->footer);
StringFree(opt->mod_blacklist);
StringFree(opt->strip_query);
StringFree(opt->cookies_file);
StringFree(opt->path_html);
StringFree(opt->path_html_utf8);

View File

@@ -198,13 +198,6 @@ int url_savename(lien_adrfilsave *const afs,
// copy of fil, used for lookups (see urlhack)
const char *normadr = adr;
const char *normfil = fil_complete;
/* query keys to strip for this URL (NULL = none); decoupled from urlhack */
char BIGSTK stripkeys[HTS_URLMAXSIZE];
const char *const strip =
StringNotEmpty(opt->strip_query)
? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
fil_complete, stripkeys, sizeof(stripkeys))
: NULL;
const char *const print_adr = jump_protocol_const(adr);
const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL; // Position nom et point
@@ -237,13 +230,9 @@ int url_savename(lien_adrfilsave *const afs,
// www-42.foo.com -> foo.com
// foo.com/bar//foobar -> foo.com/bar/foobar
if (opt->urlhack) {
// dedup-lookup key; honor the per-feature negatives like htshash.c so
// distinct URLs keep distinct savenames (else keep normadr = adr)
if (!opt->no_www_dedup)
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil =
fil_normalized_filtered_ex(fil_complete, normfil_, strip,
!opt->no_slash_dedup, !opt->no_query_dedup);
// copy of adr (without protocol), used for lookups (see urlhack)
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
normfil = fil_normalized(fil_complete, normfil_);
} else {
if (link_has_authority(adr_complete)) { // https or other protocols : in "http/" subfolder
char *pos = strchr(adr_complete, ':');
@@ -256,11 +245,6 @@ int url_savename(lien_adrfilsave *const afs,
normadr = normadr_;
}
}
// strip still applies with urlhack off (host left untouched); no // or
// query-sort here, to match the hash key (norm_slash/norm_query are 0 when
// urlhack is off) so a URL is looked up under the key it was stored with
if (strip != NULL)
normfil = fil_normalized_filtered_ex(fil_complete, normfil_, strip, 0, 0);
}
// à afficher sans ftp://

View File

@@ -529,16 +529,6 @@ struct httrackp {
htslibhandles libHandles; /**< loaded external module handles */
//
htsoptstate state; /**< embedded live engine state */
String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
appended at the tail to keep field offsets stable */
hts_boolean
no_www_dedup; /**< with urlhack, keep www.host distinct from host */
hts_boolean no_slash_dedup; /**< with urlhack, keep redundant // in paths */
hts_boolean no_query_dedup; /**< with urlhack, keep query-argument order */
String cookies_file; /**< extra Netscape cookies.txt to preload
(--cookies-file) */
int pause_min_ms; /**< inter-file pause lower bound, ms (0=off, #185) */
int pause_max_ms; /**< inter-file pause upper bound, ms */
};
/* Running statistics for a mirror. */

View File

@@ -302,14 +302,6 @@ static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* Drop a redirect Location's #fragment: a UA anchor, never part of the fetched
* resource (#204). */
static void url_drop_fragment(char *const url) {
char *const frag = strchr(url, '#');
if (frag != NULL)
*frag = '\0';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
@@ -3604,35 +3596,22 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
//
strcpybuff(mov_url, r->location);
url_drop_fragment(mov_url);
// url qque -> adresse+fichier
if ((reponse =
ident_url_relatif(mov_url, urladr(), urlfil(), moved)) >= 0) {
int set_prio_to = 0; // pas de priotité fixéd par wizard
// check whether URLHack is harmless or not (per the effective
// sub-flags)
if (opt->urlhack && (!opt->no_www_dedup || !opt->no_slash_dedup ||
!opt->no_query_dedup)) {
const int norm_host = !opt->no_www_dedup;
const int norm_slash = !opt->no_slash_dedup;
const int norm_query = !opt->no_query_dedup;
// check whether URLHack is harmless or not
if (opt->urlhack) {
char BIGSTK n_adr[HTS_URLMAXSIZE * 2], n_fil[HTS_URLMAXSIZE * 2];
char BIGSTK pn_adr[HTS_URLMAXSIZE * 2], pn_fil[HTS_URLMAXSIZE * 2];
strlcpybuff(n_adr,
norm_host ? jump_normalized_const(moved->adr)
: jump_identification_const(moved->adr),
sizeof(n_adr));
strlcpybuff(pn_adr,
norm_host ? jump_normalized_const(urladr())
: jump_identification_const(urladr()),
sizeof(pn_adr));
fil_normalized_filtered_ex(moved->fil, n_fil, NULL, norm_slash,
norm_query);
fil_normalized_filtered_ex(urlfil(), pn_fil, NULL, norm_slash,
norm_query);
n_adr[0] = n_fil[0] = '\0';
(void) adr_normalized_sized(moved->adr, n_adr, sizeof(n_adr));
(void) fil_normalized(moved->fil, n_fil);
(void) adr_normalized_sized(urladr(), pn_adr, sizeof(pn_adr));
(void) fil_normalized(urlfil(), pn_fil);
if (strcasecmp(n_adr, pn_adr) == 0
&& strcasecmp(n_fil, pn_fil) == 0) {
hts_log_print(opt, LOG_WARNING,
@@ -4812,7 +4791,6 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
mov_url[0] = '\0';
strcpybuff(mov_url, back[b].r.location); // copier URL
url_drop_fragment(mov_url);
/* Remove (temporarily created) file if it was created */
UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), back[b].url_sav));

File diff suppressed because it is too large Load Diff

View File

@@ -1,52 +0,0 @@
/* ------------------------------------------------------------ */
/*
HTTrack Website Copier, Offline Browser for Windows and Unix
Copyright (C) 2026 Xavier Roche and other contributors
SPDX-License-Identifier: GPL-3.0-or-later
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Ethical use: we kindly ask that you NOT use this software to harvest email
addresses or to collect any other private information about people. Doing so
would dishonor our work and waste the many hours we have spent on it.
Please visit our Website: http://www.httrack.com
*/
/* ------------------------------------------------------------ */
/* File: htsselftest.h */
/* named dispatch for the hidden engine self-tests */
/* Author: Xavier Roche */
/* ------------------------------------------------------------ */
#ifndef HTSSELFTEST_DEFH
#define HTSSELFTEST_DEFH
#ifdef HTS_INTERNAL_BYTECODE
#ifndef HTS_DEF_FWSTRUCT_httrackp
#define HTS_DEF_FWSTRUCT_httrackp
typedef struct httrackp httrackp;
#endif
/* Run engine self-test `name` over the positional args argv[0..argc-1], or list
the available tests when name is NULL, empty, or "list". Prints the result;
returns the process exit code (0 == success). The caller owns option cleanup.
Reached through the hidden `httrack -#test[=NAME ...]` subcommand. */
int hts_selftest(httrackp *opt, const char *name, int argc, char **argv);
#endif
#endif

View File

@@ -358,12 +358,12 @@ int smallserver(T_SOC soc, char *url, char *method, char *data, char *path) {
{NULL, 0}
};
initStrElt initStr[] = {
{"user", HTS_DEFAULT_USER_AGENT},
{"footer", "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x "
"[XR&CO'2014], %s -->"},
{"url2",
"+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}};
{"user", "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"},
{"footer",
"<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->"},
{"url2", "+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}
};
int i = 0;
for(i = 0; initInt[i].name; i++) {

View File

@@ -80,10 +80,6 @@ htspair_t hts_detect_embed[] = {
{NULL, NULL}
};
/* HTML5 media siblings of <img src>: same near-link treatment (#451) */
static const htspair_t hts_detect_embed_html5[] = {
{"source", "src"}, {"source", "srcset"}, {"track", "src"}, {NULL, NULL}};
/* Internal */
static int hts_acceptlink_(httrackp * opt, int ptr, const char *adr,
const char *fil, const char *tag,
@@ -140,17 +136,6 @@ static int cmp_token(const char *tag, const char *cmp) {
&& !isalnum((unsigned char) tag[p]));
}
/* TRUE if (tag, attribute) matches an embedded-asset pair in the table */
static hts_boolean is_embed_pair(const htspair_t *table, const char *tag,
const char *attribute) {
int i;
for (i = 0; table[i].tag != NULL; i++) {
if (cmp_token(tag, table[i].tag) && cmp_token(attribute, table[i].attr))
return HTS_TRUE;
}
return HTS_FALSE;
}
static int hts_acceptlink_(httrackp * opt, int ptr,
const char *adr, const char *fil, const char *tag,
const char *attribute, int *set_prio_to,
@@ -178,9 +163,15 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
/* Built-in known tags (<img src=..>, ..) */
if (forbidden_url != 0 && opt->nearlink && tag != NULL && attribute != NULL) {
if (is_embed_pair(hts_detect_embed, tag, attribute) ||
is_embed_pair(hts_detect_embed_html5, tag, attribute)) {
embedded_triggered = 1;
int i;
for(i = 0; hts_detect_embed[i].tag != NULL; i++) {
if (cmp_token(tag, hts_detect_embed[i].tag)
&& cmp_token(attribute, hts_detect_embed[i].attr)
) {
embedded_triggered = 1;
break;
}
}
}

View File

@@ -4,7 +4,7 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
# Golden cache-format regression test (driven by 'httrack -#B <dir>').
#
# 01_engine-cache.test writes the cache with the same build it reads back (a
# round-trip), so it cannot catch a read-path or ZIP-format regression where
@@ -13,7 +13,7 @@
# byte-exact.
#
# Regenerate the fixture after a deliberate format change with
# 'httrack -#test=cache-golden <dir> regen', then copy <dir>/hts-cache/new.zip over the
# 'httrack -#B <dir> regen', then copy <dir>/hts-cache/new.zip over the
# committed file.
set -eu
@@ -37,11 +37,11 @@ trap 'rm -rf "$dir"' EXIT
mkdir -p "$dir/hts-cache"
cp "$fixture/hts-cache/new.zip" "$dir/hts-cache/new.zip"
out=$(httrack -#test=cache-golden "$dir")
out=$(httrack -#B "$dir")
# Match the exact success line: the read must have found and verified every
# entry, not merely failed to enter the mode (a renamed/removed test prints the
# registry to stderr, which also exits non-zero but never prints this).
# entry, not merely failed to enter the mode (a bad -#B falls through to the
# usage screen, which also exits non-zero but never prints this).
test "$out" = "cache-golden: OK" || {
echo "expected 'cache-golden: OK', got: $out" >&2
exit 1

View File

@@ -1,24 +0,0 @@
#!/bin/bash
#
# Keep this POSIX-portable: the harness runs it via $(BASH), which is a plain
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Cache write-failure handling (httrack -#test=cache-writefail <dir>). #174/#219.
# A failing new.zip write (disk full) used to crash the process via assertf; it
# must instead stop the mirror with a fatal error (exit_xh=-1), no crash. The
# self-test asserts that; reverting the fix makes -#test=cache-writefail abort (SIGABRT) and fail.
set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
out=$(httrack -#test=cache-writefail "$dir")
# Match the exact success line (error logs also go to stdout); a renamed/removed
# test prints the registry to stderr, which exits non-zero but never prints this.
printf '%s\n' "$out" | grep -qx "cache-writefail: OK" || {
echo "expected 'cache-writefail: OK', got: $out" >&2
exit 1
}

View File

@@ -4,7 +4,7 @@
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
# tool flags despite the #!/bin/bash above.
# Cache create/read/update logic (driven by 'httrack -#test=cache <dir>').
# Cache create/read/update logic (driven by 'httrack -#A <dir>').
#
# The in-process self-test stores several hand-crafted edge entries (normal
# HTML, an empty redirect with a near-limit location, a non-HTML body kept via
@@ -20,13 +20,13 @@ set -eu
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
# The working directory is a required argument; without it the test prints a
# usage line to stderr and returns non-zero.
out=$(httrack -#test=cache "$dir")
# Like the other -# debug modes, a trailing token (the working directory) is
# required; a bare '-#A' falls through to the usage screen.
out=$(httrack -#A "$dir")
# Match the exact success line, so the test cannot pass for an unrelated reason
# (e.g. the cache test being gone, which prints the registry to stderr but
# never prints this line).
# (e.g. the -#A mode being gone and falling through to the usage screen, which
# also exits non-zero but never prints this).
test "$out" = "cache-selftest: OK" || {
echo "expected 'cache-selftest: OK', got: $out" >&2
exit 1

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#test=charset <charset> <string> prints the string re-decoded from <charset> as UTF-8.
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() {
test "$(httrack -O /dev/null -#test=charset "$1" "$2")" == "$3" || exit 1
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#test=charset "$1" "$2" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
}
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.
@@ -31,7 +31,7 @@ conv 'us-ascii' 'hello' 'hello'
# unknown charset: ASCII passes through unchanged, but non-ASCII input cannot be
# decoded and yields empty output (an error is printed to stderr).
conv 'no-such-charset-xyz' 'abc' 'abc'
test "$(httrack -O /dev/null -#test=charset 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
test "$(httrack -O /dev/null -#3 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
# malformed UTF-8 (lone continuation byte, truncated lead byte) must not crash
runs 'utf-8' $'\x80'

View File

@@ -90,16 +90,4 @@ refused "dangling-quote argument not refused cleanly"
run_only "$tmp/q-lone" '"'
refused "lone-quote argument not refused cleanly"
# --pause (#185): valid MIN[:MAX] accepted; malformed, reversed, over-range and
# non-finite values refused cleanly. NaN defeats naive `<`/`>` checks (it
# compares false to everything), so it must not slip through to the int cast.
run "$tmp/pause-ok" --pause 0.2:0.4
accepted "$tmp/pause-ok" "#185: valid --pause range rejected"
run "$tmp/pause-fix" --pause 0.2
accepted "$tmp/pause-fix" "#185: valid fixed --pause rejected"
for bad in nan nan:5 5:nan inf 10:5 99999; do
run "$tmp/pause-bad" --pause "$bad"
refused "#185: invalid --pause '$bad' not refused cleanly"
done
exit 0

View File

@@ -1,15 +1,14 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#test=cookies' selftest.
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=cookies run)
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a renamed/removed test (it prints the registry
# to stderr) can't pass.
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1

View File

@@ -2,16 +2,15 @@
#
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
# they silently stop being copied. Driven by the in-process 'httrack -#test=copyopt' test.
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
set -eu
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=copyopt run)
# A trailing token is required; a bare '-#9' falls through to the usage screen.
out=$(httrack -#9 run)
# Exact-match the success line so a renamed/removed test (it prints the registry
# to stderr) can't pass.
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "copy-htsopt: OK" || {
echo "expected 'copy-htsopt: OK', got: $out" >&2
exit 1

View File

@@ -5,8 +5,9 @@ set -euo pipefail
# DNS resolver/cache self-test: a mock getaddrinfo (no network) checks address
# family, single-address selection, the -@i4/-@i6 family filter, and cache reuse.
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=dns run)
# The trailing token is required, like the other -# selftests, so a bare command
# line isn't treated as "no arguments" and routed to the usage screen.
out=$(httrack -#D run)
test "$out" = "dns-selftest: OK" || {
echo "expected 'dns-selftest: OK', got: $out" >&2

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#test=entities <string> prints the string with entities decoded (UTF-8 output).
# -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() {
test "$(httrack -O /dev/null -#test=entities "$1")" == "$2" || exit 1
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#test=entities "$1" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
}
# named entities
@@ -18,21 +18,6 @@ ent '&amp;' '&'
ent '&lt;&gt;' '<>'
ent '&eacute;' 'é'
# HTML5 names from the WHATWG set
ent '&hellip;' '…'
ent '&bigcup;' ''
# longest name (31 chars) exercises the name-length cap
ent '&CounterClockwiseContourIntegral;' '∳'
# astral codepoint -> 4-byte UTF-8
ent '&Aopf;' '𝔸'
# multi-codepoint refs are skipped at generation, so left verbatim
ent '&fjlig;' '&fjlig;'
# common HTML4 names still decode (regression guard against accidental drops)
ent '&copy;&reg;&trade;' '©®™'
ent '&mdash;&ndash;' '—–'
ent '&alpha;&beta;' 'αβ'
# numeric: decimal and hex
ent '&#65;&#66;' 'AB'
ent '&#x41;' 'A'

View File

@@ -1,65 +0,0 @@
#!/bin/bash
#
# -%L URL-list loading (#49): a readable list is honored; an unusable one fails
# with the reason (errno / not-a-regular-file), not a bare "Could not include
# URL list". Offline: file:// fixture, no server. Asserts on httrack's own
# strings and the message shape, so it is locale-independent.
set -euo pipefail
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_filelist.XXXXXX") || exit 1
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
echo '<html><body>hi</body></html>' >"$tmp/index.html"
# run httrack with the given -%L target; structured log lands in $out/hts-log.txt
run() {
local out="$1" list="$2"
rm -rf "$out"
mkdir -p "$out"
httrack -O "$out" --quiet -n "-%L" "$list" >"$out/.stdout" 2>&1 || true
LOG="$out/hts-log.txt"
}
fail() {
echo "FAIL: $1"
cat "$LOG"
exit 1
}
loghas() {
grep -Eq "$1" "$LOG" || fail "expected /$1/ in $LOG"
}
lognot() {
if grep -Eq "$1" "$LOG"; then fail "unexpected /$1/ in $LOG"; fi
}
# readable list: its one URL is loaded and counted (count must be non-zero)
printf 'file://%s/index.html\n' "$tmp" >"$tmp/urls.txt"
run "$tmp/ok" "$tmp/urls.txt"
loghas '[1-9][0-9]* links added from'
# missing file: quoted name + a non-empty reason, never the old reasonless
# "Could not include URL list: <name>". The reason is the stat() errno, not the
# directory fallback literal (guards against dropping the errno lookup).
run "$tmp/miss" "$tmp/nope.txt"
loghas 'Could not include URL list "[^"]+": .+'
lognot 'Could not include URL list: '
lognot 'not a regular file'
# a directory is rejected with our own reason (locale-independent)
mkdir -p "$tmp/adir"
run "$tmp/dir" "$tmp/adir"
loghas 'Could not include URL list "[^"]+": not a regular file'
# unreadable regular file: the fopen() errno arm fires, distinct from the
# directory branch. Root bypasses mode 000, so skip it there.
if test "$(id -u)" -ne 0; then
: >"$tmp/noperm.txt"
chmod 000 "$tmp/noperm.txt"
run "$tmp/perm" "$tmp/noperm.txt"
chmod 644 "$tmp/noperm.txt"
loghas 'Could not include URL list "[^"]+": .+'
lognot 'not a regular file'
fi
exit 0

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# wildcard filter engine (strjoker), the core of +/- include/exclude rules.
# -#test=filter <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() {
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
}
nomatch() {
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does NOT match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
}
# bare star matches everything
@@ -50,75 +50,24 @@ match '*foo*bar' 'foozbar'
# '?' is the query-string marker, not a single-char wildcard
nomatch 'a?c' 'abc'
# Inside a class, backslash escapes the next char as a literal member (#148):
# '\X' matches X only (not '\'), and an escaped ']' is a member, not the terminator.
# backslash escapes a metacharacter inside a class so it is matched literally.
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
# both X and '\'. These assertions pin that behavior.
match '*[\*]' '*'
nomatch '*[\*]' "\\"
match '*[\*]' "\\"
nomatch '*[\*]' 'a'
match '*[\\]' "\\"
nomatch '*[\\]' '*'
nomatch '*[\\]' 'a'
match '*[\[]' '['
nomatch '*[\[]' "\\"
match '*[\]]' ']'
nomatch '*[\]]' "\\"
match '*[\[]' "\\"
nomatch '*[\[]' 'a'
# '*[\[\]]' is "the [ or ] character", as the filter guide documents.
match '*[\[\]]' '['
match '*[\[\]]' ']'
nomatch '*[\[\]]' 'a'
match '*[\[,\]]' '[' # comma between members is optional
match '*[\[,\]]' ']'
match '*[a,\[]' 'a' # an escaped member no longer eats the preceding one
match '*[a,\[]' '['
# Escape is decoded before the range/separator/size checks, so '\-' '\,' '\<'
# are literal members, not operators.
match '*[a\-z]' 'a'
match '*[a\-z]' 'z'
nomatch '*[a\-z]' 'b' # not the a..z range
match '*[\,]' ','
nomatch '*[\,]' "\\" # the escape must not leak '\' into the class
match '*[\<]' '<'
nomatch '*[\<]' "\\"
match '*[\[,\],a]' '['
match '*[\[,\],a]' ']'
match '*[\[,\],a]' 'a'
# A truncated range '*[a-' is the literal members {a,-}; the parser must not
# read past the end decoding it (was a 1-byte heap over-read in the range arm).
match '*[a-' 'a'
nomatch '*[a-' 'b'
# *(...) matches exactly one char from the class; *[...] matches a run.
match '*(a,b)' 'a'
nomatch '*(a,b)' 'aa'
nomatch '*(a,b)' 'c'
# documented composite filters (filters.html)
match 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.zip'
nomatch 'www.*[path].com/*[path].zip' 'www.foo.com/a/b.tar'
match '*.html*[]' 'page.html'
nomatch '*.html*[]' 'page.html?x=1' # *[] forbids the trailing query
# Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
# means the size is still unknown (scan time). A size exclusion must stay neutral
# then, so the file is fetched and only cancelled once its size is known (#143).
fsize() {
local want="$1"
shift
test "$(httrack -O /dev/null -#test=filtersize "$@")" == "$want" || exit 1
}
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # scan time: keep
fsize 'verdict=forbidden size_flag=1' 5 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # <10KB: cancel
fsize 'verdict=allowed size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # >=10KB: keep
fsize 'verdict=forbidden size_flag=0' -1 foo.txt -* '+*.jpg' '-*.jpg*[<10]' # not a jpg
# the '>' operator is just as neutral at scan time, and fires once size is known
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # scan time: keep
fsize 'verdict=forbidden size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # >10KB: cancel
# [name]/[file]/[path] never span '?' mid-string; a trailing query is still
# tolerated by the global '?' rule (same as plain *.aspx), not the class (#144).
nomatch '*[path]/end' 'a?b/end'
nomatch '*[file]end' 'foo?xend'
nomatch '*[name]X' 'abc?X'
match '*[file]' 'foo?x=1' # trailing query: tolerated, as for *.aspx
match '*.aspx' 'page.aspx?y=2'
# A literal ']' cannot be a class member: the class parser stops at the first
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'

View File

@@ -3,7 +3,5 @@
set -euo pipefail
# httrack internal hashtable autotest on 100K keys. Assert the success line (on
# stderr) so a misrouted registry entry can't pass on exit code alone.
out=$(httrack -#test=hashtable 100000 2>&1)
printf '%s\n' "$out" | grep -q "all hashtable tests were successful!" || exit 1
# httrack internal hashtable autotest on 100K keys
httrack -#7 100000

View File

@@ -3,13 +3,13 @@
set -euo pipefail
# IDNA / punycode encode (-#test=idna-encode) and decode (-#test=idna-decode). This code has a CVE history,
# IDNA / punycode encode (-#4) and decode (-#5). This code has a CVE history,
# so the edge cases below cover passthrough, round-trips, and malformed input.
enc() { test "$(httrack -O /dev/null -#test=idna-encode "$1")" == "$2" || exit 1; }
dec() { test "$(httrack -O /dev/null -#test=idna-decode "$1")" == "$2" || exit 1; }
enc() { test "$(httrack -O /dev/null -#4 "$1")" == "$2" || exit 1; }
dec() { test "$(httrack -O /dev/null -#5 "$1")" == "$2" || exit 1; }
# crash probe: malformed ACE input must exit cleanly, not abort.
runs() { httrack -O /dev/null -#test=idna-decode "$1" >/dev/null 2>&1 || exit 1; }
runs() { httrack -O /dev/null -#5 "$1" >/dev/null 2>&1 || exit 1; }
# encode
enc 'www.café.com' 'www.xn--caf-dma.com'

View File

@@ -4,13 +4,13 @@
set -euo pipefail
# MIME type guessing from extension (get_httptype / give_mimext).
# -#test=mime <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() {
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is '$2'" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
}
unknown() {
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
}
mime '/a/b.html' 'text/html'

View File

@@ -323,33 +323,4 @@ grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
# HTML5 <source>/<track> follow as embedded near-links past the -r2 depth boundary (#451).
# img.gif positive control; plain.gif (bare <a href>) negative control proves the gate is selective.
site10="$tmp/html5media"
mkdir -p "$site10"
for f in img ss plain; do gif "$site10/$f.gif"; done
printf 'x' >"$site10/v.webm"
printf 'x' >"$site10/subs.vtt"
cat >"$site10/index.html" <<EOF
<html><body><a href="leaf.html">leaf</a></body></html>
EOF
cat >"$site10/leaf.html" <<EOF
<html><body>
<img src="img.gif">
<picture><source srcset="ss.gif 2x"></picture>
<video><source src="v.webm"></video>
<video><track src="subs.vtt"></video>
<a href="plain.gif">plain link past the boundary</a>
</body></html>
EOF
out10="$tmp/html5media-out"
rm -rf "$out10"
mkdir -p "$out10"
httrack "file://$site10/index.html" -O "$out10" --quiet --near -r2 >"$out10/.log" 2>&1 || true
found "img.gif" "$out10"
found "ss.gif" "$out10"
found "v.webm" "$out10"
found "subs.vtt" "$out10"
notfound "plain.gif" "$out10"
exit 0

View File

@@ -1,15 +0,0 @@
#!/bin/bash
#
# --pause (#185): the inter-file pause target must stay in [min,max] and spread
# across it (a per-call rand() would collapse it toward min). Driven by the
# in-process 'httrack -#test=pause' test. POSIX-portable ($(BASH) is /bin/sh on macOS).
set -eu
# 'run' is an ignored placeholder argument.
out=$(httrack -#test=pause run)
test "$out" = "pause: OK" || {
echo "expected 'pause: OK', got: $out" >&2
exit 1
}

View File

@@ -8,7 +8,7 @@ set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#test=relative "$1" "$2")
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
@@ -19,7 +19,7 @@ rel() {
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#test=resolve "$1" "$2" "$3")
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"

View File

@@ -3,11 +3,11 @@
set -euo pipefail
# Local save-name extension resolution (url_savename via -#test=savename <fil> <content-type>).
# Local save-name extension resolution (url_savename via -#N <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
name() {
out="$(httrack -O /dev/null -#test=savename "$1" "$2" | sed -n 's/^savename: //p')"
out="$(httrack -O /dev/null -#N "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
exit 1

View File

@@ -1,17 +0,0 @@
#!/bin/bash
#
# The -#test dispatch itself: a bare -#test lists the registry, and an unknown
# name errors (non-zero, diagnostic) instead of silently passing.
set -eu
# Bare -#test lists known tests (printed to stderr).
list=$(httrack -#test 2>&1)
printf '%s\n' "$list" | grep -q "filter" || exit 1
printf '%s\n' "$list" | grep -q "cache-writefail" || exit 1
# Unknown name: non-zero exit + diagnostic, and no test result line.
rc=0
err=$(httrack -#test=bogus 2>&1) || rc=$?
test "$rc" -ne 0 || exit 1
printf '%s\n' "$err" | grep -q "Unknown self-test" || exit 1

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() {
test "$(httrack -O /dev/null -#test=simplify "$1")" == "simplified=$2" || exit 1
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
}
simp './foo/bar/' 'foo/bar/'

View File

@@ -1,8 +0,0 @@
#!/bin/bash
#
set -euo pipefail
# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"

View File

@@ -3,22 +3,23 @@
set -euo pipefail
# htssafe.h bounded string operations (driven by 'httrack -#test=strsafe').
# htssafe.h bounded string operations (driven by 'httrack -#8').
# Success path: every bounded op (strcpybuff/strcatbuff/strncatbuff/strlcpybuff)
# must behave correctly. 'run' selects the success path (vs the overflow modes).
# must behave correctly. Like the other -# debug modes, a trailing token is
# required (a bare '-#8' falls through to the usage screen).
rc=0
out=$(httrack -#test=strsafe run) || rc=$?
out=$(httrack -#8 run) || rc=$?
test "$rc" -eq 0 || exit 1
test "$out" == "strsafe: OK" || exit 1
# Overflow path: an over-capacity write into a sized buffer must be caught by
# the bounded macro and abort the process, not be silently truncated/completed.
# Assert the htssafe abort signature specifically, so the test cannot pass for
# an unrelated reason (e.g. the strsafe test being gone, which prints the
# registry to stderr and also exits non-zero).
# an unrelated reason (e.g. the -#8 mode being gone and falling through to the
# usage screen, which also exits non-zero).
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#test=strsafe overflow "this string is far too long for the buffer" 2>&1) || true
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*)
echo "over-capacity write was NOT caught" >&2
@@ -35,7 +36,7 @@ esac
# capacity (4 bytes into a 4-byte buffer), so this also pins the boundary: a
# '<=' off-by-one in the capacity check would let it through (and print "NOT
# aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#test=strsafe overflow-buff "abcd" 2>&1) || true
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*)
echo "htsbuff over-capacity write was NOT caught" >&2

View File

@@ -1,8 +0,0 @@
#!/bin/bash
#
set -euo pipefail
# -%u url-hack split (#271): www / // / query-order dedup toggle independently.
# All assertions live in the engine self-test (hash compare flag resolution).
httrack -O /dev/null -#test=urlhack run | grep -q "urlhack self-test OK"

View File

@@ -1,7 +0,0 @@
#!/bin/bash
#
set -euo pipefail
# Default User-Agent (#449): honest HTTrack token, no Windows 98 relic.
httrack -O /dev/null -#test=useragent run | grep -q "useragent self-test OK"

View File

@@ -1,11 +0,0 @@
#!/bin/bash
#
# #157: a dotless, accented URL named .html on the first crawl must keep .html
# across an update -- not revert to the extensionless name.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'intl/Instalação_CVS_no_Ubuntu.html' \
--not-found 'intl/Instalação_CVS_no_Ubuntu' \
httrack 'BASEURL/intl/index.html'

View File

@@ -1,17 +0,0 @@
#!/bin/bash
# Issues #32/#41: a Content-Length that disagrees with the body warns "bogus
# state (broken size)" and skips the cache; -%B (tolerant) accepts it.
: "${top_srcdir:=..}"
# Default: warn, but the file is still written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-found 'bogus state \(broken size' \
httrack 'BASEURL/size/index.html'
# -%B (tolerant): no warning, file written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-not-found 'bogus state' \
httrack 'BASEURL/size/index.html' '-%B'

View File

@@ -1,19 +0,0 @@
#!/bin/bash
# Issue #17: with "no error pages" (-o0), 4xx/5xx bodies must not be written;
# a genuine 0-byte 200 stays. Default (-o1) writes the error page. (#17's purge
# half also does not reproduce; the purge path is not exercised here.)
set -e
: "${top_srcdir:=..}"
# -o0: 404 suppressed, good page and the legit 0-byte 200 kept.
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'errpage/good.html' \
--found 'errpage/empty.html' \
--not-found 'errpage/missing.html' \
httrack 'BASEURL/errpage/index.html' '-o0'
# Control -o1 (default): the 404 error page is written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'errpage/missing.html' \
httrack 'BASEURL/errpage/index.html' '-o1'

View File

@@ -1,109 +0,0 @@
#!/bin/bash
# Issue #198: on a resumed download the server may answer the Range with a 206
# that starts *before* the offset we asked for (block-aligned ranges). httrack
# must honor the returned Content-Range, not blindly append, or the overlap
# bytes get duplicated and the file grows (corrupt PDFs). Pass 1 interrupts
# flaky.bin mid-body (partial + temp-ref); pass 2 resumes against a 206 that
# backs up 8 bytes. The result must equal the same bytes fetched whole (full.bin).
set -eu
: "${top_srcdir:=..}"
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_198.XXXXXX") || exit 1
serverpid=
crawlpid=
cleanup() {
if test -n "$crawlpid"; then kill -9 "$crawlpid" 2>/dev/null || true; fi
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null || true
wait "$serverpid" 2>/dev/null || true
fi
rm -rf "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT PIPE TERM
# OVERLAP_COUNTER gets a byte per flaky.bin request so pass 1 knows when to interrupt.
serverlog="${tmpdir}/server.log"
counter="${tmpdir}/hits"
resumed="${tmpdir}/resumed" # gets a byte when the server serves a resume 206
OVERLAP_COUNTER="$counter" OVERLAP_RESUMED="$resumed" \
python3 "$server" --root "${testdir}/server-root" \
>"$serverlog" 2>&1 &
serverpid=$!
port=
for _ in $(seq 1 50); do
line=$(head -n1 "$serverlog" 2>/dev/null)
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
kill -0 "$serverpid" 2>/dev/null || {
echo "server exited early: $(cat "$serverlog")"
exit 1
}
sleep 0.1
done
test -n "$port" || {
echo "could not discover server port"
exit 1
}
base="http://127.0.0.1:${port}"
which httrack >/dev/null || {
echo "could not find httrack"
exit 1
}
out="${tmpdir}/crawl"
common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0 -c1)
refdir="${out}/hts-cache/ref"
# pass 1: interrupt once flaky.bin's prefix is streaming (partial + temp-ref).
printf '[pass 1: interrupt flaky.bin] ..\t'
httrack "${common[@]}" "${base}/overlap/index.html" >"${tmpdir}/log1" 2>&1 &
crawlpid=$!
for _ in $(seq 1 300); do
test -s "$counter" && break
kill -0 "$crawlpid" 2>/dev/null || break
sleep 0.1
done
sleep 0.5
kill -TERM "$crawlpid" 2>/dev/null || true
wait "$crawlpid" 2>/dev/null || true
crawlpid=
test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
echo "FAIL: no temp-ref survived pass 1; cannot drive the resume"
exit 1
}
echo "OK (temp-ref present)"
# pass 2: --continue -> resume Range -> 206 that starts 8 bytes early.
printf '[pass 2: resume flaky.bin] ..\t'
httrack "${common[@]}" --continue "${base}/overlap/index.html" >"${tmpdir}/log2" 2>&1 || true
echo "OK"
# Guard against a silent full re-download: the byte-compare below only tests the
# fix if pass 2 actually went through the resume Range -> 206 path.
printf '[resume path was exercised] ..\t'
if ! test -s "$resumed"; then
echo "FAIL: pass 2 never triggered a resume 206; the overlap fix was not exercised"
exit 1
fi
echo "OK"
printf '[resumed file is not corrupted] ..\t'
dir=$(find "$out" -maxdepth 1 -type d -name '127.0.0.1*' | head -1)
flaky="${dir}/overlap/flaky.bin"
full="${dir}/overlap/full.bin"
if ! test -f "$flaky" || ! test -f "$full"; then
echo "FAIL: flaky.bin or full.bin missing after pass 2"
exit 1
fi
if ! cmp -s "$flaky" "$full"; then
echo "FAIL: resumed flaky.bin ($(wc -c <"$flaky")) != full.bin ($(wc -c <"$full")); overlap duplicated"
exit 1
fi
echo "OK ($(wc -c <"$flaky") bytes, byte-identical)"

View File

@@ -1,16 +0,0 @@
#!/bin/bash
#
# A -mime: exclusion must abort the transfer on the response Content-Type, not
# fetch the whole 1 MB body then discard it (#58). The bytes-received guard is
# the real one: the file is absent either way, but only the fix keeps the count
# tiny (header only) instead of pulling the body. Match it positively (a small,
# <=4-digit count) so a vanished/reworded summary line fails rather than passes.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'mimex/real.html' \
--not-found 'mimex/blob.pdf' \
--log-found 'excluded by MIME type filter' \
--log-found '\[[0-9]{1,4} bytes received' \
httrack 'BASEURL/mimex/index.html' '-mime:application/pdf'

View File

@@ -1,23 +0,0 @@
#!/bin/bash
#
# End-to-end --strip-query (#112): two links to one resource differing only by
# ?utm_source dedup to a single saved file (2 files written: index + resource);
# the control crawl without the option keeps both variants (3 files). Locks the
# CLI->opt->hash plumbing the engine self-test can't reach.
set -e
: "${top_srcdir:=..}"
# stripped: the two ?utm_source variants collapse to one resource
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
# control: no stripping -> both query-named variants are saved
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
httrack 'BASEURL/stripquery/index.html'
# strip still applies with url-hack off (-%u0): exercises the urlhack-off
# savename branch, which must normalize the dedup key the same way the hash does
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
httrack 'BASEURL/stripquery/index.html' -%u0 --strip-query 'utm_source'

View File

@@ -1,22 +0,0 @@
#!/bin/bash
#
# End-to-end --cookies-file (#215): /gated/secret.php needs a cookie no page
# ever Set-Cookies, so it is reachable only when the option preloads it from a
# Netscape cookies.txt. Locks the CLI->opt->cookie_load->wire plumbing.
set -e
: "${top_srcdir:=..}"
# preloaded cookie -> secret page is served. -o0 means a 500 leaves no file, so
# --found/--files only hold when the secret is genuinely fetched (200).
bash "$top_srcdir/tests/local-crawl.sh" --cookie 'session=opensesame' \
--errors 0 --files 2 \
--found 'gated/index.html' --found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0
# control: without the cookie the secret 500s; -o0 suppresses the error page so
# its absence is real (error + missing file)
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'gated/index.html' --not-found 'gated/secret.html' \
httrack 'BASEURL/gated/index.php' -o0

View File

@@ -1,36 +0,0 @@
#!/bin/bash
#
# --pause (#185): a fixed inter-file delay must slow a multi-file crawl. Measure
# the same crawl with and without --pause and compare: the harness overhead
# cancels, leaving only the pause. Integer seconds keep it portable (BSD date
# has no %N); a lower bound is not timing-flaky since a pause only adds time.
set -e
: "${top_srcdir:=..}"
# python3 runs the local server (mirror local-crawl.sh); skip when absent, else
# run() swallows its exit-77 and the serverless 0s/0s crawl looks like a fail.
command -v python3 >/dev/null || {
echo "python3 not found; skipping local crawl tests"
exit 77
}
run() { # echoes the wall-clock seconds of one crawl
local t0 t1
t0=$(date +%s)
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
httrack 'BASEURL/types/index.html' -c1 "$@" >/dev/null 2>&1
t1=$(date +%s)
echo $((t1 - t0))
}
base=$(run)
paused=$(run --pause 0.5)
delta=$((paused - base))
echo "crawl: ${base}s, with --pause 0.5: ${paused}s (delta ${delta}s)"
if [ "$delta" -lt 2 ]; then
echo "FAIL: --pause did not delay the crawl (delta ${delta}s)" >&2
exit 1
fi

View File

@@ -1,11 +0,0 @@
#!/bin/bash
# Issue #204: a 302 Location with a #fragment must drop the fragment before the
# target is fetched. The server is strict (400 on a '#' in the request-target),
# so a leaked fragment logs an error and the target is never saved.
set -e
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'redir/target.html' \
httrack 'BASEURL/redir/index.html'

View File

@@ -5,7 +5,6 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
server-root/stripquery/index.html server-root/stripquery/a.html \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -27,7 +26,6 @@ TESTS = \
00_runnable.test \
01_engine-cache.test \
01_engine-cache-golden.test \
01_engine-cache-writefail.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
@@ -35,22 +33,16 @@ TESTS = \
01_engine-dns.test \
01_engine-doitlog.test \
01_engine-entities.test \
01_engine-filelist.test \
01_engine-filter.test \
01_engine-hashtable.test \
01_engine-idna.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
01_engine-urlhack.test \
01_engine-useragent.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \
@@ -68,15 +60,6 @@ TESTS = \
17_local-empty-ct.test \
18_local-update.test \
19_local-connect-fallback.test \
20_local-resume-loop.test \
21_local-intl-update.test \
22_local-broken-size.test \
23_local-errpage.test \
24_local-resume-overlap.test \
25_local-mime-exclude.test \
26_local-strip-query.test \
27_local-cookies-file.test \
28_local-pause.test \
29_local-redirect-fragment.test
20_local-resume-loop.test
CLEANFILES = check-network_sh.cache

View File

@@ -12,14 +12,9 @@
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] [--cookie NAME=VALUE ...] \
# bash local-crawl.sh [--tls] [--root DIR] \
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
# --cookie writes a Netscape cookies.txt (scoped to the discovered host:port,
# which the ephemeral port forces into the cookie domain) and passes it to
# httrack via --cookies-file, to exercise preloaded cookies.
set -u
@@ -88,7 +83,6 @@ tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create
# --- parse leading control flags --------------------------------------------
declare -a audit=()
declare -a cookies=()
scheme=http
pos=0
args=("$@")
@@ -109,15 +103,11 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
root="${args[$pos]}"
;;
--cookie)
pos=$((pos + 1))
cookies+=("${args[$pos]}")
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory | --log-found | --log-not-found)
--found | --not-found | --directory)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
@@ -166,17 +156,6 @@ while test "$pos" -lt "$nargs"; do
pos=$((pos + 1))
done
# --- materialize any --cookie entries into a cookies.txt ---------------------
if test "${#cookies[@]}" -gt 0; then
jar="${tmpdir}/cookies.txt"
: >"$jar"
for spec in "${cookies[@]}"; do
printf '127.0.0.1:%s\tTRUE\t/\tFALSE\t1999999999\t%s\t%s\n' \
"$port" "${spec%%=*}" "${spec#*=}" >>"$jar"
done
hts+=(--cookies-file "$jar")
fi
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
@@ -217,15 +196,6 @@ if test -n "$rerun"; then
exit 1
}
result "OK (update)"
# The update summary reports "files updated"; a fresh crawl never does. Assert
# it so a regression that bypasses the cache (re-crawls fresh) can't pass.
info "checking update used the cache"
if grep -aqE "mirror complete in .*files updated" "${out}/hts-log.txt"; then
result "OK"
else
result "update pass did not report cache activity"
exit 1
fi
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
@@ -278,22 +248,6 @@ while test "$i" -lt "${#audit[@]}"; do
exit 1
fi
;;
--log-found)
i=$((i + 1))
info "checking log matches ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${out}/hts-log.txt"; then result "OK"; else
result "not in log"
exit 1
fi
;;
--log-not-found)
i=$((i + 1))
info "checking log lacks ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${out}/hts-log.txt"; then
result "present in log"
exit 1
else result "OK"; fi
;;
esac
i=$((i + 1))
done

View File

@@ -110,19 +110,6 @@ class Handler(SimpleHTTPRequestHandler):
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
# --cookies-file (#215): the secret page needs a cookie no page ever sets,
# so it is reachable only when --cookies-file preloads it.
GATE_COOKIE = ("session", "opensesame")
def route_gated_index(self):
self.send_html('\tThis is a <a href="secret.php">link</a>')
def route_gated_secret(self):
name, value = self.GATE_COOKIE
if self.request_cookies().get(name) != value:
return self.fail_cookie(name)
self.send_html("\tThis is the secret.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
@@ -190,35 +177,6 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# --- MIME-type exclusion abort (issue #58) -----------------------------
# A -mime:application/pdf filter must abort the transfer once the header
# arrives, not download the whole body and discard it.
def route_mimex_index(self):
self.send_html(
'\t<a href="blob.pdf">pdf</a>\n' '\t<a href="real.html">real</a>\n'
)
# 1 MB body: the fix aborts after the header, so httrack's "bytes received"
# stays tiny; without it the engine reads the body and the count jumps.
MIMEX_BLOB = b"%PDF-1.4\n" + b"\x00" * (1024 * 1024)
def route_mimex_blob(self):
self.send_raw(self.MIMEX_BLOB, "application/pdf")
def route_mimex_real(self):
self.send_raw(b"<html><body>real</body></html>", "text/html")
# --- special chars in URLs across an update (issue #157) ---------------
# A dotless, accented basename served as text/html (MediaWiki style). The
# name the first crawl picks (.html) must survive the update pass.
INTL_NAME = "Instalação_CVS_no_Ubuntu"
def route_intl_index(self):
self.send_html('\t<a href="%s">accented</a>\n' % self.INTL_NAME)
def route_intl_page(self):
self.send_raw(b"<html><body>accented page</body></html>\n", "text/html")
# resume / 416 loop (#206): the first GET stalls after a prefix so the crawl
# can be interrupted (partial + temp-ref); every later request is 416.
RESUME_PREFIX = b"PARTIAL-" + b"x" * 4096 # flushed before the stall
@@ -256,125 +214,10 @@ class Handler(SimpleHTTPRequestHandler):
self.send_header("Content-Length", "0")
self.end_headers()
# 206 resume must honor the server's Content-Range, not the offset we asked
# for (#198): a server resuming a few bytes *before* the request must not
# leave httrack duplicating the overlap onto the partial. flaky.bin
# interrupts once then resumes OVERLAP_EARLY bytes early; full.bin serves
# the identical bytes in one shot, so the test can compare the two.
OVERLAP_BLOB = b"%PDF-1.4\n" + bytes((i * 37 + 11) % 256 for i in range(8000))
OVERLAP_EARLY = 8
OVERLAP_PREFIX_LEN = 4000 # flushed before the stall
_overlap_started = False
def route_overlap_index(self):
self.send_html('\t<a href="flaky.bin">flaky</a>\n\t<a href="full.bin">full</a>')
def route_overlap_full(self):
self.send_raw(self.OVERLAP_BLOB, "application/octet-stream")
def route_overlap(self):
counter = os.environ.get("OVERLAP_COUNTER")
if counter:
with open(counter, "a") as fp:
fp.write("x")
blob = self.OVERLAP_BLOB
rng = self.headers.get("Range")
# First GET: stream a prefix then stall, so the crawl can be interrupted
# mid-body (partial + temp-ref on disk).
if rng is None and not Handler._overlap_started:
Handler._overlap_started = True
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(blob)))
self.send_header("Accept-Ranges", "bytes")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(blob[: self.OVERLAP_PREFIX_LEN])
self.wfile.flush()
try:
while True:
time.sleep(3600)
except OSError:
pass
return
if rng is None: # no resume request: serve the whole file
return self.route_overlap_full()
# Resume: honor the Range, but back up OVERLAP_EARLY bytes.
start = (
int(rng[len("bytes=") :].split("-")[0]) if rng.startswith("bytes=") else 0
)
start = max(0, start - self.OVERLAP_EARLY)
# Signal that the resume Range -> 206 path actually fired, so the test
# can prove it was exercised (not a silent full re-download).
resumed = os.environ.get("OVERLAP_RESUMED")
if resumed:
with open(resumed, "a") as fp:
fp.write("x")
part = blob[start:]
self.send_response(206, "Partial Content")
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(part)))
self.send_header(
"Content-Range", "bytes %d-%d/%d" % (start, len(blob) - 1, len(blob))
)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(part)
# error pages / 0-byte files (#17): -o0 ("no error pages") must keep 4xx/5xx
# bodies off disk; a genuine 0-byte 200 is a valid file and stays.
def route_errpage_index(self):
self.send_html(
'\t<a href="good.html">good</a>\n'
'\t<a href="missing.html">missing</a>\n'
'\t<a href="empty.html">empty</a>\n'
)
def route_errpage_good(self):
self.send_raw(b"<html><body>good page</body></html>\n", "text/html")
def route_errpage_missing(self):
self.send_html("\t404 error body", status=404, extra_status="Not Found")
def route_errpage_empty(self):
self.send_raw(b"", "text/html")
# broken Content-Length (#32/#41): declared size != bytes sent. httrack
# warns "bogus state (broken size)" and skips the cache unless -%B.
def route_size_index(self):
self.send_html('\t<a href="oversize.bin">over</a>\n')
def route_size_oversize(self):
body = b"A" * 100
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(body) - 2)) # lie: too short
self.send_header("Connection", "close")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
# that must be dropped before the target is fetched. A leaked '#' reaches the
# strict-server guard below and 400s.
def route_redir_index(self):
self.send_html('\t<a href="go.php">go</a>')
def route_redir_go(self):
self.send_response(302, "Found")
self.send_header("Location", "target.html#section")
self.send_header("Content-Length", "0")
self.end_headers()
def route_redir_target(self):
self.send_raw(b"<html><body>redirect target</body></html>\n", "text/html")
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/gated/index.php": route_gated_index,
"/gated/secret.php": route_gated_secret,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
@@ -390,58 +233,26 @@ class Handler(SimpleHTTPRequestHandler):
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
"/intl/index.html": route_intl_index,
"/intl/" + INTL_NAME: route_intl_page,
"/resume/index.html": route_resume_index,
"/resume/blob.txt": route_resume,
"/overlap/index.html": route_overlap_index,
"/overlap/flaky.bin": route_overlap,
"/overlap/full.bin": route_overlap_full,
"/size/index.html": route_size_index,
"/size/oversize.bin": route_size_oversize,
"/errpage/index.html": route_errpage_index,
"/errpage/good.html": route_errpage_good,
"/errpage/missing.html": route_errpage_missing,
"/errpage/empty.html": route_errpage_empty,
"/mimex/index.html": route_mimex_index,
"/mimex/blob.pdf": route_mimex_blob,
"/mimex/real.html": route_mimex_real,
"/redir/index.html": route_redir_index,
"/redir/go.php": route_redir_go,
"/redir/target.html": route_redir_target,
}
# --- dispatch ----------------------------------------------------------
def reject_fragment(self):
# Strict server: a '#' in the request-target is the client failing to
# drop a fragment (#204). RFC 3986 forbids it on the wire; answer 400.
if "#" in self.path:
self.send_response(400, "Bad Request")
self.send_header("Content-Length", "0")
self.end_headers()
return True
return False
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
# Match percent-encoded paths (accented #157 route) by their decoded form.
handler = self.ROUTES.get(path) or self.ROUTES.get(unquote(path))
handler = self.ROUTES.get(path)
if handler is not None:
handler(self)
return True
return False
def do_GET(self):
if self.reject_fragment():
return
if not self.dispatch():
super().do_GET()
def do_HEAD(self):
if self.reject_fragment():
return
if not self.dispatch():
super().do_HEAD()

View File

@@ -1 +0,0 @@
<html><body>resource A</body></html>

View File

@@ -1,5 +0,0 @@
<html><body>
Two links to one resource, differing only by a tracking parameter.
<a href="a.html?utm_source=x">x</a>
<a href="a.html?utm_source=y">y</a>
</body></html>