mirror of
https://github.com/xroche/httrack.git
synced 2026-06-27 12:37:05 +03:00
Compare commits
8 Commits
selftest-n
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
40a66600ff | ||
|
|
768756e231 | ||
|
|
b138c87a93 | ||
|
|
3de47433b7 | ||
|
|
fb8827718e | ||
|
|
7228210061 | ||
|
|
38882c0aee | ||
|
|
bfc4a016ab |
45
.github/workflows/ci.yml
vendored
45
.github/workflows/ci.yml
vendored
@@ -188,6 +188,51 @@ jobs:
|
||||
if: failure()
|
||||
run: cat tests/test-suite.log 2>/dev/null || true
|
||||
|
||||
# MemorySanitizer catches reads of uninitialized memory (#143's stack-garbage
|
||||
# size filter) that ASan/UBSan miss. It flags any byte an uninstrumented lib
|
||||
# wrote, so the job stays in our own code: offline self-tests only, no openssl
|
||||
# (--disable-https), no zlib cache tests, static (the runtime is not in .so's).
|
||||
msan:
|
||||
name: msan (MemorySanitizer, clang)
|
||||
runs-on: ubuntu-24.04
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: recursive
|
||||
|
||||
- name: Install build dependencies
|
||||
run: |
|
||||
set -euo pipefail
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y --no-install-recommends \
|
||||
build-essential clang autoconf automake libtool autoconf-archive \
|
||||
zlib1g-dev
|
||||
|
||||
- name: Configure (MSan, static, no https)
|
||||
run: |
|
||||
set -euo pipefail
|
||||
autoreconf -fi
|
||||
./configure CC=clang \
|
||||
CFLAGS="-fsanitize=memory -fsanitize-memory-track-origins=2 -fno-sanitize-recover=all -g -O1 -fno-omit-frame-pointer" \
|
||||
LDFLAGS="-fsanitize=memory" \
|
||||
--disable-https --disable-shared --enable-static
|
||||
|
||||
- name: Build
|
||||
run: make -j"$(nproc)"
|
||||
|
||||
- name: Test (offline self-tests under MSan)
|
||||
env:
|
||||
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
|
||||
run: |
|
||||
set -euo pipefail
|
||||
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
|
||||
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
|
||||
make check TESTS="$tests"
|
||||
|
||||
- name: Print the test log on failure
|
||||
if: failure()
|
||||
run: cat tests/test-suite.log 2>/dev/null || true
|
||||
|
||||
# Optional-dependency build: compile and test with HTTPS/OpenSSL disabled --
|
||||
# the configuration users on minimal systems build, and one libssl is not even
|
||||
# installed here so configure cannot silently re-enable it. The matrix above
|
||||
|
||||
@@ -33,8 +33,9 @@ the operational checklist: toolchain, invariants, and how to ship a change.
|
||||
- Be terse. Comment the why, in English; translate French comments you touch.
|
||||
- Strip AI tells from prose (em-dash overuse, rule-of-three, filler, vague
|
||||
attributions). Ref: Wikipedia "Signs of AI writing". Claude Code: `/humanizer`.
|
||||
- Behavior change → add a test. Fast path: a hidden `httrack -#N` debug
|
||||
subcommand (`htscoremain.c`) driven by a `tests/NN_*.test`, over a slow crawl.
|
||||
- Behavior change → add a test. Fast path: a hidden `httrack -#test=NAME` engine
|
||||
self-test (registry in `htsselftest.c`; `-#test` lists them) driven by a
|
||||
`tests/NN_*.test`, over a slow crawl.
|
||||
|
||||
## Review your change adversarially (strongly suggested)
|
||||
Before pushing, and when reviewing others, don't skim for bugs:
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
.\"
|
||||
.\" This file is generated by man/makeman.sh; do not edit by hand.
|
||||
.\" SPDX-License-Identifier: GPL-3.0-or-later
|
||||
.TH httrack 1 "13 June 2026" "httrack website copier"
|
||||
.TH httrack 1 "27 June 2026" "httrack website copier"
|
||||
.SH NAME
|
||||
httrack \- offline browser : copy websites to a local directory
|
||||
.SH SYNOPSIS
|
||||
@@ -43,6 +43,7 @@ httrack \- offline browser : copy websites to a local directory
|
||||
[ \fB\-x, \-\-replace\-external\fR ]
|
||||
[ \fB\-%x, \-\-disable\-passwords\fR ]
|
||||
[ \fB\-%q, \-\-include\-query\-string\fR ]
|
||||
[ \fB\-%g, \-\-strip\-query\fR ]
|
||||
[ \fB\-o, \-\-generate\-errors\fR ]
|
||||
[ \fB\-X, \-\-purge\-old[=N]\fR ]
|
||||
[ \fB\-%p, \-\-preserve\fR ]
|
||||
@@ -198,6 +199,8 @@ replace external html links by error pages (\-\-replace\-external)
|
||||
do not include any password for external password protected websites (%x0 include) (\-\-disable\-passwords)
|
||||
.IP \-%q
|
||||
*include query string for local files (useless, for information purpose only) (%q0 don't include) (\-\-include\-query\-string)
|
||||
.IP \-%g
|
||||
strip query keys for dedup ([host/pattern=]key1,key2,...) (\-\-strip\-query <param>)
|
||||
.IP \-o
|
||||
*generate output html file in case of error (404..) (o0 don't generate) (\-\-generate\-errors)
|
||||
.IP \-X
|
||||
@@ -313,12 +316,8 @@ debug HTTP headers in logfile (\-\-debug\-headers)
|
||||
.SS Guru options: (do NOT use if possible)
|
||||
.IP \-#X
|
||||
*use optimized engine (limited memory boundary checks) (\-\-fast\-engine)
|
||||
.IP \-#0
|
||||
filter test (\-#0 '*.gif' 'www.bar.com/foo.gif') (\-\-debug\-testfilters <param>)
|
||||
.IP \-#1
|
||||
simplify test (\-#1 ./foo/bar/../foobar)
|
||||
.IP \-#2
|
||||
type test (\-#2 /foo/bar.php)
|
||||
.IP \-#test
|
||||
list engine self\-tests (run one with \-#test=NAME [args])
|
||||
.IP \-#C
|
||||
cache list (\-#C '*.com/spider*.gif' (\-\-debug\-cache <param>)
|
||||
.IP \-#R
|
||||
|
||||
@@ -56,7 +56,7 @@ whttrackrundir = $(bindir)
|
||||
whttrackrun_SCRIPTS = webhttrack
|
||||
|
||||
libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
|
||||
htscache_selftest.c htsdns_selftest.c \
|
||||
htscache_selftest.c htsdns_selftest.c htsselftest.c \
|
||||
htscatchurl.c htsfilters.c htsftp.c htshash.c coucal/coucal.c \
|
||||
htshelp.c htslib.c htscoremain.c \
|
||||
htsname.c htsrobots.c htstools.c htswizard.c \
|
||||
@@ -66,7 +66,7 @@ libhttrack_la_SOURCES = htscore.c htsparse.c htsback.c htscache.c \
|
||||
md5.c \
|
||||
minizip/ioapi.c minizip/mztools.c minizip/unzip.c minizip/zip.c \
|
||||
hts-indextmpl.h htsalias.h htsback.h htsbase.h htssafe.h \
|
||||
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htscatchurl.h \
|
||||
htsbasenet.h htsbauth.h htscache.h htscache_selftest.h htsdns_selftest.h htsselftest.h htscatchurl.h \
|
||||
htsconfig.h htscore.h htsparse.h htscoremain.h htsdefines.h \
|
||||
htsfilters.h htsftp.h htsglobal.h htshash.h coucal/coucal.h \
|
||||
htshelp.h htsindex.h htslib.h htsmd5.h \
|
||||
|
||||
@@ -60,6 +60,9 @@ Please visit our Website: http://www.httrack.com
|
||||
param1 : this option must be alone, and needs one distinct parameter (-P <path>)
|
||||
param0 : this option must be alone, but the parameter should be put together (+*.gif)
|
||||
*/
|
||||
/* clang-format off: hand-aligned table; clang-format reflows the whole
|
||||
initializer (2->4 space) on any edit, churning every untouched row. */
|
||||
/* clang-format off */
|
||||
const char *hts_optalias[][4] = {
|
||||
/* {"","","",""}, */
|
||||
{"path", "-O", "param1", "output path"},
|
||||
@@ -107,6 +110,8 @@ const char *hts_optalias[][4] = {
|
||||
{"disable-passwords", "-%x", "single", ""}, {"disable-password", "-%x",
|
||||
"single", ""},
|
||||
{"include-query-string", "-%q", "single", ""},
|
||||
{"strip-query", "-%g", "param1",
|
||||
"strip [host/pattern=]key1,key2,... from URLs"},
|
||||
{"generate-errors", "-o", "single", ""},
|
||||
{"do-not-generate-errors", "-o0", "single", ""},
|
||||
{"purge-old", "-X", "param", ""},
|
||||
@@ -241,6 +246,7 @@ const char *hts_optalias[][4] = {
|
||||
|
||||
{"", "", "", ""}
|
||||
};
|
||||
/* clang-format on */
|
||||
|
||||
/*
|
||||
Check for alias in command-line
|
||||
|
||||
118
src/htsback.c
118
src/htsback.c
@@ -57,7 +57,10 @@ Please visit our Website: http://www.httrack.com
|
||||
// DOS
|
||||
#include <process.h> /* _beginthread, _endthread */
|
||||
#endif
|
||||
#include <io.h> /* _chsize_s */
|
||||
#define HTS_FTRUNCATE(fp, sz) _chsize_s(_fileno(fp), (sz))
|
||||
#else
|
||||
#define HTS_FTRUNCATE(fp, sz) ftruncate(fileno(fp), (sz))
|
||||
#endif
|
||||
|
||||
#define VT_CLREOL "\33[K"
|
||||
@@ -3763,7 +3766,27 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
|
||||
}
|
||||
#endif
|
||||
/********** **************************** ********** */
|
||||
} else { // il faut aller le chercher
|
||||
}
|
||||
// MIME type excluded by a -mime: filter: abort, don't fetch
|
||||
// the body (#58)
|
||||
else if (HTTP_IS_OK(back[i].r.statuscode) &&
|
||||
!back[i].testmode &&
|
||||
strnotempty(back[i].r.contenttype) &&
|
||||
hts_acceptmime(opt, 0, back[i].url_adr,
|
||||
back[i].url_fil,
|
||||
back[i].r.contenttype) == 1) {
|
||||
deletehttp(&back[i].r);
|
||||
back[i].r.soc = INVALID_SOCKET;
|
||||
back[i].status = STATUS_READY;
|
||||
back_set_finished(sback, i);
|
||||
back[i].r.statuscode = STATUSCODE_EXCLUDED;
|
||||
strcpybuff(back[i].r.msg, "Excluded by MIME type filter");
|
||||
hts_log_print(
|
||||
opt, LOG_NOTICE,
|
||||
"File excluded by MIME type filter (%s): %s%s",
|
||||
back[i].r.contenttype, back[i].url_adr,
|
||||
back[i].url_fil);
|
||||
} else { // il faut aller le chercher
|
||||
|
||||
// effacer buffer (requète)
|
||||
if (!noFreebuff) {
|
||||
@@ -3774,35 +3797,70 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
|
||||
// xxc SI CHUNK VERIFIER QUE CA MARCHE??
|
||||
if (back[i].r.statuscode == 206) { // on nous envoie un morceau (la fin) coz une partie sur disque!
|
||||
off_t sz = fsize_utf8(back[i].url_sav);
|
||||
/* RFC 7233: resume at the server's Content-Range start,
|
||||
not the offset we requested; a server may resume
|
||||
earlier and appending the overlap duplicates bytes
|
||||
(#198). */
|
||||
const LLint resume = back[i].r.crange_start;
|
||||
const hts_boolean range_ok =
|
||||
back[i].r.crange > 0 && resume >= 0 &&
|
||||
resume <= (LLint) sz &&
|
||||
back[i].r.crange_end + 1 == back[i].r.crange &&
|
||||
(back[i].r.totalsize < 0 ||
|
||||
back[i].r.totalsize ==
|
||||
back[i].r.crange_end - resume + 1);
|
||||
|
||||
#if HDEBUG
|
||||
printf("partial content: " LLintP " on disk..\n",
|
||||
(LLint) sz);
|
||||
#endif
|
||||
if (sz >= 0) {
|
||||
if (sz >= 0 && range_ok) {
|
||||
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML
|
||||
if (opt->getmode & HTS_GETMODE_NONHTML) {
|
||||
filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu
|
||||
file_notify(opt, back[i].url_adr, back[i].url_fil,
|
||||
back[i].url_sav, 0, 1,
|
||||
back[i].r.notmodified);
|
||||
back[i].r.out = FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "ab"); // append
|
||||
back[i].r.out =
|
||||
FOPEN(fconv(catbuff, sizeof(catbuff),
|
||||
back[i].url_sav),
|
||||
"r+b"); // resume in place
|
||||
if (back[i].r.out && opt->cache != 0) {
|
||||
back[i].r.is_write = 1; // écrire
|
||||
back[i].r.size = sz; // déja écrit
|
||||
back[i].r.statuscode = HTTP_OK; // Forcer 'OK'
|
||||
back[i].r.is_write = 1;
|
||||
back[i].r.size = resume; // bytes already on disk
|
||||
back[i].r.statuscode = HTTP_OK; // force 'OK'
|
||||
if (back[i].r.totalsize >= 0)
|
||||
back[i].r.totalsize += sz; // plus en fait
|
||||
fseek(back[i].r.out, 0, SEEK_END); // à la fin
|
||||
/* create a temporary reference file in case of broken mirror */
|
||||
if (back_serialize_ref(opt, &back[i]) != 0) {
|
||||
hts_log_print(opt, LOG_WARNING,
|
||||
"Could not create temporary reference file for %s%s",
|
||||
back[i].url_adr, back[i].url_fil);
|
||||
}
|
||||
back[i].r.totalsize += resume; // -> full size
|
||||
// drop bytes past the resume point; a silent
|
||||
// failure could leave a stale tail, so on error
|
||||
// drop the partial and refetch the whole file
|
||||
if (HTS_FTRUNCATE(back[i].r.out,
|
||||
(off_t) resume) != 0) {
|
||||
fclose(back[i].r.out);
|
||||
back[i].r.out = NULL;
|
||||
url_savename_refname_remove(
|
||||
opt, back[i].url_adr, back[i].url_fil);
|
||||
UNLINK(back[i].url_sav);
|
||||
back[i].status = STATUS_READY;
|
||||
back_set_finished(sback, i);
|
||||
strcpybuff(back[i].r.msg,
|
||||
"Can not truncate partial file, "
|
||||
"restarting");
|
||||
} else {
|
||||
fseeko(back[i].r.out, (off_t) resume, SEEK_SET);
|
||||
/* create a temporary reference file in case of
|
||||
* broken mirror */
|
||||
if (back_serialize_ref(opt, &back[i]) != 0) {
|
||||
hts_log_print(opt, LOG_WARNING,
|
||||
"Could not create temporary "
|
||||
"reference file for %s%s",
|
||||
back[i].url_adr,
|
||||
back[i].url_fil);
|
||||
}
|
||||
#if HDEBUG
|
||||
printf("continue interrupted file\n");
|
||||
printf("continue interrupted file\n");
|
||||
#endif
|
||||
}
|
||||
} else { // On est dans la m**
|
||||
back[i].status = STATUS_READY; // terminé (voir plus loin)
|
||||
back_set_finished(sback, i);
|
||||
@@ -3814,17 +3872,18 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
|
||||
FILE *fp =
|
||||
FOPEN(fconv(catbuff, sizeof(catbuff), back[i].url_sav), "rb");
|
||||
if (fp) {
|
||||
LLint alloc_mem = sz + 1;
|
||||
LLint alloc_mem = resume + 1;
|
||||
|
||||
if (back[i].r.totalsize >= 0)
|
||||
alloc_mem += back[i].r.totalsize; // AJOUTER RESTANT!
|
||||
if (deleteaddr(&back[i].r)
|
||||
&& (back[i].r.adr =
|
||||
(char *) malloct((size_t) alloc_mem))) {
|
||||
back[i].r.size = sz;
|
||||
back[i].r.size = resume;
|
||||
if (back[i].r.totalsize >= 0)
|
||||
back[i].r.totalsize += sz; // plus en fait
|
||||
if ((fread(back[i].r.adr, 1, sz, fp)) != sz) {
|
||||
back[i].r.totalsize += resume; // -> full size
|
||||
if ((fread(back[i].r.adr, 1, (size_t) resume,
|
||||
fp)) != (size_t) resume) {
|
||||
back[i].status = STATUS_READY; // terminé (voir plus loin)
|
||||
back_set_finished(sback, i);
|
||||
strcpybuff(back[i].r.msg,
|
||||
@@ -3842,14 +3901,30 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
|
||||
"No memory for partial file");
|
||||
}
|
||||
fclose(fp);
|
||||
} else { // Argh..
|
||||
} else { // open failed
|
||||
back[i].status = STATUS_READY; // terminé (voir plus loin)
|
||||
back_set_finished(sback, i);
|
||||
strcpybuff(back[i].r.msg,
|
||||
"Can not open partial file");
|
||||
}
|
||||
}
|
||||
} else { // Non trouvé??
|
||||
} else if (sz >=
|
||||
0) { // unusable range -> restart whole file
|
||||
hts_log_print(opt, LOG_WARNING,
|
||||
"Unusable partial-content range for %s%s "
|
||||
"(have " LLintP " bytes, got " LLintP
|
||||
"-" LLintP "/" LLintP "), restarting",
|
||||
back[i].url_adr, back[i].url_fil,
|
||||
(LLint) sz, back[i].r.crange_start,
|
||||
back[i].r.crange_end, back[i].r.crange);
|
||||
url_savename_refname_remove(opt, back[i].url_adr,
|
||||
back[i].url_fil);
|
||||
UNLINK(back[i].url_sav);
|
||||
back[i].status = STATUS_READY;
|
||||
back_set_finished(sback, i);
|
||||
strcpybuff(back[i].r.msg,
|
||||
"Unusable partial content, restarting");
|
||||
} else { // partial not found
|
||||
back[i].status = STATUS_READY; // terminé (voir plus loin)
|
||||
back_set_finished(sback, i);
|
||||
strcpybuff(back[i].r.msg, "Can not find partial file");
|
||||
@@ -3930,7 +4005,6 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
/*} */
|
||||
|
||||
@@ -146,7 +146,8 @@ typedef enum BackStatusCode {
|
||||
STATUSCODE_NON_FATAL = -5,
|
||||
STATUSCODE_SSL_HANDSHAKE = -6,
|
||||
STATUSCODE_TOO_BIG = -7,
|
||||
STATUSCODE_TEST_OK = -10
|
||||
STATUSCODE_TEST_OK = -10,
|
||||
STATUSCODE_EXCLUDED = -11 /* aborted: MIME excluded by a -mime: filter */
|
||||
} BackStatusCode;
|
||||
|
||||
/** HTTrack status ('status' member of of 'lien_back') **/
|
||||
|
||||
@@ -736,26 +736,39 @@ int httpmirror(char *url1, httrackp * opt) {
|
||||
/* OPTIMIZED for fast load */
|
||||
if (StringNotEmpty(opt->filelist)) {
|
||||
char *filelist_buff = NULL;
|
||||
const size_t filelist_sz = off_t_to_size_t(fsize(StringBuff(opt->filelist)));
|
||||
size_t filelist_sz = 0;
|
||||
const char *filelist_err = NULL; /* failure reason, NULL on success */
|
||||
const off_t fs = fsize(StringBuff(opt->filelist));
|
||||
|
||||
if (filelist_sz != (size_t) -1) {
|
||||
if (fs < 0) {
|
||||
/* fsize() hides the cause; redo stat() for a precise errno (#49) */
|
||||
struct stat st;
|
||||
filelist_err = stat(StringBuff(opt->filelist), &st) != 0
|
||||
? strerror(errno)
|
||||
: "not a regular file";
|
||||
} else if ((filelist_sz = off_t_to_size_t(fs)) == (size_t) -1) {
|
||||
filelist_err = "file too large";
|
||||
filelist_sz = 0;
|
||||
} else {
|
||||
FILE *fp = fopen(StringBuff(opt->filelist), "rb");
|
||||
|
||||
if (fp) {
|
||||
if (fp == NULL) {
|
||||
filelist_err = strerror(errno);
|
||||
} else {
|
||||
filelist_buff = malloct(filelist_sz + 1);
|
||||
if (filelist_buff) {
|
||||
if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
|
||||
freet(filelist_buff);
|
||||
filelist_buff = NULL;
|
||||
} else {
|
||||
*(filelist_buff + filelist_sz) = '\0';
|
||||
}
|
||||
if (filelist_buff == NULL) {
|
||||
filelist_err = "out of memory";
|
||||
} else if (fread(filelist_buff, 1, filelist_sz, fp) != filelist_sz) {
|
||||
freet(filelist_buff);
|
||||
filelist_err = "read error";
|
||||
} else {
|
||||
filelist_buff[filelist_sz] = '\0';
|
||||
}
|
||||
fclose(fp);
|
||||
}
|
||||
}
|
||||
|
||||
if (filelist_buff) {
|
||||
if (filelist_buff != NULL) {
|
||||
int filelist_ptr = 0;
|
||||
int n = 0;
|
||||
char BIGSTK line[HTS_URLMAXSIZE * 2];
|
||||
@@ -780,8 +793,8 @@ int httpmirror(char *url1, httrackp * opt) {
|
||||
// Free buffer
|
||||
freet(filelist_buff);
|
||||
} else {
|
||||
hts_log_print(opt, LOG_ERROR, "Could not include URL list: %s",
|
||||
StringBuff(opt->filelist));
|
||||
hts_log_print(opt, LOG_ERROR, "Could not include URL list \"%s\": %s",
|
||||
StringBuff(opt->filelist), filelist_err);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3726,6 +3739,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
|
||||
if (StringNotEmpty(from->user_agent))
|
||||
StringCopyS(to->user_agent, from->user_agent);
|
||||
|
||||
if (StringNotEmpty(from->strip_query))
|
||||
StringCopyS(to->strip_query, from->strip_query);
|
||||
|
||||
if (from->retry > -1)
|
||||
to->retry = from->retry;
|
||||
|
||||
|
||||
@@ -236,6 +236,8 @@ struct hash_struct {
|
||||
coucal former_adrfil;
|
||||
/* scratch buffers reused across lookups (not reentrant) */
|
||||
int normalized;
|
||||
/* query-strip keys (not owned); set from opt->strip_query at hash_init */
|
||||
const char *strip_query;
|
||||
char normfil[HTS_URLMAXSIZE * 2];
|
||||
char normfil2[HTS_URLMAXSIZE * 2];
|
||||
char catbuff[CATBUFF_SIZE];
|
||||
@@ -364,6 +366,17 @@ int fspc(httrackp * opt, FILE * fp, const char *type);
|
||||
|
||||
char *next_token(char *p, int flag);
|
||||
|
||||
/* Like fil_normalized(), but first drops query keys in STRIP (comma-separated,
|
||||
"*" = all); STRIP NULL/empty behaves exactly like fil_normalized(). */
|
||||
char *fil_normalized_filtered(const char *source, char *dest,
|
||||
const char *strip);
|
||||
|
||||
/* For URL ADR/FIL, return (in DEST) the comma keylist to strip from the
|
||||
'\n'-separated "[pattern=]keys" RULES (patterns matched on host/path via
|
||||
strjoker, last wins); NULL if none match. Feeds fil_normalized_filtered(). */
|
||||
const char *hts_query_strip_keys(const char *rules, const char *adr,
|
||||
const char *fil, char *dest, size_t destsize);
|
||||
|
||||
/* Read a whole file into a freshly malloc'd, NUL-terminated buffer; the caller
|
||||
owns it and must release it with freet(). Return NULL on missing/unreadable
|
||||
file (readfile_or substitutes defaultdata instead). The byte content is NOT
|
||||
|
||||
1032
src/htscoremain.c
1032
src/htscoremain.c
File diff suppressed because it is too large
Load Diff
@@ -76,7 +76,8 @@ int fa_strjoker(int type, char **filters, int nfil, const char *nom, LLint * siz
|
||||
}
|
||||
if (size)
|
||||
sz = *size;
|
||||
if (strjoker(nom, filters[i] + filteroffs, &sz, size_flag)) { // reconnu
|
||||
/* size unknown (scan time): no size pointer => size tests stay neutral */
|
||||
if (strjoker(nom, filters[i] + filteroffs, size ? &sz : NULL, size_flag)) {
|
||||
if (size)
|
||||
if (sz != *size)
|
||||
sizelimit = sz;
|
||||
|
||||
@@ -117,10 +117,17 @@ static coucal_hashkeys key_adrfil_hashes_generic(void *arg,
|
||||
|
||||
// copy link
|
||||
assertf(fil != NULL);
|
||||
if (hash->normalized) {
|
||||
fil_normalized(fil, &hash->normfil[strlen(hash->normfil)]);
|
||||
} else {
|
||||
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
|
||||
{
|
||||
/* resolve the per-URL strip keys; strip applies even when urlhack is off */
|
||||
char BIGSTK keybuf[HTS_URLMAXSIZE];
|
||||
const char *const keys = hts_query_strip_keys(hash->strip_query, adr, fil,
|
||||
keybuf, sizeof(keybuf));
|
||||
|
||||
if (hash->normalized || keys != NULL) {
|
||||
fil_normalized_filtered(fil, &hash->normfil[strlen(hash->normfil)], keys);
|
||||
} else {
|
||||
strcpy(&hash->normfil[strlen(hash->normfil)], fil);
|
||||
}
|
||||
}
|
||||
|
||||
// hash
|
||||
@@ -161,12 +168,20 @@ static int key_adrfil_equals_generic(void *arg,
|
||||
}
|
||||
|
||||
// now compare pathes
|
||||
if (normalized) {
|
||||
fil_normalized(a_fil, hash->normfil);
|
||||
fil_normalized(b_fil, hash->normfil2);
|
||||
return strcmp(hash->normfil, hash->normfil2) == 0;
|
||||
} else {
|
||||
return strcmp(a_fil, b_fil) == 0;
|
||||
{
|
||||
char BIGSTK ka[HTS_URLMAXSIZE], kb[HTS_URLMAXSIZE];
|
||||
const char *const keysa =
|
||||
hts_query_strip_keys(hash->strip_query, a_adr, a_fil, ka, sizeof(ka));
|
||||
const char *const keysb =
|
||||
hts_query_strip_keys(hash->strip_query, b_adr, b_fil, kb, sizeof(kb));
|
||||
|
||||
if (normalized || keysa != NULL || keysb != NULL) {
|
||||
fil_normalized_filtered(a_fil, hash->normfil, keysa);
|
||||
fil_normalized_filtered(b_fil, hash->normfil2, keysb);
|
||||
return strcmp(hash->normfil, hash->normfil2) == 0;
|
||||
} else {
|
||||
return strcmp(a_fil, b_fil) == 0;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -227,6 +242,9 @@ void hash_init(httrackp *opt, hash_struct * hash, int normalized) {
|
||||
hash->adrfil = coucal_new(0);
|
||||
hash->former_adrfil = coucal_new(0);
|
||||
hash->normalized = normalized;
|
||||
/* snapshot the query-strip list (not owned; valid for the hash lifetime) */
|
||||
hash->strip_query =
|
||||
StringNotEmpty(opt->strip_query) ? StringBuff(opt->strip_query) : NULL;
|
||||
|
||||
hts_set_hash_handler(hash->sav, opt);
|
||||
hts_set_hash_handler(hash->adrfil, opt);
|
||||
|
||||
@@ -563,6 +563,7 @@ void help(const char *app, int more) {
|
||||
(" %x do not include any password for external password protected websites (%x0 include)");
|
||||
infomsg
|
||||
(" %q *include query string for local files (useless, for information purpose only) (%q0 don't include)");
|
||||
infomsg(" %g strip query keys for dedup ([host/pattern=]key1,key2,...)");
|
||||
infomsg
|
||||
(" o *generate output html file in case of error (404..) (o0 don't generate)");
|
||||
infomsg(" X *purge old files after update (X0 keep delete)");
|
||||
@@ -646,9 +647,7 @@ void help(const char *app, int more) {
|
||||
infomsg("");
|
||||
infomsg("Guru options: (do NOT use if possible)");
|
||||
infomsg(" #X *use optimized engine (limited memory boundary checks)");
|
||||
infomsg(" #0 filter test (-#0 '*.gif' 'www.bar.com/foo.gif')");
|
||||
infomsg(" #1 simplify test (-#1 ./foo/bar/../foobar)");
|
||||
infomsg(" #2 type test (-#2 /foo/bar.php)");
|
||||
infomsg(" #test list engine self-tests (run one with -#test=NAME [args])");
|
||||
infomsg(" #C cache list (-#C '*.com/spider*.gif'");
|
||||
infomsg(" #R cache repair (damaged cache)");
|
||||
infomsg(" #d debug parser");
|
||||
|
||||
138
src/htslib.c
138
src/htslib.c
@@ -3681,6 +3681,142 @@ HTSEXT_API char *fil_normalized(const char *source, char *dest) {
|
||||
return dest;
|
||||
}
|
||||
|
||||
/* Is query key ARG[0..keylen) in the comma-separated STRIP list? "*" = all;
|
||||
case-sensitive, space-trimmed tokens. */
|
||||
static int hts_query_key_stripped(const char *arg, size_t keylen,
|
||||
const char *strip) {
|
||||
const char *p = strip;
|
||||
|
||||
while (*p != '\0') {
|
||||
const char *start = p;
|
||||
size_t toklen;
|
||||
|
||||
while (*p != '\0' && *p != ',')
|
||||
p++;
|
||||
toklen = (size_t) (p - start);
|
||||
while (toklen > 0 && *start == ' ') {
|
||||
start++;
|
||||
toklen--;
|
||||
}
|
||||
while (toklen > 0 && start[toklen - 1] == ' ')
|
||||
toklen--;
|
||||
if (toklen == 1 && start[0] == '*')
|
||||
return 1;
|
||||
if (toklen == keylen && strncmp(start, arg, keylen) == 0)
|
||||
return 1;
|
||||
if (*p == ',')
|
||||
p++;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* see htscore.h */
|
||||
char *fil_normalized_filtered(const char *source, char *dest,
|
||||
const char *strip) {
|
||||
const char *query;
|
||||
char BIGSTK tmp[HTS_URLMAXSIZE * 2];
|
||||
htsbuff cb;
|
||||
int wrote = 0;
|
||||
|
||||
/* No strip list, or no query: plain normalization. */
|
||||
if (strip == NULL || *strip == '\0' ||
|
||||
(query = strchr(source, '?')) == NULL) {
|
||||
return fil_normalized(source, dest);
|
||||
}
|
||||
|
||||
/* Copy the path, re-emit kept query args, let fil_normalized() sort. Walk
|
||||
every field incl. empty/trailing ("a&","?&&") so the result is a fixpoint
|
||||
(the read re-normalizes it; a dropped empty arg would miss dedup). */
|
||||
cb = htsbuff_ptr(tmp, sizeof(tmp));
|
||||
htsbuff_catn(&cb, source, (size_t) (query - source));
|
||||
for (query++;;) {
|
||||
const char *const arg = query;
|
||||
const char *eq = NULL;
|
||||
size_t keylen, arglen;
|
||||
|
||||
while (*query != '\0' && *query != '&') {
|
||||
if (eq == NULL && *query == '=')
|
||||
eq = query;
|
||||
query++;
|
||||
}
|
||||
arglen = (size_t) (query - arg);
|
||||
keylen = eq != NULL ? (size_t) (eq - arg) : arglen;
|
||||
if (!hts_query_key_stripped(arg, keylen, strip)) {
|
||||
htsbuff_catc(&cb, wrote ? '&' : '?');
|
||||
htsbuff_catn(&cb, arg, arglen);
|
||||
wrote = 1;
|
||||
}
|
||||
if (*query == '\0')
|
||||
break;
|
||||
query++;
|
||||
}
|
||||
return fil_normalized(tmp, dest);
|
||||
}
|
||||
|
||||
/* see htscore.h */
|
||||
const char *hts_query_strip_keys(const char *rules, const char *adr,
|
||||
const char *fil, char *dest, size_t destsize) {
|
||||
const char *p, *q;
|
||||
const char *result = NULL;
|
||||
char BIGSTK url[HTS_URLMAXSIZE * 2];
|
||||
|
||||
if (rules == NULL || *rules == '\0' || destsize == 0)
|
||||
return NULL;
|
||||
|
||||
/* Match string = normalized host/path, query removed. jump_normalized_const
|
||||
collapses www+scheme/auth so read and write (double-normalized) agree;
|
||||
query excluded keeps the decision on host/path only. */
|
||||
url[0] = '\0';
|
||||
strcatbuff(url, jump_normalized_const(adr));
|
||||
if (fil[0] != '/')
|
||||
strcatbuff(url, "/");
|
||||
q = strchr(fil, '?');
|
||||
if (q != NULL)
|
||||
strncatbuff(url, fil, (int) (q - fil));
|
||||
else
|
||||
strcatbuff(url, fil);
|
||||
|
||||
/* Walk the '\n' entries; last match wins (like the +/- filter eval). Each is
|
||||
"pattern=keys"; no '=' is the bare form, pattern "*". */
|
||||
for (p = rules; *p != '\0';) {
|
||||
const char *const line = p;
|
||||
const char *eol, *eq, *keys;
|
||||
char BIGSTK pat[HTS_URLMAXSIZE * 2];
|
||||
|
||||
while (*p != '\0' && *p != '\n')
|
||||
p++;
|
||||
eol = p;
|
||||
if (*p == '\n')
|
||||
p++;
|
||||
if (eol == line)
|
||||
continue;
|
||||
eq = memchr(line, '=', (size_t) (eol - line));
|
||||
if (eq != NULL) {
|
||||
size_t patlen = (size_t) (eq - line);
|
||||
|
||||
if (patlen >= sizeof(pat))
|
||||
patlen = sizeof(pat) - 1;
|
||||
memcpy(pat, line, patlen);
|
||||
pat[patlen] = '\0';
|
||||
keys = eq + 1;
|
||||
} else {
|
||||
pat[0] = '*';
|
||||
pat[1] = '\0';
|
||||
keys = line;
|
||||
}
|
||||
if (strjoker(url, pat, NULL, NULL) != NULL) {
|
||||
size_t klen = (size_t) (eol - keys);
|
||||
|
||||
if (klen >= destsize)
|
||||
klen = destsize - 1;
|
||||
memcpy(dest, keys, klen);
|
||||
dest[klen] = '\0';
|
||||
result = dest;
|
||||
}
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
#define endwith(a) ( (len >= (sizeof(a)-1)) ? ( strncmp(dest, a+len-(sizeof(a)-1), sizeof(a)-1) == 0 ) : 0 );
|
||||
HTSEXT_API char *adr_normalized_sized(const char *source, char *dest,
|
||||
size_t destsize) {
|
||||
@@ -5891,6 +6027,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
|
||||
opt->sizehack = HTS_FALSE;
|
||||
opt->urlhack = HTS_TRUE;
|
||||
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
|
||||
StringCopy(opt->strip_query, "");
|
||||
opt->ftp_proxy = HTS_TRUE;
|
||||
opt->convert_utf8 = HTS_TRUE;
|
||||
StringCopy(opt->filelist, "");
|
||||
@@ -6035,6 +6172,7 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
|
||||
StringFree(opt->urllist);
|
||||
StringFree(opt->footer);
|
||||
StringFree(opt->mod_blacklist);
|
||||
StringFree(opt->strip_query);
|
||||
|
||||
StringFree(opt->path_html);
|
||||
StringFree(opt->path_html_utf8);
|
||||
|
||||
@@ -198,6 +198,13 @@ int url_savename(lien_adrfilsave *const afs,
|
||||
// copy of fil, used for lookups (see urlhack)
|
||||
const char *normadr = adr;
|
||||
const char *normfil = fil_complete;
|
||||
/* query keys to strip for this URL (NULL = none); decoupled from urlhack */
|
||||
char BIGSTK stripkeys[HTS_URLMAXSIZE];
|
||||
const char *const strip =
|
||||
StringNotEmpty(opt->strip_query)
|
||||
? hts_query_strip_keys(StringBuff(opt->strip_query), adr,
|
||||
fil_complete, stripkeys, sizeof(stripkeys))
|
||||
: NULL;
|
||||
const char *const print_adr = jump_protocol_const(adr);
|
||||
const char *start_pos = NULL, *nom_pos = NULL, *dot_pos = NULL; // Position nom et point
|
||||
|
||||
@@ -232,7 +239,7 @@ int url_savename(lien_adrfilsave *const afs,
|
||||
if (opt->urlhack) {
|
||||
// copy of adr (without protocol), used for lookups (see urlhack)
|
||||
normadr = adr_normalized_sized(adr, normadr_, sizeof(normadr_));
|
||||
normfil = fil_normalized(fil_complete, normfil_);
|
||||
normfil = fil_normalized_filtered(fil_complete, normfil_, strip);
|
||||
} else {
|
||||
if (link_has_authority(adr_complete)) { // https or other protocols : in "http/" subfolder
|
||||
char *pos = strchr(adr_complete, ':');
|
||||
@@ -245,6 +252,9 @@ int url_savename(lien_adrfilsave *const afs,
|
||||
normadr = normadr_;
|
||||
}
|
||||
}
|
||||
// strip still applies with urlhack off (host left untouched)
|
||||
if (strip != NULL)
|
||||
normfil = fil_normalized_filtered(fil_complete, normfil_, strip);
|
||||
}
|
||||
|
||||
// à afficher sans ftp://
|
||||
|
||||
@@ -529,6 +529,8 @@ struct httrackp {
|
||||
htslibhandles libHandles; /**< loaded external module handles */
|
||||
//
|
||||
htsoptstate state; /**< embedded live engine state */
|
||||
String strip_query; /**< query keys to drop when deduping URLs (-strip-query);
|
||||
appended at the tail to keep field offsets stable */
|
||||
};
|
||||
|
||||
/* Running statistics for a mirror. */
|
||||
|
||||
1244
src/htsselftest.c
Normal file
1244
src/htsselftest.c
Normal file
File diff suppressed because it is too large
Load Diff
52
src/htsselftest.h
Normal file
52
src/htsselftest.h
Normal file
@@ -0,0 +1,52 @@
|
||||
/* ------------------------------------------------------------ */
|
||||
/*
|
||||
HTTrack Website Copier, Offline Browser for Windows and Unix
|
||||
Copyright (C) 2026 Xavier Roche and other contributors
|
||||
|
||||
SPDX-License-Identifier: GPL-3.0-or-later
|
||||
|
||||
This program is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation, either version 3 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program. If not, see <http://www.gnu.org/licenses/>.
|
||||
|
||||
Ethical use: we kindly ask that you NOT use this software to harvest email
|
||||
addresses or to collect any other private information about people. Doing so
|
||||
would dishonor our work and waste the many hours we have spent on it.
|
||||
|
||||
Please visit our Website: http://www.httrack.com
|
||||
*/
|
||||
|
||||
/* ------------------------------------------------------------ */
|
||||
/* File: htsselftest.h */
|
||||
/* named dispatch for the hidden engine self-tests */
|
||||
/* Author: Xavier Roche */
|
||||
/* ------------------------------------------------------------ */
|
||||
|
||||
#ifndef HTSSELFTEST_DEFH
|
||||
#define HTSSELFTEST_DEFH
|
||||
|
||||
#ifdef HTS_INTERNAL_BYTECODE
|
||||
|
||||
#ifndef HTS_DEF_FWSTRUCT_httrackp
|
||||
#define HTS_DEF_FWSTRUCT_httrackp
|
||||
typedef struct httrackp httrackp;
|
||||
#endif
|
||||
|
||||
/* Run engine self-test `name` over the positional args argv[0..argc-1], or list
|
||||
the available tests when name is NULL, empty, or "list". Prints the result;
|
||||
returns the process exit code (0 == success). The caller owns option cleanup.
|
||||
Reached through the hidden `httrack -#test[=NAME ...]` subcommand. */
|
||||
int hts_selftest(httrackp *opt, const char *name, int argc, char **argv);
|
||||
|
||||
#endif
|
||||
|
||||
#endif
|
||||
@@ -4,7 +4,7 @@
|
||||
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
|
||||
# tool flags despite the #!/bin/bash above.
|
||||
|
||||
# Golden cache-format regression test (driven by 'httrack -#B <dir>').
|
||||
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
|
||||
#
|
||||
# 01_engine-cache.test writes the cache with the same build it reads back (a
|
||||
# round-trip), so it cannot catch a read-path or ZIP-format regression where
|
||||
@@ -13,7 +13,7 @@
|
||||
# byte-exact.
|
||||
#
|
||||
# Regenerate the fixture after a deliberate format change with
|
||||
# 'httrack -#B <dir> regen', then copy <dir>/hts-cache/new.zip over the
|
||||
# 'httrack -#test=cache-golden <dir> regen', then copy <dir>/hts-cache/new.zip over the
|
||||
# committed file.
|
||||
|
||||
set -eu
|
||||
@@ -37,11 +37,11 @@ trap 'rm -rf "$dir"' EXIT
|
||||
mkdir -p "$dir/hts-cache"
|
||||
cp "$fixture/hts-cache/new.zip" "$dir/hts-cache/new.zip"
|
||||
|
||||
out=$(httrack -#B "$dir")
|
||||
out=$(httrack -#test=cache-golden "$dir")
|
||||
|
||||
# Match the exact success line: the read must have found and verified every
|
||||
# entry, not merely failed to enter the mode (a bad -#B falls through to the
|
||||
# usage screen, which also exits non-zero but never prints this).
|
||||
# entry, not merely failed to enter the mode (a renamed/removed test prints the
|
||||
# registry to stderr, which also exits non-zero but never prints this).
|
||||
test "$out" = "cache-golden: OK" || {
|
||||
echo "expected 'cache-golden: OK', got: $out" >&2
|
||||
exit 1
|
||||
|
||||
@@ -4,20 +4,20 @@
|
||||
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
|
||||
# tool flags despite the #!/bin/bash above.
|
||||
|
||||
# Cache write-failure handling (httrack -#W <dir>). #174/#219.
|
||||
# Cache write-failure handling (httrack -#test=cache-writefail <dir>). #174/#219.
|
||||
# A failing new.zip write (disk full) used to crash the process via assertf; it
|
||||
# must instead stop the mirror with a fatal error (exit_xh=-1), no crash. The
|
||||
# self-test asserts that; reverting the fix makes -#W abort (SIGABRT) and fail.
|
||||
# self-test asserts that; reverting the fix makes -#test=cache-writefail abort (SIGABRT) and fail.
|
||||
|
||||
set -eu
|
||||
|
||||
dir=$(mktemp -d)
|
||||
trap 'rm -rf "$dir"' EXIT
|
||||
|
||||
out=$(httrack -#W "$dir")
|
||||
out=$(httrack -#test=cache-writefail "$dir")
|
||||
|
||||
# Match the exact success line (error logs also go to stdout); a bad -#W falls
|
||||
# through to the usage screen, which exits non-zero but never prints this.
|
||||
# Match the exact success line (error logs also go to stdout); a renamed/removed
|
||||
# test prints the registry to stderr, which exits non-zero but never prints this.
|
||||
printf '%s\n' "$out" | grep -qx "cache-writefail: OK" || {
|
||||
echo "expected 'cache-writefail: OK', got: $out" >&2
|
||||
exit 1
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
# POSIX /bin/sh on some platforms (e.g. macOS), so avoid bashisms and GNU-only
|
||||
# tool flags despite the #!/bin/bash above.
|
||||
|
||||
# Cache create/read/update logic (driven by 'httrack -#A <dir>').
|
||||
# Cache create/read/update logic (driven by 'httrack -#test=cache <dir>').
|
||||
#
|
||||
# The in-process self-test stores several hand-crafted edge entries (normal
|
||||
# HTML, an empty redirect with a near-limit location, a non-HTML body kept via
|
||||
@@ -20,13 +20,13 @@ set -eu
|
||||
dir=$(mktemp -d)
|
||||
trap 'rm -rf "$dir"' EXIT
|
||||
|
||||
# Like the other -# debug modes, a trailing token (the working directory) is
|
||||
# required; a bare '-#A' falls through to the usage screen.
|
||||
out=$(httrack -#A "$dir")
|
||||
# The working directory is a required argument; without it the test prints a
|
||||
# usage line to stderr and returns non-zero.
|
||||
out=$(httrack -#test=cache "$dir")
|
||||
|
||||
# Match the exact success line, so the test cannot pass for an unrelated reason
|
||||
# (e.g. the -#A mode being gone and falling through to the usage screen, which
|
||||
# also exits non-zero but never prints this).
|
||||
# (e.g. the cache test being gone, which prints the registry to stderr but
|
||||
# never prints this line).
|
||||
test "$out" = "cache-selftest: OK" || {
|
||||
echo "expected 'cache-selftest: OK', got: $out" >&2
|
||||
exit 1
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
set -euo pipefail
|
||||
|
||||
# charset -> UTF-8 conversion (hts_convertStringToUTF8).
|
||||
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
|
||||
# -#test=charset <charset> <string> prints the string re-decoded from <charset> as UTF-8.
|
||||
conv() {
|
||||
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=charset "$1" "$2")" == "$3" || exit 1
|
||||
}
|
||||
# crash probe: malformed input must exit cleanly, not abort.
|
||||
runs() {
|
||||
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
|
||||
httrack -O /dev/null -#test=charset "$1" "$2" >/dev/null 2>&1 || exit 1
|
||||
}
|
||||
|
||||
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.
|
||||
@@ -31,7 +31,7 @@ conv 'us-ascii' 'hello' 'hello'
|
||||
# unknown charset: ASCII passes through unchanged, but non-ASCII input cannot be
|
||||
# decoded and yields empty output (an error is printed to stderr).
|
||||
conv 'no-such-charset-xyz' 'abc' 'abc'
|
||||
test "$(httrack -O /dev/null -#3 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=charset 'no-such-charset-xyz' 'café' 2>/dev/null)" == "" || exit 1
|
||||
|
||||
# malformed UTF-8 (lone continuation byte, truncated lead byte) must not crash
|
||||
runs 'utf-8' $'\x80'
|
||||
|
||||
@@ -1,14 +1,15 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
|
||||
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
|
||||
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#test=cookies' selftest.
|
||||
|
||||
set -eu
|
||||
|
||||
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
|
||||
out=$(httrack -#Q run)
|
||||
# 'run' is an ignored placeholder argument.
|
||||
out=$(httrack -#test=cookies run)
|
||||
|
||||
# Exact-match the success line so a fall-through to usage can't pass the test.
|
||||
# Exact-match the success line so a renamed/removed test (it prints the registry
|
||||
# to stderr) can't pass.
|
||||
test "$out" = "cookie-header: OK" || {
|
||||
echo "expected 'cookie-header: OK', got: $out" >&2
|
||||
exit 1
|
||||
|
||||
@@ -2,15 +2,16 @@
|
||||
#
|
||||
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
|
||||
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
|
||||
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
|
||||
# they silently stop being copied. Driven by the in-process 'httrack -#test=copyopt' test.
|
||||
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
|
||||
|
||||
set -eu
|
||||
|
||||
# A trailing token is required; a bare '-#9' falls through to the usage screen.
|
||||
out=$(httrack -#9 run)
|
||||
# 'run' is an ignored placeholder argument.
|
||||
out=$(httrack -#test=copyopt run)
|
||||
|
||||
# Exact-match the success line so a fall-through to usage can't pass the test.
|
||||
# Exact-match the success line so a renamed/removed test (it prints the registry
|
||||
# to stderr) can't pass.
|
||||
test "$out" = "copy-htsopt: OK" || {
|
||||
echo "expected 'copy-htsopt: OK', got: $out" >&2
|
||||
exit 1
|
||||
|
||||
@@ -5,9 +5,8 @@ set -euo pipefail
|
||||
|
||||
# DNS resolver/cache self-test: a mock getaddrinfo (no network) checks address
|
||||
# family, single-address selection, the -@i4/-@i6 family filter, and cache reuse.
|
||||
# The trailing token is required, like the other -# selftests, so a bare command
|
||||
# line isn't treated as "no arguments" and routed to the usage screen.
|
||||
out=$(httrack -#D run)
|
||||
# 'run' is an ignored placeholder argument.
|
||||
out=$(httrack -#test=dns run)
|
||||
|
||||
test "$out" = "dns-selftest: OK" || {
|
||||
echo "expected 'dns-selftest: OK', got: $out" >&2
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
set -euo pipefail
|
||||
|
||||
# HTML entity unescaping (hts_unescapeEntitiesWithCharset).
|
||||
# -#6 <string> prints the string with entities decoded (UTF-8 output).
|
||||
# -#test=entities <string> prints the string with entities decoded (UTF-8 output).
|
||||
ent() {
|
||||
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=entities "$1")" == "$2" || exit 1
|
||||
}
|
||||
# crash probe: malformed input must exit cleanly, not abort.
|
||||
runs() {
|
||||
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
|
||||
httrack -O /dev/null -#test=entities "$1" >/dev/null 2>&1 || exit 1
|
||||
}
|
||||
|
||||
# named entities
|
||||
|
||||
65
tests/01_engine-filelist.test
Normal file
65
tests/01_engine-filelist.test
Normal file
@@ -0,0 +1,65 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# -%L URL-list loading (#49): a readable list is honored; an unusable one fails
|
||||
# with the reason (errno / not-a-regular-file), not a bare "Could not include
|
||||
# URL list". Offline: file:// fixture, no server. Asserts on httrack's own
|
||||
# strings and the message shape, so it is locale-independent.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_filelist.XXXXXX") || exit 1
|
||||
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
|
||||
|
||||
echo '<html><body>hi</body></html>' >"$tmp/index.html"
|
||||
|
||||
# run httrack with the given -%L target; structured log lands in $out/hts-log.txt
|
||||
run() {
|
||||
local out="$1" list="$2"
|
||||
rm -rf "$out"
|
||||
mkdir -p "$out"
|
||||
httrack -O "$out" --quiet -n "-%L" "$list" >"$out/.stdout" 2>&1 || true
|
||||
LOG="$out/hts-log.txt"
|
||||
}
|
||||
|
||||
fail() {
|
||||
echo "FAIL: $1"
|
||||
cat "$LOG"
|
||||
exit 1
|
||||
}
|
||||
loghas() {
|
||||
grep -Eq "$1" "$LOG" || fail "expected /$1/ in $LOG"
|
||||
}
|
||||
lognot() {
|
||||
if grep -Eq "$1" "$LOG"; then fail "unexpected /$1/ in $LOG"; fi
|
||||
}
|
||||
|
||||
# readable list: its one URL is loaded and counted (count must be non-zero)
|
||||
printf 'file://%s/index.html\n' "$tmp" >"$tmp/urls.txt"
|
||||
run "$tmp/ok" "$tmp/urls.txt"
|
||||
loghas '[1-9][0-9]* links added from'
|
||||
|
||||
# missing file: quoted name + a non-empty reason, never the old reasonless
|
||||
# "Could not include URL list: <name>". The reason is the stat() errno, not the
|
||||
# directory fallback literal (guards against dropping the errno lookup).
|
||||
run "$tmp/miss" "$tmp/nope.txt"
|
||||
loghas 'Could not include URL list "[^"]+": .+'
|
||||
lognot 'Could not include URL list: '
|
||||
lognot 'not a regular file'
|
||||
|
||||
# a directory is rejected with our own reason (locale-independent)
|
||||
mkdir -p "$tmp/adir"
|
||||
run "$tmp/dir" "$tmp/adir"
|
||||
loghas 'Could not include URL list "[^"]+": not a regular file'
|
||||
|
||||
# unreadable regular file: the fopen() errno arm fires, distinct from the
|
||||
# directory branch. Root bypasses mode 000, so skip it there.
|
||||
if test "$(id -u)" -ne 0; then
|
||||
: >"$tmp/noperm.txt"
|
||||
chmod 000 "$tmp/noperm.txt"
|
||||
run "$tmp/perm" "$tmp/noperm.txt"
|
||||
chmod 644 "$tmp/noperm.txt"
|
||||
loghas 'Could not include URL list "[^"]+": .+'
|
||||
lognot 'not a regular file'
|
||||
fi
|
||||
|
||||
exit 0
|
||||
@@ -4,13 +4,13 @@
|
||||
set -euo pipefail
|
||||
|
||||
# wildcard filter engine (strjoker), the core of +/- include/exclude rules.
|
||||
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
|
||||
# -#test=filter <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
|
||||
|
||||
match() {
|
||||
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does match $1" || exit 1
|
||||
}
|
||||
nomatch() {
|
||||
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=filter "$1" "$2")" == "$2 does NOT match $1" || exit 1
|
||||
}
|
||||
|
||||
# bare star matches everything
|
||||
@@ -71,3 +71,27 @@ nomatch '*[\[\]]' '[' # not matched, despite the docs
|
||||
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
|
||||
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
|
||||
nomatch '*[\[\]]' '[]x'
|
||||
|
||||
# Size-based rules (-#test=filtersize <size> <string> <filter...>): a negative size
|
||||
# means the size is still unknown (scan time). A size exclusion must stay neutral
|
||||
# then, so the file is fetched and only cancelled once its size is known (#143).
|
||||
fsize() {
|
||||
local want="$1"
|
||||
shift
|
||||
test "$(httrack -O /dev/null -#test=filtersize "$@")" == "$want" || exit 1
|
||||
}
|
||||
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # scan time: keep
|
||||
fsize 'verdict=forbidden size_flag=1' 5 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # <10KB: cancel
|
||||
fsize 'verdict=allowed size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[<10]' # >=10KB: keep
|
||||
fsize 'verdict=forbidden size_flag=0' -1 foo.txt -* '+*.jpg' '-*.jpg*[<10]' # not a jpg
|
||||
# the '>' operator is just as neutral at scan time, and fires once size is known
|
||||
fsize 'verdict=allowed size_flag=0' -1 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # scan time: keep
|
||||
fsize 'verdict=forbidden size_flag=1' 20 foo.jpg -* '+*.jpg' '-*.jpg*[>10]' # >10KB: cancel
|
||||
|
||||
# [name]/[file]/[path] never span '?' mid-string; a trailing query is still
|
||||
# tolerated by the global '?' rule (same as plain *.aspx), not the class (#144).
|
||||
nomatch '*[path]/end' 'a?b/end'
|
||||
nomatch '*[file]end' 'foo?xend'
|
||||
nomatch '*[name]X' 'abc?X'
|
||||
match '*[file]' 'foo?x=1' # trailing query: tolerated, as for *.aspx
|
||||
match '*.aspx' 'page.aspx?y=2'
|
||||
|
||||
@@ -3,5 +3,7 @@
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# httrack internal hashtable autotest on 100K keys
|
||||
httrack -#7 100000
|
||||
# httrack internal hashtable autotest on 100K keys. Assert the success line (on
|
||||
# stderr) so a misrouted registry entry can't pass on exit code alone.
|
||||
out=$(httrack -#test=hashtable 100000 2>&1)
|
||||
printf '%s\n' "$out" | grep -q "all hashtable tests were successful!" || exit 1
|
||||
|
||||
@@ -3,13 +3,13 @@
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# IDNA / punycode encode (-#4) and decode (-#5). This code has a CVE history,
|
||||
# IDNA / punycode encode (-#test=idna-encode) and decode (-#test=idna-decode). This code has a CVE history,
|
||||
# so the edge cases below cover passthrough, round-trips, and malformed input.
|
||||
|
||||
enc() { test "$(httrack -O /dev/null -#4 "$1")" == "$2" || exit 1; }
|
||||
dec() { test "$(httrack -O /dev/null -#5 "$1")" == "$2" || exit 1; }
|
||||
enc() { test "$(httrack -O /dev/null -#test=idna-encode "$1")" == "$2" || exit 1; }
|
||||
dec() { test "$(httrack -O /dev/null -#test=idna-decode "$1")" == "$2" || exit 1; }
|
||||
# crash probe: malformed ACE input must exit cleanly, not abort.
|
||||
runs() { httrack -O /dev/null -#5 "$1" >/dev/null 2>&1 || exit 1; }
|
||||
runs() { httrack -O /dev/null -#test=idna-decode "$1" >/dev/null 2>&1 || exit 1; }
|
||||
|
||||
# encode
|
||||
enc 'www.café.com' 'www.xn--caf-dma.com'
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
set -euo pipefail
|
||||
|
||||
# MIME type guessing from extension (get_httptype / give_mimext).
|
||||
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
|
||||
# -#test=mime <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
|
||||
|
||||
mime() {
|
||||
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is '$2'" || exit 1
|
||||
}
|
||||
unknown() {
|
||||
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=mime "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
|
||||
}
|
||||
|
||||
mime '/a/b.html' 'text/html'
|
||||
|
||||
@@ -8,7 +8,7 @@ set -euo pipefail
|
||||
# relative path from <curr>'s directory to <link>
|
||||
rel() {
|
||||
local got
|
||||
got=$(httrack -O /dev/null -#l "$1" "$2")
|
||||
got=$(httrack -O /dev/null -#test=relative "$1" "$2")
|
||||
test "$got" == "relative=$3" ||
|
||||
{
|
||||
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
|
||||
@@ -19,7 +19,7 @@ rel() {
|
||||
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
|
||||
ident() {
|
||||
local got
|
||||
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
|
||||
got=$(httrack -O /dev/null -#test=resolve "$1" "$2" "$3")
|
||||
test "$got" == "$4" ||
|
||||
{
|
||||
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
|
||||
|
||||
@@ -3,11 +3,11 @@
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Local save-name extension resolution (url_savename via -#N <fil> <content-type>).
|
||||
# Local save-name extension resolution (url_savename via -#test=savename <fil> <content-type>).
|
||||
# Asserts on the basename of "savename: <path>".
|
||||
|
||||
name() {
|
||||
out="$(httrack -O /dev/null -#N "$1" "$2" | sed -n 's/^savename: //p')"
|
||||
out="$(httrack -O /dev/null -#test=savename "$1" "$2" | sed -n 's/^savename: //p')"
|
||||
test "${out##*/}" == "$3" || {
|
||||
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
|
||||
exit 1
|
||||
|
||||
17
tests/01_engine-selftest-dispatch.test
Normal file
17
tests/01_engine-selftest-dispatch.test
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# The -#test dispatch itself: a bare -#test lists the registry, and an unknown
|
||||
# name errors (non-zero, diagnostic) instead of silently passing.
|
||||
|
||||
set -eu
|
||||
|
||||
# Bare -#test lists known tests (printed to stderr).
|
||||
list=$(httrack -#test 2>&1)
|
||||
printf '%s\n' "$list" | grep -q "filter" || exit 1
|
||||
printf '%s\n' "$list" | grep -q "cache-writefail" || exit 1
|
||||
|
||||
# Unknown name: non-zero exit + diagnostic, and no test result line.
|
||||
rc=0
|
||||
err=$(httrack -#test=bogus 2>&1) || rc=$?
|
||||
test "$rc" -ne 0 || exit 1
|
||||
printf '%s\n' "$err" | grep -q "Unknown self-test" || exit 1
|
||||
@@ -5,7 +5,7 @@ set -euo pipefail
|
||||
|
||||
# path simplify engine (fil_simplifie): collapses ./ and ../ segments.
|
||||
simp() {
|
||||
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
|
||||
test "$(httrack -O /dev/null -#test=simplify "$1")" == "simplified=$2" || exit 1
|
||||
}
|
||||
|
||||
simp './foo/bar/' 'foo/bar/'
|
||||
|
||||
8
tests/01_engine-stripquery.test
Executable file
8
tests/01_engine-stripquery.test
Executable file
@@ -0,0 +1,8 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# --strip-query: pattern-scoped query-key stripping for dedup. All assertions
|
||||
# live in the engine self-test (hts_query_strip_keys + fil_normalized_filtered).
|
||||
httrack -O /dev/null -#test=stripquery | grep -q "strip-query self-test OK"
|
||||
@@ -3,23 +3,22 @@
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# htssafe.h bounded string operations (driven by 'httrack -#8').
|
||||
# htssafe.h bounded string operations (driven by 'httrack -#test=strsafe').
|
||||
|
||||
# Success path: every bounded op (strcpybuff/strcatbuff/strncatbuff/strlcpybuff)
|
||||
# must behave correctly. Like the other -# debug modes, a trailing token is
|
||||
# required (a bare '-#8' falls through to the usage screen).
|
||||
# must behave correctly. 'run' selects the success path (vs the overflow modes).
|
||||
rc=0
|
||||
out=$(httrack -#8 run) || rc=$?
|
||||
out=$(httrack -#test=strsafe run) || rc=$?
|
||||
test "$rc" -eq 0 || exit 1
|
||||
test "$out" == "strsafe: OK" || exit 1
|
||||
|
||||
# Overflow path: an over-capacity write into a sized buffer must be caught by
|
||||
# the bounded macro and abort the process, not be silently truncated/completed.
|
||||
# Assert the htssafe abort signature specifically, so the test cannot pass for
|
||||
# an unrelated reason (e.g. the -#8 mode being gone and falling through to the
|
||||
# usage screen, which also exits non-zero).
|
||||
# an unrelated reason (e.g. the strsafe test being gone, which prints the
|
||||
# registry to stderr and also exits non-zero).
|
||||
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it
|
||||
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
|
||||
err=$(httrack -#test=strsafe overflow "this string is far too long for the buffer" 2>&1) || true
|
||||
case "$err" in
|
||||
*"strsafe: NOT aborted"*)
|
||||
echo "over-capacity write was NOT caught" >&2
|
||||
@@ -36,7 +35,7 @@ esac
|
||||
# capacity (4 bytes into a 4-byte buffer), so this also pins the boundary: a
|
||||
# '<=' off-by-one in the capacity check would let it through (and print "NOT
|
||||
# aborted"). Match the specific htsbuff abort message, not just any assert.
|
||||
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
|
||||
err=$(httrack -#test=strsafe overflow-buff "abcd" 2>&1) || true
|
||||
case "$err" in
|
||||
*"strsafe: NOT aborted"*)
|
||||
echo "htsbuff over-capacity write was NOT caught" >&2
|
||||
|
||||
109
tests/24_local-resume-overlap.test
Normal file
109
tests/24_local-resume-overlap.test
Normal file
@@ -0,0 +1,109 @@
|
||||
#!/bin/bash
|
||||
# Issue #198: on a resumed download the server may answer the Range with a 206
|
||||
# that starts *before* the offset we asked for (block-aligned ranges). httrack
|
||||
# must honor the returned Content-Range, not blindly append, or the overlap
|
||||
# bytes get duplicated and the file grows (corrupt PDFs). Pass 1 interrupts
|
||||
# flaky.bin mid-body (partial + temp-ref); pass 2 resumes against a 206 that
|
||||
# backs up 8 bytes. The result must equal the same bytes fetched whole (full.bin).
|
||||
set -eu
|
||||
|
||||
: "${top_srcdir:=..}"
|
||||
testdir=$(cd "$(dirname "$0")" && pwd)
|
||||
server="${testdir}/local-server.py"
|
||||
|
||||
command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
|
||||
|
||||
tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_198.XXXXXX") || exit 1
|
||||
serverpid=
|
||||
crawlpid=
|
||||
cleanup() {
|
||||
if test -n "$crawlpid"; then kill -9 "$crawlpid" 2>/dev/null || true; fi
|
||||
if test -n "$serverpid"; then
|
||||
kill "$serverpid" 2>/dev/null || true
|
||||
wait "$serverpid" 2>/dev/null || true
|
||||
fi
|
||||
rm -rf "$tmpdir"
|
||||
}
|
||||
trap cleanup EXIT HUP INT QUIT PIPE TERM
|
||||
|
||||
# OVERLAP_COUNTER gets a byte per flaky.bin request so pass 1 knows when to interrupt.
|
||||
serverlog="${tmpdir}/server.log"
|
||||
counter="${tmpdir}/hits"
|
||||
resumed="${tmpdir}/resumed" # gets a byte when the server serves a resume 206
|
||||
OVERLAP_COUNTER="$counter" OVERLAP_RESUMED="$resumed" \
|
||||
python3 "$server" --root "${testdir}/server-root" \
|
||||
>"$serverlog" 2>&1 &
|
||||
serverpid=$!
|
||||
port=
|
||||
for _ in $(seq 1 50); do
|
||||
line=$(head -n1 "$serverlog" 2>/dev/null)
|
||||
if test "${line%% *}" == "PORT"; then
|
||||
port="${line#PORT }"
|
||||
break
|
||||
fi
|
||||
kill -0 "$serverpid" 2>/dev/null || {
|
||||
echo "server exited early: $(cat "$serverlog")"
|
||||
exit 1
|
||||
}
|
||||
sleep 0.1
|
||||
done
|
||||
test -n "$port" || {
|
||||
echo "could not discover server port"
|
||||
exit 1
|
||||
}
|
||||
base="http://127.0.0.1:${port}"
|
||||
|
||||
which httrack >/dev/null || {
|
||||
echo "could not find httrack"
|
||||
exit 1
|
||||
}
|
||||
out="${tmpdir}/crawl"
|
||||
common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0 -c1)
|
||||
refdir="${out}/hts-cache/ref"
|
||||
|
||||
# pass 1: interrupt once flaky.bin's prefix is streaming (partial + temp-ref).
|
||||
printf '[pass 1: interrupt flaky.bin] ..\t'
|
||||
httrack "${common[@]}" "${base}/overlap/index.html" >"${tmpdir}/log1" 2>&1 &
|
||||
crawlpid=$!
|
||||
for _ in $(seq 1 300); do
|
||||
test -s "$counter" && break
|
||||
kill -0 "$crawlpid" 2>/dev/null || break
|
||||
sleep 0.1
|
||||
done
|
||||
sleep 0.5
|
||||
kill -TERM "$crawlpid" 2>/dev/null || true
|
||||
wait "$crawlpid" 2>/dev/null || true
|
||||
crawlpid=
|
||||
test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
|
||||
echo "FAIL: no temp-ref survived pass 1; cannot drive the resume"
|
||||
exit 1
|
||||
}
|
||||
echo "OK (temp-ref present)"
|
||||
|
||||
# pass 2: --continue -> resume Range -> 206 that starts 8 bytes early.
|
||||
printf '[pass 2: resume flaky.bin] ..\t'
|
||||
httrack "${common[@]}" --continue "${base}/overlap/index.html" >"${tmpdir}/log2" 2>&1 || true
|
||||
echo "OK"
|
||||
|
||||
# Guard against a silent full re-download: the byte-compare below only tests the
|
||||
# fix if pass 2 actually went through the resume Range -> 206 path.
|
||||
printf '[resume path was exercised] ..\t'
|
||||
if ! test -s "$resumed"; then
|
||||
echo "FAIL: pass 2 never triggered a resume 206; the overlap fix was not exercised"
|
||||
exit 1
|
||||
fi
|
||||
echo "OK"
|
||||
|
||||
printf '[resumed file is not corrupted] ..\t'
|
||||
dir=$(find "$out" -maxdepth 1 -type d -name '127.0.0.1*' | head -1)
|
||||
flaky="${dir}/overlap/flaky.bin"
|
||||
full="${dir}/overlap/full.bin"
|
||||
if ! test -f "$flaky" || ! test -f "$full"; then
|
||||
echo "FAIL: flaky.bin or full.bin missing after pass 2"
|
||||
exit 1
|
||||
fi
|
||||
if ! cmp -s "$flaky" "$full"; then
|
||||
echo "FAIL: resumed flaky.bin ($(wc -c <"$flaky")) != full.bin ($(wc -c <"$full")); overlap duplicated"
|
||||
exit 1
|
||||
fi
|
||||
echo "OK ($(wc -c <"$flaky") bytes, byte-identical)"
|
||||
16
tests/25_local-mime-exclude.test
Executable file
16
tests/25_local-mime-exclude.test
Executable file
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# A -mime: exclusion must abort the transfer on the response Content-Type, not
|
||||
# fetch the whole 1 MB body then discard it (#58). The bytes-received guard is
|
||||
# the real one: the file is absent either way, but only the fix keeps the count
|
||||
# tiny (header only) instead of pulling the body. Match it positively (a small,
|
||||
# <=4-digit count) so a vanished/reworded summary line fails rather than passes.
|
||||
|
||||
: "${top_srcdir:=..}"
|
||||
|
||||
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
|
||||
--found 'mimex/real.html' \
|
||||
--not-found 'mimex/blob.pdf' \
|
||||
--log-found 'excluded by MIME type filter' \
|
||||
--log-found '\[[0-9]{1,4} bytes received' \
|
||||
httrack 'BASEURL/mimex/index.html' '-mime:application/pdf'
|
||||
18
tests/26_local-strip-query.test
Executable file
18
tests/26_local-strip-query.test
Executable file
@@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# End-to-end --strip-query (#112): two links to one resource differing only by
|
||||
# ?utm_source dedup to a single saved file (2 files written: index + resource);
|
||||
# the control crawl without the option keeps both variants (3 files). Locks the
|
||||
# CLI->opt->hash plumbing the engine self-test can't reach.
|
||||
|
||||
set -e
|
||||
|
||||
: "${top_srcdir:=..}"
|
||||
|
||||
# stripped: the two ?utm_source variants collapse to one resource
|
||||
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 2 \
|
||||
httrack 'BASEURL/stripquery/index.html' --strip-query 'utm_source'
|
||||
|
||||
# control: no stripping -> both query-named variants are saved
|
||||
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
|
||||
httrack 'BASEURL/stripquery/index.html'
|
||||
@@ -5,6 +5,7 @@ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
|
||||
proxy-https-server.py \
|
||||
local-crawl.sh local-server.py server.crt server.key \
|
||||
server-root/simple/basic.html server-root/simple/link.html \
|
||||
server-root/stripquery/index.html server-root/stripquery/a.html \
|
||||
fixtures/cache-golden/hts-cache/new.zip
|
||||
|
||||
TESTS_ENVIRONMENT =
|
||||
@@ -34,6 +35,7 @@ TESTS = \
|
||||
01_engine-dns.test \
|
||||
01_engine-doitlog.test \
|
||||
01_engine-entities.test \
|
||||
01_engine-filelist.test \
|
||||
01_engine-filter.test \
|
||||
01_engine-hashtable.test \
|
||||
01_engine-idna.test \
|
||||
@@ -42,7 +44,9 @@ TESTS = \
|
||||
01_engine-rcfile.test \
|
||||
01_engine-relative.test \
|
||||
01_engine-savename.test \
|
||||
01_engine-selftest-dispatch.test \
|
||||
01_engine-simplify.test \
|
||||
01_engine-stripquery.test \
|
||||
01_engine-strsafe.test \
|
||||
02_manpage-regen.test \
|
||||
02_update-cache.test \
|
||||
@@ -64,6 +68,9 @@ TESTS = \
|
||||
20_local-resume-loop.test \
|
||||
21_local-intl-update.test \
|
||||
22_local-broken-size.test \
|
||||
23_local-errpage.test
|
||||
23_local-errpage.test \
|
||||
24_local-resume-overlap.test \
|
||||
25_local-mime-exclude.test \
|
||||
26_local-strip-query.test
|
||||
|
||||
CLEANFILES = check-network_sh.cache
|
||||
|
||||
@@ -177,6 +177,24 @@ class Handler(SimpleHTTPRequestHandler):
|
||||
body, ctype = self.TYPE_MATRIX[path]
|
||||
self.send_raw(body, ctype)
|
||||
|
||||
# --- MIME-type exclusion abort (issue #58) -----------------------------
|
||||
# A -mime:application/pdf filter must abort the transfer once the header
|
||||
# arrives, not download the whole body and discard it.
|
||||
def route_mimex_index(self):
|
||||
self.send_html(
|
||||
'\t<a href="blob.pdf">pdf</a>\n' '\t<a href="real.html">real</a>\n'
|
||||
)
|
||||
|
||||
# 1 MB body: the fix aborts after the header, so httrack's "bytes received"
|
||||
# stays tiny; without it the engine reads the body and the count jumps.
|
||||
MIMEX_BLOB = b"%PDF-1.4\n" + b"\x00" * (1024 * 1024)
|
||||
|
||||
def route_mimex_blob(self):
|
||||
self.send_raw(self.MIMEX_BLOB, "application/pdf")
|
||||
|
||||
def route_mimex_real(self):
|
||||
self.send_raw(b"<html><body>real</body></html>", "text/html")
|
||||
|
||||
# --- special chars in URLs across an update (issue #157) ---------------
|
||||
# A dotless, accented basename served as text/html (MediaWiki style). The
|
||||
# name the first crawl picks (.html) must survive the update pass.
|
||||
@@ -225,6 +243,71 @@ class Handler(SimpleHTTPRequestHandler):
|
||||
self.send_header("Content-Length", "0")
|
||||
self.end_headers()
|
||||
|
||||
# 206 resume must honor the server's Content-Range, not the offset we asked
|
||||
# for (#198): a server resuming a few bytes *before* the request must not
|
||||
# leave httrack duplicating the overlap onto the partial. flaky.bin
|
||||
# interrupts once then resumes OVERLAP_EARLY bytes early; full.bin serves
|
||||
# the identical bytes in one shot, so the test can compare the two.
|
||||
OVERLAP_BLOB = b"%PDF-1.4\n" + bytes((i * 37 + 11) % 256 for i in range(8000))
|
||||
OVERLAP_EARLY = 8
|
||||
OVERLAP_PREFIX_LEN = 4000 # flushed before the stall
|
||||
_overlap_started = False
|
||||
|
||||
def route_overlap_index(self):
|
||||
self.send_html('\t<a href="flaky.bin">flaky</a>\n\t<a href="full.bin">full</a>')
|
||||
|
||||
def route_overlap_full(self):
|
||||
self.send_raw(self.OVERLAP_BLOB, "application/octet-stream")
|
||||
|
||||
def route_overlap(self):
|
||||
counter = os.environ.get("OVERLAP_COUNTER")
|
||||
if counter:
|
||||
with open(counter, "a") as fp:
|
||||
fp.write("x")
|
||||
blob = self.OVERLAP_BLOB
|
||||
rng = self.headers.get("Range")
|
||||
# First GET: stream a prefix then stall, so the crawl can be interrupted
|
||||
# mid-body (partial + temp-ref on disk).
|
||||
if rng is None and not Handler._overlap_started:
|
||||
Handler._overlap_started = True
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "application/octet-stream")
|
||||
self.send_header("Content-Length", str(len(blob)))
|
||||
self.send_header("Accept-Ranges", "bytes")
|
||||
self.end_headers()
|
||||
if self.command != "HEAD":
|
||||
self.wfile.write(blob[: self.OVERLAP_PREFIX_LEN])
|
||||
self.wfile.flush()
|
||||
try:
|
||||
while True:
|
||||
time.sleep(3600)
|
||||
except OSError:
|
||||
pass
|
||||
return
|
||||
if rng is None: # no resume request: serve the whole file
|
||||
return self.route_overlap_full()
|
||||
# Resume: honor the Range, but back up OVERLAP_EARLY bytes.
|
||||
start = (
|
||||
int(rng[len("bytes=") :].split("-")[0]) if rng.startswith("bytes=") else 0
|
||||
)
|
||||
start = max(0, start - self.OVERLAP_EARLY)
|
||||
# Signal that the resume Range -> 206 path actually fired, so the test
|
||||
# can prove it was exercised (not a silent full re-download).
|
||||
resumed = os.environ.get("OVERLAP_RESUMED")
|
||||
if resumed:
|
||||
with open(resumed, "a") as fp:
|
||||
fp.write("x")
|
||||
part = blob[start:]
|
||||
self.send_response(206, "Partial Content")
|
||||
self.send_header("Content-Type", "application/octet-stream")
|
||||
self.send_header("Content-Length", str(len(part)))
|
||||
self.send_header(
|
||||
"Content-Range", "bytes %d-%d/%d" % (start, len(blob) - 1, len(blob))
|
||||
)
|
||||
self.end_headers()
|
||||
if self.command != "HEAD":
|
||||
self.wfile.write(part)
|
||||
|
||||
# error pages / 0-byte files (#17): -o0 ("no error pages") must keep 4xx/5xx
|
||||
# bodies off disk; a genuine 0-byte 200 is a valid file and stays.
|
||||
def route_errpage_index(self):
|
||||
@@ -281,12 +364,18 @@ class Handler(SimpleHTTPRequestHandler):
|
||||
"/intl/" + INTL_NAME: route_intl_page,
|
||||
"/resume/index.html": route_resume_index,
|
||||
"/resume/blob.txt": route_resume,
|
||||
"/overlap/index.html": route_overlap_index,
|
||||
"/overlap/flaky.bin": route_overlap,
|
||||
"/overlap/full.bin": route_overlap_full,
|
||||
"/size/index.html": route_size_index,
|
||||
"/size/oversize.bin": route_size_oversize,
|
||||
"/errpage/index.html": route_errpage_index,
|
||||
"/errpage/good.html": route_errpage_good,
|
||||
"/errpage/missing.html": route_errpage_missing,
|
||||
"/errpage/empty.html": route_errpage_empty,
|
||||
"/mimex/index.html": route_mimex_index,
|
||||
"/mimex/blob.pdf": route_mimex_blob,
|
||||
"/mimex/real.html": route_mimex_real,
|
||||
}
|
||||
|
||||
# --- dispatch ----------------------------------------------------------
|
||||
|
||||
1
tests/server-root/stripquery/a.html
Normal file
1
tests/server-root/stripquery/a.html
Normal file
@@ -0,0 +1 @@
|
||||
<html><body>resource A</body></html>
|
||||
5
tests/server-root/stripquery/index.html
Normal file
5
tests/server-root/stripquery/index.html
Normal file
@@ -0,0 +1,5 @@
|
||||
<html><body>
|
||||
Two links to one resource, differing only by a tracking parameter.
|
||||
<a href="a.html?utm_source=x">x</a>
|
||||
<a href="a.html?utm_source=y">y</a>
|
||||
</body></html>
|
||||
Reference in New Issue
Block a user