Compare commits

..

3 Commits

Author SHA1 Message Date
Xavier Roche
0bea390973 Add HTTRACK_DEBUG_RESOLVE and a deterministic connect-fallback test
Exercising the connect fallback needs a host that resolves to a dead address
first and a live one next, deterministically and offline. A true SYN
black-hole can't be simulated without root, but a refused address can.

HTTRACK_DEBUG_RESOLVE="host:ip[,ip...]" pins a host's resolution to a fixed
address list (curl --resolve style), reusing the PR1 resolver seam: an
addrinfo backend that synthesizes the listed addresses for the named host and
delegates other hosts to libc (copying into its own allocations so one
freeaddrinfo frees both). It is a debug/test hook, inert unless the env var is
set, and IPv6-build-only like the rest of the resolver list.

The new local crawl test binds the server to 127.0.0.1 and resolves a host to
127.0.0.2 (refused) then 127.0.0.1: the mirror only succeeds via the fallback.
A V6_SUPPORT substitution (mirroring HTTPS_SUPPORT) lets it skip on non-IPv6
builds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-22 20:43:51 +02:00
Xavier Roche
67af1c2f0b Fall back to the next address when a connect fails or stalls
A slot connected to a single resolved address and waited the full slot
timeout (default 120s) if that address was dead -- a blackholed IPv6 on a
dual-stack host would stall the whole mirror. With the cache now holding the
full address list, retry the next address instead of failing.

In back_wait, a connecting slot probes the resolved address count once, then:
on a refused/failed connect (a new SO_ERROR check at connect completion, since
a failed non-blocking connect is reported writable too) it falls back
immediately; on a stalled connect it falls back after a short per-candidate
deadline (min(timeout, 10s)) rather than the full timeout. The last candidate
keeps the full timeout, so single-address hosts are unchanged. Per-slot state
(current index, count, connect start) lives in struct_back, parallel to the
slot array -- no htsblk/lien_back layout change, so the ABI is untouched.

back_connect_fallback_due() (the deadline decision) and newhttp_addr()'s
address selection are unit-tested through the DNS mock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-22 20:37:28 +02:00
Xavier Roche
542d6a56b5 Resolve hosts to multiple addresses and cache the full list
The DNS cache kept a single address per host and the resolver copied only
the head of getaddrinfo's result, discarding the rest. That leaves no
fallback when the chosen address is unreachable (e.g. a blackholed IPv6 on a
dual-stack host) -- the root of the "stuck on connect" stalls.

Widen t_dnscache to hold up to HTS_MAXADDRNUM addresses in resolver order and
walk ai_next when resolving. New hts_dns_resolve_all() exposes the list;
hts_dns_resolve2() still returns the first address, so existing callers are
unchanged. newhttp_addr() connects to a chosen candidate index (newhttp() is
the index-0 wrapper), for the connect-fallback path added next.

No ABI change: t_dnscache is engine-internal (httrackp holds only a pointer;
no plugin reads its fields) and the htsblk/lien_back layout is untouched.

The DNS self-test now covers the list path: count, resolver order, the
family filter, and clamping past HTS_MAXADDRNUM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-22 19:19:48 +02:00
14 changed files with 118 additions and 377 deletions

View File

@@ -232,42 +232,30 @@ jobs:
deb:
name: deb package (lintian)
runs-on: ubuntu-24.04
# Build and gate inside Debian sid, the upload target. A Debian dpkg-deb
# produces archive-legal xz members (an Ubuntu host defaults to zstd, which
# the archive's lintian rejects), and sid's lintian carries the same
# data-driven checks (embedded-lib fingerprints and the like) the buildds and
# UDD apply -- so issues surface here instead of after upload.
container: debian:sid
steps:
- name: Install packaging toolchain
run: |
set -euo pipefail
apt-get update
apt-get install -y --no-install-recommends \
ca-certificates git \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev \
debhelper devscripts lintian fakeroot
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install packaging toolchain
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev \
debhelper devscripts lintian fakeroot
# --unsigned: CI has no GPG key (also skips the release sig/checksums).
# mkdeb builds every package then runs the lintian gate (--fail-on=error,
# warning); debuild runs the packaged test pass.
# debuild builds every package, then lintian gates on errors.
#
# DEB_BUILD_OPTIONS trims work CI does not need (release builds via
# mkdeb.sh are untouched): noautodbgsym drops the -dbgsym packages whose
# LTO payloads are slow to compress and that CI never ships; parallel uses
# every core.
- name: Build and lint Debian packages
# every core. We let debuild run its test pass -- the only one now that
# mkdeb no longer runs its own -- so CI exercises the packaged tests.
- name: Build Debian packages
run: |
set -euo pipefail
# The workspace volume is owned by the host runner uid, but the
# container runs as root, so mkdeb's git calls (superproject and the
# coucal submodule) trip "dubious ownership"; mark them all safe.
git config --global --add safe.directory "*"
export DEB_BUILD_OPTIONS="noautodbgsym parallel=$(nproc)"
bash tools/mkdeb.sh --unsigned --no-release-artifacts

View File

@@ -1,8 +1,3 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack3: no-symbols-control-file usr/lib/*
# Bundled, locally patched minizip (src/minizip): it adds a zipFlush() API the
# system libminizip lacks (htscache.c flushes the cache .zip so an interrupted
# crawl leaves a valid archive), plus Android/old-zlib portability fixes.
libhttrack3: embedded-library *libminizip*

View File

@@ -1,3 +0,0 @@
# Statically linked against httrack's bundled, patched minizip (see src/minizip
# and libhttrack3's override): the zipFlush() API is absent from the system one.
proxytrack: embedded-library *libminizip*

View File

@@ -2468,44 +2468,6 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return err;
} break;
case 'N': { // url_savename name resolution: httrack -#N <fil>
// <content-type>
if (na + 2 < argc) {
lien_adrfilsave afs;
cache_back cache;
struct_back *sback;
hash_struct hash;
lien_back headers;
memset(&afs, 0, sizeof(afs));
strcpybuff(afs.af.adr, "www.example.com");
strcpybuff(afs.af.fil, argv[na + 1]);
memset(&cache, 0, sizeof(cache));
cache.hashtable = (void *) coucal_new(0);
sback = back_new(opt, opt->maxsoc * 32 + 1024);
hash_init(opt, &hash, opt->urlhack);
memset(&headers, 0, sizeof(headers));
headers.status = 0;
headers.r.statuscode = HTTP_OK;
strcpybuff(headers.r.contenttype, argv[na + 2]);
strcpybuff(headers.url_fil, argv[na + 1]);
url_savename(&afs, NULL, NULL, NULL, opt, sback, &cache,
&hash, 0, 0, &headers);
printf("savename: %s\n", afs.save);
htsmain_free();
return 0;
} else {
fprintf(
stderr,
"Option #N requires <fil> <content-type> arguments\n");
htsmain_free();
return 1;
}
} break;
case 'C': // list cache files : httrack -#C '*spid*.gif' will attempt to find the matching file
{
int hasFilter = 0;

View File

@@ -4766,51 +4766,64 @@ int hts_read(htsblk * r, char *buff, int size) {
// -- Gestion cache DNS --
// 'RX98
// Free a DNS cache record (coucal value handler).
static void hts_cache_value_free(coucal_opaque arg, coucal_value value) {
void *record = value.ptr;
(void) arg;
freet(record);
}
// opt's DNS cache hashtable, created on first use. Records (t_dnscache*) are
// owned by the table and freed by hts_cache_value_free on coucal_delete.
coucal hts_cache(httrackp *opt) {
// 'capsule' contenant uniquement le cache
t_dnscache *hts_cache(httrackp * opt) {
assertf(opt != NULL);
if (opt->state.dns_cache == NULL) {
coucal cache = coucal_new(0);
coucal_set_name(cache, "dns_cache");
coucal_value_set_value_handler(cache, hts_cache_value_free, NULL);
opt->state.dns_cache = cache;
opt->state.dns_cache = (t_dnscache *) malloct(sizeof(t_dnscache));
memset(opt->state.dns_cache, 0, sizeof(t_dnscache));
}
assertf(opt->state.dns_cache != NULL);
/* first entry is NULL */
assertf(opt->state.dns_cache->iadr == NULL);
return opt->state.dns_cache;
}
// MUST BE LOCKED (coucal is not internally serialized vs FTP/web threads)
// Free DNS cache.
void hts_cache_free(t_dnscache *const root) {
if (root != NULL) {
t_dnscache *cache;
for(cache = root; cache != NULL; ) {
t_dnscache *const next = cache->next;
cache->next = NULL;
freet(cache);
cache = next;
}
}
}
// lock le cache dns pour tout opération d'ajout
// plus prudent quand plusieurs threads peuvent écrire dedans..
// -1: status? 0: libérer 1:locker
// MUST BE LOCKED
// Look up iadr in the DNS cache, filling out[0..min(count,max)-1].
// Returns: -1 not yet tested; 0 negative-cached (not in DNS); >0 address count.
static int hts_ghbn_all(coucal cache, const char *const iadr,
static int hts_ghbn_all(const t_dnscache *cache, const char *const iadr,
SOCaddr *const out, const int max) {
void *ptr;
assertf(out != NULL);
assertf(iadr != NULL);
if (*iadr == '\0') {
return -1;
}
if (coucal_read_pvoid(cache, iadr, &ptr)) { // ok trouvé
const t_dnscache *const record = (const t_dnscache *) ptr;
int i;
/* first entry is empty */
if (cache->iadr == NULL) {
cache = cache->next;
}
for(; cache != NULL; cache = cache->next) {
assertf(cache != NULL);
assertf(cache->iadr != NULL);
assertf(cache->iadr == (const char*) cache + sizeof(t_dnscache));
if (strcmp(cache->iadr, iadr) == 0) { // ok trouvé
int i;
assertf(record->host_count <= HTS_MAXADDRNUM);
for (i = 0; i < record->host_count && i < max; i++) {
assertf(record->host_length[i] <= sizeof(record->host_addr[i]));
SOCaddr_copyaddr2(out[i], record->host_addr[i], record->host_length[i]);
assertf(cache->host_count <= HTS_MAXADDRNUM);
for (i = 0; i < cache->host_count && i < max; i++) {
assertf(cache->host_length[i] <= sizeof(cache->host_addr[i]));
SOCaddr_copyaddr2(out[i], cache->host_addr[i], cache->host_length[i]);
}
return cache->host_count;
}
return record->host_count;
}
return -1;
}
@@ -5075,7 +5088,7 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
SOCaddr *const out, const int max,
const char **error) {
char BIGSTK iadr[HTS_URLMAXSIZE * 2];
coucal cache = hts_cache(opt); // le cache dns
t_dnscache *cache = hts_cache(opt); // adresse du cache
int count;
assertf(opt != NULL);
@@ -5096,10 +5109,13 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
if (count >= 0) { // cache hit (0 == negative-cached)
return count;
} else { // non présent dans le cache dns, tester
const size_t iadr_len = strlen(iadr) + 1;
SOCaddr resolved[HTS_MAXADDRNUM];
t_dnscache *record;
int i;
// find queue
for(; cache->next != NULL; cache = cache->next) ;
#if DEBUGDNS
printf("resolving (not cached) %s\n", iadr);
#endif
@@ -5110,18 +5126,22 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
DEBUG_W("gethostbyname done\n");
#endif
/* attempt to store new entry (coucal owns it and dups the host key) */
record = malloct(sizeof(t_dnscache));
if (record != NULL) {
memset(record, 0, sizeof(*record));
record->host_count = count;
/* attempt to store new entry */
cache->next = malloct(sizeof(t_dnscache) + iadr_len);
if (cache->next != NULL) {
t_dnscache *const next = cache->next;
char *const block = (char*) cache->next;
char *const str = block + sizeof(t_dnscache);
memcpy(str, iadr, iadr_len);
next->iadr = str;
next->host_count = count;
for (i = 0; i < count; i++) {
record->host_length[i] = SOCaddr_size(resolved[i]);
assertf(record->host_length[i] <= sizeof(record->host_addr[i]));
memcpy(record->host_addr[i], &SOCaddr_sockaddr(resolved[i]),
record->host_length[i]);
next->host_length[i] = SOCaddr_size(resolved[i]);
assertf(next->host_length[i] <= sizeof(next->host_addr[i]));
memcpy(next->host_addr[i], &SOCaddr_sockaddr(resolved[i]),
next->host_length[i]);
}
coucal_add_pvoid(cache, iadr, record);
next->next = NULL;
}
/* copy result to caller (cache store may have failed; result still valid)
@@ -5992,14 +6012,14 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
/* Cache */
if (opt->state.dns_cache != NULL) {
coucal root;
t_dnscache *root;
hts_mutexlock(&opt->state.lock);
root = opt->state.dns_cache;
opt->state.dns_cache = NULL;
hts_mutexrelease(&opt->state.lock);
coucal_delete(&root); // frees records via hts_cache_value_free
hts_cache_free(root);
}
/* Cancel chain */

View File

@@ -147,8 +147,9 @@ struct OLD_htsblk {
#define HTS_DEF_FWSTRUCT_t_dnscache
typedef struct t_dnscache t_dnscache;
#endif
// One DNS cache record, stored as a coucal value keyed by hostname.
struct t_dnscache {
struct t_dnscache *next;
const char *iadr;
// resolved addresses, in resolver (RFC 6724) order; host_count==0 means the
// name does not resolve (negative cache). host_count<=HTS_MAXADDRNUM.
int host_count;
@@ -244,9 +245,8 @@ HTSEXT_API int check_hostname_dns(const char *const hostname);
int ftp_available(void);
#if HTS_DNSCACHE
/* Return opt's DNS cache hashtable (hostname -> t_dnscache record), creating it
on first use. Records are owned by the table and freed on coucal_delete. */
coucal hts_cache(httrackp *opt);
void hts_cache_free(t_dnscache *const cache);
t_dnscache *hts_cache(httrackp * opt);
#endif
// outils divers

View File

@@ -760,9 +760,9 @@ int url_savename(lien_adrfilsave *const afs,
strcatbuff(fil, DEFAULT_HTML); // nommer page par défaut (à priori ici html depuis un proxy http)
}
}
// Change the extension? e.g. php3 saved as html, cgi as html or gif/xbm
// depending on the resolved type.
if (ext_chg && !opt->no_type_change) {
// Changer extension?
// par exemple, php3 sera sauvé en html, cgi en html ou gif, xbm etc.. selon les cas
if (ext_chg && !opt->no_type_change) { // changer ext
char *a = fil + strlen(fil) - 1;
if ((opt->debug > 1) && (opt->log != NULL)) {
@@ -774,19 +774,11 @@ int url_savename(lien_adrfilsave *const afs,
adr_complete, fil_complete, ext);
}
if (ext_chg == 1) {
// Cut the old extension only when it is empty (a bare trailing dot), the
// new one, or a recognized one; an unknown trailing ".token" (e.g.
// /article-1.884291, #115) is part of the name, not an extension.
const char *const old_ext = get_ext(catbuff, sizeof(catbuff), fil);
const int known_ext = !*old_ext || strfield2(old_ext, ext) ||
is_knowntype(opt, fil) || is_dyntype(old_ext) ||
ishtml_ext(old_ext) != -1;
while((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (*a == '.' && known_ext)
*a = '\0'; // cut
strcatbuff(fil, "."); // re-add the dot
if (*a == '.')
*a = '\0'; // couper
strcatbuff(fil, "."); // recopier point
} else {
while((a > fil) && (*a != '/'))
a--;
@@ -794,7 +786,7 @@ int url_savename(lien_adrfilsave *const afs,
a++;
*a = '\0';
}
strcatbuff(fil, ext); // append ext/name
strcatbuff(fil, ext); // copier ext/nom
}
// Rechercher premier / et dernier .
{
@@ -1729,10 +1721,10 @@ char *url_savename_refname_fullpath(httrackp * opt, const char *adr,
StringBuff(opt->path_log), digest_filename);
}
/* remove refname if any; HTS_TRUE if it was removed */
hts_boolean url_savename_refname_remove(httrackp *opt, const char *adr,
const char *fil) {
/* remove refname if any */
void url_savename_refname_remove(httrackp * opt, const char *adr,
const char *fil) {
char *filename = url_savename_refname_fullpath(opt, adr, fil);
return UNLINK(filename) == 0 ? HTS_TRUE : HTS_FALSE;
(void) UNLINK(filename);
}

View File

@@ -104,9 +104,8 @@ char *url_md5(char *digest_buffer, const char *fil_complete);
void url_savename_refname(const char *adr, const char *fil, char *filename);
char *url_savename_refname_fullpath(httrackp * opt, const char *adr,
const char *fil);
/* Remove the temp-ref for (adr,fil); HTS_TRUE if it was removed. */
hts_boolean url_savename_refname_remove(httrackp *opt, const char *adr,
const char *fil);
void url_savename_refname_remove(httrackp * opt, const char *adr,
const char *fil);
#endif
#endif

View File

@@ -241,7 +241,7 @@ struct htsoptstate {
char *userhttptype;
int verif_backblue_done; /**< backblue.gif/fade.gif already emitted */
int verif_external_status;
coucal dns_cache; /**< DNS resolution cache: hostname -> t_dnscache record */
t_dnscache *dns_cache; /**< DNS resolution cache */
int dns_cache_nthreads; /**< number of in-flight DNS resolver threads */
/* HTML parsing state */
char _hts_errmsg[HTS_CDLMAXSIZE + 256]; /**< last engine error message */

View File

@@ -3749,60 +3749,44 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
} // bloc
// erreur HTTP (ex: 404, not found)
} else if ((r->statuscode == HTTP_PRECONDITION_FAILED) ||
(r->statuscode == HTTP_REQUESTED_RANGE_NOT_SATISFIABLE)) {
// 412/416: the resume partial is stale; re-get the whole file (#206)
lien_back *itemback = NULL;
int had_partial = 0;
int ref_existed = 0;
int ref_gone;
// Drop the temp-ref, its partial, and heap->sav so the re-get carries no
// Range; else back_add rebuilds the same Range and loops.
if (back_unserialize_ref(opt, heap(ptr)->adr, heap(ptr)->fil,
&itemback) == 0) {
had_partial = 1;
ref_existed = 1;
// best-effort: an orphaned partial cannot re-Range once the ref is gone
if (fexist_utf8(itemback->url_sav))
(void) UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
itemback->url_sav));
back_clear_entry(itemback);
freet(itemback);
}
// don't re-record if the ref survived (it would re-Range and loop)
ref_gone =
url_savename_refname_remove(opt, heap(ptr)->adr, heap(ptr)->fil) ||
!ref_existed;
} else if ((r->statuscode == HTTP_PRECONDITION_FAILED)
|| (r->statuscode == HTTP_REQUESTED_RANGE_NOT_SATISFIABLE)
) { // Precondition Failed, c'est à dire pour nous redemander TOUT le fichier
if (fexist_utf8(heap(ptr)->sav)) {
had_partial = 1;
remove(heap(ptr)->sav);
remove(heap(ptr)->sav); // Eliminer
} else {
hts_log_print(opt, LOG_WARNING,
"Unexpected 412/416 error (%s) for %s%s, '%s' could not be found on disk",
r->msg, urladr(), urlfil(),
heap(ptr)->sav != NULL ? heap(ptr)->sav : "");
}
// Re-get once, only if a partial existed and both Range triggers are
// gone; a failed removal gives up rather than looping. range_used is
// unreliable (it does not survive the delayed-type two-pass).
if (had_partial && ref_gone && !fexist_utf8(heap(ptr)->sav)) {
if (!fexist_utf8(heap(ptr)->sav)) { // Bien éliminé? (sinon on boucle..)
#if HDEBUG
printf("Partial content NOT up-to-date, reget all file for %s\n",
heap(ptr)->sav);
#endif
hts_log_print(opt, LOG_DEBUG, "Partial file reget (%s) for %s%s",
r->msg, urladr(), urlfil());
// enregistrer le MEME lien
if (hts_record_link(opt, heap(ptr)->adr, heap(ptr)->fil, heap(ptr)->sav, "", "", NULL)) {
heap_top()->testmode = heap(ptr)->testmode;
heap_top()->link_import = 0;
heap_top()->testmode = heap(ptr)->testmode; // mode test?
heap_top()->link_import = 0; // pas mode import
heap_top()->depth = heap(ptr)->depth;
heap_top()->pass2 = max(heap(ptr)->pass2, numero_passe);
heap_top()->retry = heap(ptr)->retry;
heap_top()->premier = heap(ptr)->premier;
heap_top()->precedent = ptr;
//
// canceller lien actuel
error = 1;
hts_invalidate_link(opt, ptr); // invalidate hashtable entry
} else { // out of memory
XH_uninit;
hts_invalidate_link(opt, ptr); // invalidate hashtable entry
//
} else { // oups erreur, plus de mémoire!!
XH_uninit; // désallocation mémoire & buffers
return 0;
}
} else {
hts_log_print(opt, LOG_WARNING,
"Giving up on partial reget (%s) for %s%s", r->msg,
urladr(), urlfil());
hts_log_print(opt, LOG_ERROR, "Can not remove old file %s", urlfil());
error = 1;
}

View File

@@ -1,41 +0,0 @@
#!/bin/bash
#
set -euo pipefail
# Local save-name extension resolution (url_savename via -#N <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
name() {
out="$(httrack -O /dev/null -#N "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
exit 1
}
}
# #115: an unknown trailing ".token" is part of the name, keep it and append the type.
name '/article-1.884291' 'text/html' 'article-1.884291.html'
name '/news/story-12345.987654' 'text/html' 'story-12345.987654.html'
# Recognized extensions still collapse to the resolved type.
name '/page.php' 'text/html' 'page.html'
name '/page.asp' 'text/html' 'page.html'
name '/foo' 'text/html' 'foo.html'
# A bare trailing dot is not a tail to keep.
name '/page.' 'text/html' 'page.html'
# Soft-404 (#267/#408): a binary URL served as HTML is named .html.
name '/x.pdf' 'text/html' 'x.html'
name '/x.gif' 'text/html' 'x.html'
# Type agrees with the extension: keep it, no churn, no double extension.
name '/x.pdf' 'application/pdf' 'x.pdf'
name '/x.jpg' 'image/jpeg' 'x.jpg'
name '/x.html' 'text/html' 'x.html'
name '/x.js' 'application/x-javascript' 'x.js'
name '/types/data.json' 'application/json' 'data.json'
# Agreeing type must not rewrite the extension's casing (no strip-and-reappend).
name '/x.JPG' 'image/jpeg' 'x.JPG'

View File

@@ -1,113 +0,0 @@
#!/bin/bash
# Issue #206: a continue/update crawl looped forever when the resume Range got a
# 416. Pass 1 leaves a partial + temp-ref; pass 2 must terminate and not loop.
set -u
: "${top_srcdir:=..}"
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_206.XXXXXX") || exit 1
serverpid=
crawlpid=
cleanup() {
test -n "$crawlpid" && kill -9 "$crawlpid" 2>/dev/null
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null
wait "$serverpid" 2>/dev/null
fi
rm -rf "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT PIPE TERM
# --- start the server, discover its ephemeral port --------------------------
# RESUME_COUNTER gets a byte per /resume/blob.txt request (pass-2 delta bounds re-gets).
serverlog="${tmpdir}/server.log"
counter="${tmpdir}/blobcount"
RESUME_COUNTER="$counter" python3 "$server" --root "${testdir}/server-root" >"$serverlog" 2>&1 &
serverpid=$!
port=
for _ in $(seq 1 50); do
line=$(head -n1 "$serverlog" 2>/dev/null)
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
kill -0 "$serverpid" 2>/dev/null || {
echo "server exited early: $(cat "$serverlog")"
exit 1
}
sleep 0.1
done
test -n "$port" || {
echo "could not discover server port"
exit 1
}
base="http://127.0.0.1:${port}"
which httrack >/dev/null || {
echo "could not find httrack"
exit 1
}
out="${tmpdir}/crawl"
mkdir "$out"
common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0)
refdir="${out}/hts-cache/ref"
# --- pass 1: crawl, interrupt once the blob download is underway -------------
printf '[pass 1: interrupt mid-download] ..\t'
httrack "${common[@]}" "${base}/resume/index.html" >"${tmpdir}/log1" 2>&1 &
crawlpid=$!
# Wait until blob.txt is requested, then SIGTERM so httrack's exit handler
# finalizes the cache and serializes the temp-ref.
for _ in $(seq 1 300); do
test -s "$counter" && break
kill -0 "$crawlpid" 2>/dev/null || break
sleep 0.1
done
sleep 0.5
kill -TERM "$crawlpid" 2>/dev/null
wait "$crawlpid" 2>/dev/null
crawlpid=
test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
echo "FAIL: no temp-ref survived pass 1; cannot drive #206"
exit 1
}
echo "OK (temp-ref present)"
before=$(wc -c <"$counter" 2>/dev/null || echo 0)
# --- pass 2: --continue -> resume Range -> 416, bounded against the #206 loop -
# Kill pass 2 after a deadline (portable stand-in for `timeout`, absent on macOS).
printf '[pass 2: resume must terminate] ..\t'
HANG_RC=137 # 128 + SIGKILL
httrack "${common[@]}" --continue "${base}/resume/index.html" >"${tmpdir}/log2" 2>&1 &
crawlpid=$!
(sleep 30 && kill -9 "$crawlpid" 2>/dev/null) &
guard=$!
rc=0
wait "$crawlpid" 2>/dev/null || rc=$?
crawlpid=
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
if test "$rc" -eq "$HANG_RC"; then
echo "FAIL: pass 2 did not terminate (#206 resume->416 loop)"
exit 1
fi
echo "OK (terminated, rc=$rc)"
# The fix re-gets once (resume Range + range-less re-get = 2): the lower bound
# rejects a drop-the-link non-fix (1), the upper bound rejects the loop (many).
after=$(wc -c <"$counter" 2>/dev/null || echo 0)
hits=$((after - before))
printf '[bounded re-get count] ..\t'
if test "$hits" -lt 2; then
echo "FAIL: only ${hits} pass-2 request(s); the stale partial was not re-got"
exit 1
fi
if test "$hits" -gt 8; then
echo "FAIL: ${hits} pass-2 requests for blob.txt (resume is looping)"
exit 1
fi
echo "OK (${hits} requests)"

View File

@@ -40,7 +40,6 @@ TESTS = \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \
@@ -59,7 +58,6 @@ TESTS = \
16_local-assume.test \
17_local-empty-ct.test \
18_local-update.test \
19_local-connect-fallback.test \
20_local-resume-loop.test
19_local-connect-fallback.test
CLEANFILES = check-network_sh.cache

View File

@@ -15,7 +15,6 @@ stdlib only (http.server + ssl) -- no new build or runtime dependency.
import argparse
import os
import time
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import quote, unquote, urlsplit
@@ -177,43 +176,6 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# resume / 416 loop (#206): the first GET stalls after a prefix so the crawl
# can be interrupted (partial + temp-ref); every later request is 416.
RESUME_PREFIX = b"PARTIAL-" + b"x" * 4096 # flushed before the stall
RESUME_LEN = len(RESUME_PREFIX) + 4096 # declared length never delivered
_resume_started = False
def route_resume_index(self):
self.send_html('\t<a href="blob.txt">blob</a>')
def route_resume(self):
counter = os.environ.get("RESUME_COUNTER")
if counter:
with open(counter, "a") as fp:
fp.write("x")
# First GET: stall mid-body so the crawl can be interrupted with a partial.
if not Handler._resume_started:
Handler._resume_started = True
self.send_response(200)
self.send_header("Content-Type", "image/png")
self.send_header("Content-Length", str(self.RESUME_LEN))
self.send_header("Accept-Ranges", "bytes")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(self.RESUME_PREFIX)
self.wfile.flush()
try:
while True:
time.sleep(3600)
except OSError:
pass
return
self.send_response(416, "Requested Range Not Satisfiable")
self.send_header("Content-Type", "image/png")
self.send_header("Content-Range", "bytes */%d" % self.RESUME_LEN)
self.send_header("Content-Length", "0")
self.end_headers()
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
@@ -233,8 +195,6 @@ class Handler(SimpleHTTPRequestHandler):
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
"/resume/index.html": route_resume_index,
"/resume/blob.txt": route_resume,
}
# --- dispatch ----------------------------------------------------------