Compare commits

..

6 Commits

Author SHA1 Message Date
Xavier Roche
82d1de5d06 tests: lock special-char URL naming across an update (#157)
#157 reported accented URLs (pt-BR MediaWiki) losing their .html extension
on an update pass, observed with 3.49-2 on Windows. It does not reproduce on
current master: the update resolves the cached content-type and re-applies
.html consistently, for UTF-8 and ISO-8859-1 sources, raw Latin-1 href bytes,
either percent-encoding case, and dotted tails. The original symptom was a
Windows codepage vs UTF-8 X-Save filename mismatch that cannot occur on a
UTF-8 filesystem.

Add a regression test that locks the invariant: a dotless, accented basename
served as text/html, crawled then updated, must keep its .html name and not
leave an extensionless sibling.

Also assert in the --rerun harness that the update pass reported "files
updated" (a fresh crawl never does), so a regression that bypasses the cache
and silently re-crawls fresh can no longer pass the update tests.

Closes #157

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-24 22:16:30 +02:00
Xavier Roche
c97b3e233e Stop the 412/416 partial-reget loop on continue/update (#206) (#422)
On resume, the Range request is rebuilt by back_add from a temp-ref keyed on
(adr,fil) that records the partial download's real save name. A 412/416
("Range Not Satisfiable") means that partial is stale and the whole file
must be re-fetched. The handler only removed heap->sav, so when the resume
pass recomputed a save name different from the temp-ref's (the default
delayed-type machinery renames freely), the partial was never cleared:
back_add re-sent the same Range, earned the same 416, and the link was
re-recorded forever, growing the scan counter without bound.

Clear the whole partial wherever it lives -- the temp-ref and the file it
points at, plus heap->sav -- so the re-record falls through to a plain full
GET. Re-get only when there was a partial to discard and both Range triggers
(the ref and the on-disk file) are actually gone; once they are, a fresh 416
with nothing left to drop means the whole-file GET itself failed, so the link
gives up cleanly instead of re-queueing. A failed removal (read-only or full
cache) also gives up rather than looping, since back_add would otherwise
re-Range the surviving ref; url_savename_refname_remove now reports the
removal result so the handler can tell. (The request's range_used flag would
be the natural one-shot signal, but it does not survive the delayed-type
two-pass, so we key off the partial instead.)

tests/20_local-resume-loop.test drives it offline: pass 1 is interrupted
(SIGTERM, so the exit handler finalizes the cache and the temp-ref) to leave
a partial, then pass 2 --continue gets 416 on every resume request. A
portable watchdog kills pass 2 if it loops; the test asserts it terminates
and attempts exactly one whole-file re-get (2 <= requests <= 8). It fails on
the pre-fix handler (loops) and on a re-get that silently drops the link.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 21:12:40 +02:00
Xavier Roche
b615a4e7fd Keep unrecognized URL tails instead of mangling them to .html (#421)
url_savename truncated any trailing ".token" when applying a resolved
content-type, so /article-1.884291 served as text/html was saved as
article-1.html, dropping the .884291 tail and colliding with every
sibling sharing the prefix. Cut the old extension only when it is empty
(a bare trailing dot), the resolved type, a known MIME extension, a
dynamic-page extension, or an html-family extension; otherwise keep the
tail and append the type (article-1.884291.html).

Recognized extensions still collapse as before, so the #267/#408
soft-404 behavior (a binary URL served as HTML named .html) is
preserved, and a type that agrees with the extension causes no churn.

Add a hidden -#N <fil> <content-type> self-test driving url_savename
offline, plus tests/01_engine-savename.test covering the matrix.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 18:33:52 +02:00
Xavier Roche
594cf0da39 debian: override embedded-library for bundled minizip, lint under debian:sid (#419)
httrack statically links its own patched minizip (src/minizip): it carries a
zipFlush() API the system libminizip lacks, which htscache.c uses to flush the
cache .zip so an interrupted crawl leaves a valid archive, plus Android and
old-zlib portability fixes. The system library can't be substituted until that
lands upstream, so add justified lintian overrides for the resulting
embedded-library tag on libhttrack3 and proxytrack.

The tag never showed in CI because the deb job built and linted on the Ubuntu
runner, whose lintian predates the minizip fingerprint; it only fires on the
newer lintian the Debian buildds and UDD run. Build and lint the package inside
a debian:sid container instead, matching the upload target. That also keeps the
archive legal: a Debian dpkg-deb writes xz members where an Ubuntu host defaults
to zstd, which Debian's lintian rejects as a malformed deb. And being unable to
unpack a zstd member, lintian never scans the binaries the embedded-library
check reads, so the override would otherwise have looked unused.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 22:27:18 +02:00
Xavier Roche
3845cd1fb3 Store the DNS cache in a coucal hashtable (#420)
The resolver cache was a hand-rolled singly-linked list with a dummy head
node: O(n) lookup, O(n^2) build, and each record carried its own next
pointer plus an inline copy of the hostname key. Swap it for coucal, the
hashtable already used for the backing cache and the ready slots, keyed by
hostname with the address record as the value.

coucal owns the records (freed through a value handler on coucal_delete)
and dups the key itself, so t_dnscache sheds both its next link and its
inline iadr string and becomes a pure address record. The state field
keeps the same pointer width (t_dnscache* -> coucal), so the installed
htsopt.h layout and the ABI are unchanged.

Behaviour is identical: same -1/0/>0 lookup contract, same negative
caching, same resolve-once semantics, all under the existing
opt->state.lock (coucal is not internally serialized against the FTP/web
threads). The DNS self-test exercises the full contract black-box and
passes unchanged.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:18:53 +02:00
Xavier Roche
94bffb0804 Fall back to the next address when a connect fails or stalls (#418)
* Resolve hosts to multiple addresses and cache the full list

The DNS cache kept a single address per host and the resolver copied only
the head of getaddrinfo's result, discarding the rest. That leaves no
fallback when the chosen address is unreachable (e.g. a blackholed IPv6 on a
dual-stack host) -- the root of the "stuck on connect" stalls.

Widen t_dnscache to hold up to HTS_MAXADDRNUM addresses in resolver order and
walk ai_next when resolving. New hts_dns_resolve_all() exposes the list;
hts_dns_resolve2() still returns the first address, so existing callers are
unchanged. newhttp_addr() connects to a chosen candidate index (newhttp() is
the index-0 wrapper), for the connect-fallback path added next.

No ABI change: t_dnscache is engine-internal (httrackp holds only a pointer;
no plugin reads its fields) and the htsblk/lien_back layout is untouched.

The DNS self-test now covers the list path: count, resolver order, the
family filter, and clamping past HTS_MAXADDRNUM.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Fall back to the next address when a connect fails or stalls

A slot connected to a single resolved address and waited the full slot
timeout (default 120s) if that address was dead -- a blackholed IPv6 on a
dual-stack host would stall the whole mirror. With the cache now holding the
full address list, retry the next address instead of failing.

In back_wait, a connecting slot probes the resolved address count once, then:
on a refused/failed connect (a new SO_ERROR check at connect completion, since
a failed non-blocking connect is reported writable too) it falls back
immediately; on a stalled connect it falls back after a short per-candidate
deadline (min(timeout, 10s)) rather than the full timeout. The last candidate
keeps the full timeout, so single-address hosts are unchanged. Per-slot state
(current index, count, connect start) lives in struct_back, parallel to the
slot array -- no htsblk/lien_back layout change, so the ABI is untouched.

back_connect_fallback_due() (the deadline decision) and newhttp_addr()'s
address selection are unit-tested through the DNS mock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Add HTTRACK_DEBUG_RESOLVE and a deterministic connect-fallback test

Exercising the connect fallback needs a host that resolves to a dead address
first and a live one next, deterministically and offline. A true SYN
black-hole can't be simulated without root, but a refused address can.

HTTRACK_DEBUG_RESOLVE="host:ip[,ip...]" pins a host's resolution to a fixed
address list (curl --resolve style), reusing the PR1 resolver seam: an
addrinfo backend that synthesizes the listed addresses for the named host and
delegates other hosts to libc (copying into its own allocations so one
freeaddrinfo frees both). It is a debug/test hook, inert unless the env var is
set, and IPv6-build-only like the rest of the resolver list.

The new local crawl test binds the server to 127.0.0.1 and resolves a host to
127.0.0.2 (refused) then 127.0.0.1: the mirror only succeeds via the fallback.
A V6_SUPPORT substitution (mirroring HTTPS_SUPPORT) lets it skip on non-IPv6
builds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 20:56:18 +02:00
16 changed files with 411 additions and 117 deletions

View File

@@ -232,30 +232,42 @@ jobs:
deb:
name: deb package (lintian)
runs-on: ubuntu-24.04
# Build and gate inside Debian sid, the upload target. A Debian dpkg-deb
# produces archive-legal xz members (an Ubuntu host defaults to zstd, which
# the archive's lintian rejects), and sid's lintian carries the same
# data-driven checks (embedded-lib fingerprints and the like) the buildds and
# UDD apply -- so issues surface here instead of after upload.
container: debian:sid
steps:
- uses: actions/checkout@v6
with:
submodules: recursive
- name: Install packaging toolchain
run: |
set -euo pipefail
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
apt-get update
apt-get install -y --no-install-recommends \
ca-certificates git \
build-essential autoconf automake libtool autoconf-archive \
zlib1g-dev libssl-dev \
debhelper devscripts lintian fakeroot
- uses: actions/checkout@v6
with:
submodules: recursive
# --unsigned: CI has no GPG key (also skips the release sig/checksums).
# debuild builds every package, then lintian gates on errors.
# mkdeb builds every package then runs the lintian gate (--fail-on=error,
# warning); debuild runs the packaged test pass.
#
# DEB_BUILD_OPTIONS trims work CI does not need (release builds via
# mkdeb.sh are untouched): noautodbgsym drops the -dbgsym packages whose
# LTO payloads are slow to compress and that CI never ships; parallel uses
# every core. We let debuild run its test pass -- the only one now that
# mkdeb no longer runs its own -- so CI exercises the packaged tests.
- name: Build Debian packages
# every core.
- name: Build and lint Debian packages
run: |
set -euo pipefail
# The workspace volume is owned by the host runner uid, but the
# container runs as root, so mkdeb's git calls (superproject and the
# coucal submodule) trip "dubious ownership"; mark them all safe.
git config --global --add safe.directory "*"
export DEB_BUILD_OPTIONS="noautodbgsym parallel=$(nproc)"
bash tools/mkdeb.sh --unsigned --no-release-artifacts

View File

@@ -1,3 +1,8 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack3: no-symbols-control-file usr/lib/*
# Bundled, locally patched minizip (src/minizip): it adds a zipFlush() API the
# system libminizip lacks (htscache.c flushes the cache .zip so an interrupted
# crawl leaves a valid archive), plus Android/old-zlib portability fixes.
libhttrack3: embedded-library *libminizip*

3
debian/proxytrack.lintian-overrides vendored Normal file
View File

@@ -0,0 +1,3 @@
# Statically linked against httrack's bundled, patched minizip (see src/minizip
# and libhttrack3's override): the zipFlush() API is absent from the system one.
proxytrack: embedded-library *libminizip*

View File

@@ -2468,6 +2468,44 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return err;
} break;
case 'N': { // url_savename name resolution: httrack -#N <fil>
// <content-type>
if (na + 2 < argc) {
lien_adrfilsave afs;
cache_back cache;
struct_back *sback;
hash_struct hash;
lien_back headers;
memset(&afs, 0, sizeof(afs));
strcpybuff(afs.af.adr, "www.example.com");
strcpybuff(afs.af.fil, argv[na + 1]);
memset(&cache, 0, sizeof(cache));
cache.hashtable = (void *) coucal_new(0);
sback = back_new(opt, opt->maxsoc * 32 + 1024);
hash_init(opt, &hash, opt->urlhack);
memset(&headers, 0, sizeof(headers));
headers.status = 0;
headers.r.statuscode = HTTP_OK;
strcpybuff(headers.r.contenttype, argv[na + 2]);
strcpybuff(headers.url_fil, argv[na + 1]);
url_savename(&afs, NULL, NULL, NULL, opt, sback, &cache,
&hash, 0, 0, &headers);
printf("savename: %s\n", afs.save);
htsmain_free();
return 0;
} else {
fprintf(
stderr,
"Option #N requires <fil> <content-type> arguments\n");
htsmain_free();
return 1;
}
} break;
case 'C': // list cache files : httrack -#C '*spid*.gif' will attempt to find the matching file
{
int hasFilter = 0;

View File

@@ -4766,64 +4766,51 @@ int hts_read(htsblk * r, char *buff, int size) {
// -- Gestion cache DNS --
// 'RX98
// 'capsule' contenant uniquement le cache
t_dnscache *hts_cache(httrackp * opt) {
// Free a DNS cache record (coucal value handler).
static void hts_cache_value_free(coucal_opaque arg, coucal_value value) {
void *record = value.ptr;
(void) arg;
freet(record);
}
// opt's DNS cache hashtable, created on first use. Records (t_dnscache*) are
// owned by the table and freed by hts_cache_value_free on coucal_delete.
coucal hts_cache(httrackp *opt) {
assertf(opt != NULL);
if (opt->state.dns_cache == NULL) {
opt->state.dns_cache = (t_dnscache *) malloct(sizeof(t_dnscache));
memset(opt->state.dns_cache, 0, sizeof(t_dnscache));
coucal cache = coucal_new(0);
coucal_set_name(cache, "dns_cache");
coucal_value_set_value_handler(cache, hts_cache_value_free, NULL);
opt->state.dns_cache = cache;
}
assertf(opt->state.dns_cache != NULL);
/* first entry is NULL */
assertf(opt->state.dns_cache->iadr == NULL);
return opt->state.dns_cache;
}
// Free DNS cache.
void hts_cache_free(t_dnscache *const root) {
if (root != NULL) {
t_dnscache *cache;
for(cache = root; cache != NULL; ) {
t_dnscache *const next = cache->next;
cache->next = NULL;
freet(cache);
cache = next;
}
}
}
// lock le cache dns pour tout opération d'ajout
// plus prudent quand plusieurs threads peuvent écrire dedans..
// -1: status? 0: libérer 1:locker
// MUST BE LOCKED
// MUST BE LOCKED (coucal is not internally serialized vs FTP/web threads)
// Look up iadr in the DNS cache, filling out[0..min(count,max)-1].
// Returns: -1 not yet tested; 0 negative-cached (not in DNS); >0 address count.
static int hts_ghbn_all(const t_dnscache *cache, const char *const iadr,
static int hts_ghbn_all(coucal cache, const char *const iadr,
SOCaddr *const out, const int max) {
void *ptr;
assertf(out != NULL);
assertf(iadr != NULL);
if (*iadr == '\0') {
return -1;
}
/* first entry is empty */
if (cache->iadr == NULL) {
cache = cache->next;
}
for(; cache != NULL; cache = cache->next) {
assertf(cache != NULL);
assertf(cache->iadr != NULL);
assertf(cache->iadr == (const char*) cache + sizeof(t_dnscache));
if (strcmp(cache->iadr, iadr) == 0) { // ok trouvé
int i;
if (coucal_read_pvoid(cache, iadr, &ptr)) { // ok trouvé
const t_dnscache *const record = (const t_dnscache *) ptr;
int i;
assertf(cache->host_count <= HTS_MAXADDRNUM);
for (i = 0; i < cache->host_count && i < max; i++) {
assertf(cache->host_length[i] <= sizeof(cache->host_addr[i]));
SOCaddr_copyaddr2(out[i], cache->host_addr[i], cache->host_length[i]);
}
return cache->host_count;
assertf(record->host_count <= HTS_MAXADDRNUM);
for (i = 0; i < record->host_count && i < max; i++) {
assertf(record->host_length[i] <= sizeof(record->host_addr[i]));
SOCaddr_copyaddr2(out[i], record->host_addr[i], record->host_length[i]);
}
return record->host_count;
}
return -1;
}
@@ -5088,7 +5075,7 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
SOCaddr *const out, const int max,
const char **error) {
char BIGSTK iadr[HTS_URLMAXSIZE * 2];
t_dnscache *cache = hts_cache(opt); // adresse du cache
coucal cache = hts_cache(opt); // le cache dns
int count;
assertf(opt != NULL);
@@ -5109,13 +5096,10 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
if (count >= 0) { // cache hit (0 == negative-cached)
return count;
} else { // non présent dans le cache dns, tester
const size_t iadr_len = strlen(iadr) + 1;
SOCaddr resolved[HTS_MAXADDRNUM];
t_dnscache *record;
int i;
// find queue
for(; cache->next != NULL; cache = cache->next) ;
#if DEBUGDNS
printf("resolving (not cached) %s\n", iadr);
#endif
@@ -5126,22 +5110,18 @@ static int hts_dns_resolve_list_(httrackp *opt, const char *_iadr,
DEBUG_W("gethostbyname done\n");
#endif
/* attempt to store new entry */
cache->next = malloct(sizeof(t_dnscache) + iadr_len);
if (cache->next != NULL) {
t_dnscache *const next = cache->next;
char *const block = (char*) cache->next;
char *const str = block + sizeof(t_dnscache);
memcpy(str, iadr, iadr_len);
next->iadr = str;
next->host_count = count;
/* attempt to store new entry (coucal owns it and dups the host key) */
record = malloct(sizeof(t_dnscache));
if (record != NULL) {
memset(record, 0, sizeof(*record));
record->host_count = count;
for (i = 0; i < count; i++) {
next->host_length[i] = SOCaddr_size(resolved[i]);
assertf(next->host_length[i] <= sizeof(next->host_addr[i]));
memcpy(next->host_addr[i], &SOCaddr_sockaddr(resolved[i]),
next->host_length[i]);
record->host_length[i] = SOCaddr_size(resolved[i]);
assertf(record->host_length[i] <= sizeof(record->host_addr[i]));
memcpy(record->host_addr[i], &SOCaddr_sockaddr(resolved[i]),
record->host_length[i]);
}
next->next = NULL;
coucal_add_pvoid(cache, iadr, record);
}
/* copy result to caller (cache store may have failed; result still valid)
@@ -6012,14 +5992,14 @@ HTSEXT_API void hts_free_opt(httrackp * opt) {
/* Cache */
if (opt->state.dns_cache != NULL) {
t_dnscache *root;
coucal root;
hts_mutexlock(&opt->state.lock);
root = opt->state.dns_cache;
opt->state.dns_cache = NULL;
hts_mutexrelease(&opt->state.lock);
hts_cache_free(root);
coucal_delete(&root); // frees records via hts_cache_value_free
}
/* Cancel chain */

View File

@@ -147,9 +147,8 @@ struct OLD_htsblk {
#define HTS_DEF_FWSTRUCT_t_dnscache
typedef struct t_dnscache t_dnscache;
#endif
// One DNS cache record, stored as a coucal value keyed by hostname.
struct t_dnscache {
struct t_dnscache *next;
const char *iadr;
// resolved addresses, in resolver (RFC 6724) order; host_count==0 means the
// name does not resolve (negative cache). host_count<=HTS_MAXADDRNUM.
int host_count;
@@ -245,8 +244,9 @@ HTSEXT_API int check_hostname_dns(const char *const hostname);
int ftp_available(void);
#if HTS_DNSCACHE
void hts_cache_free(t_dnscache *const cache);
t_dnscache *hts_cache(httrackp * opt);
/* Return opt's DNS cache hashtable (hostname -> t_dnscache record), creating it
on first use. Records are owned by the table and freed on coucal_delete. */
coucal hts_cache(httrackp *opt);
#endif
// outils divers

View File

@@ -760,9 +760,9 @@ int url_savename(lien_adrfilsave *const afs,
strcatbuff(fil, DEFAULT_HTML); // nommer page par défaut (à priori ici html depuis un proxy http)
}
}
// Changer extension?
// par exemple, php3 sera sauvé en html, cgi en html ou gif, xbm etc.. selon les cas
if (ext_chg && !opt->no_type_change) { // changer ext
// Change the extension? e.g. php3 saved as html, cgi as html or gif/xbm
// depending on the resolved type.
if (ext_chg && !opt->no_type_change) {
char *a = fil + strlen(fil) - 1;
if ((opt->debug > 1) && (opt->log != NULL)) {
@@ -774,11 +774,19 @@ int url_savename(lien_adrfilsave *const afs,
adr_complete, fil_complete, ext);
}
if (ext_chg == 1) {
// Cut the old extension only when it is empty (a bare trailing dot), the
// new one, or a recognized one; an unknown trailing ".token" (e.g.
// /article-1.884291, #115) is part of the name, not an extension.
const char *const old_ext = get_ext(catbuff, sizeof(catbuff), fil);
const int known_ext = !*old_ext || strfield2(old_ext, ext) ||
is_knowntype(opt, fil) || is_dyntype(old_ext) ||
ishtml_ext(old_ext) != -1;
while((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (*a == '.')
*a = '\0'; // couper
strcatbuff(fil, "."); // recopier point
if (*a == '.' && known_ext)
*a = '\0'; // cut
strcatbuff(fil, "."); // re-add the dot
} else {
while((a > fil) && (*a != '/'))
a--;
@@ -786,7 +794,7 @@ int url_savename(lien_adrfilsave *const afs,
a++;
*a = '\0';
}
strcatbuff(fil, ext); // copier ext/nom
strcatbuff(fil, ext); // append ext/name
}
// Rechercher premier / et dernier .
{
@@ -1721,10 +1729,10 @@ char *url_savename_refname_fullpath(httrackp * opt, const char *adr,
StringBuff(opt->path_log), digest_filename);
}
/* remove refname if any */
void url_savename_refname_remove(httrackp * opt, const char *adr,
const char *fil) {
/* remove refname if any; HTS_TRUE if it was removed */
hts_boolean url_savename_refname_remove(httrackp *opt, const char *adr,
const char *fil) {
char *filename = url_savename_refname_fullpath(opt, adr, fil);
(void) UNLINK(filename);
return UNLINK(filename) == 0 ? HTS_TRUE : HTS_FALSE;
}

View File

@@ -104,8 +104,9 @@ char *url_md5(char *digest_buffer, const char *fil_complete);
void url_savename_refname(const char *adr, const char *fil, char *filename);
char *url_savename_refname_fullpath(httrackp * opt, const char *adr,
const char *fil);
void url_savename_refname_remove(httrackp * opt, const char *adr,
const char *fil);
/* Remove the temp-ref for (adr,fil); HTS_TRUE if it was removed. */
hts_boolean url_savename_refname_remove(httrackp *opt, const char *adr,
const char *fil);
#endif
#endif

View File

@@ -241,7 +241,7 @@ struct htsoptstate {
char *userhttptype;
int verif_backblue_done; /**< backblue.gif/fade.gif already emitted */
int verif_external_status;
t_dnscache *dns_cache; /**< DNS resolution cache */
coucal dns_cache; /**< DNS resolution cache: hostname -> t_dnscache record */
int dns_cache_nthreads; /**< number of in-flight DNS resolver threads */
/* HTML parsing state */
char _hts_errmsg[HTS_CDLMAXSIZE + 256]; /**< last engine error message */

View File

@@ -3749,44 +3749,60 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
} // bloc
// erreur HTTP (ex: 404, not found)
} else if ((r->statuscode == HTTP_PRECONDITION_FAILED)
|| (r->statuscode == HTTP_REQUESTED_RANGE_NOT_SATISFIABLE)
) { // Precondition Failed, c'est à dire pour nous redemander TOUT le fichier
if (fexist_utf8(heap(ptr)->sav)) {
remove(heap(ptr)->sav); // Eliminer
} else {
hts_log_print(opt, LOG_WARNING,
"Unexpected 412/416 error (%s) for %s%s, '%s' could not be found on disk",
r->msg, urladr(), urlfil(),
heap(ptr)->sav != NULL ? heap(ptr)->sav : "");
} else if ((r->statuscode == HTTP_PRECONDITION_FAILED) ||
(r->statuscode == HTTP_REQUESTED_RANGE_NOT_SATISFIABLE)) {
// 412/416: the resume partial is stale; re-get the whole file (#206)
lien_back *itemback = NULL;
int had_partial = 0;
int ref_existed = 0;
int ref_gone;
// Drop the temp-ref, its partial, and heap->sav so the re-get carries no
// Range; else back_add rebuilds the same Range and loops.
if (back_unserialize_ref(opt, heap(ptr)->adr, heap(ptr)->fil,
&itemback) == 0) {
had_partial = 1;
ref_existed = 1;
// best-effort: an orphaned partial cannot re-Range once the ref is gone
if (fexist_utf8(itemback->url_sav))
(void) UNLINK(fconv(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
itemback->url_sav));
back_clear_entry(itemback);
freet(itemback);
}
if (!fexist_utf8(heap(ptr)->sav)) { // Bien éliminé? (sinon on boucle..)
#if HDEBUG
printf("Partial content NOT up-to-date, reget all file for %s\n",
heap(ptr)->sav);
#endif
// don't re-record if the ref survived (it would re-Range and loop)
ref_gone =
url_savename_refname_remove(opt, heap(ptr)->adr, heap(ptr)->fil) ||
!ref_existed;
if (fexist_utf8(heap(ptr)->sav)) {
had_partial = 1;
remove(heap(ptr)->sav);
}
// Re-get once, only if a partial existed and both Range triggers are
// gone; a failed removal gives up rather than looping. range_used is
// unreliable (it does not survive the delayed-type two-pass).
if (had_partial && ref_gone && !fexist_utf8(heap(ptr)->sav)) {
hts_log_print(opt, LOG_DEBUG, "Partial file reget (%s) for %s%s",
r->msg, urladr(), urlfil());
// enregistrer le MEME lien
if (hts_record_link(opt, heap(ptr)->adr, heap(ptr)->fil, heap(ptr)->sav, "", "", NULL)) {
heap_top()->testmode = heap(ptr)->testmode; // mode test?
heap_top()->link_import = 0; // pas mode import
heap_top()->testmode = heap(ptr)->testmode;
heap_top()->link_import = 0;
heap_top()->depth = heap(ptr)->depth;
heap_top()->pass2 = max(heap(ptr)->pass2, numero_passe);
heap_top()->retry = heap(ptr)->retry;
heap_top()->premier = heap(ptr)->premier;
heap_top()->precedent = ptr;
//
// canceller lien actuel
error = 1;
hts_invalidate_link(opt, ptr); // invalidate hashtable entry
//
} else { // oups erreur, plus de mémoire!!
XH_uninit; // désallocation mémoire & buffers
hts_invalidate_link(opt, ptr); // invalidate hashtable entry
} else { // out of memory
XH_uninit;
return 0;
}
} else {
hts_log_print(opt, LOG_ERROR, "Can not remove old file %s", urlfil());
hts_log_print(opt, LOG_WARNING,
"Giving up on partial reget (%s) for %s%s", r->msg,
urladr(), urlfil());
error = 1;
}

41
tests/01_engine-savename.test Executable file
View File

@@ -0,0 +1,41 @@
#!/bin/bash
#
set -euo pipefail
# Local save-name extension resolution (url_savename via -#N <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
name() {
out="$(httrack -O /dev/null -#N "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
exit 1
}
}
# #115: an unknown trailing ".token" is part of the name, keep it and append the type.
name '/article-1.884291' 'text/html' 'article-1.884291.html'
name '/news/story-12345.987654' 'text/html' 'story-12345.987654.html'
# Recognized extensions still collapse to the resolved type.
name '/page.php' 'text/html' 'page.html'
name '/page.asp' 'text/html' 'page.html'
name '/foo' 'text/html' 'foo.html'
# A bare trailing dot is not a tail to keep.
name '/page.' 'text/html' 'page.html'
# Soft-404 (#267/#408): a binary URL served as HTML is named .html.
name '/x.pdf' 'text/html' 'x.html'
name '/x.gif' 'text/html' 'x.html'
# Type agrees with the extension: keep it, no churn, no double extension.
name '/x.pdf' 'application/pdf' 'x.pdf'
name '/x.jpg' 'image/jpeg' 'x.jpg'
name '/x.html' 'text/html' 'x.html'
name '/x.js' 'application/x-javascript' 'x.js'
name '/types/data.json' 'application/json' 'data.json'
# Agreeing type must not rewrite the extension's casing (no strip-and-reappend).
name '/x.JPG' 'image/jpeg' 'x.JPG'

113
tests/20_local-resume-loop.test Executable file
View File

@@ -0,0 +1,113 @@
#!/bin/bash
# Issue #206: a continue/update crawl looped forever when the resume Range got a
# 416. Pass 1 leaves a partial + temp-ref; pass 2 must terminate and not loop.
set -u
: "${top_srcdir:=..}"
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
command -v python3 >/dev/null || ! echo "python3 not found; skipping" || exit 77
tmpdir=$(mktemp -d "${TMPDIR:-/tmp}/httrack_206.XXXXXX") || exit 1
serverpid=
crawlpid=
cleanup() {
test -n "$crawlpid" && kill -9 "$crawlpid" 2>/dev/null
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null
wait "$serverpid" 2>/dev/null
fi
rm -rf "$tmpdir"
}
trap cleanup EXIT HUP INT QUIT PIPE TERM
# --- start the server, discover its ephemeral port --------------------------
# RESUME_COUNTER gets a byte per /resume/blob.txt request (pass-2 delta bounds re-gets).
serverlog="${tmpdir}/server.log"
counter="${tmpdir}/blobcount"
RESUME_COUNTER="$counter" python3 "$server" --root "${testdir}/server-root" >"$serverlog" 2>&1 &
serverpid=$!
port=
for _ in $(seq 1 50); do
line=$(head -n1 "$serverlog" 2>/dev/null)
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
kill -0 "$serverpid" 2>/dev/null || {
echo "server exited early: $(cat "$serverlog")"
exit 1
}
sleep 0.1
done
test -n "$port" || {
echo "could not discover server port"
exit 1
}
base="http://127.0.0.1:${port}"
which httrack >/dev/null || {
echo "could not find httrack"
exit 1
}
out="${tmpdir}/crawl"
mkdir "$out"
common=(-O "$out" --quiet --disable-security-limits --robots=0 --timeout=30 --retries=0)
refdir="${out}/hts-cache/ref"
# --- pass 1: crawl, interrupt once the blob download is underway -------------
printf '[pass 1: interrupt mid-download] ..\t'
httrack "${common[@]}" "${base}/resume/index.html" >"${tmpdir}/log1" 2>&1 &
crawlpid=$!
# Wait until blob.txt is requested, then SIGTERM so httrack's exit handler
# finalizes the cache and serializes the temp-ref.
for _ in $(seq 1 300); do
test -s "$counter" && break
kill -0 "$crawlpid" 2>/dev/null || break
sleep 0.1
done
sleep 0.5
kill -TERM "$crawlpid" 2>/dev/null
wait "$crawlpid" 2>/dev/null
crawlpid=
test -n "$(find "$refdir" -name '*.ref' 2>/dev/null)" || {
echo "FAIL: no temp-ref survived pass 1; cannot drive #206"
exit 1
}
echo "OK (temp-ref present)"
before=$(wc -c <"$counter" 2>/dev/null || echo 0)
# --- pass 2: --continue -> resume Range -> 416, bounded against the #206 loop -
# Kill pass 2 after a deadline (portable stand-in for `timeout`, absent on macOS).
printf '[pass 2: resume must terminate] ..\t'
HANG_RC=137 # 128 + SIGKILL
httrack "${common[@]}" --continue "${base}/resume/index.html" >"${tmpdir}/log2" 2>&1 &
crawlpid=$!
(sleep 30 && kill -9 "$crawlpid" 2>/dev/null) &
guard=$!
rc=0
wait "$crawlpid" 2>/dev/null || rc=$?
crawlpid=
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
if test "$rc" -eq "$HANG_RC"; then
echo "FAIL: pass 2 did not terminate (#206 resume->416 loop)"
exit 1
fi
echo "OK (terminated, rc=$rc)"
# The fix re-gets once (resume Range + range-less re-get = 2): the lower bound
# rejects a drop-the-link non-fix (1), the upper bound rejects the loop (many).
after=$(wc -c <"$counter" 2>/dev/null || echo 0)
hits=$((after - before))
printf '[bounded re-get count] ..\t'
if test "$hits" -lt 2; then
echo "FAIL: only ${hits} pass-2 request(s); the stale partial was not re-got"
exit 1
fi
if test "$hits" -gt 8; then
echo "FAIL: ${hits} pass-2 requests for blob.txt (resume is looping)"
exit 1
fi
echo "OK (${hits} requests)"

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# #157: a dotless, accented URL named .html on the first crawl must keep .html
# across an update -- not revert to the extensionless name.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'intl/Instalação_CVS_no_Ubuntu.html' \
--not-found 'intl/Instalação_CVS_no_Ubuntu' \
httrack 'BASEURL/intl/index.html'

View File

@@ -40,6 +40,7 @@ TESTS = \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-savename.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \
@@ -58,6 +59,8 @@ TESTS = \
16_local-assume.test \
17_local-empty-ct.test \
18_local-update.test \
19_local-connect-fallback.test
19_local-connect-fallback.test \
20_local-resume-loop.test \
21_local-intl-update.test
CLEANFILES = check-network_sh.cache

View File

@@ -196,6 +196,15 @@ if test -n "$rerun"; then
exit 1
}
result "OK (update)"
# The update summary reports "files updated"; a fresh crawl never does. Assert
# it so a regression that bypasses the cache (re-crawls fresh) can't pass.
info "checking update used the cache"
if grep -aqE "mirror complete in .*files updated" "${out}/hts-log.txt"; then
result "OK"
else
result "update pass did not report cache activity"
exit 1
fi
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------

View File

@@ -15,6 +15,7 @@ stdlib only (http.server + ssl) -- no new build or runtime dependency.
import argparse
import os
import time
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import quote, unquote, urlsplit
@@ -176,6 +177,54 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# --- special chars in URLs across an update (issue #157) ---------------
# A dotless, accented basename served as text/html (MediaWiki style). The
# name the first crawl picks (.html) must survive the update pass.
INTL_NAME = "Instalação_CVS_no_Ubuntu"
def route_intl_index(self):
self.send_html('\t<a href="%s">accented</a>\n' % self.INTL_NAME)
def route_intl_page(self):
self.send_raw(b"<html><body>accented page</body></html>\n", "text/html")
# resume / 416 loop (#206): the first GET stalls after a prefix so the crawl
# can be interrupted (partial + temp-ref); every later request is 416.
RESUME_PREFIX = b"PARTIAL-" + b"x" * 4096 # flushed before the stall
RESUME_LEN = len(RESUME_PREFIX) + 4096 # declared length never delivered
_resume_started = False
def route_resume_index(self):
self.send_html('\t<a href="blob.txt">blob</a>')
def route_resume(self):
counter = os.environ.get("RESUME_COUNTER")
if counter:
with open(counter, "a") as fp:
fp.write("x")
# First GET: stall mid-body so the crawl can be interrupted with a partial.
if not Handler._resume_started:
Handler._resume_started = True
self.send_response(200)
self.send_header("Content-Type", "image/png")
self.send_header("Content-Length", str(self.RESUME_LEN))
self.send_header("Accept-Ranges", "bytes")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(self.RESUME_PREFIX)
self.wfile.flush()
try:
while True:
time.sleep(3600)
except OSError:
pass
return
self.send_response(416, "Requested Range Not Satisfiable")
self.send_header("Content-Type", "image/png")
self.send_header("Content-Range", "bytes */%d" % self.RESUME_LEN)
self.send_header("Content-Length", "0")
self.end_headers()
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
@@ -195,6 +244,10 @@ class Handler(SimpleHTTPRequestHandler):
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
"/intl/index.html": route_intl_index,
"/intl/" + INTL_NAME: route_intl_page,
"/resume/index.html": route_resume_index,
"/resume/blob.txt": route_resume,
}
# --- dispatch ----------------------------------------------------------
@@ -202,7 +255,8 @@ class Handler(SimpleHTTPRequestHandler):
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
handler = self.ROUTES.get(path)
# Match percent-encoded paths (accented #157 route) by their decoded form.
handler = self.ROUTES.get(path) or self.ROUTES.get(unquote(path))
if handler is not None:
handler(self)
return True