Compare commits

..

4 Commits

Author SHA1 Message Date
Xavier Roche
5501faa7b1 tests: lock "no error pages" (-o0) write-suppression (#17) (#425)
#17 (WinHTTrack 3.47-19, 2013) reported 404 error pages and 0-byte files
kept and unpurged with "no error pages" set. It does not reproduce on
current master/Linux: -o0 keeps 4xx/5xx bodies off disk and out of the
purge list, a genuine 0-byte 200 is correctly saved, and purge removes
stale files on update. The report's .html names were the extension-mangle
bug (Defect A, fixed in #408 — the reporter switched to HTTP/1.0 because
binaries were renamed .html); the settings-revert-on-update path is fixed
by the hts_tristate option work (4549ec3, #413).

Add an /errpage/ route group to local-server.py and 23_local-errpage.test
locking -o0 suppression with an -o1 control. Negative-control verified:
neutering the errpage gate (htsparse.c:3902) makes the test fail.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 18:02:28 +02:00
Xavier Roche
6322b6fb1f Lock --tolerant (-%B) on broken Content-Length, and fix an OOB it surfaced (#32/#41) (#424)
* tests: lock --tolerant (-%B) behavior on broken Content-Length (#32/#41)

A response whose Content-Length disagrees with the bytes actually sent
warns "bogus state (broken size)" and is skipped from the cache, so it is
re-fetched and re-warned on every run. --tolerant (-%B) already accepts
such responses; either way the file reaches disk. Pin that contract with a
local-server /size route (declares a length two bytes short of the body)
and a test asserting the warning fires by default and is silenced under
-%B, with the file present in both passes.

Adds --log-found/--log-not-found ERE assertions on hts-log.txt to
local-crawl.sh for the warning checks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* htslib: fix global-buffer-overflow in get_httptype_sized on empty filename

get_httptype_sized() set a = fil + strlen(fil) - 1, then dereferenced *a
in the extension scan before the a > fil bound was checked, so an empty
fil ("") read one byte before the string. istoobig() passes a literal ""
to is_hypertext_mime() whenever it classifies by mime alone (the quota
check in back_checksize), so any octet-stream-ish download hit it. Bound
the loop and the dot test before dereferencing.

Latent (an OOB read of one .rodata byte); surfaced under ASan by the new
22_local-broken-size.test, whose oversize.bin is application/octet-stream.
Adds a direct empty-fil case to the -#7 basic_selftests block as a fast,
deterministic leaf-level regression (it aborts under ASan on the old code).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:18:06 +02:00
Xavier Roche
58f368a91a tests: lock special-char URL naming across an update (#157) (#423)
#157 reported accented URLs (pt-BR MediaWiki) losing their .html extension
on an update pass, observed with 3.49-2 on Windows. It does not reproduce on
current master: the update resolves the cached content-type and re-applies
.html consistently, for UTF-8 and ISO-8859-1 sources, raw Latin-1 href bytes,
either percent-encoding case, and dotted tails. The original symptom was a
Windows codepage vs UTF-8 X-Save filename mismatch that cannot occur on a
UTF-8 filesystem.

Add a regression test that locks the invariant: a dotless, accented basename
served as text/html, crawled then updated, must keep its .html name and not
leave an extensionless sibling.

Also assert in the --rerun harness that the update pass reported "files
updated" (a fresh crawl never does), so a regression that bypasses the cache
and silently re-crawls fresh can no longer pass the update tests.

Closes #157

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:35:55 +02:00
Xavier Roche
c97b3e233e Stop the 412/416 partial-reget loop on continue/update (#206) (#422)
On resume, the Range request is rebuilt by back_add from a temp-ref keyed on
(adr,fil) that records the partial download's real save name. A 412/416
("Range Not Satisfiable") means that partial is stale and the whole file
must be re-fetched. The handler only removed heap->sav, so when the resume
pass recomputed a save name different from the temp-ref's (the default
delayed-type machinery renames freely), the partial was never cleared:
back_add re-sent the same Range, earned the same 416, and the link was
re-recorded forever, growing the scan counter without bound.

Clear the whole partial wherever it lives -- the temp-ref and the file it
points at, plus heap->sav -- so the re-record falls through to a plain full
GET. Re-get only when there was a partial to discard and both Range triggers
(the ref and the on-disk file) are actually gone; once they are, a fresh 416
with nothing left to drop means the whole-file GET itself failed, so the link
gives up cleanly instead of re-queueing. A failed removal (read-only or full
cache) also gives up rather than looping, since back_add would otherwise
re-Range the surviving ref; url_savename_refname_remove now reports the
removal result so the handler can tell. (The request's range_used flag would
be the natural one-shot signal, but it does not survive the delayed-type
two-pass, so we key off the partial instead.)

tests/20_local-resume-loop.test drives it offline: pass 1 is interrupted
(SIGTERM, so the exit handler finalizes the cache and the temp-ref) to leave
a partial, then pass 2 --continue gets 416 on every resume request. A
portable watchdog kills pass 2 if it loops; the test asserts it terminates
and attempts exactly one whole-file re-get (2 <= requests <= 8). It fails on
the pre-fix handler (loops) and on a re-get that silently drops the link.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 21:12:40 +02:00
8 changed files with 144 additions and 5 deletions

View File

@@ -353,6 +353,14 @@ static void basic_selftests(void) {
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 1) == 1);
assertf(strcmp(r.contenttype, "application/octet-stream") == 0);
// empty fil: no extension to scan; must not over-read before the string.
// flag==0 -> 0 (nothing written), flag==1 -> octet-stream.
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype), "",
0) == 0);
assertf(r.contenttype[0] == '\0');
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype), "",
1) == 1);
assertf(strcmp(r.contenttype, "application/octet-stream") == 0);
// a user --assume rule with an empty value matches but writes nothing:
// get_userhttptype returns 1 with the buffer empty, so get_httptype_sized
// must still report 0 (callers test the return like the old

View File

@@ -4177,9 +4177,10 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
/* Check html -> text/html */
const char *a = fil + strlen(fil) - 1;
while((*a != '.') && (*a != '/') && (a > fil))
/* a < fil when fil is empty: bound before dereferencing */
while ((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (*a == '.' && strlen(a) < 32) {
if (a >= fil && *a == '.' && strlen(a) < 32) {
int j = 0;
a++;

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# #157: a dotless, accented URL named .html on the first crawl must keep .html
# across an update -- not revert to the extensionless name.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --rerun \
--found 'intl/Instalação_CVS_no_Ubuntu.html' \
--not-found 'intl/Instalação_CVS_no_Ubuntu' \
httrack 'BASEURL/intl/index.html'

17
tests/22_local-broken-size.test Executable file
View File

@@ -0,0 +1,17 @@
#!/bin/bash
# Issues #32/#41: a Content-Length that disagrees with the body warns "bogus
# state (broken size)" and skips the cache; -%B (tolerant) accepts it.
: "${top_srcdir:=..}"
# Default: warn, but the file is still written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-found 'bogus state \(broken size' \
httrack 'BASEURL/size/index.html'
# -%B (tolerant): no warning, file written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'size/oversize.bin' \
--log-not-found 'bogus state' \
httrack 'BASEURL/size/index.html' '-%B'

View File

@@ -0,0 +1,19 @@
#!/bin/bash
# Issue #17: with "no error pages" (-o0), 4xx/5xx bodies must not be written;
# a genuine 0-byte 200 stays. Default (-o1) writes the error page. (#17's purge
# half also does not reproduce; the purge path is not exercised here.)
set -e
: "${top_srcdir:=..}"
# -o0: 404 suppressed, good page and the legit 0-byte 200 kept.
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'errpage/good.html' \
--found 'errpage/empty.html' \
--not-found 'errpage/missing.html' \
httrack 'BASEURL/errpage/index.html' '-o0'
# Control -o1 (default): the 404 error page is written.
bash "$top_srcdir/tests/local-crawl.sh" --errors 1 \
--found 'errpage/missing.html' \
httrack 'BASEURL/errpage/index.html' '-o1'

View File

@@ -60,6 +60,9 @@ TESTS = \
17_local-empty-ct.test \
18_local-update.test \
19_local-connect-fallback.test \
20_local-resume-loop.test
20_local-resume-loop.test \
21_local-intl-update.test \
22_local-broken-size.test \
23_local-errpage.test
CLEANFILES = check-network_sh.cache

View File

@@ -14,7 +14,9 @@
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# --errors N --files N --found PATH ... --directory PATH ... \
# --log-found REGEX ... --log-not-found REGEX ... \
# httrack BASEURL/some/path [httrack-args...]
# --log-found/--log-not-found grep (ERE) the crawl's hts-log.txt.
set -u
@@ -107,7 +109,7 @@ while test "$pos" -lt "$nargs"; do
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory)
--found | --not-found | --directory | --log-found | --log-not-found)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
@@ -196,6 +198,15 @@ if test -n "$rerun"; then
exit 1
}
result "OK (update)"
# The update summary reports "files updated"; a fresh crawl never does. Assert
# it so a regression that bypasses the cache (re-crawls fresh) can't pass.
info "checking update used the cache"
if grep -aqE "mirror complete in .*files updated" "${out}/hts-log.txt"; then
result "OK"
else
result "update pass did not report cache activity"
exit 1
fi
fi
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
@@ -248,6 +259,22 @@ while test "$i" -lt "${#audit[@]}"; do
exit 1
fi
;;
--log-found)
i=$((i + 1))
info "checking log matches ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${out}/hts-log.txt"; then result "OK"; else
result "not in log"
exit 1
fi
;;
--log-not-found)
i=$((i + 1))
info "checking log lacks ${audit[$i]}"
if grep -aqE "${audit[$i]}" "${out}/hts-log.txt"; then
result "present in log"
exit 1
else result "OK"; fi
;;
esac
i=$((i + 1))
done

View File

@@ -177,6 +177,17 @@ class Handler(SimpleHTTPRequestHandler):
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
# --- special chars in URLs across an update (issue #157) ---------------
# A dotless, accented basename served as text/html (MediaWiki style). The
# name the first crawl picks (.html) must survive the update pass.
INTL_NAME = "Instalação_CVS_no_Ubuntu"
def route_intl_index(self):
self.send_html('\t<a href="%s">accented</a>\n' % self.INTL_NAME)
def route_intl_page(self):
self.send_raw(b"<html><body>accented page</body></html>\n", "text/html")
# resume / 416 loop (#206): the first GET stalls after a prefix so the crawl
# can be interrupted (partial + temp-ref); every later request is 416.
RESUME_PREFIX = b"PARTIAL-" + b"x" * 4096 # flushed before the stall
@@ -214,6 +225,39 @@ class Handler(SimpleHTTPRequestHandler):
self.send_header("Content-Length", "0")
self.end_headers()
# error pages / 0-byte files (#17): -o0 ("no error pages") must keep 4xx/5xx
# bodies off disk; a genuine 0-byte 200 is a valid file and stays.
def route_errpage_index(self):
self.send_html(
'\t<a href="good.html">good</a>\n'
'\t<a href="missing.html">missing</a>\n'
'\t<a href="empty.html">empty</a>\n'
)
def route_errpage_good(self):
self.send_raw(b"<html><body>good page</body></html>\n", "text/html")
def route_errpage_missing(self):
self.send_html("\t404 error body", status=404, extra_status="Not Found")
def route_errpage_empty(self):
self.send_raw(b"", "text/html")
# broken Content-Length (#32/#41): declared size != bytes sent. httrack
# warns "bogus state (broken size)" and skips the cache unless -%B.
def route_size_index(self):
self.send_html('\t<a href="oversize.bin">over</a>\n')
def route_size_oversize(self):
body = b"A" * 100
self.send_response(200)
self.send_header("Content-Type", "application/octet-stream")
self.send_header("Content-Length", str(len(body) - 2)) # lie: too short
self.send_header("Connection", "close")
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
@@ -233,8 +277,16 @@ class Handler(SimpleHTTPRequestHandler):
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
"/intl/index.html": route_intl_index,
"/intl/" + INTL_NAME: route_intl_page,
"/resume/index.html": route_resume_index,
"/resume/blob.txt": route_resume,
"/size/index.html": route_size_index,
"/size/oversize.bin": route_size_oversize,
"/errpage/index.html": route_errpage_index,
"/errpage/good.html": route_errpage_good,
"/errpage/missing.html": route_errpage_missing,
"/errpage/empty.html": route_errpage_empty,
}
# --- dispatch ----------------------------------------------------------
@@ -242,7 +294,8 @@ class Handler(SimpleHTTPRequestHandler):
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
handler = self.ROUTES.get(path)
# Match percent-encoded paths (accented #157 route) by their decoded form.
handler = self.ROUTES.get(path) or self.ROUTES.get(unquote(path))
if handler is not None:
handler(self)
return True