mirror of
https://github.com/xroche/httrack.git
synced 2026-06-25 03:27:22 +03:00
#409 distinguished "the server declared text/html" from "no Content-Type, defaulted to text/html" with a new htsblk.contenttype_given flag, so a binary-looking URL that really serves HTML is saved .html while a typeless response keeps its URL extension. That worked on a fresh crawl but had two costs: the flag was never persisted, so on --update the cache read it as unset and the names reverted (report.html became report.pdf again, and the two passes disagreed); and it was an installed-struct ABI break (soname 4, libhttrack4). Replace the flag with a sentinel: when no Content-Type is received, store "unknown/unknown" as the type instead of text/html. The sentinel is treated as html for every type test (added to is_html_mime_type), so parsing, storage and filtering of a typeless response are unchanged; only the naming code (wire_patches_ext) reads it as "no declared type" and keeps the URL extension. Because the type string rides the cache, an update reads the same sentinel and names consistently -- the revert is fixed at the source. The sentinel never reaches a consumer as a real type: a single helper, hts_effective_mime(), maps it back to text/html wherever a stored type is derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed response header and the index record). The .arc export was caught by an adversarial spill audit; without the map a typeless page archived via proxytrack would carry Content-Type: unknown/unknown. Since the sentinel makes contenttype_given unnecessary, #409's ABI break is undone: the field is removed, soname returns to 3, and the Debian package reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW carries libhttrack3), so this re-aligns master with the archive rather than flip-flopping anything downstream. Tests: 18_local-update re-mirrors and asserts the names survive the update pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct stays green. Full make check: 27 pass, 0 fail. One accepted behavior change: a mime filter matching exactly text/html no longer matches a typeless response (its type is the sentinel, html-ish but not literally text/html); the response is still parsed and crawled as html. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
4 lines
82 B
Plaintext
4 lines
82 B
Plaintext
usr/lib/*/libhttrack.so.3*
|
|
usr/lib/*/libhtsjava.so.3*
|
|
usr/share/httrack/templates
|