mirror of
https://github.com/xroche/httrack.git
synced 2026-06-24 19:17:31 +03:00
#409 distinguished "the server declared text/html" from "no Content-Type, defaulted to text/html" with a new htsblk.contenttype_given flag, so a binary-looking URL that really serves HTML is saved .html while a typeless response keeps its URL extension. That worked on a fresh crawl but had two costs: the flag was never persisted, so on --update the cache read it as unset and the names reverted (report.html became report.pdf again, and the two passes disagreed); and it was an installed-struct ABI break (soname 4, libhttrack4). Replace the flag with a sentinel: when no Content-Type is received, store "unknown/unknown" as the type instead of text/html. The sentinel is treated as html for every type test (added to is_html_mime_type), so parsing, storage and filtering of a typeless response are unchanged; only the naming code (wire_patches_ext) reads it as "no declared type" and keeps the URL extension. Because the type string rides the cache, an update reads the same sentinel and names consistently -- the revert is fixed at the source. The sentinel never reaches a consumer as a real type: a single helper, hts_effective_mime(), maps it back to text/html wherever a stored type is derived (give_mimext) or emitted/persisted -- the httrack stdout serve, the ProxyTrack live serve, and the ProxyTrack .arc export (both the replayed response header and the index record). The .arc export was caught by an adversarial spill audit; without the map a typeless page archived via proxytrack would carry Content-Type: unknown/unknown. Since the sentinel makes contenttype_given unnecessary, #409's ABI break is undone: the field is removed, soname returns to 3, and the Debian package reverts libhttrack4 -> libhttrack3. soname 4 was never released (Debian NEW carries libhttrack3), so this re-aligns master with the archive rather than flip-flopping anything downstream. Tests: 18_local-update re-mirrors and asserts the names survive the update pass; 15_local-types gains a notype.html negative control; 17_local-empty-ct stays green. Full make check: 27 pass, 0 fail. One accepted behavior change: a mime filter matching exactly text/html no longer matches a typeless response (its type is the sentinel, html-ish but not literally text/html); the response is still parsed and crawled as html. Signed-off-by: Xavier Roche <roche@httrack.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
103 lines
3.9 KiB
Plaintext
103 lines
3.9 KiB
Plaintext
Source: httrack
|
|
Section: web
|
|
Priority: optional
|
|
Maintainer: Xavier Roche <roche@httrack.com>
|
|
Standards-Version: 4.7.0
|
|
Build-Depends: debhelper-compat (= 13), autoconf, autoconf-archive, automake, libtool, zlib1g-dev, libssl-dev
|
|
Rules-Requires-Root: no
|
|
Homepage: http://www.httrack.com
|
|
Vcs-Git: https://github.com/xroche/httrack.git
|
|
Vcs-Browser: https://github.com/xroche/httrack
|
|
|
|
Package: httrack
|
|
Architecture: any
|
|
Multi-Arch: foreign
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}
|
|
Suggests: webhttrack, httrack-doc
|
|
Description: Copy websites to your computer (Offline browser)
|
|
HTTrack is an offline browser utility, allowing you to download a World
|
|
Wide website from the Internet to a local directory, building recursively
|
|
all directories, getting html, images, and other files from the server to
|
|
your computer.
|
|
.
|
|
HTTrack arranges the original site's relative link-structure. Simply
|
|
open a page of the "mirrored" website in your browser, and you can
|
|
browse the site from link to link, as if you were viewing it online.
|
|
HTTrack can also update an existing mirrored site, and resume
|
|
interrupted downloads. HTTrack is fully configurable, and has an
|
|
integrated help system.
|
|
|
|
Package: webhttrack
|
|
Architecture: any
|
|
Multi-Arch: foreign
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}, webhttrack-common, sensible-utils, firefox-esr | chromium | www-browser
|
|
Replaces: webhttrack-common (<< 3.43.9-2)
|
|
Breaks: webhttrack-common (<< 3.43.9-2)
|
|
Suggests: httrack, httrack-doc
|
|
Enhances: httrack
|
|
Description: Copy websites to your computer, httrack with a Web interface
|
|
WebHTTrack is an offline browser utility, allowing you to download a World
|
|
Wide website from the Internet to a local directory, building recursively
|
|
all directories, getting html, images, and other files from the server to
|
|
your computer, using a step-by-step web interface.
|
|
.
|
|
WebHTTrack arranges the original site's relative link-structure. Simply
|
|
open a page of the "mirrored" website in your browser, and you can
|
|
browse the site from link to link, as if you were viewing it online.
|
|
HTTrack can also update an existing mirrored site, and resume
|
|
interrupted downloads. WebHTTrack is fully configurable, and has an
|
|
integrated help system.
|
|
.
|
|
Snapshots: http://www.httrack.com/page/21/
|
|
|
|
Package: webhttrack-common
|
|
Architecture: all
|
|
Multi-Arch: foreign
|
|
Depends: ${misc:Depends}
|
|
Description: webhttrack common files
|
|
This package is the common files of webhttrack, website copier and
|
|
mirroring utility
|
|
|
|
Package: libhttrack3
|
|
Architecture: any
|
|
Multi-Arch: same
|
|
Section: libs
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}
|
|
Replaces: libhttrack2, httrack (<< 3.49.8-2~)
|
|
Breaks: libhttrack2, httrack (<< 3.49.8-2~)
|
|
Description: Httrack website copier library
|
|
This package is the library part of httrack, website copier and mirroring
|
|
utility
|
|
|
|
Package: libhttrack-dev
|
|
Architecture: any
|
|
Multi-Arch: same
|
|
Section: libdevel
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}, zlib1g-dev
|
|
Description: Httrack website copier includes and development files
|
|
This package adds supplemental files for using the httrack website copier
|
|
library
|
|
|
|
Package: httrack-doc
|
|
Architecture: all
|
|
Multi-Arch: foreign
|
|
Section: doc
|
|
Depends: ${misc:Depends}
|
|
Description: Httrack website copier additional documentation
|
|
This package adds supplemental documentation for httrack and webhttrack
|
|
as a browsable html documentation
|
|
|
|
Package: proxytrack
|
|
Architecture: any
|
|
Multi-Arch: foreign
|
|
Depends: ${misc:Depends}, ${shlibs:Depends}
|
|
Suggests: squid, httrack
|
|
Description: Build HTTP Caches using archived websites copied by HTTrack
|
|
ProxyTrack is a simple proxy server aimed to deliver content archived by
|
|
HTTrack sessions. It can aggregate multiple download caches, for direct
|
|
use (through any browser) or as an upstream cache slave server.
|
|
This proxy can handle HTTP/1.1 proxy connections, and is able to reply to
|
|
ICPv2 requests for an efficient integration within other cache servers,
|
|
such as Squid. It can also handle transparent HTTP requests to allow
|
|
cached live connections inside an offline network.
|