Compare commits

...

7 Commits

Author SHA1 Message Date
Xavier Roche
bd7e0989f6 Parse robots.txt with RFC 9309 Allow/Disallow precedence (#458)
The robots.txt handler only did substring Disallow matching against a flat
token blob: no Allow:, no path wildcards. Sites using "Disallow: /" plus
"Allow: /public/" were over-blocked, since Allow was never parsed.

Move the body parsing into robots_parse() (htsrobots.c) so both the crawler
and a self-test feed raw robots.txt. Rules are stored Allow/Disallow-tagged
and consulted with RFC 9309 precedence: the longest matching path pattern
wins, Allow breaking ties. Pattern matching supports '*' (any run) and a
trailing '$' (end-of-path anchor) via a linear two-pointer matcher with a
single resumable star position, so hostile patterns cannot trigger
exponential backtracking. Path matching is now case-sensitive per the RFC.

robots_wizard is internal (not in DevIncludes_DATA, no HTSEXT_API; htsopt.h
holds only an opaque pointer), so the in-memory format changed without an ABI
break. Sitemap:/Crawl-delay: lines are tolerated but ignored, as before.

New -#test=robots self-test plus tests/01_engine-robots.test cover the
Allow-over-Disallow longest match, the equal-length Allow tie, '*'/'$'
wildcards, and httrack-group selection.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 09:07:54 +02:00
Xavier Roche
bd74ec7cab Advertise deflate in Accept-Encoding and decode it (#459)
The request Accept-Encoding offered only gzip even though the response
parser already recognized deflate/x-deflate. But the actual decode path
(hts_zunpack) used zlib's gzread, which only inflates gzip and copies any
deflate body through verbatim, so a deflate response would have been
written out still compressed. Advertising deflate without fixing that
would corrupt files.

Rewrite hts_zunpack to inflate via inflateInit2 with format detection:
gzip and zlib (RFC1950) auto-detect with +32 windowBits, everything else
is treated as raw deflate (RFC1951). Then add deflate to the advertised
list through a small hts_acceptencoding() helper shared with the test.

A new -#test=acceptencoding self-test asserts the advertised header
carries both gzip and deflate, and round-trips gzip, zlib and raw-deflate
bodies through hts_zunpack on disk. Both halves fail on the old binary.

Brotli is intentionally out of scope (new dependency, larger change).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:54:03 +02:00
Xavier Roche
1ed8ffad64 tests: group zlib-dependent self-tests under 01_zlib-* (#460)
The MSan job runs the offline 01_engine-* self-tests but must skip any that
exercise the system zlib: uninstrumented libz floods MemorySanitizer with
false positives (MSan can't see libz initialize its own internal state, so
every byte deflate/inflate produces reads as "uninitialized"). That was a
grep -v -- '-cache' exclusion, a list that would grow with each new zlib test.

Rename the three cache tests to a 01_zlib-* prefix so the MSan job selects
01_engine-* with no exclusion list. They still run in the normal suite and
under ASan+UBSan (full make check), where uninstrumented libz is fine. The
deflate Accept-Encoding test (PR #459) follows the same convention.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:39:43 +02:00
Xavier Roche
b68de172fa Follow <source>/<track> as embedded near-links (#451) (#457)
src/srcset are already extracted from any element via hts_detect[], so
<source>/<track> URLs were captured. But hts_detect_embed only listed
<img src>, so at the recursion-depth boundary under --near these media
were treated as plain links and dropped, unlike <img>. Add a separate
HTML5 media table (keeping the legacy one clang-format-stable) and chain
the embed lookup through both.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:18:28 +02:00
Xavier Roche
aabfd34380 Refresh minor legacy constants (#453) (#456)
Three independent low-risk items under tracking issue #453:

- HTTP status enum/reason map: add 429 (Too Many Requests) and 451
  (Unavailable For Legal Reasons); appended to the HTTPStatusCode enum
  (installed header, tail-append only) and to infostatuscode_const().
- Drop the 1990s relic port 31337 from the local proxy bind fallback list.
- Pin a TLS protocol floor (TLS1.2 via SSL_CTX_set_min_proto_version, with an
  SSL_OP_NO_* fallback for OpenSSL < 1.1.0) so obsolete SSLv3/TLS1.0/1.1 aren't
  negotiated. No cert verification added: that remains by design.

Adds a -#test=status engine self-test (01_engine-status.test) asserting the
reason phrases for 429/451.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 00:12:17 +02:00
Xavier Roche
65ff9e0f11 Modernize the default User-Agent (#449) (#455)
The default UA was "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)",
a 25-year-old string naming a dead OS and trivially fingerprinted. Replace
it with an honest crawler UA carrying the HTTrack token and a reference URL:
"Mozilla/5.0 (compatible; HTTrack/3.x; +https://www.httrack.com/)".

Both the engine default (hts_create_opt) and the webhttrack mini-server
config default now share one HTS_DEFAULT_USER_AGENT macro, built from
HTTRACK_AFF_VERSION so the version token tracks releases without going stale.

Closes #449

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 23:18:28 +02:00
Xavier Roche
730a1c8c5b Add modern web MIME types to the type/extension table (#454)
The MIME ⇄ extension table in htslib.c had no entry for formats that are now
common: webp, avif, woff/woff2, json, wasm, mp4/webm, opus/flac, and friends.
A crawl that met these relied on the application/<subtype> fallback (or got no
extension at all), so saved files landed with wrong or missing extensions and
get_httptype guessed nothing from the extension.

The new rows live in a separate hts_mime_modern[] table rather than appended to
the legacy hts_mime[]: clang-format reflows a whole brace initializer on any
in-place edit, which would churn every untouched legacy row. A small
hts_mime_lookup() helper scans a table in either direction; give_mimext() and
get_httptype_sized() now consult the legacy table first, then the modern one,
so a new row can never shadow a legacy mapping. Legacy aliases (flash,
realaudio) stay for archiving old content.

Self-test covers both lookups; the give_mimext cases use MIME types the
application/<=4-char-subtype fallback can't fabricate, so a missing row
actually fails the assert.

Closes #448

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 22:53:48 +02:00
23 changed files with 738 additions and 176 deletions

View File

@@ -269,8 +269,9 @@ jobs:
MSAN_OPTIONS: abort_on_error=1:halt_on_error=1
run: |
set -euo pipefail
# Engine self-tests only; the cache trio pulls in uninstrumented zlib.
tests="$(cd tests && ls 01_engine-*.test | grep -v -- '-cache' | tr '\n' ' ')"
# 01_engine-* only; zlib-dependent self-tests are named 01_zlib-* and
# skipped here (uninstrumented libz floods MSan with false positives).
tests="$(cd tests && ls 01_engine-*.test | tr '\n' ' ')"
make check TESTS="$tests"
- name: Print the test log on failure

View File

@@ -129,6 +129,8 @@ typedef enum HTTPStatusCode {
HTTP_UNSUPPORTED_MEDIA_TYPE = 415,
HTTP_REQUESTED_RANGE_NOT_SATISFIABLE = 416,
HTTP_EXPECTATION_FAILED = 417,
HTTP_TOO_MANY_REQUESTS = 429,
HTTP_UNAVAILABLE_FOR_LEGAL_REASONS = 451,
HTTP_INTERNAL_SERVER_ERROR = 500,
HTTP_NOT_IMPLEMENTED = 501,
HTTP_BAD_GATEWAY = 502,

View File

@@ -64,7 +64,7 @@ Please visit our Website: http://www.httrack.com
// catch_url_init(&port,&return_host);
HTSEXT_API T_SOC catch_url_init_std(int *port_prox, char *adr_prox) {
T_SOC soc;
int try_to_listen_to[] = { 8080, 3128, 80, 81, 82, 8081, 3129, 31337, 0, -1 };
int try_to_listen_to[] = {8080, 3128, 80, 81, 82, 8081, 3129, 0, -1};
int i = 0;
do {

View File

@@ -1796,90 +1796,18 @@ int httpmirror(char *url1, httrackp * opt) {
if (strnotempty(savename()) == 0) { // pas de chemin de sauvegarde
if (strcmp(urlfil(), "/robots.txt") == 0) { // robots.txt
if (r.adr) {
int bptr = 0;
char BIGSTK line[1024];
char BIGSTK buff[8192];
char BIGSTK infobuff[8192];
int record = 0;
line[0] = '\0';
buff[0] = '\0';
infobuff[0] = '\0';
//
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", r.adr);
#endif
do {
char *comm;
int llen;
bptr += binput(r.adr + bptr, line, sizeof(line) - 2);
/* strip comment */
comm = strchr(line, '#');
if (comm != NULL) {
*comm = '\0';
}
/* strip spaces */
llen = (int) strlen(line);
while(llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a;
a = line + 11;
while(is_realspace(*a))
a++; // sauter espace(s)
if (*a == '*') {
if (record != 2)
record = 1; // c pour nous
} else if (strfield(a, "httrack") || strfield(a, "winhttrack")
|| strfield(a, "webhttrack")) {
buff[0] = '\0'; // re-enregistrer
infobuff[0] = '\0';
record = 2; // locked
#if DEBUG_ROBOTS
printf("explicit disallow for httrack\n");
#endif
} else
record = 0;
} else if (record) {
if (strfield(line, "disallow:")) {
char *a = line + 9;
while(is_realspace(*a))
a++; // sauter espace(s)
if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
hts_boolean keep_root = (opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
? HTS_TRUE
: HTS_FALSE;
#else
hts_boolean keep_root = HTS_TRUE;
#endif
{ /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a);
strcatbuff(buff, "\n");
if ((strlen(infobuff) + strlen(a) + 8) <
sizeof(infobuff)) {
if (strnotempty(infobuff))
strcatbuff(infobuff, ", ");
strcatbuff(infobuff, a);
}
}
}
#ifdef IGNORE_RESTRICTIVE_ROBOTS
else {
hts_log_print(opt, LOG_NOTICE,
"Note: %s robots.txt rules are too restrictive, ignoring /",
urladr());
}
#endif
}
}
}
} while((bptr < r.size) && (strlen(buff) < (sizeof(buff) - 32)));
if (strnotempty(buff)) {
checkrobots_set(&robots, urladr(), buff);
robots_parse(&robots, urladr(), r.adr, r.size, infobuff,
sizeof(infobuff), keep_root);
if (strnotempty(infobuff)) {
hts_log_print(opt, LOG_INFO,
"Note: robots.txt forbidden links for %s are: %s",
urladr(), infobuff);

View File

@@ -229,6 +229,10 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
/* Honest crawler User-Agent; no fake OS/browser to go stale. */
#define HTS_DEFAULT_USER_AGENT \
"Mozilla/5.0 (compatible; HTTrack/" HTTRACK_AFF_VERSION \
"; +https://www.httrack.com/)"
#define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \

View File

@@ -563,6 +563,39 @@ const char *hts_mime[][2] = {
{"", ""}
};
/* Modern web formats (post-2010), kept in their own table: appending to the
legacy hts_mime[] above makes clang-format reflow its whole initializer.
Scanned after hts_mime[], so it never shadows a legacy mapping. */
static const char *hts_mime_modern[][2] = {
{"image/webp", "webp"},
{"image/avif", "avif"},
{"image/heic", "heic"},
{"font/woff", "woff"},
{"font/woff2", "woff2"},
{"font/ttf", "ttf"},
{"font/otf", "otf"},
{"application/json", "json"},
{"application/ld+json", "jsonld"},
{"application/manifest+json", "webmanifest"},
{"application/wasm", "wasm"},
{"text/javascript", "js"},
{"text/javascript", "mjs"},
{"text/markdown", "md"},
{"video/mp4", "mp4"},
{"video/webm", "webm"},
{"video/ogg", "ogv"},
{"video/mp2t", "ts"},
{"audio/mp4", "m4a"},
{"audio/aac", "aac"},
{"audio/ogg", "oga"},
{"audio/opus", "opus"},
{"audio/flac", "flac"},
{"audio/webm", "weba"},
{"application/x-7z-compressed", "7z"},
{"application/x-rar-compressed", "rar"},
{"application/zstd", "zst"},
{"", ""}};
// Reserved (RFC2396)
#define CIS(c,ch) ( ((unsigned char)(c)) == (ch) )
#define CHAR_RESERVED(c) ( CIS(c,';') \
@@ -1293,16 +1326,12 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
// Compression accepted ?
if (retour->req.http11) {
hts_boolean compressible = HTS_FALSE;
#if HTS_USEZLIB
if ((!retour->req.range_used)
&& (!retour->req.nocompression))
print_buffer(&bstr, "Accept-Encoding: " "gzip" /* gzip if the preffered encoding */
", " "identity;q=0.9" H_CRLF);
else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
#else
print_buffer(&bstr, "Accept-Encoding: identity" H_CRLF); /* no compression */
compressible = (!retour->req.range_used && !retour->req.nocompression);
#endif
print_buffer(&bstr, "Accept-Encoding: %s" H_CRLF,
hts_acceptencoding(compressible));
}
/* Authentification */
@@ -1918,6 +1947,10 @@ HTSEXT_API const char *infostatuscode_const(int statuscode) {
return "Requested Range Not Satisfiable";
case 417:
return "Expectation Failed";
case 429:
return "Too Many Requests";
case 451:
return "Unavailable For Legal Reasons";
case 500:
return "Internal Server Error";
case 501:
@@ -4308,6 +4341,20 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, 1);
}
// first match in a NUL-terminated {mime,ext} table. key selects the lookup
// column (0=mime, 1=ext); returns the other column, or NULL if no row matches
// (a "*" partner means the row carries no value).
static const char *hts_mime_lookup(const char *(*table)[2], int key,
const char *needle) {
int j;
for (j = 0; strnotempty(table[j][1]); j++) {
if (strfield2(table[j][key], needle) && table[j][!key][0] != '*')
return table[j][!key];
}
return NULL;
}
// write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise
@@ -4331,17 +4378,15 @@ HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
while ((a > fil) && (*a != '.') && (*a != '/'))
a--;
if (a >= fil && *a == '.' && strlen(a) < 32) {
int j = 0;
const char *mime;
a++;
while(strnotempty(hts_mime[j][1])) {
if (strfield2(hts_mime[j][1], a)) {
if (hts_mime[j][0][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][0], ssize);
return 1;
}
}
j++;
mime = hts_mime_lookup(hts_mime, 1, a);
if (mime == NULL)
mime = hts_mime_lookup(hts_mime_modern, 1, a);
if (mime != NULL) {
strlcpybuff(s, mime, ssize);
return 1;
}
if (flag) {
@@ -4365,6 +4410,11 @@ HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil,
(void) get_httptype_sized(opt, s, HTS_MIMETYPE_SIZE, fil, flag);
}
/* Advertised Accept-Encoding; gzip and deflate both decode via hts_zunpack */
const char *hts_acceptencoding(hts_boolean compressible) {
return compressible ? "gzip, deflate, identity;q=0.9" : "identity";
}
// get type of fil (php)
// s: buffer (text/html) or NULL
// return: 1 if known by user
@@ -4476,18 +4526,16 @@ int get_userhttptype(httrackp * opt, char *s, const char *fil) {
// returns 1 if an extension was found (and written to s), 0 otherwise
int give_mimext(char *s, size_t ssize, const char *st) {
int ok = 0;
int j = 0;
const char *ext;
st = hts_effective_mime(st); /* no declared type: derive an html ext */
s[0] = '\0';
while((!ok) && (strnotempty(hts_mime[j][1]))) {
if (strfield2(hts_mime[j][0], st)) {
if (hts_mime[j][1][0] != '*') { // a match exists
strlcpybuff(s, hts_mime[j][1], ssize);
ok = 1;
}
}
j++;
ext = hts_mime_lookup(hts_mime, 0, st);
if (ext == NULL)
ext = hts_mime_lookup(hts_mime_modern, 0, st);
if (ext != NULL) {
strlcpybuff(s, ext, ssize);
ok = 1;
}
// wrap "x" mimetypes, such as:
// application/x-mp3
@@ -5754,6 +5802,13 @@ HTSEXT_API int hts_init(void) {
abortLog("unable to initialize TLS: SSL_CTX_new()");
assertf("unable to initialize TLS" == NULL);
}
/* Pin a TLS floor (no SSLv3/TLS1.0/1.1); no cert verify, by design. */
#if OPENSSL_VERSION_NUMBER >= 0x10100000L
SSL_CTX_set_min_proto_version(openssl_ctx, TLS1_2_VERSION);
#else
SSL_CTX_set_options(openssl_ctx, SSL_OP_NO_SSLv2 | SSL_OP_NO_SSLv3 |
SSL_OP_NO_TLSv1 | SSL_OP_NO_TLSv1_1);
#endif
}
#endif
@@ -6005,8 +6060,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->user_agent, HTS_DEFAULT_USER_AGENT);
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default

View File

@@ -285,6 +285,9 @@ int ishttperror(int err);
int get_userhttptype(httrackp * opt, char *s, const char *fil);
int give_mimext(char *s, size_t ssize, const char *st);
/* Advertised Accept-Encoding value (no header name/CRLF); see htslib.c. */
const char *hts_acceptencoding(hts_boolean compressible);
int may_bogus_multiple(httrackp * opt, const char *mime, const char *filename);
int may_unknown2(httrackp * opt, const char *mime, const char *filename);

View File

@@ -44,28 +44,84 @@ Please visit our Website: http://www.httrack.com
// -- robots --
/* RFC 9309 path-prefix match; '*' any run, '$' anchors end; linear. */
static hts_boolean robots_pattern_match(const char *pattern, const char *path) {
size_t patlen = strlen(pattern);
hts_boolean anchored = HTS_FALSE;
const char *p, *pend, *s;
const char *star = NULL, *star_s = NULL;
if (patlen > 0 && pattern[patlen - 1] == '$') {
anchored = HTS_TRUE;
patlen--;
}
p = pattern;
pend = pattern + patlen;
s = path;
while (*s != '\0') {
if (p == pend) {
if (!anchored)
return HTS_TRUE; // prefix matched
if (star != NULL) { // anchored: '*' must eat the rest
p = star + 1;
s = ++star_s;
continue;
}
return HTS_FALSE;
}
if (*p == '*') {
star = p++;
star_s = s;
} else if (*p == *s) {
p++;
s++;
} else if (star != NULL) {
p = star + 1;
s = ++star_s;
} else {
return HTS_FALSE;
}
}
while (p < pend && *p == '*')
p++;
return (p == pend) ? HTS_TRUE : HTS_FALSE;
}
// fil="" : vérifier si règle déja enregistrée
int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
while(robots) {
if (strfield2(robots->adr, adr)) {
if (fil[0]) {
/* RFC 9309: longest pattern wins, Allow beats Disallow on ties. */
int ptr = 0;
char line[250];
char line[HTS_ROBOTS_TOKEN_SIZE];
size_t toklen = strlen(robots->token);
size_t best_len = 0;
hts_boolean matched = HTS_FALSE;
hts_boolean best_allow = HTS_FALSE;
if (strnotempty(robots->token)) {
do {
ptr += binput(robots->token + ptr, line, 200);
if (line[0] == '/') { // absolu
if (strfield(fil, line)) { // commence avec ligne
return -1; // interdit
}
} else { // relatif
if (strstrcase(fil, line)) {
return -1;
while (ptr < (int) toklen) {
ptr += binput(robots->token + ptr, line, sizeof(line) - 1);
if (line[0] != 'A' && line[0] != 'D')
continue;
{
const hts_boolean is_allow =
(line[0] == 'A') ? HTS_TRUE : HTS_FALSE;
const char *pat = line + 1;
if (robots_pattern_match(pat, fil)) {
const size_t len = strlen(pat);
if (!matched || len > best_len || (len == best_len && is_allow)) {
matched = HTS_TRUE;
best_len = len;
best_allow = is_allow;
}
}
} while((strnotempty(line)) && (ptr < (int) strlen(robots->token)));
}
}
if (matched && !best_allow)
return -1; // forbidden
} else {
return -1;
}
@@ -74,6 +130,93 @@ int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
}
return 0;
}
/* Append "<marker><pattern>\n" to the bounded rule blob if it fits. */
static void robots_blob_add(char *blob, size_t blobsize, char marker,
const char *pat) {
const size_t used = strlen(blob);
const size_t need = strlen(pat) + 2; // marker + '\n'
if (need < blobsize - used) { // overflow-safe: used <= blobsize-1
blob[used] = marker;
blob[used + 1] = '\0';
strlcatbuff(blob, pat, blobsize);
strlcatbuff(blob, "\n", blobsize);
}
}
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow) {
size_t bptr = 0;
int record = 0;
char BIGSTK line[1024];
char BIGSTK blob[HTS_ROBOTS_TOKEN_SIZE];
blob[0] = '\0';
if (info != NULL && infosize > 0)
info[0] = '\0';
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", body);
#endif
while (bptr < bodysize) {
char *comm;
int llen;
bptr += binput(body + bptr, line, sizeof(line) - 2);
comm = strchr(line, '#'); // strip comment
if (comm != NULL)
*comm = '\0';
llen = (int) strlen(line); // strip trailing spaces
while (llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a = line + 11;
while (is_realspace(*a))
a++;
if (*a == '*') {
if (record != 2)
record = 1; // generic group applies to us
} else if (strfield(a, "httrack") || strfield(a, "winhttrack") ||
strfield(a, "webhttrack")) {
blob[0] = '\0'; // explicit group: restart capture
if (info != NULL && infosize > 0)
info[0] = '\0';
record = 2; // locked to the httrack group
} else
record = 0;
} else if (record) {
hts_boolean is_allow = strfield(line, "allow:");
hts_boolean is_disallow = !is_allow && strfield(line, "disallow:");
if (is_allow || is_disallow) {
char *a = line + (is_allow ? 6 : 9);
while (is_realspace(*a))
a++;
if (strnotempty(a)) {
if (is_disallow && !keep_root_disallow && strcmp(a, "/") == 0) {
// dropped: site-wide disallow ignored by option
} else {
robots_blob_add(blob, sizeof(blob), is_allow ? 'A' : 'D', a);
if (is_disallow && info != NULL &&
strlen(a) + 2 < infosize - strlen(info)) {
if (strnotempty(info))
strlcatbuff(info, ", ", infosize);
strlcatbuff(info, a, infosize);
}
}
}
}
}
}
if (strnotempty(blob))
checkrobots_set(robots, adr, blob);
}
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data) {
if (((int) strlen(adr)) >= sizeof(robots->adr) - 2)
return 0;

View File

@@ -39,17 +39,27 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEF_FWSTRUCT_robots_wizard
typedef struct robots_wizard robots_wizard;
#endif
/* Per-host blob: one rule per line, first byte 'A'/'D' then path pattern. */
#define HTS_ROBOTS_TOKEN_SIZE 4096
struct robots_wizard {
char adr[128];
char token[4096];
char token[HTS_ROBOTS_TOKEN_SIZE];
struct robots_wizard *next;
};
/* Library internal definictions */
#ifdef HTS_INTERNAL_BYTECODE
/* -1 if `fil` disallowed for `adr` (RFC 9309); empty: -1 if rules exist. */
int checkrobots(robots_wizard * robots, const char *adr, const char *fil);
void checkrobots_free(robots_wizard * robots);
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data);
/* Parse robots.txt `body` for `adr`, storing the HTTrack group's rules; `info`
gets a disallow summary, `keep_root_disallow` FALSE drops "Disallow: /". */
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow);
#endif
#endif

View File

@@ -50,6 +50,9 @@ Please visit our Website: http://www.httrack.com
#include "htscharset.h"
#include "htsencoding.h"
#include "htsmd5.h"
#if HTS_USEZLIB
#include "htszlib.h"
#endif
#include "coucal/coucal.h"
#include <ctype.h>
@@ -239,6 +242,14 @@ static void basic_selftests(void) {
assertf(strcmp(ext, "html") == 0);
assertf(give_mimext(ext, sizeof(ext), "no/such-mime-type") == 0);
assertf(ext[0] == '\0');
// modern web formats -> extension. Avoid MIME types the
// application/<=4-char-subtype fallback could fabricate without a row.
assertf(give_mimext(ext, sizeof(ext), "image/webp") == 1);
assertf(strcmp(ext, "webp") == 0);
assertf(give_mimext(ext, sizeof(ext), "application/manifest+json") == 1);
assertf(strcmp(ext, "webmanifest") == 0);
assertf(give_mimext(ext, sizeof(ext), "font/woff2") == 1);
assertf(strcmp(ext, "woff2") == 0);
}
// convtolower(): lower-cases into the caller buffer (bounded by its size).
{
@@ -293,6 +304,16 @@ static void basic_selftests(void) {
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.gif", 0) == 1);
assertf(strcmp(r.contenttype, "image/gif") == 0);
// modern extensions map back to their MIME type
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"x.webp", 0) == 1);
assertf(strcmp(r.contenttype, "image/webp") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"app.wasm", 0) == 1);
assertf(strcmp(r.contenttype, "application/wasm") == 0);
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"mod.mjs", 0) == 1);
assertf(strcmp(r.contenttype, "text/javascript") == 0);
// no extension and flag==0: nothing written, returns 0
assertf(get_httptype_sized(opt, r.contenttype, sizeof(r.contenttype),
"noextfile", 0) == 0);
@@ -1284,6 +1305,275 @@ static int st_urlhack(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Default User-Agent: honest HTTrack token, no resurrected Windows 98. */
static int st_useragent(httrackp *opt, int argc, char **argv) {
const char *ua = StringBuff(opt->user_agent);
(void) argc;
(void) argv;
assertf(ua != NULL);
assertf(strcmp(ua, HTS_DEFAULT_USER_AGENT) == 0);
/* Teeth independent of the macro: honest token + self-identifier, and no
legacy Mozilla/4.x fake-browser string (rejects the whole relic family). */
assertf(strstr(ua, "HTTrack/") != NULL);
assertf(strstr(ua, "+https://www.httrack.com/") != NULL);
assertf(strstr(ua, "Mozilla/4.") == NULL);
printf("useragent self-test OK: %s\n", ua);
return 0;
}
/* HTTP status code -> reason phrase, including the modern 429/451. */
static int st_status(httrackp *opt, int argc, char **argv) {
const char *s;
(void) opt;
(void) argc;
(void) argv;
s = infostatuscode_const(429);
assertf(s != NULL && strcmp(s, "Too Many Requests") == 0);
s = infostatuscode_const(451);
assertf(s != NULL && strcmp(s, "Unavailable For Legal Reasons") == 0);
/* A spot-check of a long-standing code, and an unknown one. */
s = infostatuscode_const(404);
assertf(s != NULL && strcmp(s, "Not Found") == 0);
assertf(infostatuscode_const(799) == NULL);
printf("status self-test OK\n");
return 0;
}
#if HTS_USEZLIB
/* Deflate src->path at windowBits (16+ gzip, + zlib, - raw); 0 on success. */
static int ae_write_packed(const char *path, int windowBits,
const unsigned char *src, size_t len) {
unsigned char out[8192];
z_stream strm;
FILE *f;
int zerr;
memset(&strm, 0, sizeof(strm));
if (deflateInit2(&strm, Z_DEFAULT_COMPRESSION, Z_DEFLATED, windowBits, 8,
Z_DEFAULT_STRATEGY) != Z_OK)
return 1;
if ((f = FOPEN(path, "wb")) == NULL) {
deflateEnd(&strm);
return 1;
}
strm.next_in = (Bytef *) src;
strm.avail_in = (uInt) len;
do {
size_t n;
strm.next_out = out;
strm.avail_out = sizeof(out);
zerr = deflate(&strm, Z_FINISH);
n = sizeof(out) - strm.avail_out;
if (n > 0 && fwrite(out, 1, n, f) != n) {
deflateEnd(&strm);
fclose(f);
return 1;
}
} while (zerr == Z_OK);
deflateEnd(&strm);
fclose(f);
return (zerr == Z_STREAM_END) ? 0 : 1;
}
/* Forged raw deflate (08 1D) that misdetects as zlib; only fallback decodes */
static int ae_write_collision(const char *path, const unsigned char *src,
size_t len) {
/* block-1 LEN low byte 0x1D: with 0x08, (0x081D)%31==0 */
const size_t n1 = 29;
size_t n2, p = 0;
unsigned char *buf;
FILE *f;
int ok;
if (len < n1 || len - n1 > 0xFFFF)
return 1;
n2 = len - n1;
buf = malloct(10 + len);
if (buf == NULL)
return 1;
buf[p++] = 0x08; /* BFINAL=0, BTYPE=00, forged padding -> zlib CMF nibble */
buf[p++] = (unsigned char) (n1 & 0xff);
buf[p++] = (unsigned char) (n1 >> 8);
buf[p++] = (unsigned char) (~n1 & 0xff);
buf[p++] = (unsigned char) ((~n1 >> 8) & 0xff);
memcpy(buf + p, src, n1);
p += n1;
buf[p++] = 0x01; /* BFINAL=1, BTYPE=00 */
buf[p++] = (unsigned char) (n2 & 0xff);
buf[p++] = (unsigned char) (n2 >> 8);
buf[p++] = (unsigned char) (~n2 & 0xff);
buf[p++] = (unsigned char) ((~n2 >> 8) & 0xff);
memcpy(buf + p, src + n1, n2);
p += n2;
f = FOPEN(path, "wb");
ok = (f != NULL && fwrite(buf, 1, p, f) == p);
if (f != NULL)
fclose(f);
freet(buf);
return ok ? 0 : 1;
}
/* Compare path's bytes to expect[0..len); 0 if equal. Streams (large files). */
static int ae_check_decoded(const char *path, const unsigned char *expect,
size_t len) {
unsigned char buf[8192];
FILE *f = FOPEN(path, "rb");
size_t off = 0, n;
if (f == NULL)
return 1;
while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {
if (n > len - off || memcmp(buf, expect + off, n) != 0) {
fclose(f);
return 1;
}
off += n;
}
fclose(f);
return (off == len) ? 0 : 1;
}
#endif
/* Accept-Encoding (#450): advertise gzip+deflate; both decode (hts_zunpack) */
static int st_acceptencoding(httrackp *opt, int argc, char **argv) {
const char *off = hts_acceptencoding(HTS_FALSE);
const char *on = hts_acceptencoding(HTS_TRUE);
(void) opt;
assertf(strcmp(off, "identity") == 0);
assertf(strstr(on, "gzip") != NULL);
assertf(strstr(on, "deflate") != NULL); /* fails on the old gzip-only list */
#if HTS_USEZLIB
if (argc >= 1) {
static const int windowBits[] = {16 + MAX_WBITS, MAX_WBITS, -MAX_WBITS};
const unsigned char small[] =
"deflate round-trip: HTTrack decodes gzip and deflate alike. "
"deflate round-trip: HTTrack decodes gzip and deflate alike.";
const size_t slen = sizeof(small) - 1;
/* 64 KiB of varied (LCG) bytes: forces the multi-fread loop */
const size_t blen = 64 * 1024;
unsigned char *body = malloct(blen);
uint32_t x = 0x1234567u;
char inpath[HTS_URLMAXSIZE], outpath[HTS_URLMAXSIZE];
size_t i;
assertf(body != NULL);
for (i = 0; i < blen; i++) {
x = x * 1103515245u + 12345u;
body[i] = (unsigned char) (x >> 16);
}
/* gzip, zlib (RFC1950) and raw deflate (RFC1951), both small and large. */
for (i = 0; i < sizeof(windowBits) / sizeof(windowBits[0]); i++) {
snprintf(inpath, sizeof(inpath), "%s/ae-in-%d.z", argv[0], windowBits[i]);
snprintf(outpath, sizeof(outpath), "%s/ae-out-%d", argv[0],
windowBits[i]);
assertf(ae_write_packed(inpath, windowBits[i], small, slen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) slen);
assertf(ae_check_decoded(outpath, small, slen) == 0);
assertf(ae_write_packed(inpath, windowBits[i], body, blen) == 0);
assertf(hts_zunpack(inpath, outpath) == (int) blen);
assertf(ae_check_decoded(outpath, body, blen) == 0);
}
/* Fallback teeth: raw deflate misdetected as zlib; -1 without the retry. */
snprintf(inpath, sizeof(inpath), "%s/ae-collide.z", argv[0]);
snprintf(outpath, sizeof(outpath), "%s/ae-collide.out", argv[0]);
assertf(ae_write_collision(inpath, body, 64) == 0);
assertf(hts_zunpack(inpath, outpath) == 64);
assertf(ae_check_decoded(outpath, body, 64) == 0);
freet(body);
}
#else
(void) argc;
(void) argv;
#endif
printf("acceptencoding self-test OK: %s\n", on);
return 0;
}
/* Each call parses `txt` under a fresh host, then checkrobots() for `path`. */
static int rb_decide(robots_wizard *r, const char *txt, const char *path) {
static int n = 0;
char host[64];
snprintf(host, sizeof(host), "h%d.example", n++);
robots_parse(r, host, txt, strlen(txt), NULL, 0, HTS_TRUE);
return checkrobots(r, host, path);
}
static int st_robots(httrackp *opt, int argc, char **argv) {
robots_wizard robots;
(void) opt;
(void) argc;
(void) argv;
memset(&robots, 0, sizeof(robots));
/* Longer Allow re-opens subtree under Disallow: / (old matcher couldn't). */
{
const char *txt = "User-agent: *\nDisallow: /\nAllow: /public/\n";
assertf(rb_decide(&robots, txt, "/public/x") == 0); /* allowed */
assertf(rb_decide(&robots, txt, "/private") == -1); /* denied */
assertf(rb_decide(&robots, txt, "/") == -1); /* denied */
}
/* Equal-length match: Allow wins the tie over Disallow. */
{
const char *txt = "User-agent: *\nDisallow: /foo\nAllow: /foo\n";
assertf(rb_decide(&robots, txt, "/foo/bar") == 0);
}
/* Longest match wins even when it is not the last rule. */
{
assertf(rb_decide(&robots, "User-agent: *\nDisallow: /a/b\nAllow: /a\n",
"/a/b/c") == -1);
assertf(rb_decide(&robots, "User-agent: *\nAllow: /a/b\nDisallow: /a\n",
"/a/b/c") == 0);
}
/* '*' matches any run of characters. */
{
const char *txt = "User-agent: *\nDisallow: /*.php\n";
assertf(rb_decide(&robots, txt, "/a/b/index.php") == -1);
assertf(rb_decide(&robots, txt, "/a/b/index.html") == 0);
}
/* Trailing '$' anchors the end of the path. */
{
const char *txt = "User-agent: *\nDisallow: /a$\n";
assertf(rb_decide(&robots, txt, "/a") == -1);
assertf(rb_decide(&robots, txt, "/ab") == 0);
assertf(rb_decide(&robots, txt, "/a/b") == 0);
}
/* The httrack-specific group replaces the generic '*' group entirely. */
{
const char *txt = "User-agent: *\nDisallow: /everyone\n"
"User-agent: httrack\nDisallow: /\n";
assertf(rb_decide(&robots, txt, "/anything") == -1);
}
/* Replace, not merge: the generic group does not bind the httrack group. */
{
const char *txt = "User-agent: *\nDisallow: /x\n"
"User-agent: httrack\nDisallow: /y\n";
assertf(rb_decide(&robots, txt, "/x") == 0);
assertf(rb_decide(&robots, txt, "/y") == -1);
}
/* No rules: everything is allowed. */
assertf(rb_decide(&robots, "User-agent: *\nDisallow:\n", "/x") == 0);
checkrobots_free(&robots);
printf("robots self-test OK\n");
return 0;
}
/* ------------------------------------------------------------ */
/* Registry: name -> handler, with a usage hint and a one-line description. */
/* ------------------------------------------------------------ */
@@ -1330,6 +1620,12 @@ static const struct selftest_entry {
st_cache_writefail},
{"dns", "", "DNS resolver/cache self-test", st_dns},
{"cookies", "", "cookie request-header self-test", st_cookies},
{"useragent", "", "default User-Agent self-test", st_useragent},
{"status", "", "HTTP status code -> reason phrase self-test", st_status},
{"acceptencoding", "[dir]",
"Accept-Encoding advertises gzip+deflate, both decode", st_acceptencoding},
{"robots", "", "robots.txt RFC 9309 Allow/Disallow precedence self-test",
st_robots},
};
static void list_selftests(void) {

View File

@@ -358,12 +358,12 @@ int smallserver(T_SOC soc, char *url, char *method, char *data, char *path) {
{NULL, 0}
};
initStrElt initStr[] = {
{"user", "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"},
{"footer",
"<!-- Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2014], %s -->"},
{"url2", "+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}
};
{"user", HTS_DEFAULT_USER_AGENT},
{"footer", "<!-- Mirrored from %s%s by HTTrack Website Copier/3.x "
"[XR&CO'2014], %s -->"},
{"url2",
"+*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*"},
{NULL, NULL}};
int i = 0;
for(i = 0; initInt[i].name; i++) {

View File

@@ -80,6 +80,10 @@ htspair_t hts_detect_embed[] = {
{NULL, NULL}
};
/* HTML5 media siblings of <img src>: same near-link treatment (#451) */
static const htspair_t hts_detect_embed_html5[] = {
{"source", "src"}, {"source", "srcset"}, {"track", "src"}, {NULL, NULL}};
/* Internal */
static int hts_acceptlink_(httrackp * opt, int ptr, const char *adr,
const char *fil, const char *tag,
@@ -136,6 +140,17 @@ static int cmp_token(const char *tag, const char *cmp) {
&& !isalnum((unsigned char) tag[p]));
}
/* TRUE if (tag, attribute) matches an embedded-asset pair in the table */
static hts_boolean is_embed_pair(const htspair_t *table, const char *tag,
const char *attribute) {
int i;
for (i = 0; table[i].tag != NULL; i++) {
if (cmp_token(tag, table[i].tag) && cmp_token(attribute, table[i].attr))
return HTS_TRUE;
}
return HTS_FALSE;
}
static int hts_acceptlink_(httrackp * opt, int ptr,
const char *adr, const char *fil, const char *tag,
const char *attribute, int *set_prio_to,
@@ -163,15 +178,9 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
/* Built-in known tags (<img src=..>, ..) */
if (forbidden_url != 0 && opt->nearlink && tag != NULL && attribute != NULL) {
int i;
for(i = 0; hts_detect_embed[i].tag != NULL; i++) {
if (cmp_token(tag, hts_detect_embed[i].tag)
&& cmp_token(attribute, hts_detect_embed[i].attr)
) {
embedded_triggered = 1;
break;
}
if (is_embed_pair(hts_detect_embed, tag, attribute) ||
is_embed_pair(hts_detect_embed_html5, tag, attribute)) {
embedded_triggered = 1;
}
}

View File

@@ -47,48 +47,89 @@ Please visit our Website: http://www.httrack.com
*/
/*
Unpack file into a new file
Unpack file into a new file (gzip, zlib RFC1950 or raw deflate RFC1951).
Return value: size of the new file, or -1 if an error occurred
*/
/* Note: utf-8 */
int hts_zunpack(char *filename, char *newfile) {
int ret = -1;
if (filename != NULL && newfile != NULL) {
if (filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
const int fd = in != NULL ? fileno(in) : -1;
const int dup_fd = fd != -1 ? dup(fd) : -1;
// Note: we must dup to be able to flose cleanly.
const gzFile gz = dup_fd != -1 ? gzdopen(dup_fd, "rb") : NULL;
if (filename != NULL && newfile != NULL && filename[0] && newfile[0]) {
char catbuff[CATBUFF_SIZE];
FILE *const in = FOPEN(fconv(catbuff, sizeof(catbuff), filename), "rb");
if (gz) {
FILE *const fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
int size = 0;
if (in != NULL) {
unsigned char BIGSTK inbuf[8192];
size_t navail = fread(inbuf, 1, sizeof(inbuf), in);
/* gzip/zlib headers -> +32 windowBits; else raw deflate (RFC1951) */
const hts_boolean wrapped =
(navail >= 2 &&
((inbuf[0] == 0x1f && inbuf[1] == 0x8b) ||
((inbuf[0] & 0x0f) == Z_DEFLATED &&
(((unsigned) inbuf[0] << 8 | inbuf[1]) % 31) == 0)));
int attempt;
if (fpout) {
int nr;
/* deflate is ambiguous; on failure retry with the other windowBits */
for (attempt = 0; attempt < 2 && ret < 0; attempt++) {
const int windowBits =
(attempt == 0 ? wrapped : !wrapped) ? (32 + MAX_WBITS) : -MAX_WBITS;
FILE *fpout;
z_stream strm;
do {
char BIGSTK buff[1024];
nr = gzread(gz, buff, sizeof(buff));
if (nr > 0) {
size += nr;
if (fwrite(buff, 1, nr, fpout) != nr)
nr = size = -1;
}
} while(nr > 0);
if (attempt > 0) {
/* rewind input; reopening fpout "wb" discards the partial output */
if (fseek(in, 0, SEEK_SET) != 0)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
}
fpout = FOPEN(fconv(catbuff, sizeof(catbuff), newfile), "wb");
if (fpout == NULL)
break;
memset(&strm, 0, sizeof(strm));
if (inflateInit2(&strm, windowBits) != Z_OK) {
fclose(fpout);
} else
size = -1;
gzclose(gz);
ret = (int) size;
}
if (in != NULL) {
fclose(in);
break;
}
{
hts_boolean ok = HTS_TRUE;
int size = 0;
int zerr = Z_OK;
/* chunked inflate; first chunk in inbuf, single member */
do {
strm.next_in = inbuf;
strm.avail_in = (uInt) navail;
do {
unsigned char BIGSTK outbuf[8192];
size_t produced;
strm.next_out = outbuf;
strm.avail_out = sizeof(outbuf);
zerr = inflate(&strm, Z_NO_FLUSH);
if (zerr == Z_NEED_DICT || zerr == Z_DATA_ERROR ||
zerr == Z_MEM_ERROR || zerr == Z_STREAM_ERROR) {
ok = HTS_FALSE;
break;
}
produced = sizeof(outbuf) - strm.avail_out;
if (produced > 0 &&
fwrite(outbuf, 1, produced, fpout) != produced) {
ok = HTS_FALSE;
break;
}
size += (int) produced;
} while (strm.avail_out == 0);
if (!ok || zerr == Z_STREAM_END)
break;
navail = fread(inbuf, 1, sizeof(inbuf), in);
} while (navail > 0);
if (ok && zerr == Z_STREAM_END)
ret = size;
}
inflateEnd(&strm);
fclose(fpout);
}
fclose(in);
}
}
return ret;

View File

@@ -497,6 +497,12 @@ static const char *GetHttpMessage(int statuscode) {
case 417:
return "Expectation Failed";
break;
case 429:
return "Too Many Requests";
break;
case 451:
return "Unavailable For Legal Reasons";
break;
case 500:
return "Internal Server Error";
break;

View File

@@ -323,4 +323,33 @@ grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
# HTML5 <source>/<track> follow as embedded near-links past the -r2 depth boundary (#451).
# img.gif positive control; plain.gif (bare <a href>) negative control proves the gate is selective.
site10="$tmp/html5media"
mkdir -p "$site10"
for f in img ss plain; do gif "$site10/$f.gif"; done
printf 'x' >"$site10/v.webm"
printf 'x' >"$site10/subs.vtt"
cat >"$site10/index.html" <<EOF
<html><body><a href="leaf.html">leaf</a></body></html>
EOF
cat >"$site10/leaf.html" <<EOF
<html><body>
<img src="img.gif">
<picture><source srcset="ss.gif 2x"></picture>
<video><source src="v.webm"></video>
<video><track src="subs.vtt"></video>
<a href="plain.gif">plain link past the boundary</a>
</body></html>
EOF
out10="$tmp/html5media-out"
rm -rf "$out10"
mkdir -p "$out10"
httrack "file://$site10/index.html" -O "$out10" --quiet --near -r2 >"$out10/.log" 2>&1 || true
found "img.gif" "$out10"
found "ss.gif" "$out10"
found "v.webm" "$out10"
found "subs.vtt" "$out10"
notfound "plain.gif" "$out10"
exit 0

7
tests/01_engine-robots.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# robots.txt RFC 9309 Allow/Disallow precedence (#452): longest match wins.
httrack -O /dev/null -#test=robots run | grep -q "robots self-test OK"

7
tests/01_engine-status.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# HTTP status -> reason phrase, including the modern 429/451 (#453).
httrack -O /dev/null -#test=status run | grep -q "status self-test OK"

7
tests/01_engine-useragent.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# Default User-Agent (#449): honest HTTrack token, no Windows 98 relic.
httrack -O /dev/null -#test=useragent run | grep -q "useragent self-test OK"

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
set -euo pipefail
# Accept-Encoding (#450): advertise gzip+deflate; decode gzip/zlib/raw-deflate.
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
httrack -O /dev/null -#test=acceptencoding "$dir" run |
grep -q "acceptencoding self-test OK"

View File

@@ -6,7 +6,7 @@
# Golden cache-format regression test (driven by 'httrack -#test=cache-golden <dir>').
#
# 01_engine-cache.test writes the cache with the same build it reads back (a
# 01_zlib-cache.test writes the cache with the same build it reads back (a
# round-trip), so it cannot catch a read-path or ZIP-format regression where
# writer and reader drift together. This reads a *committed* cache frozen by an
# earlier build and asserts a fixed set of entries still decodes field- and

View File

@@ -1,4 +1,4 @@
# Committed binary fixture read by 01_engine-cache-golden.test. List it
# Committed binary fixture read by 01_zlib-cache-golden.test. List it
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
@@ -25,9 +25,6 @@ TEST_EXTENSIONS = .test
TEST_LOG_COMPILER = $(BASH)
TESTS = \
00_runnable.test \
01_engine-cache.test \
01_engine-cache-golden.test \
01_engine-cache-writefail.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
@@ -44,12 +41,19 @@ TESTS = \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-robots.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \
01_engine-status.test \
01_engine-stripquery.test \
01_engine-strsafe.test \
01_engine-urlhack.test \
01_engine-useragent.test \
01_zlib-acceptencoding.test \
01_zlib-cache.test \
01_zlib-cache-golden.test \
01_zlib-cache-writefail.test \
02_manpage-regen.test \
02_update-cache.test \
10_crawl-simple.test \