Compare commits

..

9 Commits

Author SHA1 Message Date
Xavier Roche
b804ee2da1 htsparse: keep makestat_time out of ENGINE_SET_CONTEXT
makestat_time throttles the makestat/maketrack stats to once per minute:
the wait loop compares time_local() against it and, when it fires, writes
it back to the local. But the field is by-value in the extended context,
so it can't round-trip through ENGINE_SAVE_CONTEXT, while ENGINE_SET_CONTEXT
re-read it from the load-once baseline on every loop iteration. That reset
the local before the next compare, so under -%v / maketrack the throttle
never held and the stats line plus the full back-stack dump were emitted
every iteration.

Drop makestat_time (and the never-changing makestat_fp) from SET_CONTEXT;
they belong to the load-once set. Wrapped the macro in clang-format off/on
for the same backslash-realignment reason as HT_ADD_END.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-07-01 10:50:42 +02:00
Xavier Roche
20317cb85b htsparse: free the cache buffer in HT_ADD_END
The not-modified fast path reads the stored //[HTML-MD5]// digest via
cache_readdata, which malloc's the buffer, but never freed it. Every page
whose on-disk size already matches the freshly rewritten one leaks that
buffer. Free it after the compare.

Wrapped the macro in clang-format off/on: it is hand-aligned and
clang-format realigns every backslash on any edit, churning untouched
lines.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-07-01 10:50:42 +02:00
Xavier Roche
98e382390b htsparse: reserve 6x room for full HTML-escaping, not 5x
HT_ADD_HTMLESCAPED_ANY reserved strlen*5+1024 on the assumption that
"&amp;" (5 bytes) is the worst-case expansion. That holds for
escape_for_html_print, but escape_for_html_print_full turns a high byte
into "&#xHH;" (6 bytes). Past ~1023 high bytes the reservation is short,
so the escaper hits its internal cap: it truncates the string mid-run and
its overflow return counts the terminating NUL, which then lands inside
the mirrored HTML file. The only _full call site rewrites a link into a
2KB buffer, so a long non-ASCII local path triggers it.

Give the macro a per-function expansion factor (HTS_HTMLESCAPE_MAXEXP=5,
HTS_HTMLESCAPE_FULL_MAXEXP=6) and pass 6 for the _full variant. A new
escape-room self-test pins each function's real worst-case expansion
against the constant the macro reserves, so the two can't drift again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-07-01 10:50:15 +02:00
Xavier Roche
694e45c698 Collapse the 5 inplace_escape_* bodies into one shared helper (#464)
DECLARE_INPLACE_ESCAPE_VERSION stamped five byte-identical function
bodies, one per escaper. Move the body into a single static
inplace_escape() helper parameterized by a function pointer to the
underlying escape_*(); the five HTSEXT_API inplace_escape_* symbols
stay as thin wrappers, so the exported ABI is unchanged.

A new -#test=inplace-escape self-test asserts each inplace_escape_*()
equals its escape_*() applied to a copy across several samples
(including a >255-byte one to hit the helper's malloct path), guarding
the refactor.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 09:14:15 +02:00
Xavier Roche
db9ec2cc3b Replace duplicated HT_INDEX_END macro with a shared function (#463)
The macro that closes the makeindex index.html (footer, optional refresh meta, then the user "primary" command) was copy-pasted into htsparse.c and htscore.c, flagged by a `// COPY IN HTSCORE.C` comment and drifting in whitespace. Collapse both into hts_finish_makeindex() in htscore.c, declared in htscore.h.

The two copies were not byte-identical: the final usercommand() call passed "primary","primary" from htsparse.c but "","" from htscore.c. The helper takes those as adr/fil parameters so each call site keeps its exact behavior.

Add a -#test=makeindex self-test (driven by 01_engine-makeindex.test) that drives the function offline and asserts the footer is written, the refresh meta appears only for a single first link, and *fp/*done are updated.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:49:34 +02:00
Xavier Roche
6a9ab2a11f Fix macro-hygiene defects in htsstrings.h (dup define, double-eval) (#462)
StringSubRW was defined twice (the second under a redundant comment); drop
the duplicate. StringCatN and StringSetLength each evaluated their SIZE
argument twice, so a side-effecting argument would run twice and a wrapping
expression could clamp inconsistently. Capture SIZE once into a local, the
way StringCopyN already does. StringSetLength keeps a signed local so the
"negative means strlen(buffer_)" contract is preserved.

No current call site passes a side-effecting SIZE, so the change is
behavior-preserving; the strsafe self-test now passes a (counter++, V)
argument and asserts a single evaluation, which fails on the old macros.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:49:01 +02:00
Xavier Roche
13b31986d5 tests: skip connect-fallback test on GNU/Hurd (#461)
19_local-connect-fallback synthesizes a multi-address host by pinning
"deadhost" to a dead 127.0.0.2 then the live 127.0.0.1, so httrack's
per-address connect fallback has somewhere to fall back to. That fixture
needs a second loopback IP: the dead and live addresses share the URL's
port (htslib.c applies it to every resolved address), so they must differ
by IP. Linux and macOS route all of 127/8 to loopback; GNU/Hurd has only
127.0.0.1, so the dead address fails synchronously, the fallback never
engages, and the test fails on hurd-i386.

This is a fixture limitation, not a bug in the fallback code, so skip on
GNU/Hurd (uname -s = GNU). A runtime bind/connect probe was tried first
but both wrongly skipped macOS, which connects to 127.0.0.2 fine but does
not treat it as bindable; the one-loopback-IP fact is what actually
distinguishes Hurd. hurd-i386 is a non-release port and did not block
migration.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 19:40:07 +02:00
Xavier Roche
bd7e0989f6 Parse robots.txt with RFC 9309 Allow/Disallow precedence (#458)
The robots.txt handler only did substring Disallow matching against a flat
token blob: no Allow:, no path wildcards. Sites using "Disallow: /" plus
"Allow: /public/" were over-blocked, since Allow was never parsed.

Move the body parsing into robots_parse() (htsrobots.c) so both the crawler
and a self-test feed raw robots.txt. Rules are stored Allow/Disallow-tagged
and consulted with RFC 9309 precedence: the longest matching path pattern
wins, Allow breaking ties. Pattern matching supports '*' (any run) and a
trailing '$' (end-of-path anchor) via a linear two-pointer matcher with a
single resumable star position, so hostile patterns cannot trigger
exponential backtracking. Path matching is now case-sensitive per the RFC.

robots_wizard is internal (not in DevIncludes_DATA, no HTSEXT_API; htsopt.h
holds only an opaque pointer), so the in-memory format changed without an ABI
break. Sitemap:/Crawl-delay: lines are tolerated but ignored, as before.

New -#test=robots self-test plus tests/01_engine-robots.test cover the
Allow-over-Disallow longest match, the equal-length Allow tie, '*'/'$'
wildcards, and httrack-group selection.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 09:07:54 +02:00
Xavier Roche
bd74ec7cab Advertise deflate in Accept-Encoding and decode it (#459)
The request Accept-Encoding offered only gzip even though the response
parser already recognized deflate/x-deflate. But the actual decode path
(hts_zunpack) used zlib's gzread, which only inflates gzip and copies any
deflate body through verbatim, so a deflate response would have been
written out still compressed. Advertising deflate without fixing that
would corrupt files.

Rewrite hts_zunpack to inflate via inflateInit2 with format detection:
gzip and zlib (RFC1950) auto-detect with +32 windowBits, everything else
is treated as raw deflate (RFC1951). Then add deflate to the advertised
list through a small hts_acceptencoding() helper shared with the test.

A new -#test=acceptencoding self-test asserts the advertised header
carries both gzip and deflate, and round-trips gzip, zlib and raw-deflate
bodies through hts_zunpack on disk. Both halves fail on the old binary.

Brotli is intentionally out of scope (new dependency, larger change).

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 08:54:03 +02:00
14 changed files with 576 additions and 177 deletions

View File

@@ -406,29 +406,40 @@ void hts_invalidate_link(httrackp * opt, int lpos) {
opt->liens[lpos]->pass2 = -1;
}
#define HT_INDEX_END do { \
if (!makeindex_done) { \
if (makeindex_fp) { \
char BIGSTK tempo[1024]; \
if (makeindex_links == 1) { \
char BIGSTK link_escaped[HTS_URLMAXSIZE*2]; \
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped)); \
snprintf(tempo,sizeof(tempo),"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">"CRLF, link_escaped); \
} else \
tempo[0]='\0'; \
hts_template_format(makeindex_fp,template_footer, \
"<!-- Mirror and index made by HTTrack Website Copier/"HTTRACK_VERSION" "HTTRACK_AFF_AUTHORS" -->", \
tempo, /* EOF */ NULL \
); \
fflush(makeindex_fp); \
fclose(makeindex_fp); /* à ne pas oublier sinon on passe une nuit blanche */ \
makeindex_fp=NULL; \
usercommand(opt,0,NULL,fconcat(OPT_GET_BUFF(opt),OPT_GET_BUFF_SIZE(opt),StringBuff(opt->path_html_utf8),"index.html"),"",""); \
} \
} \
makeindex_done=1; /* ok c'est fait */ \
} while(0)
// Write the makeindex footer (refresh meta when makeindex_links==1), close
// the file, then run usercommand.
void hts_finish_makeindex(httrackp *opt, int *makeindex_done,
FILE **makeindex_fp, int makeindex_links,
const char *makeindex_firstlink,
const char *template_footer, const char *adr,
const char *fil) {
if (!*makeindex_done) {
if (*makeindex_fp) {
char BIGSTK tempo[1024];
if (makeindex_links == 1) {
char BIGSTK link_escaped[HTS_URLMAXSIZE * 2];
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped));
snprintf(tempo, sizeof(tempo),
"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">" CRLF,
link_escaped);
} else
tempo[0] = '\0';
hts_template_format(*makeindex_fp, template_footer,
"<!-- Mirror and index made by HTTrack Website "
"Copier/" HTTRACK_VERSION " " HTTRACK_AFF_AUTHORS
" -->",
tempo, /* EOF */ NULL);
fflush(*makeindex_fp);
fclose(*makeindex_fp);
*makeindex_fp = NULL;
usercommand(opt, 0, NULL,
fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt),
StringBuff(opt->path_html_utf8), "index.html"),
adr, fil);
}
}
*makeindex_done = 1;
}
/* does it look like XML ? (SVG et al.) */
static int look_like_xml(const char *s) {
@@ -1796,90 +1807,18 @@ int httpmirror(char *url1, httrackp * opt) {
if (strnotempty(savename()) == 0) { // pas de chemin de sauvegarde
if (strcmp(urlfil(), "/robots.txt") == 0) { // robots.txt
if (r.adr) {
int bptr = 0;
char BIGSTK line[1024];
char BIGSTK buff[8192];
char BIGSTK infobuff[8192];
int record = 0;
line[0] = '\0';
buff[0] = '\0';
infobuff[0] = '\0';
//
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", r.adr);
#endif
do {
char *comm;
int llen;
bptr += binput(r.adr + bptr, line, sizeof(line) - 2);
/* strip comment */
comm = strchr(line, '#');
if (comm != NULL) {
*comm = '\0';
}
/* strip spaces */
llen = (int) strlen(line);
while(llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a;
a = line + 11;
while(is_realspace(*a))
a++; // sauter espace(s)
if (*a == '*') {
if (record != 2)
record = 1; // c pour nous
} else if (strfield(a, "httrack") || strfield(a, "winhttrack")
|| strfield(a, "webhttrack")) {
buff[0] = '\0'; // re-enregistrer
infobuff[0] = '\0';
record = 2; // locked
#if DEBUG_ROBOTS
printf("explicit disallow for httrack\n");
#endif
} else
record = 0;
} else if (record) {
if (strfield(line, "disallow:")) {
char *a = line + 9;
while(is_realspace(*a))
a++; // sauter espace(s)
if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
hts_boolean keep_root = (opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
? HTS_TRUE
: HTS_FALSE;
#else
hts_boolean keep_root = HTS_TRUE;
#endif
{ /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a);
strcatbuff(buff, "\n");
if ((strlen(infobuff) + strlen(a) + 8) <
sizeof(infobuff)) {
if (strnotempty(infobuff))
strcatbuff(infobuff, ", ");
strcatbuff(infobuff, a);
}
}
}
#ifdef IGNORE_RESTRICTIVE_ROBOTS
else {
hts_log_print(opt, LOG_NOTICE,
"Note: %s robots.txt rules are too restrictive, ignoring /",
urladr());
}
#endif
}
}
}
} while((bptr < r.size) && (strlen(buff) < (sizeof(buff) - 32)));
if (strnotempty(buff)) {
checkrobots_set(&robots, urladr(), buff);
robots_parse(&robots, urladr(), r.adr, r.size, infobuff,
sizeof(infobuff), keep_root);
if (strnotempty(infobuff)) {
hts_log_print(opt, LOG_INFO,
"Note: robots.txt forbidden links for %s are: %s",
urladr(), infobuff);
@@ -2116,7 +2055,8 @@ int httpmirror(char *url1, httrackp * opt) {
/*
Ensure the index is being closed
*/
HT_INDEX_END;
hts_finish_makeindex(opt, &makeindex_done, &makeindex_fp, makeindex_links,
makeindex_firstlink, template_footer, "", "");
/*
updating-a-remotely-deteted-website hack

View File

@@ -362,6 +362,14 @@ void usercommand(httrackp * opt, int exe, const char *cmd, const char *file,
void usercommand_exe(const char *cmd, const char *file);
// Finish the makeindex index.html (footer + refresh meta), run usercommand.
// Updates *makeindex_done/*makeindex_fp in place; adr/fil are the mode strings.
void hts_finish_makeindex(httrackp *opt, int *makeindex_done,
FILE **makeindex_fp, int makeindex_links,
const char *makeindex_firstlink,
const char *template_footer, const char *adr,
const char *fil);
int filters_init(char ***ptrfilters, int maxfilter, int filterinc);
int fspc(httrackp * opt, FILE * fp, const char *type);
@@ -470,4 +478,8 @@ void voidf(void);
/* HTML marker comment marking where the top index is spliced. */
#define HTS_TOPINDEX "TOP_INDEX_HTTRACK"
/* Worst-case byte expansion HT_ADD_HTMLESCAPED* must reserve per escaper. */
#define HTS_HTMLESCAPE_MAXEXP 5 /* escape_for_html_print: '&'->"&amp;" */
#define HTS_HTMLESCAPE_FULL_MAXEXP 6 /* _full: high byte->"&#xHH;" */
#endif

View File

@@ -4131,25 +4131,33 @@ DECLARE_APPEND_ESCAPE_VERSION(escape_uri)
#undef DECLARE_APPEND_ESCAPE_VERSION
// Same as above, but in-place
#undef DECLARE_INPLACE_ESCAPE_VERSION
#define DECLARE_INPLACE_ESCAPE_VERSION(NAME) \
HTSEXT_API size_t inplace_ ##NAME(char *const dest, const size_t size) { \
char buffer[256]; \
const size_t len = strnlen(dest, size); \
const int in_buffer = len + 1 < sizeof(buffer); \
char *src = in_buffer ? buffer : malloct(len + 1); \
size_t ret; \
assertf(src != NULL); \
assertf(len < size); \
memcpy(src, dest, len + 1); \
ret = NAME(src, dest, size); \
if (!in_buffer) { \
freet(src); \
} \
return ret; \
// In-place escaping: copy dest aside, then escape that copy back into dest.
typedef size_t (*escape_fn_t)(const char *src, char *dest, size_t size);
static size_t inplace_escape(char *const dest, const size_t size,
escape_fn_t escape) {
char buffer[256];
const size_t len = strnlen(dest, size);
const int in_buffer = len + 1 < sizeof(buffer);
char *src = in_buffer ? buffer : malloct(len + 1);
size_t ret;
assertf(src != NULL);
assertf(len < size);
memcpy(src, dest, len + 1);
ret = escape(src, dest, size);
if (!in_buffer) {
freet(src);
}
return ret;
}
// Thin exported wrappers binding inplace_escape() to each escaper (ABI).
#undef DECLARE_INPLACE_ESCAPE_VERSION
#define DECLARE_INPLACE_ESCAPE_VERSION(NAME) \
HTSEXT_API size_t inplace_##NAME(char *const dest, const size_t size) { \
return inplace_escape(dest, size, NAME); \
}
DECLARE_INPLACE_ESCAPE_VERSION(escape_in_url)
DECLARE_INPLACE_ESCAPE_VERSION(escape_spc_url)
DECLARE_INPLACE_ESCAPE_VERSION(escape_uri_utf)

View File

@@ -77,13 +77,14 @@ Please visit our Website: http://www.httrack.com
/** Append to the output buffer the string 'A'. **/
#define HT_ADD(A) TypedArrayAppend(output_buffer, A, strlen(A))
/** Append to the output buffer the string 'A', html-escaped. **/
#define HT_ADD_HTMLESCAPED_ANY(A, FUNCTION) do { \
/* clang-format off: an edit realigns all backslashes, churning the macro. */
/* clang-format off */
/** Append 'A' to the output buffer, html-escaped; FACTOR = max byte expansion. **/
#define HT_ADD_HTMLESCAPED_ANY(A, FUNCTION, FACTOR) do { \
if ((opt->getmode & 1) != 0 && ptr>0) { \
const char *const str_ = (A); \
size_t size_; \
/* &amp; is the maximum expansion */ \
TypedArrayEnsureRoom(output_buffer, strlen(str_) * 5 + 1024); \
TypedArrayEnsureRoom(output_buffer, strlen(str_) * (FACTOR) + 1024); \
size_ = FUNCTION(str_, &TypedArrayTail(output_buffer), \
TypedArrayRoom(output_buffer)); \
TypedArraySize(output_buffer) += size_; \
@@ -91,17 +92,22 @@ Please visit our Website: http://www.httrack.com
} while(0)
/** Append to the output buffer the string 'A', html-escaped for &. **/
#define HT_ADD_HTMLESCAPED(A) HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print)
#define HT_ADD_HTMLESCAPED(A) \
HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print, HTS_HTMLESCAPE_MAXEXP)
/**
* Append to the output buffer the string 'A', html-escaped for & and
* Append to the output buffer the string 'A', html-escaped for & and
* high chars.
**/
#define HT_ADD_HTMLESCAPED_FULL(A) HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print_full)
#define HT_ADD_HTMLESCAPED_FULL(A) \
HT_ADD_HTMLESCAPED_ANY(A, escape_for_html_print_full, HTS_HTMLESCAPE_FULL_MAXEXP)
/* clang-format on */
// does nothing
#define XH_uninit do {} while(0)
/* clang-format off: an edit realigns all backslashes, churning the macro. */
/* clang-format off */
#define HT_ADD_END { \
int ok=0;\
if (TypedArraySize(output_buffer) != 0) { \
@@ -123,6 +129,7 @@ Please visit our Website: http://www.httrack.com
} else {\
ok=0;\
} \
freet(mbuff);\
}\
if (!ok) { \
file_notify(opt,urladr(), urlfil(), savename(), 1, 1, r->notmodified); \
@@ -165,32 +172,9 @@ Please visit our Website: http://www.httrack.com
} \
TypedArrayFree(output_buffer); \
}
/* clang-format on */
#define HT_ADD_FOP
// COPY IN HTSCORE.C
#define HT_INDEX_END do { \
if (!makeindex_done) { \
if (makeindex_fp) { \
char BIGSTK tempo[1024]; \
if (makeindex_links == 1) { \
char BIGSTK link_escaped[HTS_URLMAXSIZE*2]; \
escape_uri_utf(makeindex_firstlink, link_escaped, sizeof(link_escaped)); \
snprintf(tempo,sizeof(tempo),"<meta HTTP-EQUIV=\"Refresh\" CONTENT=\"0; URL=%s\">"CRLF,link_escaped); \
} else \
tempo[0]='\0'; \
hts_template_format(makeindex_fp,template_footer, \
"<!-- Mirror and index made by HTTrack Website Copier/"HTTRACK_VERSION" "HTTRACK_AFF_AUTHORS" -->", \
tempo, /* EOF */ NULL \
); \
fflush(makeindex_fp); \
fclose(makeindex_fp); /* à ne pas oublier sinon on passe une nuit blanche */ \
makeindex_fp=NULL; \
usercommand(opt,0,NULL,fconcat(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_html_utf8),"index.html"),"primary","primary"); \
} \
} \
makeindex_done=1; /* ok c'est fait */ \
} while(0)
#define ENGINE_DEFINE_CONTEXT() \
ENGINE_DEFINE_CONTEXT_BASE(); \
/* */ \
@@ -217,6 +201,9 @@ Please visit our Website: http://www.httrack.com
HTS_UNUSED TStamp makestat_time = stre->makestat_time; \
HTS_UNUSED FILE* makestat_fp = stre->makestat_fp
/* clang-format off: an edit realigns all backslashes, churning the macro. */
/* clang-format off */
/* Load-once: re-reading resets makestat_time (mutated locally, never SAVEd). */
#define ENGINE_SET_CONTEXT() \
ENGINE_SET_CONTEXT_BASE(); \
/* */ \
@@ -227,9 +214,8 @@ Please visit our Website: http://www.httrack.com
makeindex_fp = *stre->makeindex_fp_; \
makeindex_links = *stre->makeindex_links_; \
/* */ \
stat_fragment = *stre->stat_fragment_; \
makestat_time = stre->makestat_time; \
makestat_fp = stre->makestat_fp
stat_fragment = *stre->stat_fragment_
/* clang-format on */
#define ENGINE_LOAD_CONTEXT() \
ENGINE_DEFINE_CONTEXT()
@@ -709,7 +695,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
} else if (heap(ptr)->depth < opt->depth) { // on a sauté level1+1 et level1
HT_INDEX_END;
hts_finish_makeindex(opt, &makeindex_done, &makeindex_fp,
makeindex_links, makeindex_firstlink,
template_footer, "primary", "primary");
}
} // if (opt->makeindex)
}

View File

@@ -44,28 +44,84 @@ Please visit our Website: http://www.httrack.com
// -- robots --
/* RFC 9309 path-prefix match; '*' any run, '$' anchors end; linear. */
static hts_boolean robots_pattern_match(const char *pattern, const char *path) {
size_t patlen = strlen(pattern);
hts_boolean anchored = HTS_FALSE;
const char *p, *pend, *s;
const char *star = NULL, *star_s = NULL;
if (patlen > 0 && pattern[patlen - 1] == '$') {
anchored = HTS_TRUE;
patlen--;
}
p = pattern;
pend = pattern + patlen;
s = path;
while (*s != '\0') {
if (p == pend) {
if (!anchored)
return HTS_TRUE; // prefix matched
if (star != NULL) { // anchored: '*' must eat the rest
p = star + 1;
s = ++star_s;
continue;
}
return HTS_FALSE;
}
if (*p == '*') {
star = p++;
star_s = s;
} else if (*p == *s) {
p++;
s++;
} else if (star != NULL) {
p = star + 1;
s = ++star_s;
} else {
return HTS_FALSE;
}
}
while (p < pend && *p == '*')
p++;
return (p == pend) ? HTS_TRUE : HTS_FALSE;
}
// fil="" : vérifier si règle déja enregistrée
int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
while(robots) {
if (strfield2(robots->adr, adr)) {
if (fil[0]) {
/* RFC 9309: longest pattern wins, Allow beats Disallow on ties. */
int ptr = 0;
char line[250];
char line[HTS_ROBOTS_TOKEN_SIZE];
size_t toklen = strlen(robots->token);
size_t best_len = 0;
hts_boolean matched = HTS_FALSE;
hts_boolean best_allow = HTS_FALSE;
if (strnotempty(robots->token)) {
do {
ptr += binput(robots->token + ptr, line, 200);
if (line[0] == '/') { // absolu
if (strfield(fil, line)) { // commence avec ligne
return -1; // interdit
}
} else { // relatif
if (strstrcase(fil, line)) {
return -1;
while (ptr < (int) toklen) {
ptr += binput(robots->token + ptr, line, sizeof(line) - 1);
if (line[0] != 'A' && line[0] != 'D')
continue;
{
const hts_boolean is_allow =
(line[0] == 'A') ? HTS_TRUE : HTS_FALSE;
const char *pat = line + 1;
if (robots_pattern_match(pat, fil)) {
const size_t len = strlen(pat);
if (!matched || len > best_len || (len == best_len && is_allow)) {
matched = HTS_TRUE;
best_len = len;
best_allow = is_allow;
}
}
} while((strnotempty(line)) && (ptr < (int) strlen(robots->token)));
}
}
if (matched && !best_allow)
return -1; // forbidden
} else {
return -1;
}
@@ -74,6 +130,93 @@ int checkrobots(robots_wizard * robots, const char *adr, const char *fil) {
}
return 0;
}
/* Append "<marker><pattern>\n" to the bounded rule blob if it fits. */
static void robots_blob_add(char *blob, size_t blobsize, char marker,
const char *pat) {
const size_t used = strlen(blob);
const size_t need = strlen(pat) + 2; // marker + '\n'
if (need < blobsize - used) { // overflow-safe: used <= blobsize-1
blob[used] = marker;
blob[used + 1] = '\0';
strlcatbuff(blob, pat, blobsize);
strlcatbuff(blob, "\n", blobsize);
}
}
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow) {
size_t bptr = 0;
int record = 0;
char BIGSTK line[1024];
char BIGSTK blob[HTS_ROBOTS_TOKEN_SIZE];
blob[0] = '\0';
if (info != NULL && infosize > 0)
info[0] = '\0';
#if DEBUG_ROBOTS
printf("robots.txt dump:\n%s\n", body);
#endif
while (bptr < bodysize) {
char *comm;
int llen;
bptr += binput(body + bptr, line, sizeof(line) - 2);
comm = strchr(line, '#'); // strip comment
if (comm != NULL)
*comm = '\0';
llen = (int) strlen(line); // strip trailing spaces
while (llen > 0 && is_realspace(line[llen - 1])) {
line[llen - 1] = '\0';
llen--;
}
if (strfield(line, "user-agent:")) {
char *a = line + 11;
while (is_realspace(*a))
a++;
if (*a == '*') {
if (record != 2)
record = 1; // generic group applies to us
} else if (strfield(a, "httrack") || strfield(a, "winhttrack") ||
strfield(a, "webhttrack")) {
blob[0] = '\0'; // explicit group: restart capture
if (info != NULL && infosize > 0)
info[0] = '\0';
record = 2; // locked to the httrack group
} else
record = 0;
} else if (record) {
hts_boolean is_allow = strfield(line, "allow:");
hts_boolean is_disallow = !is_allow && strfield(line, "disallow:");
if (is_allow || is_disallow) {
char *a = line + (is_allow ? 6 : 9);
while (is_realspace(*a))
a++;
if (strnotempty(a)) {
if (is_disallow && !keep_root_disallow && strcmp(a, "/") == 0) {
// dropped: site-wide disallow ignored by option
} else {
robots_blob_add(blob, sizeof(blob), is_allow ? 'A' : 'D', a);
if (is_disallow && info != NULL &&
strlen(a) + 2 < infosize - strlen(info)) {
if (strnotempty(info))
strlcatbuff(info, ", ", infosize);
strlcatbuff(info, a, infosize);
}
}
}
}
}
}
if (strnotempty(blob))
checkrobots_set(robots, adr, blob);
}
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data) {
if (((int) strlen(adr)) >= sizeof(robots->adr) - 2)
return 0;

View File

@@ -39,17 +39,27 @@ Please visit our Website: http://www.httrack.com
#define HTS_DEF_FWSTRUCT_robots_wizard
typedef struct robots_wizard robots_wizard;
#endif
/* Per-host blob: one rule per line, first byte 'A'/'D' then path pattern. */
#define HTS_ROBOTS_TOKEN_SIZE 4096
struct robots_wizard {
char adr[128];
char token[4096];
char token[HTS_ROBOTS_TOKEN_SIZE];
struct robots_wizard *next;
};
/* Library internal definictions */
#ifdef HTS_INTERNAL_BYTECODE
/* -1 if `fil` disallowed for `adr` (RFC 9309); empty: -1 if rules exist. */
int checkrobots(robots_wizard * robots, const char *adr, const char *fil);
void checkrobots_free(robots_wizard * robots);
int checkrobots_set(robots_wizard * robots, const char *adr, const char *data);
/* Parse robots.txt `body` for `adr`, storing the HTTrack group's rules; `info`
gets a disallow summary, `keep_root_disallow` FALSE drops "Disallow: /". */
void robots_parse(robots_wizard *robots, const char *adr, const char *body,
size_t bodysize, char *info, size_t infosize,
hts_boolean keep_root_disallow);
#endif
#endif

View File

@@ -524,6 +524,41 @@ static int string_safety_selftests(void) {
return 1;
}
/* StringCatN/StringSetLength must eval SIZE once: (n_eval++, V) leaves
n_eval == 2 on a double-eval macro. */
{
String s = STRING_EMPTY;
int n_eval = 0;
StringCat(s, "hello");
StringCatN(s, "world", (n_eval++, 3)); /* strlen>SIZE so the clamp runs */
if (n_eval != 1 || strcmp(StringBuff(s), "hellowor") != 0) {
StringFree(s);
return 1;
}
n_eval = 0;
StringSetLength(s, (n_eval++, 5));
if (n_eval != 1 || StringLength(s) != 5) {
StringFree(s);
return 1;
}
StringFree(s);
}
/* StringSubRW still reads/writes after dropping its duplicate definition. */
{
String s = STRING_EMPTY;
StringCat(s, "abc");
StringSubRW(s, 1) = 'X';
if (StringSub(s, 1) != 'X' || strcmp(StringBuff(s), "aXc") != 0) {
StringFree(s);
return 1;
}
StringFree(s);
}
return 0;
}
@@ -1305,6 +1340,134 @@ static int st_urlhack(httrackp *opt, int argc, char **argv) {
return 0;
}
// hts_finish_makeindex writes the footer, emits the refresh meta only when
// makeindex_links==1, and clears *fp / sets *done. argv[0] is a writable dir.
static int st_makeindex(httrackp *opt, int argc, char **argv) {
char path[HTS_URLMAXSIZE];
char buf[4096];
FILE *fp;
size_t n;
int done;
assertf(argc >= 1);
snprintf(path, sizeof(path), "%s/index.html", argv[0]);
/* single first link: footer + a refresh meta carrying the escaped URL */
done = 0;
fp = fopen(path, "wb");
assertf(fp != NULL);
hts_finish_makeindex(opt, &done, &fp, 1, "http://example.com/a b", "%s%s", "",
"");
assertf(fp == NULL); /* the function closed and cleared it */
assertf(done != 0);
fp = fopen(path, "rb");
assertf(fp != NULL);
n = fread(buf, 1, sizeof(buf) - 1, fp);
fclose(fp);
buf[n] = '\0';
assertf(strstr(buf, "Mirror and index made by HTTrack") != NULL);
assertf(strstr(buf, "Refresh") != NULL);
assertf(strstr(buf, "example.com") != NULL);
/* no single link: footer only, no refresh meta */
done = 0;
fp = fopen(path, "wb");
assertf(fp != NULL);
hts_finish_makeindex(opt, &done, &fp, 0, NULL, "%s%s", "", "");
assertf(fp == NULL);
assertf(done != 0);
fp = fopen(path, "rb");
assertf(fp != NULL);
n = fread(buf, 1, sizeof(buf) - 1, fp);
fclose(fp);
buf[n] = '\0';
assertf(strstr(buf, "Mirror and index made by HTTrack") != NULL);
assertf(strstr(buf, "Refresh") == NULL);
UNLINK(path);
printf("makeindex self-test OK\n");
return 0;
}
/* Each inplace_escape_*() must equal escape_*() on a copy. */
static int st_inplace_escape(httrackp *opt, int argc, char **argv) {
/* >255 bytes forces the helper's malloct path, not the stack buffer */
static char longstr[600];
static const char *const samples[] = {
"", "abc", "a b/c?d=e&f", "h\x8ello w\x94rld",
"a%b\"c<d>", "/path to/file", longstr};
static size_t (*const inplace[])(char *, size_t) = {
inplace_escape_in_url, inplace_escape_spc_url, inplace_escape_uri_utf,
inplace_escape_check_url, inplace_escape_uri};
static size_t (*const plain[])(const char *, char *, size_t) = {
escape_in_url, escape_spc_url, escape_uri_utf, escape_check_url,
escape_uri};
size_t i, f;
(void) opt;
(void) argc;
(void) argv;
memset(longstr, 'a', sizeof(longstr) - 1);
for (f = 0; f < sizeof(inplace) / sizeof(inplace[0]); f++) {
for (i = 0; i < sizeof(samples) / sizeof(samples[0]); i++) {
char ref[4096], work[4096];
size_t rret, iret;
rret = plain[f](samples[i], ref, sizeof(ref));
strcpybuff(work, samples[i]);
iret = inplace[f](work, sizeof(work));
assertf(iret == rret);
assertf(strcmp(work, ref) == 0);
}
}
printf("inplace-escape self-test OK\n");
return 0;
}
/* Pin HTS_HTMLESCAPE*_MAXEXP to each escaper's true max byte expansion. */
static int st_escape_room(httrackp *opt, int argc, char **argv) {
/* N > 1023: where 6n outgrows the old 5n+1024 reservation */
enum { N = 2000 };
char *src = malloct(N + 1);
char *dst;
size_t room, got;
(void) opt;
(void) argc;
(void) argv;
/* _full worst case: a high byte expands to "&#xHH;" (6 bytes) */
memset(src, 0xE9, N);
src[N] = '\0';
room = (size_t) N * HTS_HTMLESCAPE_FULL_MAXEXP + 1024;
dst = malloct(room);
got = escape_for_html_print_full(src, dst, room);
assertf(got == (size_t) N * HTS_HTMLESCAPE_FULL_MAXEXP);
assertf(strlen(dst) == got);
freet(dst);
/* one factor short overflows (returns size), truncating the page: the bug */
room = (size_t) N * (HTS_HTMLESCAPE_FULL_MAXEXP - 1) + 1024;
dst = malloct(room);
got = escape_for_html_print_full(src, dst, room);
assertf(got == room);
freet(dst);
/* plain escaper worst case: '&' -> "&amp;" (5); high bytes stay verbatim */
memset(src, '&', N);
src[N] = '\0';
room = (size_t) N * HTS_HTMLESCAPE_MAXEXP + 1024;
dst = malloct(room);
got = escape_for_html_print(src, dst, room);
assertf(got == (size_t) N * HTS_HTMLESCAPE_MAXEXP);
assertf(strlen(dst) == got);
freet(dst);
freet(src);
printf("escape-room self-test OK\n");
return 0;
}
/* Default User-Agent: honest HTTrack token, no resurrected Windows 98. */
static int st_useragent(httrackp *opt, int argc, char **argv) {
const char *ua = StringBuff(opt->user_agent);
@@ -1491,6 +1654,89 @@ static int st_acceptencoding(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Each call parses `txt` under a fresh host, then checkrobots() for `path`. */
static int rb_decide(robots_wizard *r, const char *txt, const char *path) {
static int n = 0;
char host[64];
snprintf(host, sizeof(host), "h%d.example", n++);
robots_parse(r, host, txt, strlen(txt), NULL, 0, HTS_TRUE);
return checkrobots(r, host, path);
}
static int st_robots(httrackp *opt, int argc, char **argv) {
robots_wizard robots;
(void) opt;
(void) argc;
(void) argv;
memset(&robots, 0, sizeof(robots));
/* Longer Allow re-opens subtree under Disallow: / (old matcher couldn't). */
{
const char *txt = "User-agent: *\nDisallow: /\nAllow: /public/\n";
assertf(rb_decide(&robots, txt, "/public/x") == 0); /* allowed */
assertf(rb_decide(&robots, txt, "/private") == -1); /* denied */
assertf(rb_decide(&robots, txt, "/") == -1); /* denied */
}
/* Equal-length match: Allow wins the tie over Disallow. */
{
const char *txt = "User-agent: *\nDisallow: /foo\nAllow: /foo\n";
assertf(rb_decide(&robots, txt, "/foo/bar") == 0);
}
/* Longest match wins even when it is not the last rule. */
{
assertf(rb_decide(&robots, "User-agent: *\nDisallow: /a/b\nAllow: /a\n",
"/a/b/c") == -1);
assertf(rb_decide(&robots, "User-agent: *\nAllow: /a/b\nDisallow: /a\n",
"/a/b/c") == 0);
}
/* '*' matches any run of characters. */
{
const char *txt = "User-agent: *\nDisallow: /*.php\n";
assertf(rb_decide(&robots, txt, "/a/b/index.php") == -1);
assertf(rb_decide(&robots, txt, "/a/b/index.html") == 0);
}
/* Trailing '$' anchors the end of the path. */
{
const char *txt = "User-agent: *\nDisallow: /a$\n";
assertf(rb_decide(&robots, txt, "/a") == -1);
assertf(rb_decide(&robots, txt, "/ab") == 0);
assertf(rb_decide(&robots, txt, "/a/b") == 0);
}
/* The httrack-specific group replaces the generic '*' group entirely. */
{
const char *txt = "User-agent: *\nDisallow: /everyone\n"
"User-agent: httrack\nDisallow: /\n";
assertf(rb_decide(&robots, txt, "/anything") == -1);
}
/* Replace, not merge: the generic group does not bind the httrack group. */
{
const char *txt = "User-agent: *\nDisallow: /x\n"
"User-agent: httrack\nDisallow: /y\n";
assertf(rb_decide(&robots, txt, "/x") == 0);
assertf(rb_decide(&robots, txt, "/y") == -1);
}
/* No rules: everything is allowed. */
assertf(rb_decide(&robots, "User-agent: *\nDisallow:\n", "/x") == 0);
checkrobots_free(&robots);
printf("robots self-test OK\n");
return 0;
}
/* ------------------------------------------------------------ */
/* Registry: name -> handler, with a usage hint and a one-line description. */
/* ------------------------------------------------------------ */
@@ -1538,9 +1784,17 @@ static const struct selftest_entry {
{"dns", "", "DNS resolver/cache self-test", st_dns},
{"cookies", "", "cookie request-header self-test", st_cookies},
{"useragent", "", "default User-Agent self-test", st_useragent},
{"makeindex", "[dir]", "hts_finish_makeindex footer/refresh self-test",
st_makeindex},
{"inplace-escape", "", "inplace_escape_* vs escape_* equivalence self-test",
st_inplace_escape},
{"escape-room", "", "HT_ADD_HTMLESCAPED* reservation-factor self-test",
st_escape_room},
{"status", "", "HTTP status code -> reason phrase self-test", st_status},
{"acceptencoding", "[dir]",
"Accept-Encoding advertises gzip+deflate, both decode", st_acceptencoding},
{"robots", "", "robots.txt RFC 9309 Allow/Disallow precedence self-test",
st_robots},
};
static void list_selftests(void) {

View File

@@ -121,9 +121,6 @@ struct String {
/** Byte at POS (read/write). No bounds check; POS must be < StringLength. **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Subcharacter (read/write) **/
#define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Byte POS positions from the end (read). POS==1 is the last byte. **/
#define StringRight(BLK, POS) (StringBuff(BLK)[StringLength(BLK) - POS])
@@ -191,8 +188,9 @@ HTS_STATIC char *StringBuffN_(String *blk, int size) {
asserts SIZE fits the existing content; does not (re)allocate. **/
#define StringSetLength(BLK, SIZE) \
do { \
if (SIZE >= 0) { \
(BLK).length_ = SIZE; \
const int len__ = (SIZE); /* signed: negative means strlen(buffer_) */ \
if (len__ >= 0) { \
(BLK).length_ = len__; \
} else { \
(BLK).length_ = strlen((BLK).buffer_); \
} \
@@ -308,10 +306,11 @@ HTS_STATIC void StringAttach(String *blk, char **str) {
#define StringCatN(BLK, STR, SIZE) \
do { \
const char *str__ = (STR); \
const size_t usize__ = (SIZE); \
if (str__ != NULL) { \
size_t size__ = strlen(str__); \
if (size__ > (SIZE)) { \
size__ = (SIZE); \
if (size__ > usize__) { \
size__ = usize__; \
} \
StringMemcat(BLK, str__, size__); \
} \

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# HT_ADD_HTMLESCAPED* must reserve the escaper's worst case (6 for _full).
httrack -O /dev/null -#test=escape-room run | grep -q "escape-room self-test OK"

View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# inplace_escape_*() must match escape_*() on a copy: guards the shared helper.
httrack -O /dev/null -#test=inplace-escape run | grep -q "inplace-escape self-test OK"

12
tests/01_engine-makeindex.test Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
#
set -euo pipefail
# hts_finish_makeindex writes the footer and gates the refresh meta on a single
# first link (guards the macro->function extraction).
dir=$(mktemp -d)
trap 'rm -rf "$dir"' EXIT
httrack -O /dev/null -#test=makeindex "$dir" run |
grep -q "makeindex self-test OK"

7
tests/01_engine-robots.test Executable file
View File

@@ -0,0 +1,7 @@
#!/bin/bash
#
set -euo pipefail
# robots.txt RFC 9309 Allow/Disallow precedence (#452): longest match wins.
httrack -O /dev/null -#test=robots run | grep -q "robots self-test OK"

View File

@@ -20,6 +20,14 @@ if ! command -v python3 >/dev/null 2>&1; then
echo "python3 missing, skipping"
exit 77
fi
# The fixture needs a second loopback IP (dead 127.0.0.2 + live 127.0.0.1) for
# the fallback to have a target; GNU/Hurd has only 127.0.0.1, so skip there.
case "$(uname -s)" in
GNU | GNU/*)
echo "GNU/Hurd: single loopback IP, connect-fallback fixture unbuildable, skipping"
exit 77
;;
esac
server="$top_srcdir/tests/local-server.py"
root="$top_srcdir/tests/server-root"

View File

@@ -36,11 +36,15 @@ TESTS = \
01_engine-filter.test \
01_engine-hashtable.test \
01_engine-idna.test \
01_engine-escape-room.test \
01_engine-inplace-escape.test \
01_engine-makeindex.test \
01_engine-mime.test \
01_engine-parse.test \
01_engine-pause.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-robots.test \
01_engine-savename.test \
01_engine-selftest-dispatch.test \
01_engine-simplify.test \