Compare commits

..

15 Commits

Author SHA1 Message Date
Xavier Roche
c52a524a63 htslib: bound the proxy CONNECT response; harden + cover review findings
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).

- Bound the response read so a proxy cannot stall the single-threaded back_wait
  crawl: proxy_getline now fails on an over-long line instead of consuming it
  forever, the header drain is capped at 64 lines, and the send loop gives up
  rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
  hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
  CR/LF defense lives entirely upstream (escape_remove_control / header-line
  splitting).
- Test: the origin now records the headers it receives, and the test asserts
  Proxy-Authorization never reaches the origin through the tunnel (the previous
  assertions couldn't see a leak). Added a flooding-proxy scenario that proves
  the crawl terminates instead of hanging on an unbounded response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 09:52:10 +02:00
Xavier Roche
1907621d37 htslib: tunnel https through the proxy via CONNECT (#85)
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.

Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).

Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:43:56 +02:00
Xavier Roche
3b2d7afdaa Merge pull request #393 from xroche/fix/empty-footer-doitlog-106
Keep empty quoted args when reloading doit.log (#106)
2026-06-19 08:13:19 +02:00
Xavier Roche
6ee539619e htscoremain: keep empty quoted args when reloading doit.log (#106)
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.

Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.

01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:09:57 +02:00
Xavier Roche
fb098b27b4 Merge pull request #392 from xroche/fix/cookie-rfc6265-151
Drop $Version/$Path from the request Cookie header (#151)
2026-06-18 22:42:47 +02:00
Xavier Roche
5f6a3fb917 htslib: drop $Version/$Path from request Cookie header (#151)
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:

  Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/

Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":

  Cookie: name=value; has_js=1

The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 22:12:28 +02:00
Xavier Roche
f9e676dbe3 Merge pull request #391 from xroche/feature/api-enum-callsites-savename83
htsopt: name the savename_83 enum and finish the call-site constant adoption
2026-06-18 21:43:34 +02:00
Xavier Roche
1b440c44b5 htsopt: name savename_83 enum and adopt enum constants at call sites
Type opt->savename_83 as a new hts_savename_83 enum (LONG/DOS/ISO9660 =
0/1/2) and replace the remaining magic-number literals for the already-
typed verbosedisplay and savename_delayed fields with their named enum
constants across the engine.

Behavior-preserving: every constant equals the literal it replaces, and a
C enum is int-sized, so struct layout is unchanged (sizeof(httrackp) and
offsetof(savename_83) are identical to origin/master, no soname bump). The
-L option block is deliberately reflowed to clang-format style, which is
what made the savename_83 retype tractable. Bitmask fields (travel/seeker/
getmode/parsejava/hostcontrol) intentionally stay int with named bit enums,
per the existing flags-as-enum split.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 21:03:33 +02:00
Xavier Roche
ac6dd1a570 Merge pull request #390 from xroche/fix/copy-htsopt-unsigned-enum-guards
copy_htsopt silently drops boolean option fields
2026-06-18 20:46:00 +02:00
Xavier Roche
4549ec3695 htsopt: fix copy_htsopt dropping unsigned-enum fields
copy_htsopt() copies each field only when it is not the "-1 means unset"
sentinel, written as `if (from->X > -1)`. The boolean/enum option
migrations turned nearlink, errpage and parseall into hts_boolean, which
GCC backs with unsigned int. `unsigned > -1` is always false, so those
three fields silently stopped being copied.

Cast to int at the guard to restore the signed sentinel test. Add a
hidden `httrack -#9` self-test that drives copy_htsopt over distinct
boolean values plus an int positive control (tests/01_engine-copyopt.test);
it fails on the unfixed guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 20:25:42 +02:00
Xavier Roche
ac56c31b24 Merge pull request #389 from xroche/fix/travel-test-all-enum
htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
2026-06-18 18:40:33 +02:00
Xavier Roche
ee6beeeb7d htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
The -t "test all" flag was a stray #define sitting next to the scope
enum; make it an enum constant so the named travel values live in one
place. The mask (HTS_TRAVEL_SCOPE_MASK) stays a #define: it selects the
scope out of opt->travel, it is not a member of the value set.

Name and value (1 << 8) are unchanged, so every use site compiles
identically and opt->travel stays plain int. No ABI change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:29:23 +02:00
Xavier Roche
6788bda380 Merge pull request #388 from xroche/feature/api-enum-fields-2
htsopt: type debug, savename_delayed and verbosedisplay as named enums
2026-06-18 18:25:44 +02:00
Xavier Roche
7ead8d595e htsopt: type three more option fields as named enums
debug becomes hts_log_type (it already stored LOG_* values; the int
declaration was a latent type hole), savename_delayed becomes a new
hts_savename_delayed { NONE, SOFT, HARD }, and verbosedisplay becomes a
new hts_verbosedisplay { NONE, SIMPLE, FULL }. hostcontrol stays int but
its bits are now named by a new hts_hostcontrol flags enum, matching the
existing getmode/seeker/travel/htsparsejava_flags pattern.

A C enum is int-sized, so struct layout, field offsets and
sizeof(httrackp) are unchanged: no ABI break, no soname bump. The three
sscanf("%d", ...) sites that fill these fields now write through an int*
(size-identical) to keep the format type exact.

These enums are unsigned-backed (all enumerators non-negative), so the
non-negative debug comparisons (debug < level, debug > LOG_INFO, etc.)
now compile to unsigned jumps. debug is never negative, never sscanf'd
and never tested against a negative bound, so the result is unchanged;
disassembly is otherwise byte-identical bar instruction scheduling.

savename_83 is left as int on purpose: its sscanf sits in the -L parser
block whose old indentation does not round-trip through clang-format.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:11:19 +02:00
Xavier Roche
93f502990c Merge pull request #387 from xroche/feature/api-bool-returns
Return hts_boolean from the yes/no library functions
2026-06-18 17:38:48 +02:00
15 changed files with 823 additions and 132 deletions

View File

@@ -2532,8 +2532,26 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#if HTS_USEOPENSSL
/* SSL mode */
if (back[i].r.ssl) {
int tunnel_ok = 1;
// https via proxy: CONNECT-tunnel before TLS (#85)
if (back[i].r.req.proxy.active && back[i].r.ssl_con == NULL) {
const int timeout = back[i].timeout > 0 ? back[i].timeout : 30;
tunnel_ok =
http_proxy_tunnel(opt, &back[i].r, back[i].url_adr, timeout);
if (!tunnel_ok) {
if (!strnotempty(back[i].r.msg))
strcpybuff(back[i].r.msg, "proxy CONNECT failed");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].r.statuscode = STATUSCODE_NON_FATAL;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
}
}
// handshake not yet launched
if (!back[i].r.ssl_con) {
if (tunnel_ok && !back[i].r.ssl_con) {
SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL);
// new session
back[i].r.ssl_con = SSL_new(openssl_ctx);
@@ -2551,7 +2569,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE;
}
/* Error */
if (back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
if (tunnel_ok && back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
strcpybuff(back[i].r.msg, "bad SSL/TLS handshake");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
@@ -3838,7 +3856,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
/* funny log for commandline users */
//if (!opt->quiet) {
// petite animation
if (opt->verbosedisplay == 1) {
if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (back[i].status == STATUS_READY) {
if (back[i].r.statuscode == HTTP_OK)
printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r",

View File

@@ -3342,7 +3342,8 @@ int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,
int ptr, int numero_passe) {
int n = back_pluggable_sockets(sback, opt);
if (opt->savename_delayed == 2 && !opt->delayed_cached) /* cancel (always delayed) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!opt->delayed_cached) /* cancel (always delayed) */
return 0;
if (n > 0) {
int p;
@@ -3702,7 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0)
to->maxsoc = from->maxsoc;
if (from->nearlink > -1)
/* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard
is always false; cast to int to keep the -1 "unset" sentinel test. */
if ((int) from->nearlink > -1)
to->nearlink = from->nearlink;
if (from->timeout > -1)
@@ -3729,10 +3732,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol;
if (from->errpage > -1)
if ((int) from->errpage > -1)
to->errpage = from->errpage;
if (from->parseall > -1)
if ((int) from->parseall > -1)
to->parseall = from->parseall;
// test all: bit 8 de travel
@@ -3844,7 +3847,7 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
a = opt->savename_type;
b = opt->savename_83;
opt->savename_type = 0;
opt->savename_83 = 0;
opt->savename_83 = HTS_SAVENAME_83_LONG;
// note: adr,fil peuvent être patchés
r =
url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe,

View File

@@ -612,12 +612,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Terminal is a tty, may ask questions and display funny information */
if (isatty(1)) {
opt->quiet = 0;
opt->verbosedisplay = 1;
opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
}
/* Not a tty, no stdin input or funny output! */
else {
opt->quiet = 1;
opt->verbosedisplay = 0;
opt->verbosedisplay = HTS_VERBOSE_NONE;
}
#endif
@@ -953,9 +953,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
p = buff;
do {
int insert_after_argc;
int quoted; /* "" unquotes to empty but is still a real token (#106) */
// read next
lastp = p;
quoted = (p != NULL && *p == '"');
if (p) {
p = next_token(p, 1);
if (p) {
@@ -966,7 +968,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Insert parameters BUT so that they can be in the same order */
if (lastp) {
if (strnotempty(lastp)) {
if (strnotempty(lastp) || quoted) {
insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_argvblk_size, x_ptr);
@@ -1815,24 +1817,22 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
com++;
}
break;
case 'L':
{
sscanf(com + 1, "%d", &opt->savename_83);
switch (opt->savename_83) {
case 0: // 8-3 (ISO9660 L1)
opt->savename_83 = 1;
break;
case 1:
opt->savename_83 = 0;
break;
default: // 2 == ISO9660 (ISO9660 L2)
opt->savename_83 = 2;
break;
}
while(isdigit((unsigned char) *(com + 1)))
com++;
case 'L': {
sscanf(com + 1, "%d", (int *) &opt->savename_83);
switch (opt->savename_83) {
case 0: // 8-3 (ISO9660 L1)
opt->savename_83 = HTS_SAVENAME_83_DOS;
break;
case 1:
opt->savename_83 = HTS_SAVENAME_83_LONG;
break;
default: // 2 == ISO9660 (ISO9660 L2)
opt->savename_83 = HTS_SAVENAME_83_ISO9660;
break;
}
break;
while (isdigit((unsigned char) *(com + 1)))
com++;
} break;
case 's':
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", (int *) &opt->robots);
@@ -1989,9 +1989,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
break; // url hack
case 'v':
opt->verbosedisplay = 2;
opt->verbosedisplay = HTS_VERBOSE_FULL;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->verbosedisplay);
sscanf(com + 1, "%d", (int *) &opt->verbosedisplay);
while(isdigit((unsigned char) *(com + 1)))
com++;
}
@@ -2004,9 +2004,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
break;
case 'N':
opt->savename_delayed = 2;
opt->savename_delayed = HTS_SAVENAME_DELAYED_HARD;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->savename_delayed);
sscanf(com + 1, "%d", (int *) &opt->savename_delayed);
while(isdigit((unsigned char) *(com + 1)))
com++;
}
@@ -3096,6 +3096,78 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return 0;
break;
case '9': { // copy_htsopt selftest: httrack -#9
httrackp *from = hts_create_opt();
httrackp *to = hts_create_opt();
int err = 0;
/* from-values differ from both the to-values and the
hts_create_opt() defaults (nearlink FALSE, errpage/parseall
TRUE), so a copy that no-ops or just resets to defaults is
caught too, not only the unsigned-guard bug. */
from->retry = 7; /* int field: positive control */
to->retry = 0;
from->nearlink = HTS_TRUE;
to->nearlink = HTS_FALSE;
from->errpage = HTS_FALSE;
to->errpage = HTS_TRUE;
from->parseall = HTS_FALSE;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->retry != 7)
err = 1;
if (to->nearlink != HTS_TRUE)
err = 1;
if (to->errpage != HTS_FALSE)
err = 1;
if (to->parseall != HTS_FALSE)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
htsmain_free();
return err;
} break;
case 'Q': { // cookie request-header selftest: httrack -#Q
static t_cookie cookie;
char hdr[1024];
/* RFC 6265: bare name=value pairs, no $Version/$Path (#151). */
const char *expected = "Cookie: name=value; has_js=1" H_CRLF;
int err = 0;
const char *dom = "www.example.com";
int added;
cookie.max_len = (int) sizeof(cookie.data);
cookie.data[0] = '\0';
added = cookie_add(&cookie, "name", "value", dom, "/");
added |= cookie_add(&cookie, "has_js", "1", dom, "/");
/* different domain: must be filtered out */
added |= cookie_add(&cookie, "junk", "x", "other.org", "/");
if (added) {
printf("cookie-header: FAIL (cookie_add setup)\n");
htsmain_free();
return 1;
}
http_cookie_header_selftest(&cookie, dom, "/", hdr,
sizeof(hdr));
if (strcmp(hdr, expected) != 0)
err = 1;
if (strstr(hdr, "$Version") != NULL ||
strstr(hdr, "$Path") != NULL)
err = 1;
if (strstr(hdr, "junk") != NULL) // wrong-domain cookie leaked
err = 1;
printf("cookie-header: %s\n", err ? "FAIL" : "OK");
if (err)
printf(" got: %s\n", hdr);
htsmain_free();
return err;
} break;
case '!':
HTS_PANIC_PRINTF
("Option #! is disabled for security reasons");

View File

@@ -644,6 +644,165 @@ T_SOC http_fopen(httrackp * opt, const char *adr, const char *fil, htsblk * reto
return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour);
}
// Read a CRLF line from a non-blocking socket (waits up to timeout per recv).
// Returns the line length (0 = empty), or -1 on timeout/EOF/error.
static int proxy_getline(T_SOC soc, char *s, int max, int timeout) {
int j = 0;
for (;;) {
unsigned char ch;
int n;
if (!check_readinput_t(soc, timeout))
return -1; // timed out waiting for data
n = (int) recv(soc, &ch, 1, 0);
if (n == 1) {
if (ch == 13) // CR
continue;
if (ch == 10) // LF: end of line
break;
if (j >= max - 1)
return -1; // line too long: bound the read against a hostile proxy
s[j++] = (char) ch;
} else if (n == 0) {
return -1; // connection closed
} else {
#ifdef _WIN32
if (WSAGetLastError() == WSAEWOULDBLOCK)
continue;
#else
if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
continue;
#endif
return -1;
}
}
s[j] = '\0';
return j;
}
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout) {
const T_SOC soc = retour->soc;
const char *const host = jump_identification_const(adr); // host[:port]
const char *const portsep = jump_toport_const(adr); // ":port" or NULL
char BIGSTK authority[HTS_URLMAXSIZE * 2];
char BIGSTK req[HTS_URLMAXSIZE * 4 + 1100];
char line[1024];
int code;
if (soc == INVALID_SOCKET)
return 0;
// CONNECT needs an explicit host:port; default the https port
authority[0] = '\0';
if (portsep != NULL)
strlcatbuff(authority, host, sizeof(authority)); // already host:port
else
snprintf(authority, sizeof(authority), "%s:%d", host, 443);
// backstop: never let a stray CR/LF in the host smuggle a second line into
// the CONNECT request (the host is already sanitized upstream)
{
const char *c;
for (c = authority; *c != '\0'; c++) {
if ((unsigned char) *c < ' ') {
strcpybuff(retour->msg, "proxy CONNECT: invalid host");
return 0;
}
}
}
snprintf(req, sizeof(req), "CONNECT %s HTTP/1.0" H_CRLF "Host: %s" H_CRLF,
authority, authority);
// creds go on the CONNECT, not the tunneled origin request
if (link_has_authorization(retour->req.proxy.name)) {
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
char autorisation[1100];
char user_pass[256];
autorisation[0] = user_pass[0] = '\0';
strncatbuff(user_pass, astart, (int) (a - astart) - 1);
strcpybuff(user_pass, unescape_http(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt), user_pass));
code64((unsigned char *) user_pass, (int) strlen(user_pass),
(unsigned char *) autorisation, 0);
strlcatbuff(req, "Proxy-Authorization: Basic ", sizeof(req));
strlcatbuff(req, autorisation, sizeof(req));
strlcatbuff(req, H_CRLF, sizeof(req));
}
strlcatbuff(req, H_CRLF, sizeof(req)); // end of request headers
// raw send: ssl is set, so sendc() would route to TLS
{
const char *p = req;
size_t remain = strlen(req);
int stalls = 0;
while (remain > 0) {
const int n = (int) send(soc, p, (int) remain, 0);
if (n > 0) {
p += n;
remain -= (size_t) n;
stalls = 0;
} else {
#ifdef _WIN32
const int wouldblock = (WSAGetLastError() == WSAEWOULDBLOCK);
#else
const int wouldblock =
(errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR);
#endif
// don't spin forever on a fatal error or an unwritable socket
if (!wouldblock || !check_writeinput_t(soc, timeout) ||
++stalls > 100) {
strcpybuff(retour->msg, "proxy CONNECT: write error");
return 0;
}
}
}
}
// proxy status line: "HTTP/1.x <code> ..."
if (proxy_getline(soc, line, sizeof(line), timeout) < 0) {
strcpybuff(retour->msg, "proxy CONNECT: no response");
return 0;
}
if (sscanf(line, "HTTP/%*d.%*d %d", &code) < 1)
code = 0;
if (code < 200 || code >= 300) {
snprintf(retour->msg, sizeof(retour->msg), "proxy CONNECT refused: %s",
strnotempty(line) ? line : "(no status)");
return 0;
}
// drain headers to the blank line; cap the count so a flooding proxy can't
// stall the crawl
{
int headers = 0;
for (;;) {
const int n = proxy_getline(soc, line, sizeof(line), timeout);
if (n < 0) {
strcpybuff(retour->msg, "proxy CONNECT: truncated response");
return 0;
}
if (n == 0)
break; // blank line: tunnel ready
if (++headers > 64) {
strcpybuff(retour->msg, "proxy CONNECT: too many response headers");
return 0;
}
}
}
return 1;
}
// ouverture d'une liaison http, envoi d'une requète
// mode: 0 GET 1 HEAD [2 POST]
// treat: traiter header?
@@ -680,14 +839,14 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
/* connexion */
if (retour) {
if ((!(retour->req.proxy.active))
|| ((strcmp(adr, "file://") == 0)
|| (strncmp(adr, "https://", 8) == 0)
)
) { /* pas de proxy, ou non utilisable ici */
/* no proxy, or proxy not usable here (local file) */
if ((!(retour->req.proxy.active)) || (strcmp(adr, "file://") == 0)) {
soc = newhttp(opt, adr, retour, -1, waitconnect);
} else {
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port, waitconnect); // ouvrir sur le proxy à la place
// to the proxy; https tunnels to the origin via CONNECT in back_wait
// (#85)
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port,
waitconnect);
}
} else {
soc = newhttp(opt, adr, NULL, -1, waitconnect);
@@ -874,6 +1033,50 @@ static void print_buffer(buff_struct*const str, const char *format, ...) {
assertf(str->pos < str->capacity);
}
/* Append the request "Cookie:" header line for every stored cookie matching
domain/path. RFC 6265 form: bare "name=value" pairs joined by "; ", no
$Version/$Path attributes (those are RFC 2965 syntax that modern servers
reject, issue #151). Returns the number of cookies emitted. */
static int append_cookie_header(buff_struct *bstr, t_cookie *cookie,
const char *domain, const char *path) {
char buffer[8192];
char *b;
int cook = 0;
int max_cookies = 8;
if (cookie == NULL)
return 0;
b = cookie->data;
do {
b = cookie_find(b, "", domain, path); // next matching cookie
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(bstr, "Cookie: ");
cook = 1;
} else
print_buffer(bstr, "; ");
print_buffer(bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(bstr, "=%s", cookie_get(buffer, b, 6));
b = cookie_nextfield(b);
}
} while (b != NULL && max_cookies > 0);
if (cook)
print_buffer(bstr, H_CRLF);
return cook;
}
/* Self-test entry for append_cookie_header(): build the request Cookie line
into dst (always NUL-terminated). Returns the number of cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size) {
buff_struct bstr = {dst, dst_size, 0};
assertf(dst != NULL && dst_size > 0);
dst[0] = '\0';
return append_cookie_header(&bstr, cookie, domain, path);
}
// envoi d'une requète
int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
const char *xsend, const char *adr, const char *fil,
@@ -999,8 +1202,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
if (xsend)
print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes
// tester proxy authentication
if (retour->req.proxy.active) {
// for https, auth rides the CONNECT (the tunneled GET would leak it)
if (retour->req.proxy.active && strncmp(adr, "https://", 8) != 0) {
if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy!
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
@@ -1048,34 +1251,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
search_tag + strlen(POSTTOK) + 1))));
}
}
// gestion cookies?
// send stored cookies matching this host/path
if (cookie) {
char buffer[8192];
char *b = cookie->data;
int cook = 0;
int max_cookies = 8;
do {
b = cookie_find(b, "", jump_identification_const(adr), fil); // prochain cookie satisfaisant aux conditions
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(&bstr, "Cookie: $Version=1; ");
cook = 1;
} else
print_buffer(&bstr, "; ");
print_buffer(&bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(&bstr, "=%s", cookie_get(buffer, b, 6));
print_buffer(&bstr, "; $Path=%s", cookie_get(buffer, b, 2));
b = cookie_nextfield(b);
}
} while(b != NULL && max_cookies > 0);
if (cook) { // on a envoyé un (ou plusieurs) cookie?
print_buffer(&bstr, H_CRLF);
#if DEBUG_COOK
printf("Header:\n%s\n", bstr.buffer);
#endif
}
append_cookie_header(&bstr, cookie, jump_identification_const(adr), fil);
}
// gérer le keep-alive (garder socket)
if (retour->req.http11 && !retour->req.nokeepalive) {
@@ -1808,6 +1986,24 @@ int check_readinput_t(T_SOC soc, int timeout) {
return 0;
}
// wait until the socket is writable, up to timeout seconds
int check_writeinput_t(T_SOC soc, int timeout) {
if (soc != INVALID_SOCKET) {
fd_set fds;
struct timeval tv;
const int isoc = (int) soc;
assertf(isoc == soc);
FD_ZERO(&fds);
FD_SET(isoc, &fds);
tv.tv_sec = timeout;
tv.tv_usec = 0;
select(isoc + 1, NULL, &fds, NULL, &tv);
return FD_ISSET(isoc, &fds) ? 1 : 0;
} else
return 0;
}
// idem, sauf qu'ici on peut choisir la taille max de données à recevoir
// SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes
// en éliminant les cr (ex: header), arrêt si double-lf
@@ -5468,9 +5664,10 @@ HTSEXT_API httrackp *hts_create_opt(void) {
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = 0; // noms longs par défaut
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default
opt->savename_type = 0; // avec structure originale
opt->savename_delayed = 2; // hard delayed type (default)
opt->savename_delayed =
HTS_SAVENAME_DELAYED_HARD; // always delay the type check (default)
opt->delayed_cached = HTS_TRUE;
opt->mimehtml = HTS_FALSE;
opt->parsejava = HTSPARSE_DEFAULT; // parser classes
@@ -5495,7 +5692,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->parseall = HTS_TRUE;
opt->parsedebug = HTS_FALSE;
opt->norecatch = HTS_FALSE;
opt->verbosedisplay = 0; // pas d'animation texte
opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);

View File

@@ -182,6 +182,11 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, const char *xsend
const char *adr, const char *fil,
const char *referer_adr, const char *referer_fil,
htsblk * retour);
/* Build the request "Cookie:" header line for stored cookies matching
domain/path into dst (NUL-terminated). Exposed for the -#Q self-test;
wraps the same logic http_sendhead() uses. Returns cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size);
//int newhttp(char* iadr,char* err=NULL);
T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port,
@@ -193,6 +198,17 @@ HTS_INLINE void deletesoc_r(htsblk * r);
htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc);
int check_readinput(htsblk * r);
int check_readinput_t(T_SOC soc, int timeout);
int check_writeinput_t(T_SOC soc, int timeout);
/* Open an HTTP CONNECT tunnel through the active proxy for an https request:
`retour->soc` must already be TCP-connected to the proxy, and `adr` is the
origin authority (url_adr, e.g. "https://host:port"). Sends the CONNECT
request (with Proxy-Authorization when the proxy carries credentials) and
reads the proxy's status line, so the caller's TLS handshake then runs
end-to-end with the origin. Blocks up to `timeout` seconds. Returns 1 on a
2xx tunnel, 0 on failure (retour->msg/statuscode set). */
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout);
void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour,
char *rcvd);
void treatfirstline(htsblk * retour, const char *rcvd);

View File

@@ -184,10 +184,11 @@ int url_savename(lien_adrfilsave *const afs,
/* 8-3 ? */
switch (opt->savename_83) {
case 1: // 8-3
case HTS_SAVENAME_83_DOS: // 8-3
max_char = 8;
break;
case 2: // Level 2 File names may be up to 31 characters.
case HTS_SAVENAME_83_ISO9660: // Level 2 File names may be up to 31
// characters.
max_char = 31;
break;
default:
@@ -324,7 +325,7 @@ int url_savename(lien_adrfilsave *const afs,
}
/* replace shtml to html.. */
if (opt->savename_delayed == 2)
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
is_html = -1; /* ALWAYS delay type */
else
is_html = ishtml(opt, fil);
@@ -363,7 +364,9 @@ int url_savename(lien_adrfilsave *const afs,
) {
// tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == 2 || (ishtest = ishtml(opt, fil)) < 0) { // on ne sait pas si c'est un html ou un fichier..
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
// lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
@@ -393,11 +396,12 @@ int url_savename(lien_adrfilsave *const afs,
}
#endif
//
} else if (opt->savename_delayed != 2 && is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
but also by filename */
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the text/html MIME file type,
that is, ".html" */
} else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
but also by filename */
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the
text/html MIME file type, that is, ".html" */
char BIGSTK mime[1024];
mime[0] = ext[0] = '\0';
@@ -408,9 +412,13 @@ int url_savename(lien_adrfilsave *const afs,
}
}
}
// note: if savename_delayed is enabled, the naming will be temporary (and slightly invalid!)
// note: if we are about to stop (opt->state.stop), back_add() will fail later
else if (opt->savename_delayed != 0 && !opt->state.stop) {
// note: if savename_delayed is enabled, the naming will be temporary
// (and slightly invalid!)
//
// note: if we are about to stop (opt->state.stop), back_add() will
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess
if (headers != NULL && headers->status >= 0 && !is_redirect) {
@@ -698,7 +706,7 @@ int url_savename(lien_adrfilsave *const afs,
}
// restaurer
opt->state._hts_in_html_parsing = hihp;
} // caché?
} // caché?
}
}
}
@@ -1190,7 +1198,8 @@ int url_savename(lien_adrfilsave *const afs,
// Not used anymore unless non-delayed types.
// de même en cas de manque d'extension on en place une de manière forcée..
// cela évite les /chez/toto et les /chez/toto/index.html incompatibles
if (opt->savename_type != -1 && opt->savename_delayed != 2) {
if (opt->savename_type != -1 &&
opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD) {
char *a = afs->save + strlen(afs->save) - 1;
while((a > afs->save) && (*a != '.') && (*a != '/'))
@@ -1236,31 +1245,21 @@ int url_savename(lien_adrfilsave *const afs,
size_t i;
for(i = 0 ; afs->save[i] != '\0' ; i++) {
unsigned char c = (unsigned char) afs->save[i];
if (c < 32 // control
|| c == 127 // unwise
|| c == '~' // unix unwise
|| c == '\\' // windows separator
|| c == ':' // windows forbidden
|| c == '*' // windows forbidden
|| c == '?' // windows forbidden
|| c == '\"' // windows forbidden
|| c == '<' // windows forbidden
|| c == '>' // windows forbidden
|| c == '|' // windows forbidden
//|| c == '@' // ?
||
(
opt->savename_83 == 2 // CDROM
&&
(
c == '-'
|| c == '='
|| c == '+'
)
)
)
{
afs->save[i] = '_';
if (c < 32 // control
|| c == 127 // unwise
|| c == '~' // unix unwise
|| c == '\\' // windows separator
|| c == ':' // windows forbidden
|| c == '*' // windows forbidden
|| c == '?' // windows forbidden
|| c == '\"' // windows forbidden
|| c == '<' // windows forbidden
|| c == '>' // windows forbidden
|| c == '|' // windows forbidden
//|| c == '@' // ?
|| (opt->savename_83 == HTS_SAVENAME_83_ISO9660 // CDROM
&& (c == '-' || c == '=' || c == '+'))) {
afs->save[i] = '_';
}
}
}
@@ -1521,7 +1520,8 @@ int url_savename(lien_adrfilsave *const afs,
char *a = afs->save + strlen(afs->save) - 1;
char *b;
int n = 2;
char collisionSeparator = ((opt->savename_83 != 2) ? '-' : '_');
char collisionSeparator =
((opt->savename_83 != HTS_SAVENAME_83_ISO9660) ? '-' : '_');
tempo[0] = '\0';

View File

@@ -342,17 +342,44 @@ typedef enum hts_seeker {
HTS_SEEKER_UP = 1 << 1 /**< may ascend to parent directories */
} hts_seeker;
/* Link-following scope, stored in the low byte of opt->travel. */
/* opt->travel: link-following scope in the low byte, flags OR'd in above it. */
typedef enum hts_travel_scope {
HTS_TRAVEL_SAME_ADDRESS = 0, /**< stay on the same address (host) */
HTS_TRAVEL_SAME_DOMAIN = 1, /**< stay on the same principal domain */
HTS_TRAVEL_SAME_TLD = 2, /**< stay on the same TLD (e.g. .com) */
HTS_TRAVEL_EVERYWHERE = 7 /**< follow links anywhere on the web */
HTS_TRAVEL_EVERYWHERE = 7, /**< follow links anywhere on the web */
HTS_TRAVEL_TEST_ALL = 1 << 8 /**< also test forbidden URLs (-t) */
} hts_travel_scope;
/* Flags OR'd into opt->travel above the scope value. */
#define HTS_TRAVEL_SCOPE_MASK 0xff /**< mask selecting the scope value */
#define HTS_TRAVEL_TEST_ALL (1 << 8) /**< also test forbidden URLs (-t) */
/* Mask selecting the scope value out of opt->travel. */
#define HTS_TRAVEL_SCOPE_MASK 0xff
/* Text progress display detail (opt->verbosedisplay). */
typedef enum hts_verbosedisplay {
HTS_VERBOSE_NONE = 0, /**< no animated progress display (default) */
HTS_VERBOSE_SIMPLE = 1, /**< minimal single-line progress */
HTS_VERBOSE_FULL = 2 /**< full animated progress */
} hts_verbosedisplay;
/* Delayed file-type resolution policy (opt->savename_delayed). */
typedef enum hts_savename_delayed {
HTS_SAVENAME_DELAYED_NONE = 0, /**< resolve the type immediately */
HTS_SAVENAME_DELAYED_SOFT = 1, /**< delay the type check when unknown */
HTS_SAVENAME_DELAYED_HARD = 2 /**< always delay the type check (default) */
} hts_savename_delayed;
/* Saved-name length layout (opt->savename_83). */
typedef enum hts_savename_83 {
HTS_SAVENAME_83_LONG = 0, /**< long file names (default) */
HTS_SAVENAME_83_DOS = 1, /**< DOS 8.3 names (ISO9660 level 1) */
HTS_SAVENAME_83_ISO9660 = 2 /**< ISO9660 level 2 names (up to 31 chars) */
} hts_savename_83;
/* Host-banning triggers (opt->hostcontrol bitmask). */
typedef enum hts_hostcontrol {
HTS_HOSTCONTROL_BAN_TIMEOUT = 1 << 0, /**< ban a timing-out host */
HTS_HOSTCONTROL_BAN_SLOW = 1 << 1 /**< ban a too-slow host */
} hts_hostcontrol;
#ifndef HTS_DEF_FWSTRUCT_lien_buffers
#define HTS_DEF_FWSTRUCT_lien_buffers
@@ -386,7 +413,7 @@ struct httrackp {
hts_urlmode
urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */
hts_boolean no_type_change; // do not change file type according to MIME
int debug; /**< debug logging level */
hts_log_type debug; /**< debug logging level */
int getmode; /**< what to fetch (HTML, images, ...) bitmask */
FILE *log; /**< informational log stream; NULL mutes it */
FILE *errlog; /**< error log stream; NULL mutes it */
@@ -410,11 +437,12 @@ struct httrackp {
// int aff_progress; // progress bar
hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */
t_proxy proxy; /**< proxy configuration */
int savename_83; /**< force 8.3 (DOS) file names */
hts_savename_83
savename_83; /**< saved-name length layout (long/DOS/ISO9660) */
int savename_type; /**< saved-name layout (original tree, flat, ...) */
String
savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */
int savename_delayed; // delayed type check
hts_savename_delayed savename_delayed; /**< delayed type-check policy */
hts_boolean
delayed_cached; // delayed type check can be cached to speedup updates
hts_boolean mimehtml; /**< produce a single MIME/MHTML archive */
@@ -430,7 +458,7 @@ struct httrackp {
hts_boolean makestat; /**< maintain a transfer-statistics log */
hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< drop hosts that are too slow, etc. */
int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
hts_boolean errpage; /**< generate an error page on 404 and similar */
hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
@@ -455,7 +483,7 @@ struct httrackp {
parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */
int verbosedisplay; /**< animated text progress display */
hts_verbosedisplay verbosedisplay; /**< animated text progress display */
String footer; /**< footer/info line injected into pages */
int maxcache; /**< in-memory cache backing limit (bytes) */
// int maxcache_anticipate; // maximum links to anticipate (upper bound)

View File

@@ -3722,7 +3722,8 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
//case -1: can_retry=1; break;
case STATUSCODE_TIMEOUT:
if (opt->hostcontrol) { // timeout et retry épuisés
if ((opt->hostcontrol & 1) && (heap(ptr)->retry <= 0)) {
if ((opt->hostcontrol & HTS_HOSTCONTROL_BAN_TIMEOUT) &&
(heap(ptr)->retry <= 0)) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG,
@@ -3735,7 +3736,7 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
break;
case STATUSCODE_SLOW:
if ((opt->hostcontrol) && (heap(ptr)->retry <= 0)) { // too slow
if (opt->hostcontrol & 2) {
if (opt->hostcontrol & HTS_HOSTCONTROL_BAN_SLOW) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG,
@@ -4261,10 +4262,10 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
char com[256];
linput(stdin, com, 200);
if (opt->verbosedisplay == 2)
opt->verbosedisplay = 1;
if (opt->verbosedisplay == HTS_VERBOSE_FULL)
opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
else
opt->verbosedisplay = 2;
opt->verbosedisplay = HTS_VERBOSE_FULL;
/* Info for wrappers */
hts_log_print(opt, LOG_INFO, "engine: change-options");
RUN_CALLBACK0(opt, chopt);
@@ -4374,7 +4375,7 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
printf("%c\x0d", ("/-\\|")[roll]);
fflush(stdout);
}
} else if (opt->verbosedisplay == 1) {
} else if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (b >= 0) {
if (back[b].r.statuscode == HTTP_OK)
printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot,
@@ -4465,8 +4466,8 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
char in_error_msg[32];
// resolve unresolved type
if (opt->savename_delayed != 0 && *forbidden_url == 0 && IS_DELAYED_EXT(afs->save)
&& !opt->state.stop) {
if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
*forbidden_url == 0 && IS_DELAYED_EXT(afs->save) && !opt->state.stop) {
int loops;
int continue_loop;
@@ -4850,7 +4851,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
}
}
} // delayed type check ?
} // delayed type check ?
ENGINE_SAVE_CONTEXT_BASE();

View File

@@ -288,7 +288,7 @@ static void __cdecl htsshow_uninit(t_hts_callbackarg * carg) {
}
static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) {
use_show = 0;
if (opt->verbosedisplay == 2) {
if (opt->verbosedisplay == HTS_VERBOSE_FULL) {
use_show = 1;
vt_clear();
}
@@ -852,7 +852,7 @@ static void sig_doback(int blind) { // mettre en backing
if (global_opt != NULL) {
// suppress logging and asking lousy questions
global_opt->quiet = 1;
global_opt->verbosedisplay = 0;
global_opt->verbosedisplay = HTS_VERBOSE_NONE;
}
if (!blind)

15
tests/01_engine-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1
}

17
tests/01_engine-copyopt.test Executable file
View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
set -eu
# A trailing token is required; a bare '-#9' falls through to the usage screen.
out=$(httrack -#9 run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "copy-htsopt: OK" || {
echo "expected 'copy-htsopt: OK', got: $out" >&2
exit 1
}

View File

@@ -89,4 +89,37 @@ grep -q NEWCONTENT "$(find "$out" -path '*/a.html' -print -quit)" || {
exit 1
}
# --- 3. an empty quoted arg survives the doit.log round-trip (#106) ----------
# -%F "" (empty footer) records an empty "" token in doit.log; -r2 follows it so
# a "drop the empty token" bug shifts -r2 into -%F's slot (the reprise then sees
# -%F -r2 and panics "%F needs to be followed by ..."), making the bug visible
# rather than a harmless run off the end of argv.
out2="$tmp/out2"
rc=0
"$bin" "$url" -O "$out2" --quiet -n -%v0 -%F "" -r2 >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: initial mirror with empty footer exited $rc"
exit 1
}
# precondition: the writer put the empty token on disk for the reader to reload.
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer not recorded as -%F \"\" -r2 in doit.log"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
# no-url reprise: the reader rebuilds argv from doit.log and rewrites doit.log
# from it. The empty token surviving in the regenerated file proves the reader
# kept it (a drop/swallow would panic above or rewrite -%F without the "").
rc=0
"$bin" -O "$out2" --quiet >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: empty-footer reprise exited $rc (empty token dropped from doit.log?)"
exit 1
}
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer did not survive the doit.log reload round-trip"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
exit 0

View File

@@ -0,0 +1,136 @@
#!/bin/bash
#
# Issue #85: an https crawl must go through the configured proxy (CONNECT
# tunnel), not bypass it and hit the origin directly. Fully local: a self-signed
# TLS origin plus a logging CONNECT proxy, so no network access is needed.
set -euo pipefail
: "${top_srcdir:=..}"
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
if ! command -v python3 >/dev/null 2>&1 || ! command -v openssl >/dev/null 2>&1; then
echo "python3/openssl missing, skipping"
exit 77
fi
server="$top_srcdir/tests/proxy-https-server.py"
tmpdir=$(mktemp -d)
pids=
cleanup() {
for pid in $pids; do
kill "$pid" 2>/dev/null || true
done
rm -rf "$tmpdir"
}
trap cleanup EXIT
# self-signed cert for the local TLS origin (httrack does not verify certs)
openssl req -x509 -newkey rsa:2048 -keyout "$tmpdir/key.pem" \
-out "$tmpdir/cert.pem" -days 2 -nodes -subj "/CN=127.0.0.1" \
>/dev/null 2>&1
cat "$tmpdir/key.pem" "$tmpdir/cert.pem" >"$tmpdir/both.pem"
# start_server <logdir> <mode>: launches a proxy+origin pair, sets $origin_port
# and $proxy_port from its announced ephemeral ports.
start_server() {
local dir="$1" mode="$2" ports
mkdir -p "$dir"
ports="$dir/ports.txt"
python3 "$server" "$tmpdir/both.pem" "$dir" "$mode" \
>"$ports" 2>"$dir/server.err" &
pids="$pids $!"
for _ in $(seq 1 100); do
grep -q "^ready" "$ports" 2>/dev/null && break
sleep 0.1
done
grep -q "^ready" "$ports" 2>/dev/null || {
echo "server ($mode) did not start" >&2
cat "$dir/server.err" >&2
exit 1
}
origin_port=$(awk '/^ORIGIN/{print $2}' "$ports")
proxy_port=$(awk '/^PROXY/{print $2}' "$ports")
}
# Run httrack, but kill it after a deadline so a hang (e.g. a missing bound on
# the proxy response) surfaces as the kill code $HANG_RC instead of stalling the
# whole job. A portable stand-in for `timeout`, which macOS lacks.
HANG_RC=137 # 128 + SIGKILL
run_crawl() {
local out="$1" proxy="$2" port="$3"
rm -rf "$out"
httrack "https://127.0.0.1:${port}/" --proxy "$proxy" \
-O "$out" -r1 -s0 --timeout=10 >"$out.log" 2>&1 &
local pid=$!
(sleep 60 && kill -9 "$pid" 2>/dev/null) &
local guard=$!
local rc=0
wait "$pid" 2>/dev/null || rc=$?
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
return "$rc"
}
# --- working proxy ----------------------------------------------------------
ok="$tmpdir/ok"
start_server "$ok" ok
# 1. page retrieved AND the proxy saw a CONNECT to the origin
run_crawl "$ok/out" "127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out" || {
echo "FAIL: origin page not downloaded through proxy" >&2
cat "$ok/out.log" >&2
exit 1
}
grep -q "^CONNECT 127.0.0.1:${origin_port} " "$ok/proxy.log" || {
echo "FAIL: proxy never received a CONNECT (https bypassed the proxy)" >&2
cat "$ok/proxy.log" >&2
exit 1
}
echo "OK: https tunneled through proxy via CONNECT"
# 2. authenticated proxy: creds ride the CONNECT, and NEVER reach the origin
: >"$ok/proxy.log"
: >"$ok/origin-headers.log"
run_crawl "$ok/out2" "user:secret@127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out2" || {
echo "FAIL: origin page not downloaded through authenticated proxy" >&2
exit 1
}
got=$(awk '/^AUTH Basic /{print $3}' "$ok/proxy.log" | head -1)
# base64("user:secret"); compared as a literal to stay portable (no base64 -d,
# which differs between GNU and BSD)
test "$got" == "dXNlcjpzZWNyZXQ=" || {
echo "FAIL: Proxy-Authorization not carried on CONNECT (got '$got')" >&2
cat "$ok/proxy.log" >&2
exit 1
}
if grep -qi "proxy-authorization" "$ok/origin-headers.log"; then
echo "FAIL: proxy credentials leaked to the origin through the tunnel" >&2
cat "$ok/origin-headers.log" >&2
exit 1
fi
echo "OK: proxy credentials carried on CONNECT, not leaked to origin"
# --- hostile proxy ----------------------------------------------------------
# A proxy that answers 200 then streams headers forever must not hang the crawl:
# the client bounds the response. run_crawl kills a hung httrack after 60s, so a
# missing bound surfaces as $HANG_RC here.
flood="$tmpdir/flood"
start_server "$flood" flood
rc=0
run_crawl "$flood/out" "127.0.0.1:${proxy_port}" "$origin_port" || rc=$?
test "$rc" -ne "$HANG_RC" || {
echo "FAIL: crawl hung on a flooding proxy (bounded read missing)" >&2
exit 1
}
grep -rq "ORIGIN-PAGE-85" "$flood/out" 2>/dev/null && {
echo "FAIL: flooding proxy unexpectedly served the page" >&2
exit 1
}
echo "OK: bounded proxy response, no hang on a flooding proxy"

View File

@@ -2,6 +2,7 @@
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -24,6 +25,8 @@ TESTS = \
01_engine-cache-golden.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
01_engine-copyopt.test \
01_engine-doitlog.test \
01_engine-entities.test \
01_engine-filter.test \
@@ -42,6 +45,7 @@ TESTS = \
11_crawl-international.test \
11_crawl-longurl.test \
11_crawl-parsing.test \
12_crawl_https.test
12_crawl_https.test \
13_crawl_proxy_https.test
CLEANFILES = check-network_sh.cache

151
tests/proxy-https-server.py Normal file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Local CONNECT proxy + self-signed HTTPS origin for the issue #85 test.
Starts a TLS origin server and an HTTP proxy that honours CONNECT, on ephemeral
ports. Every request line the proxy receives (and any Proxy-Authorization) is
appended to the proxy log; every header the origin receives over the tunnel is
appended to the origin log. That lets the test assert both that an https crawl
tunneled through the proxy and that proxy credentials never leaked to the origin.
Proxy modes (argv[3], default "ok"):
ok - honour CONNECT and tunnel to the origin
flood - answer 200 then stream headers forever with no blank line, to exercise
the client's bound on the proxy response (must not hang the crawl)
Usage: proxy-https-server.py <cert.pem> <logdir> [mode]
Prints "ORIGIN <port>", "PROXY <port>", then "ready" (one per line) on stdout.
"""
import http.server
import os
import socket
import socketserver
import ssl
import sys
import threading
ORIGIN_BODY = b"<html><body>ORIGIN-PAGE-85</body></html>"
PROXY_LOG = "proxy.log"
ORIGIN_LOG = "origin-headers.log"
def make_origin(logdir):
class Origin(http.server.BaseHTTPRequestHandler):
def do_GET(self):
with open(os.path.join(logdir, ORIGIN_LOG), "a") as handle:
for key in self.headers.keys():
handle.write(key + "\n")
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Length", str(len(ORIGIN_BODY)))
self.end_headers()
self.wfile.write(ORIGIN_BODY)
def log_message(self, *args):
pass
return Origin
def start_origin(certfile, logdir):
httpd = socketserver.TCPServer(("127.0.0.1", 0), make_origin(logdir))
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
threading.Thread(target=httpd.serve_forever, daemon=True).start()
return port
def pipe(src, dst):
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
for sock in (src, dst):
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
def handle_client(conn, logdir, mode):
rfile = conn.makefile("rb")
request_line = rfile.readline().decode("latin-1").strip()
auth = None
while True:
line = rfile.readline().decode("latin-1")
if line in ("\r\n", "\n", ""):
break
key, _, value = line.partition(":")
if key.strip().lower() == "proxy-authorization":
auth = value.strip()
with open(os.path.join(logdir, PROXY_LOG), "a") as handle:
handle.write(request_line + "\n")
if auth is not None:
handle.write("AUTH " + auth + "\n")
parts = request_line.split()
if not (len(parts) >= 2 and parts[0] == "CONNECT"):
conn.sendall(b"HTTP/1.0 501 Not Implemented\r\n\r\n")
conn.close()
return
if mode == "flood":
# 200, then an endless header stream with no terminating blank line: the
# client must bound this and give up, not hang.
try:
conn.sendall(b"HTTP/1.0 200 Connection established\r\n")
while True:
conn.sendall(b"X-Pad: 0123456789\r\n")
except OSError:
pass
conn.close()
return
host, _, port = parts[1].partition(":")
try:
upstream = socket.create_connection((host, int(port or 443)))
except OSError:
conn.sendall(b"HTTP/1.0 502 Bad Gateway\r\n\r\n")
conn.close()
return
conn.sendall(b"HTTP/1.0 200 Connection established\r\n\r\n")
threading.Thread(target=pipe, args=(conn, upstream), daemon=True).start()
pipe(upstream, conn)
def start_proxy(logdir, mode):
srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("127.0.0.1", 0))
srv.listen(16)
port = srv.getsockname()[1]
def serve():
while True:
conn, _ = srv.accept()
threading.Thread(
target=handle_client, args=(conn, logdir, mode), daemon=True
).start()
threading.Thread(target=serve, daemon=True).start()
return port
def main():
certfile, logdir = sys.argv[1], sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "ok"
for name in (PROXY_LOG, ORIGIN_LOG):
open(os.path.join(logdir, name), "w").close()
origin_port = start_origin(certfile, logdir)
proxy_port = start_proxy(logdir, mode)
print("ORIGIN %d" % origin_port, flush=True)
print("PROXY %d" % proxy_port, flush=True)
print("ready", flush=True)
threading.Event().wait()
if __name__ == "__main__":
main()