Compare commits

..

7 Commits

Author SHA1 Message Date
Xavier Roche
c52a524a63 htslib: bound the proxy CONNECT response; harden + cover review findings
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).

- Bound the response read so a proxy cannot stall the single-threaded back_wait
  crawl: proxy_getline now fails on an over-long line instead of consuming it
  forever, the header drain is capped at 64 lines, and the send loop gives up
  rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
  hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
  CR/LF defense lives entirely upstream (escape_remove_control / header-line
  splitting).
- Test: the origin now records the headers it receives, and the test asserts
  Proxy-Authorization never reaches the origin through the tunnel (the previous
  assertions couldn't see a leak). Added a flooding-proxy scenario that proves
  the crawl terminates instead of hanging on an unbounded response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 09:52:10 +02:00
Xavier Roche
1907621d37 htslib: tunnel https through the proxy via CONNECT (#85)
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.

Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).

Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:43:56 +02:00
Xavier Roche
3b2d7afdaa Merge pull request #393 from xroche/fix/empty-footer-doitlog-106
Keep empty quoted args when reloading doit.log (#106)
2026-06-19 08:13:19 +02:00
Xavier Roche
6ee539619e htscoremain: keep empty quoted args when reloading doit.log (#106)
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.

Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.

01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:09:57 +02:00
Xavier Roche
fb098b27b4 Merge pull request #392 from xroche/fix/cookie-rfc6265-151
Drop $Version/$Path from the request Cookie header (#151)
2026-06-18 22:42:47 +02:00
Xavier Roche
5f6a3fb917 htslib: drop $Version/$Path from request Cookie header (#151)
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:

  Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/

Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":

  Cookie: name=value; has_js=1

The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 22:12:28 +02:00
Xavier Roche
f9e676dbe3 Merge pull request #391 from xroche/feature/api-enum-callsites-savename83
htsopt: name the savename_83 enum and finish the call-site constant adoption
2026-06-18 21:43:34 +02:00
9 changed files with 646 additions and 39 deletions

View File

@@ -2532,8 +2532,26 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#if HTS_USEOPENSSL
/* SSL mode */
if (back[i].r.ssl) {
int tunnel_ok = 1;
// https via proxy: CONNECT-tunnel before TLS (#85)
if (back[i].r.req.proxy.active && back[i].r.ssl_con == NULL) {
const int timeout = back[i].timeout > 0 ? back[i].timeout : 30;
tunnel_ok =
http_proxy_tunnel(opt, &back[i].r, back[i].url_adr, timeout);
if (!tunnel_ok) {
if (!strnotempty(back[i].r.msg))
strcpybuff(back[i].r.msg, "proxy CONNECT failed");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].r.statuscode = STATUSCODE_NON_FATAL;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
}
}
// handshake not yet launched
if (!back[i].r.ssl_con) {
if (tunnel_ok && !back[i].r.ssl_con) {
SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL);
// new session
back[i].r.ssl_con = SSL_new(openssl_ctx);
@@ -2551,7 +2569,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE;
}
/* Error */
if (back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
if (tunnel_ok && back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
strcpybuff(back[i].r.msg, "bad SSL/TLS handshake");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;

View File

@@ -953,9 +953,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
p = buff;
do {
int insert_after_argc;
int quoted; /* "" unquotes to empty but is still a real token (#106) */
// read next
lastp = p;
quoted = (p != NULL && *p == '"');
if (p) {
p = next_token(p, 1);
if (p) {
@@ -966,7 +968,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Insert parameters BUT so that they can be in the same order */
if (lastp) {
if (strnotempty(lastp)) {
if (strnotempty(lastp) || quoted) {
insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_argvblk_size, x_ptr);
@@ -3129,6 +3131,43 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return err;
} break;
case 'Q': { // cookie request-header selftest: httrack -#Q
static t_cookie cookie;
char hdr[1024];
/* RFC 6265: bare name=value pairs, no $Version/$Path (#151). */
const char *expected = "Cookie: name=value; has_js=1" H_CRLF;
int err = 0;
const char *dom = "www.example.com";
int added;
cookie.max_len = (int) sizeof(cookie.data);
cookie.data[0] = '\0';
added = cookie_add(&cookie, "name", "value", dom, "/");
added |= cookie_add(&cookie, "has_js", "1", dom, "/");
/* different domain: must be filtered out */
added |= cookie_add(&cookie, "junk", "x", "other.org", "/");
if (added) {
printf("cookie-header: FAIL (cookie_add setup)\n");
htsmain_free();
return 1;
}
http_cookie_header_selftest(&cookie, dom, "/", hdr,
sizeof(hdr));
if (strcmp(hdr, expected) != 0)
err = 1;
if (strstr(hdr, "$Version") != NULL ||
strstr(hdr, "$Path") != NULL)
err = 1;
if (strstr(hdr, "junk") != NULL) // wrong-domain cookie leaked
err = 1;
printf("cookie-header: %s\n", err ? "FAIL" : "OK");
if (err)
printf(" got: %s\n", hdr);
htsmain_free();
return err;
} break;
case '!':
HTS_PANIC_PRINTF
("Option #! is disabled for security reasons");

View File

@@ -644,6 +644,165 @@ T_SOC http_fopen(httrackp * opt, const char *adr, const char *fil, htsblk * reto
return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour);
}
// Read a CRLF line from a non-blocking socket (waits up to timeout per recv).
// Returns the line length (0 = empty), or -1 on timeout/EOF/error.
static int proxy_getline(T_SOC soc, char *s, int max, int timeout) {
int j = 0;
for (;;) {
unsigned char ch;
int n;
if (!check_readinput_t(soc, timeout))
return -1; // timed out waiting for data
n = (int) recv(soc, &ch, 1, 0);
if (n == 1) {
if (ch == 13) // CR
continue;
if (ch == 10) // LF: end of line
break;
if (j >= max - 1)
return -1; // line too long: bound the read against a hostile proxy
s[j++] = (char) ch;
} else if (n == 0) {
return -1; // connection closed
} else {
#ifdef _WIN32
if (WSAGetLastError() == WSAEWOULDBLOCK)
continue;
#else
if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
continue;
#endif
return -1;
}
}
s[j] = '\0';
return j;
}
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout) {
const T_SOC soc = retour->soc;
const char *const host = jump_identification_const(adr); // host[:port]
const char *const portsep = jump_toport_const(adr); // ":port" or NULL
char BIGSTK authority[HTS_URLMAXSIZE * 2];
char BIGSTK req[HTS_URLMAXSIZE * 4 + 1100];
char line[1024];
int code;
if (soc == INVALID_SOCKET)
return 0;
// CONNECT needs an explicit host:port; default the https port
authority[0] = '\0';
if (portsep != NULL)
strlcatbuff(authority, host, sizeof(authority)); // already host:port
else
snprintf(authority, sizeof(authority), "%s:%d", host, 443);
// backstop: never let a stray CR/LF in the host smuggle a second line into
// the CONNECT request (the host is already sanitized upstream)
{
const char *c;
for (c = authority; *c != '\0'; c++) {
if ((unsigned char) *c < ' ') {
strcpybuff(retour->msg, "proxy CONNECT: invalid host");
return 0;
}
}
}
snprintf(req, sizeof(req), "CONNECT %s HTTP/1.0" H_CRLF "Host: %s" H_CRLF,
authority, authority);
// creds go on the CONNECT, not the tunneled origin request
if (link_has_authorization(retour->req.proxy.name)) {
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
char autorisation[1100];
char user_pass[256];
autorisation[0] = user_pass[0] = '\0';
strncatbuff(user_pass, astart, (int) (a - astart) - 1);
strcpybuff(user_pass, unescape_http(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt), user_pass));
code64((unsigned char *) user_pass, (int) strlen(user_pass),
(unsigned char *) autorisation, 0);
strlcatbuff(req, "Proxy-Authorization: Basic ", sizeof(req));
strlcatbuff(req, autorisation, sizeof(req));
strlcatbuff(req, H_CRLF, sizeof(req));
}
strlcatbuff(req, H_CRLF, sizeof(req)); // end of request headers
// raw send: ssl is set, so sendc() would route to TLS
{
const char *p = req;
size_t remain = strlen(req);
int stalls = 0;
while (remain > 0) {
const int n = (int) send(soc, p, (int) remain, 0);
if (n > 0) {
p += n;
remain -= (size_t) n;
stalls = 0;
} else {
#ifdef _WIN32
const int wouldblock = (WSAGetLastError() == WSAEWOULDBLOCK);
#else
const int wouldblock =
(errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR);
#endif
// don't spin forever on a fatal error or an unwritable socket
if (!wouldblock || !check_writeinput_t(soc, timeout) ||
++stalls > 100) {
strcpybuff(retour->msg, "proxy CONNECT: write error");
return 0;
}
}
}
}
// proxy status line: "HTTP/1.x <code> ..."
if (proxy_getline(soc, line, sizeof(line), timeout) < 0) {
strcpybuff(retour->msg, "proxy CONNECT: no response");
return 0;
}
if (sscanf(line, "HTTP/%*d.%*d %d", &code) < 1)
code = 0;
if (code < 200 || code >= 300) {
snprintf(retour->msg, sizeof(retour->msg), "proxy CONNECT refused: %s",
strnotempty(line) ? line : "(no status)");
return 0;
}
// drain headers to the blank line; cap the count so a flooding proxy can't
// stall the crawl
{
int headers = 0;
for (;;) {
const int n = proxy_getline(soc, line, sizeof(line), timeout);
if (n < 0) {
strcpybuff(retour->msg, "proxy CONNECT: truncated response");
return 0;
}
if (n == 0)
break; // blank line: tunnel ready
if (++headers > 64) {
strcpybuff(retour->msg, "proxy CONNECT: too many response headers");
return 0;
}
}
}
return 1;
}
// ouverture d'une liaison http, envoi d'une requète
// mode: 0 GET 1 HEAD [2 POST]
// treat: traiter header?
@@ -680,14 +839,14 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
/* connexion */
if (retour) {
if ((!(retour->req.proxy.active))
|| ((strcmp(adr, "file://") == 0)
|| (strncmp(adr, "https://", 8) == 0)
)
) { /* pas de proxy, ou non utilisable ici */
/* no proxy, or proxy not usable here (local file) */
if ((!(retour->req.proxy.active)) || (strcmp(adr, "file://") == 0)) {
soc = newhttp(opt, adr, retour, -1, waitconnect);
} else {
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port, waitconnect); // ouvrir sur le proxy à la place
// to the proxy; https tunnels to the origin via CONNECT in back_wait
// (#85)
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port,
waitconnect);
}
} else {
soc = newhttp(opt, adr, NULL, -1, waitconnect);
@@ -874,6 +1033,50 @@ static void print_buffer(buff_struct*const str, const char *format, ...) {
assertf(str->pos < str->capacity);
}
/* Append the request "Cookie:" header line for every stored cookie matching
domain/path. RFC 6265 form: bare "name=value" pairs joined by "; ", no
$Version/$Path attributes (those are RFC 2965 syntax that modern servers
reject, issue #151). Returns the number of cookies emitted. */
static int append_cookie_header(buff_struct *bstr, t_cookie *cookie,
const char *domain, const char *path) {
char buffer[8192];
char *b;
int cook = 0;
int max_cookies = 8;
if (cookie == NULL)
return 0;
b = cookie->data;
do {
b = cookie_find(b, "", domain, path); // next matching cookie
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(bstr, "Cookie: ");
cook = 1;
} else
print_buffer(bstr, "; ");
print_buffer(bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(bstr, "=%s", cookie_get(buffer, b, 6));
b = cookie_nextfield(b);
}
} while (b != NULL && max_cookies > 0);
if (cook)
print_buffer(bstr, H_CRLF);
return cook;
}
/* Self-test entry for append_cookie_header(): build the request Cookie line
into dst (always NUL-terminated). Returns the number of cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size) {
buff_struct bstr = {dst, dst_size, 0};
assertf(dst != NULL && dst_size > 0);
dst[0] = '\0';
return append_cookie_header(&bstr, cookie, domain, path);
}
// envoi d'une requète
int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
const char *xsend, const char *adr, const char *fil,
@@ -999,8 +1202,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
if (xsend)
print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes
// tester proxy authentication
if (retour->req.proxy.active) {
// for https, auth rides the CONNECT (the tunneled GET would leak it)
if (retour->req.proxy.active && strncmp(adr, "https://", 8) != 0) {
if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy!
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
@@ -1048,34 +1251,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
search_tag + strlen(POSTTOK) + 1))));
}
}
// gestion cookies?
// send stored cookies matching this host/path
if (cookie) {
char buffer[8192];
char *b = cookie->data;
int cook = 0;
int max_cookies = 8;
do {
b = cookie_find(b, "", jump_identification_const(adr), fil); // prochain cookie satisfaisant aux conditions
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(&bstr, "Cookie: $Version=1; ");
cook = 1;
} else
print_buffer(&bstr, "; ");
print_buffer(&bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(&bstr, "=%s", cookie_get(buffer, b, 6));
print_buffer(&bstr, "; $Path=%s", cookie_get(buffer, b, 2));
b = cookie_nextfield(b);
}
} while(b != NULL && max_cookies > 0);
if (cook) { // on a envoyé un (ou plusieurs) cookie?
print_buffer(&bstr, H_CRLF);
#if DEBUG_COOK
printf("Header:\n%s\n", bstr.buffer);
#endif
}
append_cookie_header(&bstr, cookie, jump_identification_const(adr), fil);
}
// gérer le keep-alive (garder socket)
if (retour->req.http11 && !retour->req.nokeepalive) {
@@ -1808,6 +1986,24 @@ int check_readinput_t(T_SOC soc, int timeout) {
return 0;
}
// wait until the socket is writable, up to timeout seconds
int check_writeinput_t(T_SOC soc, int timeout) {
if (soc != INVALID_SOCKET) {
fd_set fds;
struct timeval tv;
const int isoc = (int) soc;
assertf(isoc == soc);
FD_ZERO(&fds);
FD_SET(isoc, &fds);
tv.tv_sec = timeout;
tv.tv_usec = 0;
select(isoc + 1, NULL, &fds, NULL, &tv);
return FD_ISSET(isoc, &fds) ? 1 : 0;
} else
return 0;
}
// idem, sauf qu'ici on peut choisir la taille max de données à recevoir
// SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes
// en éliminant les cr (ex: header), arrêt si double-lf

View File

@@ -182,6 +182,11 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, const char *xsend
const char *adr, const char *fil,
const char *referer_adr, const char *referer_fil,
htsblk * retour);
/* Build the request "Cookie:" header line for stored cookies matching
domain/path into dst (NUL-terminated). Exposed for the -#Q self-test;
wraps the same logic http_sendhead() uses. Returns cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size);
//int newhttp(char* iadr,char* err=NULL);
T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port,
@@ -193,6 +198,17 @@ HTS_INLINE void deletesoc_r(htsblk * r);
htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc);
int check_readinput(htsblk * r);
int check_readinput_t(T_SOC soc, int timeout);
int check_writeinput_t(T_SOC soc, int timeout);
/* Open an HTTP CONNECT tunnel through the active proxy for an https request:
`retour->soc` must already be TCP-connected to the proxy, and `adr` is the
origin authority (url_adr, e.g. "https://host:port"). Sends the CONNECT
request (with Proxy-Authorization when the proxy carries credentials) and
reads the proxy's status line, so the caller's TLS handshake then runs
end-to-end with the origin. Blocks up to `timeout` seconds. Returns 1 on a
2xx tunnel, 0 on failure (retour->msg/statuscode set). */
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout);
void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour,
char *rcvd);
void treatfirstline(htsblk * retour, const char *rcvd);

15
tests/01_engine-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1
}

View File

@@ -89,4 +89,37 @@ grep -q NEWCONTENT "$(find "$out" -path '*/a.html' -print -quit)" || {
exit 1
}
# --- 3. an empty quoted arg survives the doit.log round-trip (#106) ----------
# -%F "" (empty footer) records an empty "" token in doit.log; -r2 follows it so
# a "drop the empty token" bug shifts -r2 into -%F's slot (the reprise then sees
# -%F -r2 and panics "%F needs to be followed by ..."), making the bug visible
# rather than a harmless run off the end of argv.
out2="$tmp/out2"
rc=0
"$bin" "$url" -O "$out2" --quiet -n -%v0 -%F "" -r2 >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: initial mirror with empty footer exited $rc"
exit 1
}
# precondition: the writer put the empty token on disk for the reader to reload.
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer not recorded as -%F \"\" -r2 in doit.log"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
# no-url reprise: the reader rebuilds argv from doit.log and rewrites doit.log
# from it. The empty token surviving in the regenerated file proves the reader
# kept it (a drop/swallow would panic above or rewrite -%F without the "").
rc=0
"$bin" -O "$out2" --quiet >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: empty-footer reprise exited $rc (empty token dropped from doit.log?)"
exit 1
}
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer did not survive the doit.log reload round-trip"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
exit 0

View File

@@ -0,0 +1,136 @@
#!/bin/bash
#
# Issue #85: an https crawl must go through the configured proxy (CONNECT
# tunnel), not bypass it and hit the origin directly. Fully local: a self-signed
# TLS origin plus a logging CONNECT proxy, so no network access is needed.
set -euo pipefail
: "${top_srcdir:=..}"
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
if ! command -v python3 >/dev/null 2>&1 || ! command -v openssl >/dev/null 2>&1; then
echo "python3/openssl missing, skipping"
exit 77
fi
server="$top_srcdir/tests/proxy-https-server.py"
tmpdir=$(mktemp -d)
pids=
cleanup() {
for pid in $pids; do
kill "$pid" 2>/dev/null || true
done
rm -rf "$tmpdir"
}
trap cleanup EXIT
# self-signed cert for the local TLS origin (httrack does not verify certs)
openssl req -x509 -newkey rsa:2048 -keyout "$tmpdir/key.pem" \
-out "$tmpdir/cert.pem" -days 2 -nodes -subj "/CN=127.0.0.1" \
>/dev/null 2>&1
cat "$tmpdir/key.pem" "$tmpdir/cert.pem" >"$tmpdir/both.pem"
# start_server <logdir> <mode>: launches a proxy+origin pair, sets $origin_port
# and $proxy_port from its announced ephemeral ports.
start_server() {
local dir="$1" mode="$2" ports
mkdir -p "$dir"
ports="$dir/ports.txt"
python3 "$server" "$tmpdir/both.pem" "$dir" "$mode" \
>"$ports" 2>"$dir/server.err" &
pids="$pids $!"
for _ in $(seq 1 100); do
grep -q "^ready" "$ports" 2>/dev/null && break
sleep 0.1
done
grep -q "^ready" "$ports" 2>/dev/null || {
echo "server ($mode) did not start" >&2
cat "$dir/server.err" >&2
exit 1
}
origin_port=$(awk '/^ORIGIN/{print $2}' "$ports")
proxy_port=$(awk '/^PROXY/{print $2}' "$ports")
}
# Run httrack, but kill it after a deadline so a hang (e.g. a missing bound on
# the proxy response) surfaces as the kill code $HANG_RC instead of stalling the
# whole job. A portable stand-in for `timeout`, which macOS lacks.
HANG_RC=137 # 128 + SIGKILL
run_crawl() {
local out="$1" proxy="$2" port="$3"
rm -rf "$out"
httrack "https://127.0.0.1:${port}/" --proxy "$proxy" \
-O "$out" -r1 -s0 --timeout=10 >"$out.log" 2>&1 &
local pid=$!
(sleep 60 && kill -9 "$pid" 2>/dev/null) &
local guard=$!
local rc=0
wait "$pid" 2>/dev/null || rc=$?
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
return "$rc"
}
# --- working proxy ----------------------------------------------------------
ok="$tmpdir/ok"
start_server "$ok" ok
# 1. page retrieved AND the proxy saw a CONNECT to the origin
run_crawl "$ok/out" "127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out" || {
echo "FAIL: origin page not downloaded through proxy" >&2
cat "$ok/out.log" >&2
exit 1
}
grep -q "^CONNECT 127.0.0.1:${origin_port} " "$ok/proxy.log" || {
echo "FAIL: proxy never received a CONNECT (https bypassed the proxy)" >&2
cat "$ok/proxy.log" >&2
exit 1
}
echo "OK: https tunneled through proxy via CONNECT"
# 2. authenticated proxy: creds ride the CONNECT, and NEVER reach the origin
: >"$ok/proxy.log"
: >"$ok/origin-headers.log"
run_crawl "$ok/out2" "user:secret@127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out2" || {
echo "FAIL: origin page not downloaded through authenticated proxy" >&2
exit 1
}
got=$(awk '/^AUTH Basic /{print $3}' "$ok/proxy.log" | head -1)
# base64("user:secret"); compared as a literal to stay portable (no base64 -d,
# which differs between GNU and BSD)
test "$got" == "dXNlcjpzZWNyZXQ=" || {
echo "FAIL: Proxy-Authorization not carried on CONNECT (got '$got')" >&2
cat "$ok/proxy.log" >&2
exit 1
}
if grep -qi "proxy-authorization" "$ok/origin-headers.log"; then
echo "FAIL: proxy credentials leaked to the origin through the tunnel" >&2
cat "$ok/origin-headers.log" >&2
exit 1
fi
echo "OK: proxy credentials carried on CONNECT, not leaked to origin"
# --- hostile proxy ----------------------------------------------------------
# A proxy that answers 200 then streams headers forever must not hang the crawl:
# the client bounds the response. run_crawl kills a hung httrack after 60s, so a
# missing bound surfaces as $HANG_RC here.
flood="$tmpdir/flood"
start_server "$flood" flood
rc=0
run_crawl "$flood/out" "127.0.0.1:${proxy_port}" "$origin_port" || rc=$?
test "$rc" -ne "$HANG_RC" || {
echo "FAIL: crawl hung on a flooding proxy (bounded read missing)" >&2
exit 1
}
grep -rq "ORIGIN-PAGE-85" "$flood/out" 2>/dev/null && {
echo "FAIL: flooding proxy unexpectedly served the page" >&2
exit 1
}
echo "OK: bounded proxy response, no hang on a flooding proxy"

View File

@@ -2,6 +2,7 @@
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -24,6 +25,7 @@ TESTS = \
01_engine-cache-golden.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
01_engine-copyopt.test \
01_engine-doitlog.test \
01_engine-entities.test \
@@ -43,6 +45,7 @@ TESTS = \
11_crawl-international.test \
11_crawl-longurl.test \
11_crawl-parsing.test \
12_crawl_https.test
12_crawl_https.test \
13_crawl_proxy_https.test
CLEANFILES = check-network_sh.cache

151
tests/proxy-https-server.py Normal file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Local CONNECT proxy + self-signed HTTPS origin for the issue #85 test.
Starts a TLS origin server and an HTTP proxy that honours CONNECT, on ephemeral
ports. Every request line the proxy receives (and any Proxy-Authorization) is
appended to the proxy log; every header the origin receives over the tunnel is
appended to the origin log. That lets the test assert both that an https crawl
tunneled through the proxy and that proxy credentials never leaked to the origin.
Proxy modes (argv[3], default "ok"):
ok - honour CONNECT and tunnel to the origin
flood - answer 200 then stream headers forever with no blank line, to exercise
the client's bound on the proxy response (must not hang the crawl)
Usage: proxy-https-server.py <cert.pem> <logdir> [mode]
Prints "ORIGIN <port>", "PROXY <port>", then "ready" (one per line) on stdout.
"""
import http.server
import os
import socket
import socketserver
import ssl
import sys
import threading
ORIGIN_BODY = b"<html><body>ORIGIN-PAGE-85</body></html>"
PROXY_LOG = "proxy.log"
ORIGIN_LOG = "origin-headers.log"
def make_origin(logdir):
class Origin(http.server.BaseHTTPRequestHandler):
def do_GET(self):
with open(os.path.join(logdir, ORIGIN_LOG), "a") as handle:
for key in self.headers.keys():
handle.write(key + "\n")
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Length", str(len(ORIGIN_BODY)))
self.end_headers()
self.wfile.write(ORIGIN_BODY)
def log_message(self, *args):
pass
return Origin
def start_origin(certfile, logdir):
httpd = socketserver.TCPServer(("127.0.0.1", 0), make_origin(logdir))
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
threading.Thread(target=httpd.serve_forever, daemon=True).start()
return port
def pipe(src, dst):
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
for sock in (src, dst):
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
def handle_client(conn, logdir, mode):
rfile = conn.makefile("rb")
request_line = rfile.readline().decode("latin-1").strip()
auth = None
while True:
line = rfile.readline().decode("latin-1")
if line in ("\r\n", "\n", ""):
break
key, _, value = line.partition(":")
if key.strip().lower() == "proxy-authorization":
auth = value.strip()
with open(os.path.join(logdir, PROXY_LOG), "a") as handle:
handle.write(request_line + "\n")
if auth is not None:
handle.write("AUTH " + auth + "\n")
parts = request_line.split()
if not (len(parts) >= 2 and parts[0] == "CONNECT"):
conn.sendall(b"HTTP/1.0 501 Not Implemented\r\n\r\n")
conn.close()
return
if mode == "flood":
# 200, then an endless header stream with no terminating blank line: the
# client must bound this and give up, not hang.
try:
conn.sendall(b"HTTP/1.0 200 Connection established\r\n")
while True:
conn.sendall(b"X-Pad: 0123456789\r\n")
except OSError:
pass
conn.close()
return
host, _, port = parts[1].partition(":")
try:
upstream = socket.create_connection((host, int(port or 443)))
except OSError:
conn.sendall(b"HTTP/1.0 502 Bad Gateway\r\n\r\n")
conn.close()
return
conn.sendall(b"HTTP/1.0 200 Connection established\r\n\r\n")
threading.Thread(target=pipe, args=(conn, upstream), daemon=True).start()
pipe(upstream, conn)
def start_proxy(logdir, mode):
srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("127.0.0.1", 0))
srv.listen(16)
port = srv.getsockname()[1]
def serve():
while True:
conn, _ = srv.accept()
threading.Thread(
target=handle_client, args=(conn, logdir, mode), daemon=True
).start()
threading.Thread(target=serve, daemon=True).start()
return port
def main():
certfile, logdir = sys.argv[1], sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "ok"
for name in (PROXY_LOG, ORIGIN_LOG):
open(os.path.join(logdir, name), "w").close()
origin_port = start_origin(certfile, logdir)
proxy_port = start_proxy(logdir, mode)
print("ORIGIN %d" % origin_port, flush=True)
print("PROXY %d" % proxy_port, flush=True)
print("ready", flush=True)
threading.Event().wait()
if __name__ == "__main__":
main()