Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).
- Bound the response read so a proxy cannot stall the single-threaded back_wait
crawl: proxy_getline now fails on an over-long line instead of consuming it
forever, the header drain is capped at 64 lines, and the send loop gives up
rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
CR/LF defense lives entirely upstream (escape_remove_control / header-line
splitting).
- Test: the origin now records the headers it receives, and the test asserts
Proxy-Authorization never reaches the origin through the tunnel (the previous
assertions couldn't see a leak). Added a flooding-proxy scenario that proves
the crawl terminates instead of hanging on an unbounded response.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.
Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).
Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>