Compare commits

...

27 Commits

Author SHA1 Message Date
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
Xavier Roche
02c7f4ebf6 htsparse: don't treat XHR.open's method argument as a URL (#218)
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.

Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.

Closes #218

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 20:27:04 +02:00
Xavier Roche
9070b44a70 Merge pull request #398 from xroche/fix/html-underflow-396
htsparse: fix buffer underflow reading *(html-1) at offset 0 (#396)
2026-06-19 19:55:40 +02:00
Xavier Roche
799c045061 htsparse: don't read *(html-1) before the parse buffer (#396)
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.

Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.

Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:44:25 +02:00
Xavier Roche
fb1ee3bf2e Merge pull request #397 from xroche/fix/css-import-94
CSS @import: capture URLs that carry a media/supports/layer condition (#94)
2026-06-19 19:30:21 +02:00
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
Xavier Roche
0145ec37a3 htsparse: don't crawl xmlns namespace declarations (#191)
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.

01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:24:55 +02:00
Xavier Roche
a80fab38ba Merge pull request #394 from xroche/fix/proxy-https-connect-85
Tunnel https through the proxy via CONNECT (#85)
2026-06-19 18:03:31 +02:00
Xavier Roche
c52a524a63 htslib: bound the proxy CONNECT response; harden + cover review findings
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).

- Bound the response read so a proxy cannot stall the single-threaded back_wait
  crawl: proxy_getline now fails on an over-long line instead of consuming it
  forever, the header drain is capped at 64 lines, and the send loop gives up
  rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
  hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
  CR/LF defense lives entirely upstream (escape_remove_control / header-line
  splitting).
- Test: the origin now records the headers it receives, and the test asserts
  Proxy-Authorization never reaches the origin through the tunnel (the previous
  assertions couldn't see a leak). Added a flooding-proxy scenario that proves
  the crawl terminates instead of hanging on an unbounded response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 09:52:10 +02:00
Xavier Roche
1907621d37 htslib: tunnel https through the proxy via CONNECT (#85)
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.

Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).

Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:43:56 +02:00
Xavier Roche
3b2d7afdaa Merge pull request #393 from xroche/fix/empty-footer-doitlog-106
Keep empty quoted args when reloading doit.log (#106)
2026-06-19 08:13:19 +02:00
Xavier Roche
6ee539619e htscoremain: keep empty quoted args when reloading doit.log (#106)
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.

Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.

01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:09:57 +02:00
Xavier Roche
fb098b27b4 Merge pull request #392 from xroche/fix/cookie-rfc6265-151
Drop $Version/$Path from the request Cookie header (#151)
2026-06-18 22:42:47 +02:00
Xavier Roche
5f6a3fb917 htslib: drop $Version/$Path from request Cookie header (#151)
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:

  Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/

Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":

  Cookie: name=value; has_js=1

The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 22:12:28 +02:00
Xavier Roche
f9e676dbe3 Merge pull request #391 from xroche/feature/api-enum-callsites-savename83
htsopt: name the savename_83 enum and finish the call-site constant adoption
2026-06-18 21:43:34 +02:00
Xavier Roche
1b440c44b5 htsopt: name savename_83 enum and adopt enum constants at call sites
Type opt->savename_83 as a new hts_savename_83 enum (LONG/DOS/ISO9660 =
0/1/2) and replace the remaining magic-number literals for the already-
typed verbosedisplay and savename_delayed fields with their named enum
constants across the engine.

Behavior-preserving: every constant equals the literal it replaces, and a
C enum is int-sized, so struct layout is unchanged (sizeof(httrackp) and
offsetof(savename_83) are identical to origin/master, no soname bump). The
-L option block is deliberately reflowed to clang-format style, which is
what made the savename_83 retype tractable. Bitmask fields (travel/seeker/
getmode/parsejava/hostcontrol) intentionally stay int with named bit enums,
per the existing flags-as-enum split.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 21:03:33 +02:00
Xavier Roche
ac6dd1a570 Merge pull request #390 from xroche/fix/copy-htsopt-unsigned-enum-guards
copy_htsopt silently drops boolean option fields
2026-06-18 20:46:00 +02:00
37 changed files with 1651 additions and 622 deletions

View File

@@ -320,6 +320,21 @@ jobs:
lint:
name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps:
- uses: actions/checkout@v6
@@ -332,12 +347,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh
run: shellcheck $SHELL_SCRIPTS
- name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit
run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -1,8 +1,7 @@
#!/bin/sh
# Simple indexing test using HTTrack
# A "real" script/program would use advanced search, and
# A "real" script/program would use advanced search, and
# use dichotomy to find the word in the index.txt file
# This script is really basic and NOT optimized, and
# should not be used for professional purpose :)
@@ -11,50 +10,49 @@ TESTSITE="http://localhost/"
# Create an index if necessary
if ! test -f "index.txt"; then
echo "Building the index .."
rm -rf test
httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./
echo "Building the index .."
rm -rf test
httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./
fi
# Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt
if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old
tr -d '\r' <index.txt.old >index.txt
fi
keyword=-
while test -n "$keyword"; do
printf "Enter a keyword: "
read keyword
printf "Enter a keyword: "
read -r keyword
if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`"
if test -n "$keyword"; then
FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then
# Multiple matches
printf "Found multiple keywords: "
echo "$FOUNDK"|cut -f2 -d':'|tr '\n' ' '
echo ""
echo "Use keyword$ to find only one"
else
# One match
N=`echo "$FOUNDK"|cut -f1 -d':'`
PM=`tail +$N index.txt|grep -nE "\("|head -n 1`
if ! echo "$PM"|grep "ignored">/dev/null; then
M=`echo $PM|cut -f1 -d':'`
echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else
echo "keyword ignored (too many hits)"
fi
fi
else
echo "not found"
fi
if test -n "$FOUNDK"; then
if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches
printf "Found multiple keywords: "
echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
echo ""
echo "Use keyword$ to find only one"
else
# One match
N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM" | grep "ignored" >/dev/null; then
M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:"
tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else
echo "keyword ignored (too many hits)"
fi
fi
else
echo "not found"
fi
fi
fi
done

View File

@@ -2532,8 +2532,26 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#if HTS_USEOPENSSL
/* SSL mode */
if (back[i].r.ssl) {
int tunnel_ok = 1;
// https via proxy: CONNECT-tunnel before TLS (#85)
if (back[i].r.req.proxy.active && back[i].r.ssl_con == NULL) {
const int timeout = back[i].timeout > 0 ? back[i].timeout : 30;
tunnel_ok =
http_proxy_tunnel(opt, &back[i].r, back[i].url_adr, timeout);
if (!tunnel_ok) {
if (!strnotempty(back[i].r.msg))
strcpybuff(back[i].r.msg, "proxy CONNECT failed");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].r.statuscode = STATUSCODE_NON_FATAL;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
}
}
// handshake not yet launched
if (!back[i].r.ssl_con) {
if (tunnel_ok && !back[i].r.ssl_con) {
SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL);
// new session
back[i].r.ssl_con = SSL_new(openssl_ctx);
@@ -2551,7 +2569,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE;
}
/* Error */
if (back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
if (tunnel_ok && back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
strcpybuff(back[i].r.msg, "bad SSL/TLS handshake");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
@@ -3838,7 +3856,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
/* funny log for commandline users */
//if (!opt->quiet) {
// petite animation
if (opt->verbosedisplay == 1) {
if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (back[i].status == STATUS_READY) {
if (back[i].r.statuscode == HTTP_OK)
printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r",

View File

@@ -3,57 +3,59 @@
# Change this to download files
if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi
# Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n"
for i in *.TXT ; do
echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \
(
unset arr
while read LINE ; do
from=$[$(echo $LINE | cut -f1 -d' ')]
if ! test -n "$from"; then
echo "error with $i" >&2
exit 1
elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2
exit 1
fi
to=$(echo $LINE | cut -f2 -d' ')
arr[$from]=$to
done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n "
i=0
while test "$i" -lt 256; do
if test "$i" -gt 0; then
printf ", "
if test $[${i}%8] -eq 0; then
printf "\n "
fi
fi
value=${arr[$i]:-0}
printf "0x%04x" $value
i=$[${i}+1]
done
printf " };\n\n"
)
echo "processed $i" >&2
printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT; do
echo "processing $i" >&2
grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
(
unset arr
while read -r LINE; do
from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then
echo "error with $i" >&2
exit 1
elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2
exit 1
fi
to=$(echo "$LINE" | cut -f2 -d' ')
arr[from]=$to
done
# shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
idx=0
while test "$idx" -lt 256; do
if test "$idx" -gt 0; then
printf ", "
if test $((idx % 8)) -eq 0; then
printf "\n "
fi
fi
value=${arr[$idx]:-0}
printf "0x%04x" "$value"
idx=$((idx + 1))
done
printf " };\n\n"
)
echo "processed $i" >&2
done
# Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT ; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n"
for i in *.TXT; do
# shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done
printf " { NULL, NULL }\n};\n"

View File

@@ -3342,7 +3342,8 @@ int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,
int ptr, int numero_passe) {
int n = back_pluggable_sockets(sback, opt);
if (opt->savename_delayed == 2 && !opt->delayed_cached) /* cancel (always delayed) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!opt->delayed_cached) /* cancel (always delayed) */
return 0;
if (n > 0) {
int p;
@@ -3846,7 +3847,7 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
a = opt->savename_type;
b = opt->savename_83;
opt->savename_type = 0;
opt->savename_83 = 0;
opt->savename_83 = HTS_SAVENAME_83_LONG;
// note: adr,fil peuvent être patchés
r =
url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe,

View File

@@ -612,12 +612,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Terminal is a tty, may ask questions and display funny information */
if (isatty(1)) {
opt->quiet = 0;
opt->verbosedisplay = 1;
opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
}
/* Not a tty, no stdin input or funny output! */
else {
opt->quiet = 1;
opt->verbosedisplay = 0;
opt->verbosedisplay = HTS_VERBOSE_NONE;
}
#endif
@@ -953,9 +953,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
p = buff;
do {
int insert_after_argc;
int quoted; /* "" unquotes to empty but is still a real token (#106) */
// read next
lastp = p;
quoted = (p != NULL && *p == '"');
if (p) {
p = next_token(p, 1);
if (p) {
@@ -966,7 +968,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Insert parameters BUT so that they can be in the same order */
if (lastp) {
if (strnotempty(lastp)) {
if (strnotempty(lastp) || quoted) {
insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_argvblk_size, x_ptr);
@@ -1815,24 +1817,22 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
com++;
}
break;
case 'L':
{
sscanf(com + 1, "%d", &opt->savename_83);
switch (opt->savename_83) {
case 0: // 8-3 (ISO9660 L1)
opt->savename_83 = 1;
break;
case 1:
opt->savename_83 = 0;
break;
default: // 2 == ISO9660 (ISO9660 L2)
opt->savename_83 = 2;
break;
}
while(isdigit((unsigned char) *(com + 1)))
com++;
case 'L': {
sscanf(com + 1, "%d", (int *) &opt->savename_83);
switch (opt->savename_83) {
case 0: // 8-3 (ISO9660 L1)
opt->savename_83 = HTS_SAVENAME_83_DOS;
break;
case 1:
opt->savename_83 = HTS_SAVENAME_83_LONG;
break;
default: // 2 == ISO9660 (ISO9660 L2)
opt->savename_83 = HTS_SAVENAME_83_ISO9660;
break;
}
break;
while (isdigit((unsigned char) *(com + 1)))
com++;
} break;
case 's':
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", (int *) &opt->robots);
@@ -1989,7 +1989,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
break; // url hack
case 'v':
opt->verbosedisplay = 2;
opt->verbosedisplay = HTS_VERBOSE_FULL;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", (int *) &opt->verbosedisplay);
while(isdigit((unsigned char) *(com + 1)))
@@ -2004,7 +2004,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
}
break;
case 'N':
opt->savename_delayed = 2;
opt->savename_delayed = HTS_SAVENAME_DELAYED_HARD;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", (int *) &opt->savename_delayed);
while(isdigit((unsigned char) *(com + 1)))
@@ -2787,6 +2787,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0;
}
break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs
if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");
@@ -3131,6 +3172,43 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free();
return err;
} break;
case 'Q': { // cookie request-header selftest: httrack -#Q
static t_cookie cookie;
char hdr[1024];
/* RFC 6265: bare name=value pairs, no $Version/$Path (#151). */
const char *expected = "Cookie: name=value; has_js=1" H_CRLF;
int err = 0;
const char *dom = "www.example.com";
int added;
cookie.max_len = (int) sizeof(cookie.data);
cookie.data[0] = '\0';
added = cookie_add(&cookie, "name", "value", dom, "/");
added |= cookie_add(&cookie, "has_js", "1", dom, "/");
/* different domain: must be filtered out */
added |= cookie_add(&cookie, "junk", "x", "other.org", "/");
if (added) {
printf("cookie-header: FAIL (cookie_add setup)\n");
htsmain_free();
return 1;
}
http_cookie_header_selftest(&cookie, dom, "/", hdr,
sizeof(hdr));
if (strcmp(hdr, expected) != 0)
err = 1;
if (strstr(hdr, "$Version") != NULL ||
strstr(hdr, "$Path") != NULL)
err = 1;
if (strstr(hdr, "junk") != NULL) // wrong-domain cookie leaked
err = 1;
printf("cookie-header: %s\n", err ? "FAIL" : "OK");
if (err)
printf(" got: %s\n", hdr);
htsmain_free();
return err;
} break;
case '!':
HTS_PANIC_PRINTF
("Option #! is disabled for security reasons");

View File

@@ -33,43 +33,43 @@ EOF
else
GET "${url}"
fi
) \
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \
| sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\
| ( \
read A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A"|cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$[1 << 32]"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}"|hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]"
i=$[${i}+1]
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
) |
grep -E '^<!ENTITY [a-zA-Z0-9_]' |
sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
(
read -r A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$((1 << 32))"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$((i + 1))
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
# next
read A
done
)
# next
read -r A
done
)
cat <<EOF
}
/* unknown */
return -1;
}
EOF
) > ${dest}
) >${dest}

View File

@@ -644,6 +644,165 @@ T_SOC http_fopen(httrackp * opt, const char *adr, const char *fil, htsblk * reto
return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour);
}
// Read a CRLF line from a non-blocking socket (waits up to timeout per recv).
// Returns the line length (0 = empty), or -1 on timeout/EOF/error.
static int proxy_getline(T_SOC soc, char *s, int max, int timeout) {
int j = 0;
for (;;) {
unsigned char ch;
int n;
if (!check_readinput_t(soc, timeout))
return -1; // timed out waiting for data
n = (int) recv(soc, &ch, 1, 0);
if (n == 1) {
if (ch == 13) // CR
continue;
if (ch == 10) // LF: end of line
break;
if (j >= max - 1)
return -1; // line too long: bound the read against a hostile proxy
s[j++] = (char) ch;
} else if (n == 0) {
return -1; // connection closed
} else {
#ifdef _WIN32
if (WSAGetLastError() == WSAEWOULDBLOCK)
continue;
#else
if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
continue;
#endif
return -1;
}
}
s[j] = '\0';
return j;
}
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout) {
const T_SOC soc = retour->soc;
const char *const host = jump_identification_const(adr); // host[:port]
const char *const portsep = jump_toport_const(adr); // ":port" or NULL
char BIGSTK authority[HTS_URLMAXSIZE * 2];
char BIGSTK req[HTS_URLMAXSIZE * 4 + 1100];
char line[1024];
int code;
if (soc == INVALID_SOCKET)
return 0;
// CONNECT needs an explicit host:port; default the https port
authority[0] = '\0';
if (portsep != NULL)
strlcatbuff(authority, host, sizeof(authority)); // already host:port
else
snprintf(authority, sizeof(authority), "%s:%d", host, 443);
// backstop: never let a stray CR/LF in the host smuggle a second line into
// the CONNECT request (the host is already sanitized upstream)
{
const char *c;
for (c = authority; *c != '\0'; c++) {
if ((unsigned char) *c < ' ') {
strcpybuff(retour->msg, "proxy CONNECT: invalid host");
return 0;
}
}
}
snprintf(req, sizeof(req), "CONNECT %s HTTP/1.0" H_CRLF "Host: %s" H_CRLF,
authority, authority);
// creds go on the CONNECT, not the tunneled origin request
if (link_has_authorization(retour->req.proxy.name)) {
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
char autorisation[1100];
char user_pass[256];
autorisation[0] = user_pass[0] = '\0';
strncatbuff(user_pass, astart, (int) (a - astart) - 1);
strcpybuff(user_pass, unescape_http(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt), user_pass));
code64((unsigned char *) user_pass, (int) strlen(user_pass),
(unsigned char *) autorisation, 0);
strlcatbuff(req, "Proxy-Authorization: Basic ", sizeof(req));
strlcatbuff(req, autorisation, sizeof(req));
strlcatbuff(req, H_CRLF, sizeof(req));
}
strlcatbuff(req, H_CRLF, sizeof(req)); // end of request headers
// raw send: ssl is set, so sendc() would route to TLS
{
const char *p = req;
size_t remain = strlen(req);
int stalls = 0;
while (remain > 0) {
const int n = (int) send(soc, p, (int) remain, 0);
if (n > 0) {
p += n;
remain -= (size_t) n;
stalls = 0;
} else {
#ifdef _WIN32
const int wouldblock = (WSAGetLastError() == WSAEWOULDBLOCK);
#else
const int wouldblock =
(errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR);
#endif
// don't spin forever on a fatal error or an unwritable socket
if (!wouldblock || !check_writeinput_t(soc, timeout) ||
++stalls > 100) {
strcpybuff(retour->msg, "proxy CONNECT: write error");
return 0;
}
}
}
}
// proxy status line: "HTTP/1.x <code> ..."
if (proxy_getline(soc, line, sizeof(line), timeout) < 0) {
strcpybuff(retour->msg, "proxy CONNECT: no response");
return 0;
}
if (sscanf(line, "HTTP/%*d.%*d %d", &code) < 1)
code = 0;
if (code < 200 || code >= 300) {
snprintf(retour->msg, sizeof(retour->msg), "proxy CONNECT refused: %s",
strnotempty(line) ? line : "(no status)");
return 0;
}
// drain headers to the blank line; cap the count so a flooding proxy can't
// stall the crawl
{
int headers = 0;
for (;;) {
const int n = proxy_getline(soc, line, sizeof(line), timeout);
if (n < 0) {
strcpybuff(retour->msg, "proxy CONNECT: truncated response");
return 0;
}
if (n == 0)
break; // blank line: tunnel ready
if (++headers > 64) {
strcpybuff(retour->msg, "proxy CONNECT: too many response headers");
return 0;
}
}
}
return 1;
}
// ouverture d'une liaison http, envoi d'une requète
// mode: 0 GET 1 HEAD [2 POST]
// treat: traiter header?
@@ -680,14 +839,14 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
/* connexion */
if (retour) {
if ((!(retour->req.proxy.active))
|| ((strcmp(adr, "file://") == 0)
|| (strncmp(adr, "https://", 8) == 0)
)
) { /* pas de proxy, ou non utilisable ici */
/* no proxy, or proxy not usable here (local file) */
if ((!(retour->req.proxy.active)) || (strcmp(adr, "file://") == 0)) {
soc = newhttp(opt, adr, retour, -1, waitconnect);
} else {
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port, waitconnect); // ouvrir sur le proxy à la place
// to the proxy; https tunnels to the origin via CONNECT in back_wait
// (#85)
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port,
waitconnect);
}
} else {
soc = newhttp(opt, adr, NULL, -1, waitconnect);
@@ -874,6 +1033,50 @@ static void print_buffer(buff_struct*const str, const char *format, ...) {
assertf(str->pos < str->capacity);
}
/* Append the request "Cookie:" header line for every stored cookie matching
domain/path. RFC 6265 form: bare "name=value" pairs joined by "; ", no
$Version/$Path attributes (those are RFC 2965 syntax that modern servers
reject, issue #151). Returns the number of cookies emitted. */
static int append_cookie_header(buff_struct *bstr, t_cookie *cookie,
const char *domain, const char *path) {
char buffer[8192];
char *b;
int cook = 0;
int max_cookies = 8;
if (cookie == NULL)
return 0;
b = cookie->data;
do {
b = cookie_find(b, "", domain, path); // next matching cookie
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(bstr, "Cookie: ");
cook = 1;
} else
print_buffer(bstr, "; ");
print_buffer(bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(bstr, "=%s", cookie_get(buffer, b, 6));
b = cookie_nextfield(b);
}
} while (b != NULL && max_cookies > 0);
if (cook)
print_buffer(bstr, H_CRLF);
return cook;
}
/* Self-test entry for append_cookie_header(): build the request Cookie line
into dst (always NUL-terminated). Returns the number of cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size) {
buff_struct bstr = {dst, dst_size, 0};
assertf(dst != NULL && dst_size > 0);
dst[0] = '\0';
return append_cookie_header(&bstr, cookie, domain, path);
}
// envoi d'une requète
int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
const char *xsend, const char *adr, const char *fil,
@@ -999,8 +1202,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
if (xsend)
print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes
// tester proxy authentication
if (retour->req.proxy.active) {
// for https, auth rides the CONNECT (the tunneled GET would leak it)
if (retour->req.proxy.active && strncmp(adr, "https://", 8) != 0) {
if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy!
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
@@ -1048,34 +1251,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
search_tag + strlen(POSTTOK) + 1))));
}
}
// gestion cookies?
// send stored cookies matching this host/path
if (cookie) {
char buffer[8192];
char *b = cookie->data;
int cook = 0;
int max_cookies = 8;
do {
b = cookie_find(b, "", jump_identification_const(adr), fil); // prochain cookie satisfaisant aux conditions
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(&bstr, "Cookie: $Version=1; ");
cook = 1;
} else
print_buffer(&bstr, "; ");
print_buffer(&bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(&bstr, "=%s", cookie_get(buffer, b, 6));
print_buffer(&bstr, "; $Path=%s", cookie_get(buffer, b, 2));
b = cookie_nextfield(b);
}
} while(b != NULL && max_cookies > 0);
if (cook) { // on a envoyé un (ou plusieurs) cookie?
print_buffer(&bstr, H_CRLF);
#if DEBUG_COOK
printf("Header:\n%s\n", bstr.buffer);
#endif
}
append_cookie_header(&bstr, cookie, jump_identification_const(adr), fil);
}
// gérer le keep-alive (garder socket)
if (retour->req.http11 && !retour->req.nokeepalive) {
@@ -1808,6 +1986,24 @@ int check_readinput_t(T_SOC soc, int timeout) {
return 0;
}
// wait until the socket is writable, up to timeout seconds
int check_writeinput_t(T_SOC soc, int timeout) {
if (soc != INVALID_SOCKET) {
fd_set fds;
struct timeval tv;
const int isoc = (int) soc;
assertf(isoc == soc);
FD_ZERO(&fds);
FD_SET(isoc, &fds);
tv.tv_sec = timeout;
tv.tv_usec = 0;
select(isoc + 1, NULL, &fds, NULL, &tv);
return FD_ISSET(isoc, &fds) ? 1 : 0;
} else
return 0;
}
// idem, sauf qu'ici on peut choisir la taille max de données à recevoir
// SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes
// en éliminant les cr (ex: header), arrêt si double-lf
@@ -2409,6 +2605,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
}
// no hostname
@@ -5468,9 +5666,10 @@ HTSEXT_API httrackp *hts_create_opt(void) {
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, "");
StringCopy(opt->from, "");
opt->savename_83 = 0; // noms longs par défaut
opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default
opt->savename_type = 0; // avec structure originale
opt->savename_delayed = 2; // hard delayed type (default)
opt->savename_delayed =
HTS_SAVENAME_DELAYED_HARD; // always delay the type check (default)
opt->delayed_cached = HTS_TRUE;
opt->mimehtml = HTS_FALSE;
opt->parsejava = HTSPARSE_DEFAULT; // parser classes
@@ -5495,7 +5694,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->parseall = HTS_TRUE;
opt->parsedebug = HTS_FALSE;
opt->norecatch = HTS_FALSE;
opt->verbosedisplay = 0; // pas d'animation texte
opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);

View File

@@ -182,6 +182,11 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, const char *xsend
const char *adr, const char *fil,
const char *referer_adr, const char *referer_fil,
htsblk * retour);
/* Build the request "Cookie:" header line for stored cookies matching
domain/path into dst (NUL-terminated). Exposed for the -#Q self-test;
wraps the same logic http_sendhead() uses. Returns cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size);
//int newhttp(char* iadr,char* err=NULL);
T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port,
@@ -193,6 +198,17 @@ HTS_INLINE void deletesoc_r(htsblk * r);
htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc);
int check_readinput(htsblk * r);
int check_readinput_t(T_SOC soc, int timeout);
int check_writeinput_t(T_SOC soc, int timeout);
/* Open an HTTP CONNECT tunnel through the active proxy for an https request:
`retour->soc` must already be TCP-connected to the proxy, and `adr` is the
origin authority (url_adr, e.g. "https://host:port"). Sends the CONNECT
request (with Proxy-Authorization when the proxy carries credentials) and
reads the proxy's status line, so the caller's TLS handshake then runs
end-to-end with the origin. Blocks up to `timeout` seconds. Returns 1 on a
2xx tunnel, 0 on failure (retour->msg/statuscode set). */
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout);
void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour,
char *rcvd);
void treatfirstline(htsblk * retour, const char *rcvd);

View File

@@ -184,10 +184,11 @@ int url_savename(lien_adrfilsave *const afs,
/* 8-3 ? */
switch (opt->savename_83) {
case 1: // 8-3
case HTS_SAVENAME_83_DOS: // 8-3
max_char = 8;
break;
case 2: // Level 2 File names may be up to 31 characters.
case HTS_SAVENAME_83_ISO9660: // Level 2 File names may be up to 31
// characters.
max_char = 31;
break;
default:
@@ -324,7 +325,7 @@ int url_savename(lien_adrfilsave *const afs,
}
/* replace shtml to html.. */
if (opt->savename_delayed == 2)
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
is_html = -1; /* ALWAYS delay type */
else
is_html = ishtml(opt, fil);
@@ -363,7 +364,9 @@ int url_savename(lien_adrfilsave *const afs,
) {
// tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == 2 || (ishtest = ishtml(opt, fil)) < 0) { // on ne sait pas si c'est un html ou un fichier..
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
// lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
@@ -393,11 +396,12 @@ int url_savename(lien_adrfilsave *const afs,
}
#endif
//
} else if (opt->savename_delayed != 2 && is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
but also by filename */
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the text/html MIME file type,
that is, ".html" */
} else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
but also by filename */
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the
text/html MIME file type, that is, ".html" */
char BIGSTK mime[1024];
mime[0] = ext[0] = '\0';
@@ -408,9 +412,13 @@ int url_savename(lien_adrfilsave *const afs,
}
}
}
// note: if savename_delayed is enabled, the naming will be temporary (and slightly invalid!)
// note: if we are about to stop (opt->state.stop), back_add() will fail later
else if (opt->savename_delayed != 0 && !opt->state.stop) {
// note: if savename_delayed is enabled, the naming will be temporary
// (and slightly invalid!)
//
// note: if we are about to stop (opt->state.stop), back_add() will
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess
if (headers != NULL && headers->status >= 0 && !is_redirect) {
@@ -698,7 +706,7 @@ int url_savename(lien_adrfilsave *const afs,
}
// restaurer
opt->state._hts_in_html_parsing = hihp;
} // caché?
} // caché?
}
}
}
@@ -1190,7 +1198,8 @@ int url_savename(lien_adrfilsave *const afs,
// Not used anymore unless non-delayed types.
// de même en cas de manque d'extension on en place une de manière forcée..
// cela évite les /chez/toto et les /chez/toto/index.html incompatibles
if (opt->savename_type != -1 && opt->savename_delayed != 2) {
if (opt->savename_type != -1 &&
opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD) {
char *a = afs->save + strlen(afs->save) - 1;
while((a > afs->save) && (*a != '.') && (*a != '/'))
@@ -1236,31 +1245,21 @@ int url_savename(lien_adrfilsave *const afs,
size_t i;
for(i = 0 ; afs->save[i] != '\0' ; i++) {
unsigned char c = (unsigned char) afs->save[i];
if (c < 32 // control
|| c == 127 // unwise
|| c == '~' // unix unwise
|| c == '\\' // windows separator
|| c == ':' // windows forbidden
|| c == '*' // windows forbidden
|| c == '?' // windows forbidden
|| c == '\"' // windows forbidden
|| c == '<' // windows forbidden
|| c == '>' // windows forbidden
|| c == '|' // windows forbidden
//|| c == '@' // ?
||
(
opt->savename_83 == 2 // CDROM
&&
(
c == '-'
|| c == '='
|| c == '+'
)
)
)
{
afs->save[i] = '_';
if (c < 32 // control
|| c == 127 // unwise
|| c == '~' // unix unwise
|| c == '\\' // windows separator
|| c == ':' // windows forbidden
|| c == '*' // windows forbidden
|| c == '?' // windows forbidden
|| c == '\"' // windows forbidden
|| c == '<' // windows forbidden
|| c == '>' // windows forbidden
|| c == '|' // windows forbidden
//|| c == '@' // ?
|| (opt->savename_83 == HTS_SAVENAME_83_ISO9660 // CDROM
&& (c == '-' || c == '=' || c == '+'))) {
afs->save[i] = '_';
}
}
}
@@ -1521,7 +1520,8 @@ int url_savename(lien_adrfilsave *const afs,
char *a = afs->save + strlen(afs->save) - 1;
char *b;
int n = 2;
char collisionSeparator = ((opt->savename_83 != 2) ? '-' : '_');
char collisionSeparator =
((opt->savename_83 != HTS_SAVENAME_83_ISO9660) ? '-' : '_');
tempo[0] = '\0';

View File

@@ -368,6 +368,13 @@ typedef enum hts_savename_delayed {
HTS_SAVENAME_DELAYED_HARD = 2 /**< always delay the type check (default) */
} hts_savename_delayed;
/* Saved-name length layout (opt->savename_83). */
typedef enum hts_savename_83 {
HTS_SAVENAME_83_LONG = 0, /**< long file names (default) */
HTS_SAVENAME_83_DOS = 1, /**< DOS 8.3 names (ISO9660 level 1) */
HTS_SAVENAME_83_ISO9660 = 2 /**< ISO9660 level 2 names (up to 31 chars) */
} hts_savename_83;
/* Host-banning triggers (opt->hostcontrol bitmask). */
typedef enum hts_hostcontrol {
HTS_HOSTCONTROL_BAN_TIMEOUT = 1 << 0, /**< ban a timing-out host */
@@ -430,7 +437,8 @@ struct httrackp {
// int aff_progress; // progress bar
hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */
t_proxy proxy; /**< proxy configuration */
int savename_83; /**< force 8.3 (DOS) file names */
hts_savename_83
savename_83; /**< saved-name length layout (long/DOS/ISO9660) */
int savename_type; /**< saved-name layout (original tree, flat, ...) */
String
savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */

View File

@@ -296,6 +296,48 @@ static const char *html_inline_safe(const char *src, char *dst, size_t size) {
return dst;
}
/* Byte before html, or a space sentinel at the buffer start where html[-1]
would underflow; space reads as the word boundary the guards want there. */
static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
static const char *const methods[] = {"GET", "POST", "PUT",
"DELETE", "HEAD", "OPTIONS",
"PATCH", "TRACE", NULL};
int i;
for (i = 0; methods[i] != NULL; i++) {
if (strlen(methods[i]) == len && strfield(s, methods[i]) == (int) len)
return 1;
}
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE];
@@ -556,7 +598,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title");
if (p) {
if (*(html - 1) == '/')
if (html_prevc(html, r->adr) == '/')
p = 0; // /title
} else {
if (strfield(html, "/html"))
@@ -1341,6 +1383,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0;
char quotes_replacement = '\0';
int ensure_not_mime = 0;
// .open(method,url): reject an HTTP-method first arg (#218)
int ensure_not_method = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1357,9 +1404,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (!nc)
nc = strfield(html, ":location"); // javascript:location="doc"
if (!nc) { // location="doc"
if ((nc = strfield(html, "location"))
&& !isspace(*(html - 1))
)
if ((nc = strfield(html, "location")) &&
!isspace(html_prevc(html, r->adr)))
nc = 0;
}
if (!nc)
@@ -1369,6 +1415,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse
expected_end = "),"; // fin: virgule ou parenthèse
ensure_not_mime = 1; //* ensure the url is not a mime type */
ensure_not_method = 1; // xhr.open: don't grab method
}
if (!nc)
if ((nc = strfield(html, ".replace"))) { // window.replace("url")
@@ -1380,7 +1427,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse
}
if (!nc && (nc = strfield(html, "url")) && (!isalnum(*(html - 1))) && *(html - 1) != '_') { // url(url)
if (!nc && (nc = strfield(html, "url")) &&
(!isalnum(html_prevc(html, r->adr))) &&
html_prevc(html, r->adr) != '_') { // url(url)
expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse
can_avoid_quotes = 1;
@@ -1390,6 +1439,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) {
expected = 0; // no char expected
is_import = 1;
} else
nc = 0;
}
@@ -1407,6 +1457,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c;
int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\''))
a++;
@@ -1421,12 +1472,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++;
}
c = b--;
c += ndelim;
while(*c == ' ')
c++;
if ((strchr(expected_end, *c)) || (*c == '\n')
|| (*c == '\r')) {
c -= (ndelim + 1);
// no closing delimiter here (truncated input):
// Don't scan past the buffer NUL or capture it.
if (*c != '\0') {
c += ndelim;
while (*c == ' ')
c++;
valid_url =
(strchr(expected_end, *c)) || (*c == '\n') ||
(*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) {
if (ensure_not_mime) {
int i = 0;
@@ -1442,6 +1501,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
i++;
}
}
// XHR.open's "GET" etc. is a method, not a URL
if (a != NULL && ensure_not_method &&
is_http_method(a, (size_t) (c - a + 1))) {
a = NULL;
}
// Check for bogus links (Vasiliy)
if (a != NULL) {
const size_t size = c - a + 1;
@@ -1485,7 +1549,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
}
}
}
}
}
@@ -1692,6 +1755,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
hts_nodetect[i -
1]);
}
// xmlns / xmlns:prefix declare
// XML namespaces, not resources
// (#191)
else {
const int xl = strfield(
intag_startattr, "xmlns");
const char xc =
intag_startattr[xl];
if (xl &&
(xc == ':' || xc == '=' ||
is_space(xc))) {
url_ok = 0;
hts_log_print(
opt, LOG_DEBUG,
"dirty parsing: xmlns "
"namespace avoided");
}
}
}
}
@@ -2967,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx
//else {
@@ -4262,10 +4347,10 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
char com[256];
linput(stdin, com, 200);
if (opt->verbosedisplay == 2)
opt->verbosedisplay = 1;
if (opt->verbosedisplay == HTS_VERBOSE_FULL)
opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
else
opt->verbosedisplay = 2;
opt->verbosedisplay = HTS_VERBOSE_FULL;
/* Info for wrappers */
hts_log_print(opt, LOG_INFO, "engine: change-options");
RUN_CALLBACK0(opt, chopt);
@@ -4375,7 +4460,7 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
printf("%c\x0d", ("/-\\|")[roll]);
fflush(stdout);
}
} else if (opt->verbosedisplay == 1) {
} else if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (b >= 0) {
if (back[b].r.statuscode == HTTP_OK)
printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot,
@@ -4466,8 +4551,8 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
char in_error_msg[32];
// resolve unresolved type
if (opt->savename_delayed != 0 && *forbidden_url == 0 && IS_DELAYED_EXT(afs->save)
&& !opt->state.stop) {
if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
*forbidden_url == 0 && IS_DELAYED_EXT(afs->save) && !opt->state.stop) {
int loops;
int continue_loop;
@@ -4851,7 +4936,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
}
}
} // delayed type check ?
} // delayed type check ?
ENGINE_SAVE_CONTEXT_BASE();

View File

@@ -288,7 +288,7 @@ static void __cdecl htsshow_uninit(t_hts_callbackarg * carg) {
}
static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) {
use_show = 0;
if (opt->verbosedisplay == 2) {
if (opt->verbosedisplay == HTS_VERBOSE_FULL) {
use_show = 1;
vt_clear();
}
@@ -852,7 +852,7 @@ static void sig_doback(int blind) { // mettre en backing
if (global_opt != NULL) {
// suppress logging and asking lousy questions
global_opt->quiet = 1;
global_opt->verbosedisplay = 0;
global_opt->verbosedisplay = HTS_VERBOSE_NONE;
}
if (!blind)

View File

@@ -4,131 +4,140 @@
# Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition"
SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}"
# sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi
# Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for
# the browser to exit
BROWSEREXE="/usr/bin/open -W"
if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for
# the browser to exit
BROWSEREXE="/usr/bin/open -W"
fi
BINWD=`dirname "$0"`
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin"
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "`
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share"
BINWD=$(dirname "$0")
SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
IFS=':' read -ra pathdirs <<<"$PATH"
for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
###
# And now some famous cuisine
function log {
echo "$0($$): $@" >&2
return 0
echo "$0($$): $*" >&2
return 0
}
function launch_browser {
log "Launching $1"
browser=$1
url=$2
log "Spawning browser.."
${browser} "${url}"
# note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately
log "Browser (or helper) exited"
log "Launching $1"
browser=$1
url=$2
log "Spawning browser.."
${browser} "${url}"
# note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately
log "Browser (or helper) exited"
}
# First ensure that we can launch the server
BINPATH=
for i in ${SRCHPATH}; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i}
for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done
for i in ${SRCHDISTPATH}; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1
test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale
HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'`
HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1
# Find the browser
# note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do
for j in ${SRCHPATH}; do
if test -x ${j}/${i}; then
BROWSEREXE=${j}/${i}
fi
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" && break
for i in "${SRCHBROWSEREXE[@]}"; do
for j in "${SRCHPATH[@]}"; do
if test -x "${j}/${i}"; then
BROWSEREXE="${j}/${i}"
fi
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command
if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='`
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html"
else
INDEXF=""
fi
fi
if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html"
fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $?
if test -f "${HOME}/.httrack.ini"; then
INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html"
else
INDEXF=""
fi
fi
if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html"
fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $?
fi
# Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}&
(
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL
SRVURL=
MAXCOUNT=60
while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1]
test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=`
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1
MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1
done
# Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=`
test -n "${SRVPID}" && kill -9 ${SRVPID}
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE}
test -n "$1" && log "..Done"
return 0
test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done"
return 0
}
# Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup
exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
}
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.

15
tests/01_engine-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1
}

View File

@@ -89,4 +89,37 @@ grep -q NEWCONTENT "$(find "$out" -path '*/a.html' -print -quit)" || {
exit 1
}
# --- 3. an empty quoted arg survives the doit.log round-trip (#106) ----------
# -%F "" (empty footer) records an empty "" token in doit.log; -r2 follows it so
# a "drop the empty token" bug shifts -r2 into -%F's slot (the reprise then sees
# -%F -r2 and panics "%F needs to be followed by ..."), making the bug visible
# rather than a harmless run off the end of argv.
out2="$tmp/out2"
rc=0
"$bin" "$url" -O "$out2" --quiet -n -%v0 -%F "" -r2 >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: initial mirror with empty footer exited $rc"
exit 1
}
# precondition: the writer put the empty token on disk for the reader to reload.
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer not recorded as -%F \"\" -r2 in doit.log"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
# no-url reprise: the reader rebuilds argv from doit.log and rewrites doit.log
# from it. The empty token surviving in the regenerated file proves the reader
# kept it (a drop/swallow would panic above or rewrite -%F without the "").
rc=0
"$bin" -O "$out2" --quiet >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: empty-footer reprise exited $rc (empty token dropped from doit.log?)"
exit 1
}
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer did not survive the doit.log reload round-trip"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
}
# named entities

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
}
nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
}
# bare star matches everything
@@ -67,7 +67,7 @@ nomatch '*[\[]' 'a'
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
}
unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
}
mime '/a/b.html' 'text/html'

View File

@@ -154,4 +154,173 @@ grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
# xmlns / xmlns:prefix decls must not be crawled (#191). Local file:// targets so a
# regression downloads them; each is the LAST attr (heuristic only scans a value before '>').
site3="$tmp/xmlns"
mkdir -p "$site3"
for f in ns og rdfs real; do gif "$site3/$f.gif"; done
cat >"$site3/index.html" <<EOF
<html xmlns="file://$site3/ns.gif"><body>
<svg xmlns:og="file://$site3/og.gif"></svg>
<div class="c" xmlns:rdfs="file://$site3/rdfs.gif"></div>
<a href="file://$site3/real.gif">real link</a>
</body></html>
EOF
out3="$tmp/xmlns-out"
crawl "$site3/index.html" "$out3"
# the real link is still captured
found "real.gif" "$out3"
# namespace-declaration targets must not be fetched (default + prefixed forms)
notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
# Offset-0 underflow (#396): a token at the buffer start makes the detector's
# word-boundary guard read *(html-1) one byte early (aborts under ASan). The
# url() target is still captured; here it just must not underflow.
site6="$tmp/parse-off0"
mkdir -p "$site6"
printf 'body{}\n' >"$site6/off0.css"
printf 'url(off0.css)\n' >"$site6/main.css"
out6="$tmp/parse-off0-out"
crawl "$site6/main.css" "$out6"
found "off0.css" "$out6"
# XMLHttpRequest.open(method, url) (#218): the first argument is an HTTP method,
# not a URL. Without the fix "GET" is captured as a link and fetched (the offline
# fixture saves a bare file named GET; a live server mangles it to GET.html).
# window.open(url) detection must be unaffected.
site7="$tmp/xhropen"
mkdir -p "$site7"
gif "$site7/winopen.gif"
cat >"$site7/index.html" <<EOF
<html><body><script>
var x = new XMLHttpRequest();
x.open("GET", "ajax_info.txt");
var y = new XMLHttpRequest();
y.open("Post", "submit.cgi");
window.open("file://$site7/winopen.gif");
</script></body></html>
EOF
out7="$tmp/xhropen-out"
crawl "$site7/index.html" "$out7"
# negative control: without the fix a file named exactly GET is downloaded
notfound "GET" "$out7"
# methods are matched case-insensitively (XHR spec normalizes them): a mixed-case
# method is rejected too, so a file named Post must not appear either
notfound "Post" "$out7"
# regression guard: window.open(url) is still detected, so its absolute URL is
# rewritten to a local link. The rewrite only happens if the parser saw it, so
# these two assertions fail if .open detection broke (not a trivial --near save).
saved7=$(savedhtml "$out7")
test -n "$saved7" || ! echo "FAIL: saved xhr page not found" || exit 1
grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! echo "FAIL #218: window.open(url) no longer detected/rewritten" || exit 1
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
}
simp './foo/bar/' 'foo/bar/'
@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;;
*"overflow while copying"*) ;;
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;;
*"strsafe: NOT aborted"*)
echo "over-capacity write was NOT caught" >&2
exit 1
;;
*"overflow while copying"*) ;;
*)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;;
*"htsbuff append overflow"*) ;;
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;;
*"strsafe: NOT aborted"*)
echo "htsbuff over-capacity write was NOT caught" >&2
exit 1
;;
*"htsbuff append overflow"*) ;;
*)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac

View File

@@ -3,6 +3,6 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php
--found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php

View File

@@ -3,21 +3,21 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests
bash crawl-test.sh \
--errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0
--errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0
# unicode tests (bogus links)
bash crawl-test.sh \
--errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0
--errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0

View File

@@ -3,67 +3,67 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests
bash crawl-test.sh \
--errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html
--errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html
bash crawl-test.sh \
--errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html
--errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html
bash crawl-test.sh \
--errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html
--errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html
bash crawl-test.sh \
--errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html
--errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=42&can=1
# we expect 2 errors only because other links are too longs (to be modified if suitable)
bash crawl-test.sh --errors 2 --files 1 \
--found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php
--found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php

View File

@@ -3,45 +3,45 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=4&can=1
bash crawl-test.sh --errors 0 --files 4 \
--found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html
--found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html
# http://code.google.com/p/httrack/issues/detail?id=2&can=1
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html
--found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html
# javascript parsing
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html
--found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html
# handling of + before query string
bash crawl-test.sh --errors 0 --files 6 \
--found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html
--found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html
# handling of # encoded in filename
# see http://code.google.com/p/httrack/issues/detail?id=25
bash crawl-test.sh --errors 2 --files 4 \
--found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html
--found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html

View File

@@ -3,11 +3,11 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
echo "no https support compiled, skipping"
exit 77
fi
bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html

View File

@@ -0,0 +1,136 @@
#!/bin/bash
#
# Issue #85: an https crawl must go through the configured proxy (CONNECT
# tunnel), not bypass it and hit the origin directly. Fully local: a self-signed
# TLS origin plus a logging CONNECT proxy, so no network access is needed.
set -euo pipefail
: "${top_srcdir:=..}"
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
if ! command -v python3 >/dev/null 2>&1 || ! command -v openssl >/dev/null 2>&1; then
echo "python3/openssl missing, skipping"
exit 77
fi
server="$top_srcdir/tests/proxy-https-server.py"
tmpdir=$(mktemp -d)
pids=
cleanup() {
for pid in $pids; do
kill "$pid" 2>/dev/null || true
done
rm -rf "$tmpdir"
}
trap cleanup EXIT
# self-signed cert for the local TLS origin (httrack does not verify certs)
openssl req -x509 -newkey rsa:2048 -keyout "$tmpdir/key.pem" \
-out "$tmpdir/cert.pem" -days 2 -nodes -subj "/CN=127.0.0.1" \
>/dev/null 2>&1
cat "$tmpdir/key.pem" "$tmpdir/cert.pem" >"$tmpdir/both.pem"
# start_server <logdir> <mode>: launches a proxy+origin pair, sets $origin_port
# and $proxy_port from its announced ephemeral ports.
start_server() {
local dir="$1" mode="$2" ports
mkdir -p "$dir"
ports="$dir/ports.txt"
python3 "$server" "$tmpdir/both.pem" "$dir" "$mode" \
>"$ports" 2>"$dir/server.err" &
pids="$pids $!"
for _ in $(seq 1 100); do
grep -q "^ready" "$ports" 2>/dev/null && break
sleep 0.1
done
grep -q "^ready" "$ports" 2>/dev/null || {
echo "server ($mode) did not start" >&2
cat "$dir/server.err" >&2
exit 1
}
origin_port=$(awk '/^ORIGIN/{print $2}' "$ports")
proxy_port=$(awk '/^PROXY/{print $2}' "$ports")
}
# Run httrack, but kill it after a deadline so a hang (e.g. a missing bound on
# the proxy response) surfaces as the kill code $HANG_RC instead of stalling the
# whole job. A portable stand-in for `timeout`, which macOS lacks.
HANG_RC=137 # 128 + SIGKILL
run_crawl() {
local out="$1" proxy="$2" port="$3"
rm -rf "$out"
httrack "https://127.0.0.1:${port}/" --proxy "$proxy" \
-O "$out" -r1 -s0 --timeout=10 >"$out.log" 2>&1 &
local pid=$!
(sleep 60 && kill -9 "$pid" 2>/dev/null) &
local guard=$!
local rc=0
wait "$pid" 2>/dev/null || rc=$?
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
return "$rc"
}
# --- working proxy ----------------------------------------------------------
ok="$tmpdir/ok"
start_server "$ok" ok
# 1. page retrieved AND the proxy saw a CONNECT to the origin
run_crawl "$ok/out" "127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out" || {
echo "FAIL: origin page not downloaded through proxy" >&2
cat "$ok/out.log" >&2
exit 1
}
grep -q "^CONNECT 127.0.0.1:${origin_port} " "$ok/proxy.log" || {
echo "FAIL: proxy never received a CONNECT (https bypassed the proxy)" >&2
cat "$ok/proxy.log" >&2
exit 1
}
echo "OK: https tunneled through proxy via CONNECT"
# 2. authenticated proxy: creds ride the CONNECT, and NEVER reach the origin
: >"$ok/proxy.log"
: >"$ok/origin-headers.log"
run_crawl "$ok/out2" "user:secret@127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out2" || {
echo "FAIL: origin page not downloaded through authenticated proxy" >&2
exit 1
}
got=$(awk '/^AUTH Basic /{print $3}' "$ok/proxy.log" | head -1)
# base64("user:secret"); compared as a literal to stay portable (no base64 -d,
# which differs between GNU and BSD)
test "$got" == "dXNlcjpzZWNyZXQ=" || {
echo "FAIL: Proxy-Authorization not carried on CONNECT (got '$got')" >&2
cat "$ok/proxy.log" >&2
exit 1
}
if grep -qi "proxy-authorization" "$ok/origin-headers.log"; then
echo "FAIL: proxy credentials leaked to the origin through the tunnel" >&2
cat "$ok/origin-headers.log" >&2
exit 1
fi
echo "OK: proxy credentials carried on CONNECT, not leaked to origin"
# --- hostile proxy ----------------------------------------------------------
# A proxy that answers 200 then streams headers forever must not hang the crawl:
# the client bounds the response. run_crawl kills a hung httrack after 60s, so a
# missing bound surfaces as $HANG_RC here.
flood="$tmpdir/flood"
start_server "$flood" flood
rc=0
run_crawl "$flood/out" "127.0.0.1:${proxy_port}" "$origin_port" || rc=$?
test "$rc" -ne "$HANG_RC" || {
echo "FAIL: crawl hung on a flooding proxy (bounded read missing)" >&2
exit 1
}
grep -rq "ORIGIN-PAGE-85" "$flood/out" 2>/dev/null && {
echo "FAIL: flooding proxy unexpectedly served the page" >&2
exit 1
}
echo "OK: bounded proxy response, no hang on a flooding proxy"

View File

@@ -2,6 +2,7 @@
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT =
@@ -24,6 +25,7 @@ TESTS = \
01_engine-cache-golden.test \
01_engine-charset.test \
01_engine-cmdline.test \
01_engine-cookies.test \
01_engine-copyopt.test \
01_engine-doitlog.test \
01_engine-entities.test \
@@ -33,6 +35,7 @@ TESTS = \
01_engine-mime.test \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \
@@ -43,6 +46,7 @@ TESTS = \
11_crawl-international.test \
11_crawl-longurl.test \
11_crawl-parsing.test \
12_crawl_https.test
12_crawl_https.test \
13_crawl_proxy_https.test
CLEANFILES = check-network_sh.cache

View File

@@ -6,39 +6,39 @@
# do not enable online tests (./configure --disable-online-unit-tests)
if test "$ONLINE_UNIT_TESTS" == "no"; then
echo "online tests are disabled" >&2
exit 1
echo "online tests are disabled" >&2
exit 1
# enable online tests (--enable-online-unit-tests)
elif test "$ONLINE_UNIT_TESTS" == "yes"; then
exit 0
exit 0
# check if online tests are reachable
else
# test url
url=http://ut.httrack.com/enabled
# test url
url=http://ut.httrack.com/enabled
# cache file name
cache=check-network_sh.cache
# cache file name
cache=check-network_sh.cache
# cached result ?
if test -f $cache ; then
if grep -q "ok" $cache ; then
exit 0
else
echo "online tests are disabled (cached)" >&2
exit 1
fi
# cached result ?
if test -f $cache; then
if grep -q "ok" $cache; then
exit 0
else
echo "online tests are disabled (cached)" >&2
exit 1
fi
# fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null ; then
echo "ok" > $cache
exit 0
else
echo "error" > $cache
echo "online tests are disabled (auto)" >&2
exit 1
fi
# fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null; then
echo "ok" >$cache
exit 0
else
echo "error" >$cache
echo "online tests are disabled (auto)" >&2
exit 1
fi
fi

View File

@@ -2,185 +2,184 @@
#
function warning {
echo "** $*" >&2
return 0
echo "** $*" >&2
return 0
}
function die {
warning "$*"
exit 1
warning "$*"
exit 1
}
function debug {
if test -n "$verbose"; then
echo "$*" >&2
fi
if test -n "$verbose"; then
echo "$*" >&2
fi
}
function info {
printf "[$*] ..\t" >&2
printf '[%s] ..\t' "$*" >&2
}
function result {
echo "$*" >&2
echo "$*" >&2
}
function cleanup {
debug "cleaning function called"
if test -n "$tmpdir"; then
if test -d "$tmpdir"; then
if test -z "$nopurge"; then
debug "cleaning up $tmpdir"
rm -rf "$tmpdir"
fi
debug "cleaning function called"
if test -n "$tmpdir"; then
if test -d "$tmpdir"; then
if test -z "$nopurge"; then
debug "cleaning up $tmpdir"
rm -rf "$tmpdir"
fi
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
}
function usage {
cat << EOF
cat <<EOF
usage: $0
EOF
}
function assert_equals {
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
else
result "OK ($2)"
fi
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
else
result "OK ($2)"
fi
}
function start-crawl {
# parse args
pos=1
while test "$#" -ge "$pos" ; do
case "${!pos}" in
--debug)
verbose=1
;;
--no-purge|--summary|--print-files)
;;
--errors|--files|--found|--not-found|--directory)
pos=$[${pos}+1]
test "$#" -ge "$pos" || warning "missing argument" || return 1
;;
httrack)
pos=$[${pos}+1]
break;
;;
*)
warning "unrecognized option ${!pos}"
return 1
;;
esac
pos=$[${pos}+1]
done
debug "remaining args: ${@:${pos}}"
# parse args
pos=1
while test "$#" -ge "$pos"; do
case "${!pos}" in
--debug)
verbose=1
;;
--no-purge | --summary | --print-files) ;;
--errors | --files | --found | --not-found | --directory)
pos=$((pos + 1))
test "$#" -ge "$pos" || warning "missing argument" || return 1
;;
httrack)
pos=$((pos + 1))
break
;;
*)
warning "unrecognized option ${!pos}"
return 1
;;
esac
pos=$((pos + 1))
done
debug "remaining args: ${*:pos}"
# ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5"
# ut/ won't exceed 2 minutes
moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ?
if test -n "$http_proxy"; then
moreargs="$moreargs --proxy $http_proxy"
fi
# proxy environment ?
if test -n "${http_proxy:-}"; then
moreargs+=(--proxy "$http_proxy")
fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl"
rm -rf "$tmp"
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}}
info "running httrack ${@:${pos}}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break;
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl"
rm -rf "$tmp"
else
tmpdir=
fi
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp"
else
tmpdir=
fi
}
# check args
@@ -195,7 +194,7 @@ tmpdir=
crawlpid=
nopurge=
verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory
tmpdir="${tmptopdir}/httrack_ut.$$"

151
tests/proxy-https-server.py Normal file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Local CONNECT proxy + self-signed HTTPS origin for the issue #85 test.
Starts a TLS origin server and an HTTP proxy that honours CONNECT, on ephemeral
ports. Every request line the proxy receives (and any Proxy-Authorization) is
appended to the proxy log; every header the origin receives over the tunnel is
appended to the origin log. That lets the test assert both that an https crawl
tunneled through the proxy and that proxy credentials never leaked to the origin.
Proxy modes (argv[3], default "ok"):
ok - honour CONNECT and tunnel to the origin
flood - answer 200 then stream headers forever with no blank line, to exercise
the client's bound on the proxy response (must not hang the crawl)
Usage: proxy-https-server.py <cert.pem> <logdir> [mode]
Prints "ORIGIN <port>", "PROXY <port>", then "ready" (one per line) on stdout.
"""
import http.server
import os
import socket
import socketserver
import ssl
import sys
import threading
ORIGIN_BODY = b"<html><body>ORIGIN-PAGE-85</body></html>"
PROXY_LOG = "proxy.log"
ORIGIN_LOG = "origin-headers.log"
def make_origin(logdir):
class Origin(http.server.BaseHTTPRequestHandler):
def do_GET(self):
with open(os.path.join(logdir, ORIGIN_LOG), "a") as handle:
for key in self.headers.keys():
handle.write(key + "\n")
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Length", str(len(ORIGIN_BODY)))
self.end_headers()
self.wfile.write(ORIGIN_BODY)
def log_message(self, *args):
pass
return Origin
def start_origin(certfile, logdir):
httpd = socketserver.TCPServer(("127.0.0.1", 0), make_origin(logdir))
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
threading.Thread(target=httpd.serve_forever, daemon=True).start()
return port
def pipe(src, dst):
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
for sock in (src, dst):
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
def handle_client(conn, logdir, mode):
rfile = conn.makefile("rb")
request_line = rfile.readline().decode("latin-1").strip()
auth = None
while True:
line = rfile.readline().decode("latin-1")
if line in ("\r\n", "\n", ""):
break
key, _, value = line.partition(":")
if key.strip().lower() == "proxy-authorization":
auth = value.strip()
with open(os.path.join(logdir, PROXY_LOG), "a") as handle:
handle.write(request_line + "\n")
if auth is not None:
handle.write("AUTH " + auth + "\n")
parts = request_line.split()
if not (len(parts) >= 2 and parts[0] == "CONNECT"):
conn.sendall(b"HTTP/1.0 501 Not Implemented\r\n\r\n")
conn.close()
return
if mode == "flood":
# 200, then an endless header stream with no terminating blank line: the
# client must bound this and give up, not hang.
try:
conn.sendall(b"HTTP/1.0 200 Connection established\r\n")
while True:
conn.sendall(b"X-Pad: 0123456789\r\n")
except OSError:
pass
conn.close()
return
host, _, port = parts[1].partition(":")
try:
upstream = socket.create_connection((host, int(port or 443)))
except OSError:
conn.sendall(b"HTTP/1.0 502 Bad Gateway\r\n\r\n")
conn.close()
return
conn.sendall(b"HTTP/1.0 200 Connection established\r\n\r\n")
threading.Thread(target=pipe, args=(conn, upstream), daemon=True).start()
pipe(upstream, conn)
def start_proxy(logdir, mode):
srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("127.0.0.1", 0))
srv.listen(16)
port = srv.getsockname()[1]
def serve():
while True:
conn, _ = srv.accept()
threading.Thread(
target=handle_client, args=(conn, logdir, mode), daemon=True
).start()
threading.Thread(target=serve, daemon=True).start()
return port
def main():
certfile, logdir = sys.argv[1], sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "ok"
for name in (PROXY_LOG, ORIGIN_LOG):
open(os.path.join(logdir, name), "w").close()
origin_port = start_origin(certfile, logdir)
proxy_port = start_proxy(logdir, mode)
print("ORIGIN %d" % origin_port, flush=True)
print("PROXY %d" % proxy_port, flush=True)
print("ready", flush=True)
threading.Event().wait()
if __name__ == "__main__":
main()

View File

@@ -2,19 +2,19 @@
#
error=0
for i in *.test ; do
if bash $i ; then
echo "$i: passed" >&2
else
echo "$i: ERROR" >&2
error=$[${error}+1]
fi
for i in *.test; do
if bash "$i"; then
echo "$i: passed" >&2
else
echo "$i: ERROR" >&2
error=$((error + 1))
fi
done
if test "$error" -eq 0; then
echo "all tests passed" >&2
echo "all tests passed" >&2
else
echo "${error} test(s) failed" >&2
echo "${error} test(s) failed" >&2
fi
exit $error