Compare commits

...

43 Commits

Author SHA1 Message Date
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
Xavier Roche
02c7f4ebf6 htsparse: don't treat XHR.open's method argument as a URL (#218)
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.

Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.

Closes #218

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 20:27:04 +02:00
Xavier Roche
9070b44a70 Merge pull request #398 from xroche/fix/html-underflow-396
htsparse: fix buffer underflow reading *(html-1) at offset 0 (#396)
2026-06-19 19:55:40 +02:00
Xavier Roche
799c045061 htsparse: don't read *(html-1) before the parse buffer (#396)
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.

Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.

Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:44:25 +02:00
Xavier Roche
fb1ee3bf2e Merge pull request #397 from xroche/fix/css-import-94
CSS @import: capture URLs that carry a media/supports/layer condition (#94)
2026-06-19 19:30:21 +02:00
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
Xavier Roche
0145ec37a3 htsparse: don't crawl xmlns namespace declarations (#191)
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.

01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:24:55 +02:00
Xavier Roche
a80fab38ba Merge pull request #394 from xroche/fix/proxy-https-connect-85
Tunnel https through the proxy via CONNECT (#85)
2026-06-19 18:03:31 +02:00
Xavier Roche
c52a524a63 htslib: bound the proxy CONNECT response; harden + cover review findings
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).

- Bound the response read so a proxy cannot stall the single-threaded back_wait
  crawl: proxy_getline now fails on an over-long line instead of consuming it
  forever, the header drain is capped at 64 lines, and the send loop gives up
  rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
  hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
  CR/LF defense lives entirely upstream (escape_remove_control / header-line
  splitting).
- Test: the origin now records the headers it receives, and the test asserts
  Proxy-Authorization never reaches the origin through the tunnel (the previous
  assertions couldn't see a leak). Added a flooding-proxy scenario that proves
  the crawl terminates instead of hanging on an unbounded response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 09:52:10 +02:00
Xavier Roche
1907621d37 htslib: tunnel https through the proxy via CONNECT (#85)
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.

Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).

Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:43:56 +02:00
Xavier Roche
3b2d7afdaa Merge pull request #393 from xroche/fix/empty-footer-doitlog-106
Keep empty quoted args when reloading doit.log (#106)
2026-06-19 08:13:19 +02:00
Xavier Roche
6ee539619e htscoremain: keep empty quoted args when reloading doit.log (#106)
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.

Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.

01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:09:57 +02:00
Xavier Roche
fb098b27b4 Merge pull request #392 from xroche/fix/cookie-rfc6265-151
Drop $Version/$Path from the request Cookie header (#151)
2026-06-18 22:42:47 +02:00
Xavier Roche
5f6a3fb917 htslib: drop $Version/$Path from request Cookie header (#151)
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:

  Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/

Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":

  Cookie: name=value; has_js=1

The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 22:12:28 +02:00
Xavier Roche
f9e676dbe3 Merge pull request #391 from xroche/feature/api-enum-callsites-savename83
htsopt: name the savename_83 enum and finish the call-site constant adoption
2026-06-18 21:43:34 +02:00
Xavier Roche
1b440c44b5 htsopt: name savename_83 enum and adopt enum constants at call sites
Type opt->savename_83 as a new hts_savename_83 enum (LONG/DOS/ISO9660 =
0/1/2) and replace the remaining magic-number literals for the already-
typed verbosedisplay and savename_delayed fields with their named enum
constants across the engine.

Behavior-preserving: every constant equals the literal it replaces, and a
C enum is int-sized, so struct layout is unchanged (sizeof(httrackp) and
offsetof(savename_83) are identical to origin/master, no soname bump). The
-L option block is deliberately reflowed to clang-format style, which is
what made the savename_83 retype tractable. Bitmask fields (travel/seeker/
getmode/parsejava/hostcontrol) intentionally stay int with named bit enums,
per the existing flags-as-enum split.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 21:03:33 +02:00
Xavier Roche
ac6dd1a570 Merge pull request #390 from xroche/fix/copy-htsopt-unsigned-enum-guards
copy_htsopt silently drops boolean option fields
2026-06-18 20:46:00 +02:00
Xavier Roche
4549ec3695 htsopt: fix copy_htsopt dropping unsigned-enum fields
copy_htsopt() copies each field only when it is not the "-1 means unset"
sentinel, written as `if (from->X > -1)`. The boolean/enum option
migrations turned nearlink, errpage and parseall into hts_boolean, which
GCC backs with unsigned int. `unsigned > -1` is always false, so those
three fields silently stopped being copied.

Cast to int at the guard to restore the signed sentinel test. Add a
hidden `httrack -#9` self-test that drives copy_htsopt over distinct
boolean values plus an int positive control (tests/01_engine-copyopt.test);
it fails on the unfixed guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 20:25:42 +02:00
Xavier Roche
ac56c31b24 Merge pull request #389 from xroche/fix/travel-test-all-enum
htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
2026-06-18 18:40:33 +02:00
Xavier Roche
ee6beeeb7d htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
The -t "test all" flag was a stray #define sitting next to the scope
enum; make it an enum constant so the named travel values live in one
place. The mask (HTS_TRAVEL_SCOPE_MASK) stays a #define: it selects the
scope out of opt->travel, it is not a member of the value set.

Name and value (1 << 8) are unchanged, so every use site compiles
identically and opt->travel stays plain int. No ABI change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:29:23 +02:00
Xavier Roche
6788bda380 Merge pull request #388 from xroche/feature/api-enum-fields-2
htsopt: type debug, savename_delayed and verbosedisplay as named enums
2026-06-18 18:25:44 +02:00
Xavier Roche
7ead8d595e htsopt: type three more option fields as named enums
debug becomes hts_log_type (it already stored LOG_* values; the int
declaration was a latent type hole), savename_delayed becomes a new
hts_savename_delayed { NONE, SOFT, HARD }, and verbosedisplay becomes a
new hts_verbosedisplay { NONE, SIMPLE, FULL }. hostcontrol stays int but
its bits are now named by a new hts_hostcontrol flags enum, matching the
existing getmode/seeker/travel/htsparsejava_flags pattern.

A C enum is int-sized, so struct layout, field offsets and
sizeof(httrackp) are unchanged: no ABI break, no soname bump. The three
sscanf("%d", ...) sites that fill these fields now write through an int*
(size-identical) to keep the format type exact.

These enums are unsigned-backed (all enumerators non-negative), so the
non-negative debug comparisons (debug < level, debug > LOG_INFO, etc.)
now compile to unsigned jumps. debug is never negative, never sscanf'd
and never tested against a negative bound, so the result is unchanged;
disassembly is otherwise byte-identical bar instruction scheduling.

savename_83 is left as int on purpose: its sscanf sits in the -L parser
block whose old indentation does not round-trip through clang-format.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:11:19 +02:00
Xavier Roche
93f502990c Merge pull request #387 from xroche/feature/api-bool-returns
Return hts_boolean from the yes/no library functions
2026-06-18 17:38:48 +02:00
Xavier Roche
0f4b2596b2 htslib: return hts_boolean from the yes/no library functions
The exported API had many functions returning int where the int is really a
yes/no answer. Type the 14 genuinely-boolean ones as hts_boolean
(catch_url, dir_exists, is_dyntype, may_unknown, hts_findnext,
hts_findisdir/isfile/issystem, hts_has_stopped, hts_addurl, hts_resetaddurl,
hts_log, get_httptype_sized, guess_httptype_sized) and the three boolean int
parameters likewise (get_httptype_sized's flag, unescape_http_unharm's no_high,
hts_request_stop's force).

hts_boolean moves from htsopt.h to htsglobal.h so the library header, which only
forward-declares httrackp and does not include htsopt.h, can see the type.

The audit deliberately left alone the functions whose name suggests a boolean
but whose value is not 0/1: hts_is_testing returns 0..5, hts_is_exiting and
is_knowntype/is_userknowntype are tri-state, structcheck and the *_utf8 wrappers
are POSIX 0/-1, hts_findgetsize is a size, hts_main is an exit code, and
copy_htsopt returns 0 for success (a bool would read backwards). hts_setpause
and hts_is_parsing keep int params because they gate on '>= 0', not 0/1.

Not an ABI break: int -> int-sized enum is the same calling convention for both
return values (eax) and parameters, and enum<->int is implicit for callers, so
already-compiled consumers keep working. Verified by comparing per-object
disassembly against master: 39 of 45 objects byte-identical, htslib differs only
in __LINE__ immediates, and the five caller/definer objects differ only in
register allocation and return-block merging (no control-flow or value change).
make check passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 09:19:36 +02:00
Xavier Roche
4a676bb5e1 Merge pull request #386 from xroche/feature/api-boolean-enum
Type the boolean option fields as a named enum
2026-06-18 09:04:14 +02:00
Xavier Roche
36b4e834b8 htsopt: type the boolean option fields as a named enum
The httrackp option fields that are pure on/off toggles were declared as bare
int. Introduce a two-value enum, hts_boolean { HTS_FALSE, HTS_TRUE }, and use it
as the type of the 38 boolean fields so each one documents its nature at the
declaration. The hts_create_opt() defaults block now reads HTS_TRUE/HTS_FALSE.

An enum is used rather than C bool on purpose: a C enum is int-sized and
represented like int, so the struct layout, every field offset and
sizeof(httrackp) are unchanged (verified: 141648 bytes before and after). The
size_httrackp guard value still holds and there is no soname bump. A bool field
would be one byte and would repack the whole struct.

Scope is httrackp only; fields that look boolean but are not were left as int
(savename_delayed is tri-state, hostcontrol is a bitmask), as was is_update in
the separate lien_back struct. The four CLI sites that sscanf("%d") into a
boolean field now cast to int* to keep the read well-defined.

Value-preserving: built against origin/master and compared per-object
disassembly. 40 of 45 objects are byte-identical; the five that differ
(htscore/htslib/htsname/htsparse/htswizard) differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes. Runtime assignments and tests on these fields are
left as plain 0/1, which compile identically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 07:34:36 +02:00
Xavier Roche
bbb423f025 Merge pull request #385 from xroche/feature/api-enum-types
Give the option fields named enum types and flag macros
2026-06-18 07:06:59 +02:00
Xavier Roche
eed46e0b09 htsopt: give the option fields named enum types and flag macros
The per-mirror option fields in the installed htsopt.h carried bare ints whose
values were scattered magic numbers, decoded only by reading the parser. Type
the four single-valued fields as enums (urlmode -> hts_urlmode, cache ->
hts_cachemode, wizard -> hts_wizard, robots -> hts_robots) and name the bitmask
bits as enums too (hts_getmode, hts_seeker, hts_travel_scope, plus
HTS_TRAVEL_SCOPE_MASK / HTS_TRAVEL_TEST_ALL), following the existing
htsparsejava_flags pattern where the flag bits are an enum but the field stays
int. Replace the magic numbers at every use site with the named values.

This is not an ABI break: a C enum is int-sized and represented identically, so
the struct layout, field offsets and sizeof(httrackp) are unchanged and the
size_httrackp guard value still holds. No soname bump.

The substitution is value-preserving and was verified by comparing per-object
disassembly between this branch and origin/master: 98 of 103 objects are
byte-identical, the htscore/htscoremain/htsparse objects have identical opcode
sequences (the only deltas are __LINE__ immediates moved by clang-format
wrapping long lines), and htslib/htswizard differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 23:59:38 +02:00
Xavier Roche
fa57f0148f Merge pull request #384 from xroche/cleanup/dead-decls
Drop dead and duplicate function declarations
2026-06-17 22:15:13 +02:00
Xavier Roche
76260d5e6e src: drop dead and duplicate function declarations
Four declarations named functions that have no definition anywhere, so
they were never exported (absent from libhttrack.so) and any caller
would fail to link: htswrap_set_userdef and htswrap_get_userdef (the
live path is the CHAIN_FUNCTION ARGUMENT with CALLBACKARG_USERDEF),
antislash_unescaped, and the internal liens_record. escape_remove_control
was additionally declared twice in httrack-library.h; the documented
declaration stays, the bare duplicate goes.

Header-only cleanup. The exported symbol set is unchanged (verified with
nm -D), so this is not an ABI break and needs no soname bump.

Found while documenting the public API (#382).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 22:11:30 +02:00
Xavier Roche
5d0913dfce Merge pull request #383 from xroche/fix/mtime-local-precision
Fix mtime_local sub-second precision loss on POSIX
2026-06-17 22:06:42 +02:00
Xavier Roche
9b7601a987 htslib: fix mtime_local sub-second precision on POSIX
mtime_local() returns milliseconds since the epoch, but the POSIX
branch divided tv_usec (microseconds) by 1000000 instead of 1000,
dropping the entire millisecond term. The clock only advanced at
whole-second boundaries, so every sub-second delta the callers compute
(request/connect timing, transfer-rate smoothing) read as zero. The
Windows ftime() branch was already correct.

Found while documenting the public API (#382).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 22:03:16 +02:00
Xavier Roche
4ec38c4e66 Merge pull request #382 from xroche/docs/api-httrack-library
Document the public C API with contract comments
2026-06-17 21:17:24 +02:00
46 changed files with 2135 additions and 945 deletions

View File

@@ -320,6 +320,21 @@ jobs:
lint: lint:
name: lint (shellcheck, shfmt) name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps: steps:
- uses: actions/checkout@v6 - uses: actions/checkout@v6
@@ -332,12 +347,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck - name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh run: shellcheck $SHELL_SCRIPTS
- name: shfmt - name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format # Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip, # (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -1,8 +1,7 @@
#!/bin/sh #!/bin/sh
# Simple indexing test using HTTrack # Simple indexing test using HTTrack
# A "real" script/program would use advanced search, and # A "real" script/program would use advanced search, and
# use dichotomy to find the word in the index.txt file # use dichotomy to find the word in the index.txt file
# This script is really basic and NOT optimized, and # This script is really basic and NOT optimized, and
# should not be used for professional purpose :) # should not be used for professional purpose :)
@@ -11,50 +10,49 @@ TESTSITE="http://localhost/"
# Create an index if necessary # Create an index if necessary
if ! test -f "index.txt"; then if ! test -f "index.txt"; then
echo "Building the index .." echo "Building the index .."
rm -rf test rm -rf test
httrack --display "$TESTSITE" -%I -O test httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./ mv test/index.txt ./
fi fi
# Convert crlf to lf # Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .." echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt tr -d '\r' <index.txt.old >index.txt
fi fi
keyword=- keyword=-
while test -n "$keyword"; do while test -n "$keyword"; do
printf "Enter a keyword: " printf "Enter a keyword: "
read keyword read -r keyword
if test -n "$keyword"; then if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`" FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches # Multiple matches
printf "Found multiple keywords: " printf "Found multiple keywords: "
echo "$FOUNDK"|cut -f2 -d':'|tr '\n' ' ' echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
echo "" echo ""
echo "Use keyword$ to find only one" echo "Use keyword$ to find only one"
else else
# One match # One match
N=`echo "$FOUNDK"|cut -f1 -d':'` N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=`tail +$N index.txt|grep -nE "\("|head -n 1` PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM"|grep "ignored">/dev/null; then if ! echo "$PM" | grep "ignored" >/dev/null; then
M=`echo $PM|cut -f1 -d':'` M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:" echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' ' tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else else
echo "keyword ignored (too many hits)" echo "keyword ignored (too many hits)"
fi fi
fi fi
else else
echo "not found" echo "not found"
fi fi
fi fi
done done

View File

@@ -2532,8 +2532,26 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#if HTS_USEOPENSSL #if HTS_USEOPENSSL
/* SSL mode */ /* SSL mode */
if (back[i].r.ssl) { if (back[i].r.ssl) {
int tunnel_ok = 1;
// https via proxy: CONNECT-tunnel before TLS (#85)
if (back[i].r.req.proxy.active && back[i].r.ssl_con == NULL) {
const int timeout = back[i].timeout > 0 ? back[i].timeout : 30;
tunnel_ok =
http_proxy_tunnel(opt, &back[i].r, back[i].url_adr, timeout);
if (!tunnel_ok) {
if (!strnotempty(back[i].r.msg))
strcpybuff(back[i].r.msg, "proxy CONNECT failed");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].r.statuscode = STATUSCODE_NON_FATAL;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
}
}
// handshake not yet launched // handshake not yet launched
if (!back[i].r.ssl_con) { if (tunnel_ok && !back[i].r.ssl_con) {
SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL); SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL);
// new session // new session
back[i].r.ssl_con = SSL_new(openssl_ctx); back[i].r.ssl_con = SSL_new(openssl_ctx);
@@ -2551,7 +2569,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE; back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE;
} }
/* Error */ /* Error */
if (back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) { if (tunnel_ok && back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
strcpybuff(back[i].r.msg, "bad SSL/TLS handshake"); strcpybuff(back[i].r.msg, "bad SSL/TLS handshake");
deletehttp(&back[i].r); deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET; back[i].r.soc = INVALID_SOCKET;
@@ -2779,7 +2797,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
if (strcmp(back[i].url_fil, "/robots.txt")) { if (strcmp(back[i].url_fil, "/robots.txt")) {
if (back[i].r.statuscode == HTTP_OK) { // 'OK' if (back[i].r.statuscode == HTTP_OK) { // 'OK'
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_fil)) { // pas HTML if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_fil)) { // pas HTML
if (opt->getmode & 2) { // on peut ecrire des non html if (opt->getmode & HTS_GETMODE_NONHTML) {
int fcheck = 0; int fcheck = 0;
int last_errno = 0; int last_errno = 0;
@@ -2852,7 +2870,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
} }
} }
} }
} else { // on coupe tout! } else { // on coupe tout!
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
"File cancelled (non HTML): %s%s", "File cancelled (non HTML): %s%s",
back[i].url_adr, back[i].url_fil); back[i].url_adr, back[i].url_fil);
@@ -3661,7 +3679,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#endif #endif
if (sz >= 0) { if (sz >= 0) {
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML
if (opt->getmode & 2) { // on peut ecrire des non html **sinon ben euhh sera intercepté plus loin, donc rap sur ce qui va sortir** if (opt->getmode & HTS_GETMODE_NONHTML) {
filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu
file_notify(opt, back[i].url_adr, back[i].url_fil, file_notify(opt, back[i].url_adr, back[i].url_fil,
back[i].url_sav, 0, 1, back[i].url_sav, 0, 1,
@@ -3838,7 +3856,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
/* funny log for commandline users */ /* funny log for commandline users */
//if (!opt->quiet) { //if (!opt->quiet) {
// petite animation // petite animation
if (opt->verbosedisplay == 1) { if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (back[i].status == STATUS_READY) { if (back[i].status == STATUS_READY) {
if (back[i].r.statuscode == HTTP_OK) if (back[i].r.statuscode == HTTP_OK)
printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r", printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r",

View File

@@ -3,57 +3,59 @@
# Change this to download files # Change this to download files
if false; then if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi fi
# Produce code # Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n" printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT ; do for i in *.TXT; do
echo "processing $i" >&2 echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \ grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
( (
unset arr unset arr
while read LINE ; do while read -r LINE; do
from=$[$(echo $LINE | cut -f1 -d' ')] from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then if ! test -n "$from"; then
echo "error with $i" >&2 echo "error with $i" >&2
exit 1 exit 1
elif test $from -ge 256; then elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2 echo "out-of-range ($LINE) with $i" >&2
exit 1 exit 1
fi fi
to=$(echo $LINE | cut -f2 -d' ') to=$(echo "$LINE" | cut -f2 -d' ')
arr[$from]=$to arr[from]=$to
done done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n " name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
i=0 printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
while test "$i" -lt 256; do idx=0
if test "$i" -gt 0; then while test "$idx" -lt 256; do
printf ", " if test "$idx" -gt 0; then
if test $[${i}%8] -eq 0; then printf ", "
printf "\n " if test $((idx % 8)) -eq 0; then
fi printf "\n "
fi fi
value=${arr[$i]:-0} fi
printf "0x%04x" $value value=${arr[$idx]:-0}
i=$[${i}+1] printf "0x%04x" "$value"
done idx=$((idx + 1))
printf " };\n\n" done
) printf " };\n\n"
echo "processed $i" >&2 )
echo "processed $i" >&2
done done
# Indexes # Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n" printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT ; do for i in *.TXT; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n" name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done done
printf " { NULL, NULL }\n};\n" printf " { NULL, NULL }\n};\n"

View File

@@ -370,7 +370,7 @@ int cache_selftests(httrackp *opt, const char *dir) {
StringCopy(opt->path_html, base); StringCopy(opt->path_html, base);
StringCopy(opt->path_html_utf8, base); StringCopy(opt->path_html_utf8, base);
} }
opt->cache = 1; opt->cache = HTS_CACHE_PRIORITY;
/* pass 1: create everything in a single write session */ /* pass 1: create everything in a single write session */
selftest_open_for_write(&cache, opt); selftest_open_for_write(&cache, opt);
@@ -547,7 +547,7 @@ static void golden_setup(httrackp *opt, const char *dir) {
StringCopy(opt->path_log, base); StringCopy(opt->path_log, base);
StringCopy(opt->path_html, base); StringCopy(opt->path_html, base);
StringCopy(opt->path_html_utf8, base); StringCopy(opt->path_html_utf8, base);
opt->cache = 1; opt->cache = HTS_CACHE_PRIORITY;
} }
int cache_golden_selftest(httrackp *opt, const char *dir, int regen) { int cache_golden_selftest(httrackp *opt, const char *dir, int regen) {

View File

@@ -135,7 +135,8 @@ HTSEXT_API T_SOC catch_url_init(int *port, /* 128 bytes */ char *adr) {
// returns 0 if error // returns 0 if error
// url: buffer where URL must be stored - or ip:port in case of failure // url: buffer where URL must be stored - or ip:port in case of failure
// data: 32Kb // data: 32Kb
HTSEXT_API int catch_url(T_SOC soc, char *url, char *method, char *data) { HTSEXT_API hts_boolean catch_url(T_SOC soc, char *url, char *method,
char *data) {
int retour = 0; int retour = 0;
// connexion (accept) // connexion (accept)

View File

@@ -1835,9 +1835,10 @@ int httpmirror(char *url1, httrackp * opt) {
a++; // sauter espace(s) a++; // sauter espace(s)
if (strnotempty(a)) { if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS #ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 || opt->robots >= 3) if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
#endif #endif
{ /* ignoring disallow: / */ { /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) { if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a); strcatbuff(buff, a);
strcatbuff(buff, "\n"); strcatbuff(buff, "\n");
@@ -1932,10 +1933,10 @@ int httpmirror(char *url1, httrackp * opt) {
"Warning: store %s without scan: %s", r.contenttype, "Warning: store %s without scan: %s", r.contenttype,
savename()); savename());
} else { } else {
if ((opt->getmode & 2) != 0) { // ok autorisé if ((opt->getmode & HTS_GETMODE_NONHTML) != 0) {
hts_log_print(opt, LOG_DEBUG, "Store %s: %s", r.contenttype, hts_log_print(opt, LOG_DEBUG, "Store %s: %s", r.contenttype,
savename()); savename());
} else { // lien non autorisé! (ex: cgi-bin en html) } else { // lien non autorisé! (ex: cgi-bin en html)
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
"non-html file ignored after upload at %s : %s", "non-html file ignored after upload at %s : %s",
urladr(), urlfil()); urladr(), urlfil());
@@ -2052,7 +2053,7 @@ int httpmirror(char *url1, httrackp * opt) {
ptr++; ptr++;
// faut-il sauter le(s) lien(s) suivant(s)? (fichiers images à passer après les html) // faut-il sauter le(s) lien(s) suivant(s)? (fichiers images à passer après les html)
if (opt->getmode & 4) { // sauver les non html après if (opt->getmode & HTS_GETMODE_HTML_FIRST) {
// sauter les fichiers selon la passe // sauter les fichiers selon la passe
if (!numero_passe) { if (!numero_passe) {
while((ptr < opt->lien_tot) ? (heap(ptr)->pass2) : 0) while((ptr < opt->lien_tot) ? (heap(ptr)->pass2) : 0)
@@ -2584,7 +2585,7 @@ static int mkdir_compat(const char *pathname) {
/* path must end with "/" or with the finename (/tmp/bar/ or /tmp/bar/foo.zip) */ /* path must end with "/" or with the finename (/tmp/bar/ or /tmp/bar/foo.zip) */
/* Note: preserve errno */ /* Note: preserve errno */
HTSEXT_API int dir_exists(const char *path) { HTSEXT_API hts_boolean dir_exists(const char *path) {
const int err = errno; const int err = errno;
STRUCT_STAT st; STRUCT_STAT st;
char BIGSTK file[HTS_URLMAXSIZE * 2]; char BIGSTK file[HTS_URLMAXSIZE * 2];
@@ -3341,7 +3342,8 @@ int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,
int ptr, int numero_passe) { int ptr, int numero_passe) {
int n = back_pluggable_sockets(sback, opt); int n = back_pluggable_sockets(sback, opt);
if (opt->savename_delayed == 2 && !opt->delayed_cached) /* cancel (always delayed) */ if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!opt->delayed_cached) /* cancel (always delayed) */
return 0; return 0;
if (n > 0) { if (n > 0) {
int p; int p;
@@ -3645,7 +3647,7 @@ HTSEXT_API int hts_setpause(httrackp * opt, int p) {
} }
// ask for termination // ask for termination
HTSEXT_API int hts_request_stop(httrackp * opt, int force) { HTSEXT_API int hts_request_stop(httrackp *opt, hts_boolean force) {
if (opt != NULL) { if (opt != NULL) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user"); hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
hts_mutexlock(&opt->state.lock); hts_mutexlock(&opt->state.lock);
@@ -3655,7 +3657,7 @@ HTSEXT_API int hts_request_stop(httrackp * opt, int force) {
return 0; return 0;
} }
HTSEXT_API int hts_has_stopped(httrackp * opt) { HTSEXT_API hts_boolean hts_has_stopped(httrackp *opt) {
int ended; int ended;
hts_mutexlock(&opt->state.lock); hts_mutexlock(&opt->state.lock);
ended = opt->state.is_ended; ended = opt->state.is_ended;
@@ -3677,12 +3679,12 @@ HTSEXT_API int hts_has_stopped(httrackp * opt) {
//} //}
// ajout d'URL // ajout d'URL
// -1 : erreur // -1 : erreur
HTSEXT_API int hts_addurl(httrackp * opt, char **url) { HTSEXT_API hts_boolean hts_addurl(httrackp *opt, char **url) {
if (url) if (url)
opt->state._hts_addurl = url; opt->state._hts_addurl = url;
return (opt->state._hts_addurl != NULL); return (opt->state._hts_addurl != NULL);
} }
HTSEXT_API int hts_resetaddurl(httrackp * opt) { HTSEXT_API hts_boolean hts_resetaddurl(httrackp *opt) {
opt->state._hts_addurl = NULL; opt->state._hts_addurl = NULL;
return (opt->state._hts_addurl != NULL); return (opt->state._hts_addurl != NULL);
} }
@@ -3701,7 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0) if (from->maxsoc > 0)
to->maxsoc = from->maxsoc; to->maxsoc = from->maxsoc;
if (from->nearlink > -1) /* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard
is always false; cast to int to keep the -1 "unset" sentinel test. */
if ((int) from->nearlink > -1)
to->nearlink = from->nearlink; to->nearlink = from->nearlink;
if (from->timeout > -1) if (from->timeout > -1)
@@ -3728,18 +3732,18 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1) if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol; to->hostcontrol = from->hostcontrol;
if (from->errpage > -1) if ((int) from->errpage > -1)
to->errpage = from->errpage; to->errpage = from->errpage;
if (from->parseall > -1) if ((int) from->parseall > -1)
to->parseall = from->parseall; to->parseall = from->parseall;
// test all: bit 8 de travel // test all: bit 8 de travel
if (from->travel > -1) { if (from->travel > -1) {
if (from->travel & 256) if (from->travel & HTS_TRAVEL_TEST_ALL)
to->travel |= 256; to->travel |= HTS_TRAVEL_TEST_ALL;
else else
to->travel &= 255; to->travel &= HTS_TRAVEL_SCOPE_MASK;
} }
return 0; return 0;
@@ -3843,7 +3847,7 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
a = opt->savename_type; a = opt->savename_type;
b = opt->savename_83; b = opt->savename_83;
opt->savename_type = 0; opt->savename_type = 0;
opt->savename_83 = 0; opt->savename_83 = HTS_SAVENAME_83_LONG;
// note: adr,fil peuvent être patchés // note: adr,fil peuvent être patchés
r = r =
url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe, url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe,

View File

@@ -369,10 +369,6 @@ char *readfile_or(const char *fil, const char *defaultdata);
void check_rate(TStamp stat_timestart, int maxrate); void check_rate(TStamp stat_timestart, int maxrate);
#endif #endif
// links
int liens_record(char *adr, char *fil, char *save, char *former_adr,
char *former_fil, char *codebase);
/* Backing (download-slot) scheduler. Operate on the back[] ring (struct_back). /* Backing (download-slot) scheduler. Operate on the back[] ring (struct_back).
Not thread-safe; call from the single crawl loop. */ Not thread-safe; call from the single crawl loop. */

View File

@@ -612,12 +612,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Terminal is a tty, may ask questions and display funny information */ /* Terminal is a tty, may ask questions and display funny information */
if (isatty(1)) { if (isatty(1)) {
opt->quiet = 0; opt->quiet = 0;
opt->verbosedisplay = 1; opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
} }
/* Not a tty, no stdin input or funny output! */ /* Not a tty, no stdin input or funny output! */
else { else {
opt->quiet = 1; opt->quiet = 1;
opt->verbosedisplay = 0; opt->verbosedisplay = HTS_VERBOSE_NONE;
} }
#endif #endif
@@ -953,9 +953,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
p = buff; p = buff;
do { do {
int insert_after_argc; int insert_after_argc;
int quoted; /* "" unquotes to empty but is still a real token (#106) */
// read next // read next
lastp = p; lastp = p;
quoted = (p != NULL && *p == '"');
if (p) { if (p) {
p = next_token(p, 1); p = next_token(p, 1);
if (p) { if (p) {
@@ -966,7 +968,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Insert parameters BUT so that they can be in the same order */ /* Insert parameters BUT so that they can be in the same order */
if (lastp) { if (lastp) {
if (strnotempty(lastp)) { if (strnotempty(lastp) || quoted) {
insert_after_argc = argc - insert_after; insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk, cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_argvblk_size, x_ptr); x_argvblk_size, x_ptr);
@@ -1431,7 +1433,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
StringBuff(opt->path_log), "hts-in_progress.lock"))) { // fichier lock? StringBuff(opt->path_log), "hts-in_progress.lock"))) { // fichier lock?
//char s[32]; //char s[32];
opt->cache = 1; // cache prioritaire opt->cache = HTS_CACHE_PRIORITY; // cache prioritaire
if (opt->quiet == 0) { if (opt->quiet == 0) {
if ((fexist if ((fexist
(fconcat (fconcat
@@ -1465,7 +1467,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(fconcat (fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_html), "index.html"))) { (OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_html), "index.html"))) {
//char s[32]; //char s[32];
opt->cache = 2; // cache vient après test de validité opt->cache = HTS_CACHE_TEST_UPDATE;
if (opt->quiet == 0) { if (opt->quiet == 0) {
if ((fexist if ((fexist
(fconcat (fconcat
@@ -1558,25 +1560,25 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; // déja fait normalement return 0; // déja fait normalement
// //
case 'g': // récupérer un (ou plusieurs) fichiers isolés case 'g': // récupérer un (ou plusieurs) fichiers isolés
opt->wizard = 2; // le wizard on peut plus s'en passer.. opt->wizard = HTS_WIZARD_AUTO;
//opt->wizard=0; // pas de wizard //opt->wizard=0; // pas de wizard
opt->cache = 0; // ni de cache opt->cache = HTS_CACHE_NONE; // ni de cache
opt->makeindex = 0; // ni d'index opt->makeindex = 0; // ni d'index
httrack_logmode = 1; // erreurs à l'écran httrack_logmode = 1; // erreurs à l'écran
opt->savename_type = 1003; // mettre dans le répertoire courant opt->savename_type = 1003; // mettre dans le répertoire courant
opt->depth = 0; // ne pas explorer la page opt->depth = 0; // ne pas explorer la page
opt->accept_cookie = 0; // pas de cookies opt->accept_cookie = 0; // pas de cookies
opt->robots = 0; // pas de robots opt->robots = HTS_ROBOTS_NEVER; // pas de robots
break; break;
case 'w': case 'w':
opt->wizard = 2; // wizard 'soft' (ne pose pas de questions) opt->wizard = HTS_WIZARD_AUTO;
opt->travel = 0; opt->travel = HTS_TRAVEL_SAME_ADDRESS;
opt->seeker = 1; opt->seeker = HTS_SEEKER_DOWN;
break; break;
case 'W': case 'W':
opt->wizard = 1; // Wizard-Help (pose des questions) opt->wizard = HTS_WIZARD_ASK; // Wizard-Help (pose des questions)
opt->travel = 0; opt->travel = HTS_TRAVEL_SAME_ADDRESS;
opt->seeker = 1; opt->seeker = HTS_SEEKER_DOWN;
break; break;
case 'r': // n'est plus le recurse get bestial mais wizard itou! case 'r': // n'est plus le recurse get bestial mais wizard itou!
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
@@ -1598,19 +1600,23 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
// note: les tests opt->depth sont pour éviter de faire // note: les tests opt->depth sont pour éviter de faire
// un miroir du web (:-O) accidentelement ;-) // un miroir du web (:-O) accidentelement ;-)
case 'a': /*if (opt->depth==9999) opt->depth=3; */ case 'a': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 0 + (opt->travel & 256); opt->travel =
HTS_TRAVEL_SAME_ADDRESS + (opt->travel & HTS_TRAVEL_TEST_ALL);
break; break;
case 'd': /*if (opt->depth==9999) opt->depth=3; */ case 'd': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 1 + (opt->travel & 256); opt->travel =
HTS_TRAVEL_SAME_DOMAIN + (opt->travel & HTS_TRAVEL_TEST_ALL);
break; break;
case 'l': /*if (opt->depth==9999) opt->depth=3; */ case 'l': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 2 + (opt->travel & 256); opt->travel =
HTS_TRAVEL_SAME_TLD + (opt->travel & HTS_TRAVEL_TEST_ALL);
break; break;
case 'e': /*if (opt->depth==9999) opt->depth=3; */ case 'e': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 7 + (opt->travel & 256); opt->travel =
HTS_TRAVEL_EVERYWHERE + (opt->travel & HTS_TRAVEL_TEST_ALL);
break; break;
case 't': case 't':
opt->travel |= 256; opt->travel |= HTS_TRAVEL_TEST_ALL;
break; break;
case 'n': case 'n':
opt->nearlink = 1; opt->nearlink = 1;
@@ -1620,16 +1626,16 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
break; break;
// //
case 'U': case 'U':
opt->seeker = 2; opt->seeker = HTS_SEEKER_UP;
break; break;
case 'D': case 'D':
opt->seeker = 1; opt->seeker = HTS_SEEKER_DOWN;
break; break;
case 'S': case 'S':
opt->seeker = 0; opt->seeker = 0;
break; break;
case 'B': case 'B':
opt->seeker = 3; opt->seeker = HTS_SEEKER_DOWN | HTS_SEEKER_UP;
break; break;
// //
case 'Y': case 'Y':
@@ -1659,12 +1665,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
//case 'A': opt->urlmode=1; break; //case 'A': opt->urlmode=1; break;
//case 'R': opt->urlmode=2; break; //case 'R': opt->urlmode=2; break;
case 'K': case 'K':
opt->urlmode = 0; opt->urlmode = HTS_URLMODE_ABSOLUTE;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->urlmode); sscanf(com + 1, "%d", (int *) &opt->urlmode);
if (opt->urlmode == 0) { // in fact K0 ==> K2 if (opt->urlmode == HTS_URLMODE_ABSOLUTE) { // in fact K0 ==> K2
// and K ==> K0 // and K ==> K0
opt->urlmode = 2; opt->urlmode = HTS_URLMODE_RELATIVE;
} }
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
@@ -1779,7 +1785,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
break; break;
// //
case 'b': case 'b':
sscanf(com + 1, "%d", &opt->accept_cookie); sscanf(com + 1, "%d", (int *) &opt->accept_cookie);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
break; break;
@@ -1811,53 +1817,51 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
com++; com++;
} }
break; break;
case 'L': case 'L': {
{ sscanf(com + 1, "%d", (int *) &opt->savename_83);
sscanf(com + 1, "%d", &opt->savename_83); switch (opt->savename_83) {
switch (opt->savename_83) { case 0: // 8-3 (ISO9660 L1)
case 0: // 8-3 (ISO9660 L1) opt->savename_83 = HTS_SAVENAME_83_DOS;
opt->savename_83 = 1; break;
break; case 1:
case 1: opt->savename_83 = HTS_SAVENAME_83_LONG;
opt->savename_83 = 0; break;
break; default: // 2 == ISO9660 (ISO9660 L2)
default: // 2 == ISO9660 (ISO9660 L2) opt->savename_83 = HTS_SAVENAME_83_ISO9660;
opt->savename_83 = 2; break;
break;
}
while(isdigit((unsigned char) *(com + 1)))
com++;
} }
break; while (isdigit((unsigned char) *(com + 1)))
com++;
} break;
case 's': case 's':
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->robots); sscanf(com + 1, "%d", (int *) &opt->robots);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} else } else
opt->robots = 1; opt->robots = HTS_ROBOTS_SOMETIMES;
#if DEBUG_ROBOTS #if DEBUG_ROBOTS
printf("robots.txt mode set to %d\n", opt->robots); printf("robots.txt mode set to %d\n", opt->robots);
#endif #endif
break; break;
case 'o': case 'o':
sscanf(com + 1, "%d", &opt->errpage); sscanf(com + 1, "%d", (int *) &opt->errpage);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
break; break;
case 'u': case 'u':
sscanf(com + 1, "%d", &opt->check_type); sscanf(com + 1, "%d", (int *) &opt->check_type);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
break; break;
// //
case 'C': case 'C':
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->cache); sscanf(com + 1, "%d", (int *) &opt->cache);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} else } else
opt->cache = 1; opt->cache = HTS_CACHE_PRIORITY;
break; break;
case 'k': case 'k':
opt->all_in_cache = 1; opt->all_in_cache = 1;
@@ -1913,7 +1917,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
case 'I': case 'I':
opt->kindex = 1; opt->kindex = 1;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->kindex); sscanf(com + 1, "%d", (int *) &opt->kindex);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} }
@@ -1985,9 +1989,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
} }
break; // url hack break; // url hack
case 'v': case 'v':
opt->verbosedisplay = 2; opt->verbosedisplay = HTS_VERBOSE_FULL;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->verbosedisplay); sscanf(com + 1, "%d", (int *) &opt->verbosedisplay);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} }
@@ -2000,9 +2004,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
} }
break; break;
case 'N': case 'N':
opt->savename_delayed = 2; opt->savename_delayed = HTS_SAVENAME_DELAYED_HARD;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->savename_delayed); sscanf(com + 1, "%d", (int *) &opt->savename_delayed);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} }
@@ -2045,7 +2049,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
// preserve: no footer, original links // preserve: no footer, original links
case 'p': case 'p':
StringClear(opt->footer); StringClear(opt->footer);
opt->urlmode = 4; opt->urlmode = HTS_URLMODE_KEEP_ORIGINAL;
break; break;
case 'L': // URL list case 'L': // URL list
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) { if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
@@ -2783,6 +2787,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; return 0;
} }
break; break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs case '2': // mimedefs
if (na + 1 >= argc) { if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL"); HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");
@@ -3092,6 +3137,78 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free(); htsmain_free();
return 0; return 0;
break; break;
case '9': { // copy_htsopt selftest: httrack -#9
httrackp *from = hts_create_opt();
httrackp *to = hts_create_opt();
int err = 0;
/* from-values differ from both the to-values and the
hts_create_opt() defaults (nearlink FALSE, errpage/parseall
TRUE), so a copy that no-ops or just resets to defaults is
caught too, not only the unsigned-guard bug. */
from->retry = 7; /* int field: positive control */
to->retry = 0;
from->nearlink = HTS_TRUE;
to->nearlink = HTS_FALSE;
from->errpage = HTS_FALSE;
to->errpage = HTS_TRUE;
from->parseall = HTS_FALSE;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->retry != 7)
err = 1;
if (to->nearlink != HTS_TRUE)
err = 1;
if (to->errpage != HTS_FALSE)
err = 1;
if (to->parseall != HTS_FALSE)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
htsmain_free();
return err;
} break;
case 'Q': { // cookie request-header selftest: httrack -#Q
static t_cookie cookie;
char hdr[1024];
/* RFC 6265: bare name=value pairs, no $Version/$Path (#151). */
const char *expected = "Cookie: name=value; has_js=1" H_CRLF;
int err = 0;
const char *dom = "www.example.com";
int added;
cookie.max_len = (int) sizeof(cookie.data);
cookie.data[0] = '\0';
added = cookie_add(&cookie, "name", "value", dom, "/");
added |= cookie_add(&cookie, "has_js", "1", dom, "/");
/* different domain: must be filtered out */
added |= cookie_add(&cookie, "junk", "x", "other.org", "/");
if (added) {
printf("cookie-header: FAIL (cookie_add setup)\n");
htsmain_free();
return 1;
}
http_cookie_header_selftest(&cookie, dom, "/", hdr,
sizeof(hdr));
if (strcmp(hdr, expected) != 0)
err = 1;
if (strstr(hdr, "$Version") != NULL ||
strstr(hdr, "$Path") != NULL)
err = 1;
if (strstr(hdr, "junk") != NULL) // wrong-domain cookie leaked
err = 1;
printf("cookie-header: %s\n", err ? "FAIL" : "OK");
if (err)
printf(" got: %s\n", hdr);
htsmain_free();
return err;
} break;
case '!': case '!':
HTS_PANIC_PRINTF HTS_PANIC_PRINTF
("Option #! is disabled for security reasons"); ("Option #! is disabled for security reasons");
@@ -3610,12 +3727,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
printf("Mirror launched on %s by HTTrack Website Copier/" printf("Mirror launched on %s by HTTrack Website Copier/"
HTTRACK_VERSION "%s " HTTRACK_AFF_AUTHORS "" LF, t, HTTRACK_VERSION "%s " HTTRACK_AFF_AUTHORS "" LF, t,
hts_get_version_info(opt)); hts_get_version_info(opt));
if (opt->wizard == 0) { if (opt->wizard == HTS_WIZARD_NONE) {
printf printf
("mirroring %s with %d levels, %d sockets,t=%d,s=%d,logm=%d,lnk=%d,mdg=%d\n", ("mirroring %s with %d levels, %d sockets,t=%d,s=%d,logm=%d,lnk=%d,mdg=%d\n",
url, opt->depth, opt->maxsoc, opt->travel, opt->seeker, url, opt->depth, opt->maxsoc, opt->travel, opt->seeker,
httrack_logmode, opt->urlmode, opt->getmode); httrack_logmode, opt->urlmode, opt->getmode);
} else { // the magic wizard } else { // the magic wizard
printf("mirroring %s with the wizard help..\n", url); printf("mirroring %s with the wizard help..\n", url);
} }
} }

View File

@@ -33,43 +33,43 @@ EOF
else else
GET "${url}" GET "${url}"
fi fi
) \ ) |
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \ grep -E '^<!ENTITY [a-zA-Z0-9_]' |
| sed \ sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \ -e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \ -e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\ -e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
| ( \ (
read A read -r A
while test -n "$A"; do while test -n "$A"; do
ent="${A%% *}" ent="${A%% *}"
code=$(echo "$A"|cut -f2 -d' ') code=$(echo "$A" | cut -f2 -d' ')
# compute hash # compute hash
hash=0 hash=0
i=0 i=0
a=1664525 a=1664525
c=1013904223 c=1013904223
m="$[1 << 32]" m="$((1 << 32))"
while test "$i" -lt ${#ent}; do while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}"|hexdump -v -e '/1 "%d"')" d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]" hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$[${i}+1] i=$((i + 1))
done done
echo -e " /* $A */" echo -e " /* $A */"
echo -e " case ${hash}u:" echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {" echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};" echo -e " return ${code};"
echo -e " }" echo -e " }"
echo -e " break;" echo -e " break;"
# next # next
read A read -r A
done done
) )
cat <<EOF cat <<EOF
} }
/* unknown */ /* unknown */
return -1; return -1;
} }
EOF EOF
) > ${dest} ) >${dest}

View File

@@ -242,6 +242,14 @@ Please visit our Website: http://www.httrack.com
#define HTS_NOPARAM "(none)" #define HTS_NOPARAM "(none)"
#define HTS_NOPARAM2 "\"(none)\"" #define HTS_NOPARAM2 "\"(none)\""
/* Boolean flag for option fields and API yes/no returns. An enum (not C bool)
so it stays int-sized: option fields keep the httrackp layout/ABI, and a
return type stays compatible with the int it replaces. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
#endif
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */ /* Larger/smaller of two values. Macros: arguments are evaluated twice. */
#define maximum(A,B) ( (A) > (B) ? (A) : (B) ) #define maximum(A,B) ( (A) > (B) ? (A) : (B) )

View File

@@ -644,6 +644,165 @@ T_SOC http_fopen(httrackp * opt, const char *adr, const char *fil, htsblk * reto
return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour); return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour);
} }
// Read a CRLF line from a non-blocking socket (waits up to timeout per recv).
// Returns the line length (0 = empty), or -1 on timeout/EOF/error.
static int proxy_getline(T_SOC soc, char *s, int max, int timeout) {
int j = 0;
for (;;) {
unsigned char ch;
int n;
if (!check_readinput_t(soc, timeout))
return -1; // timed out waiting for data
n = (int) recv(soc, &ch, 1, 0);
if (n == 1) {
if (ch == 13) // CR
continue;
if (ch == 10) // LF: end of line
break;
if (j >= max - 1)
return -1; // line too long: bound the read against a hostile proxy
s[j++] = (char) ch;
} else if (n == 0) {
return -1; // connection closed
} else {
#ifdef _WIN32
if (WSAGetLastError() == WSAEWOULDBLOCK)
continue;
#else
if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
continue;
#endif
return -1;
}
}
s[j] = '\0';
return j;
}
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout) {
const T_SOC soc = retour->soc;
const char *const host = jump_identification_const(adr); // host[:port]
const char *const portsep = jump_toport_const(adr); // ":port" or NULL
char BIGSTK authority[HTS_URLMAXSIZE * 2];
char BIGSTK req[HTS_URLMAXSIZE * 4 + 1100];
char line[1024];
int code;
if (soc == INVALID_SOCKET)
return 0;
// CONNECT needs an explicit host:port; default the https port
authority[0] = '\0';
if (portsep != NULL)
strlcatbuff(authority, host, sizeof(authority)); // already host:port
else
snprintf(authority, sizeof(authority), "%s:%d", host, 443);
// backstop: never let a stray CR/LF in the host smuggle a second line into
// the CONNECT request (the host is already sanitized upstream)
{
const char *c;
for (c = authority; *c != '\0'; c++) {
if ((unsigned char) *c < ' ') {
strcpybuff(retour->msg, "proxy CONNECT: invalid host");
return 0;
}
}
}
snprintf(req, sizeof(req), "CONNECT %s HTTP/1.0" H_CRLF "Host: %s" H_CRLF,
authority, authority);
// creds go on the CONNECT, not the tunneled origin request
if (link_has_authorization(retour->req.proxy.name)) {
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
char autorisation[1100];
char user_pass[256];
autorisation[0] = user_pass[0] = '\0';
strncatbuff(user_pass, astart, (int) (a - astart) - 1);
strcpybuff(user_pass, unescape_http(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt), user_pass));
code64((unsigned char *) user_pass, (int) strlen(user_pass),
(unsigned char *) autorisation, 0);
strlcatbuff(req, "Proxy-Authorization: Basic ", sizeof(req));
strlcatbuff(req, autorisation, sizeof(req));
strlcatbuff(req, H_CRLF, sizeof(req));
}
strlcatbuff(req, H_CRLF, sizeof(req)); // end of request headers
// raw send: ssl is set, so sendc() would route to TLS
{
const char *p = req;
size_t remain = strlen(req);
int stalls = 0;
while (remain > 0) {
const int n = (int) send(soc, p, (int) remain, 0);
if (n > 0) {
p += n;
remain -= (size_t) n;
stalls = 0;
} else {
#ifdef _WIN32
const int wouldblock = (WSAGetLastError() == WSAEWOULDBLOCK);
#else
const int wouldblock =
(errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR);
#endif
// don't spin forever on a fatal error or an unwritable socket
if (!wouldblock || !check_writeinput_t(soc, timeout) ||
++stalls > 100) {
strcpybuff(retour->msg, "proxy CONNECT: write error");
return 0;
}
}
}
}
// proxy status line: "HTTP/1.x <code> ..."
if (proxy_getline(soc, line, sizeof(line), timeout) < 0) {
strcpybuff(retour->msg, "proxy CONNECT: no response");
return 0;
}
if (sscanf(line, "HTTP/%*d.%*d %d", &code) < 1)
code = 0;
if (code < 200 || code >= 300) {
snprintf(retour->msg, sizeof(retour->msg), "proxy CONNECT refused: %s",
strnotempty(line) ? line : "(no status)");
return 0;
}
// drain headers to the blank line; cap the count so a flooding proxy can't
// stall the crawl
{
int headers = 0;
for (;;) {
const int n = proxy_getline(soc, line, sizeof(line), timeout);
if (n < 0) {
strcpybuff(retour->msg, "proxy CONNECT: truncated response");
return 0;
}
if (n == 0)
break; // blank line: tunnel ready
if (++headers > 64) {
strcpybuff(retour->msg, "proxy CONNECT: too many response headers");
return 0;
}
}
}
return 1;
}
// ouverture d'une liaison http, envoi d'une requète // ouverture d'une liaison http, envoi d'une requète
// mode: 0 GET 1 HEAD [2 POST] // mode: 0 GET 1 HEAD [2 POST]
// treat: traiter header? // treat: traiter header?
@@ -680,14 +839,14 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
/* connexion */ /* connexion */
if (retour) { if (retour) {
if ((!(retour->req.proxy.active)) /* no proxy, or proxy not usable here (local file) */
|| ((strcmp(adr, "file://") == 0) if ((!(retour->req.proxy.active)) || (strcmp(adr, "file://") == 0)) {
|| (strncmp(adr, "https://", 8) == 0)
)
) { /* pas de proxy, ou non utilisable ici */
soc = newhttp(opt, adr, retour, -1, waitconnect); soc = newhttp(opt, adr, retour, -1, waitconnect);
} else { } else {
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port, waitconnect); // ouvrir sur le proxy à la place // to the proxy; https tunnels to the origin via CONNECT in back_wait
// (#85)
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port,
waitconnect);
} }
} else { } else {
soc = newhttp(opt, adr, NULL, -1, waitconnect); soc = newhttp(opt, adr, NULL, -1, waitconnect);
@@ -874,6 +1033,50 @@ static void print_buffer(buff_struct*const str, const char *format, ...) {
assertf(str->pos < str->capacity); assertf(str->pos < str->capacity);
} }
/* Append the request "Cookie:" header line for every stored cookie matching
domain/path. RFC 6265 form: bare "name=value" pairs joined by "; ", no
$Version/$Path attributes (those are RFC 2965 syntax that modern servers
reject, issue #151). Returns the number of cookies emitted. */
static int append_cookie_header(buff_struct *bstr, t_cookie *cookie,
const char *domain, const char *path) {
char buffer[8192];
char *b;
int cook = 0;
int max_cookies = 8;
if (cookie == NULL)
return 0;
b = cookie->data;
do {
b = cookie_find(b, "", domain, path); // next matching cookie
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(bstr, "Cookie: ");
cook = 1;
} else
print_buffer(bstr, "; ");
print_buffer(bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(bstr, "=%s", cookie_get(buffer, b, 6));
b = cookie_nextfield(b);
}
} while (b != NULL && max_cookies > 0);
if (cook)
print_buffer(bstr, H_CRLF);
return cook;
}
/* Self-test entry for append_cookie_header(): build the request Cookie line
into dst (always NUL-terminated). Returns the number of cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size) {
buff_struct bstr = {dst, dst_size, 0};
assertf(dst != NULL && dst_size > 0);
dst[0] = '\0';
return append_cookie_header(&bstr, cookie, domain, path);
}
// envoi d'une requète // envoi d'une requète
int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
const char *xsend, const char *adr, const char *fil, const char *xsend, const char *adr, const char *fil,
@@ -999,8 +1202,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
if (xsend) if (xsend)
print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes
// tester proxy authentication // for https, auth rides the CONNECT (the tunneled GET would leak it)
if (retour->req.proxy.active) { if (retour->req.proxy.active && strncmp(adr, "https://", 8) != 0) {
if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy! if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy!
const char *a = jump_identification_const(retour->req.proxy.name); const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name); const char *astart = jump_protocol_const(retour->req.proxy.name);
@@ -1048,34 +1251,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
search_tag + strlen(POSTTOK) + 1)))); search_tag + strlen(POSTTOK) + 1))));
} }
} }
// gestion cookies? // send stored cookies matching this host/path
if (cookie) { if (cookie) {
char buffer[8192]; append_cookie_header(&bstr, cookie, jump_identification_const(adr), fil);
char *b = cookie->data;
int cook = 0;
int max_cookies = 8;
do {
b = cookie_find(b, "", jump_identification_const(adr), fil); // prochain cookie satisfaisant aux conditions
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(&bstr, "Cookie: $Version=1; ");
cook = 1;
} else
print_buffer(&bstr, "; ");
print_buffer(&bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(&bstr, "=%s", cookie_get(buffer, b, 6));
print_buffer(&bstr, "; $Path=%s", cookie_get(buffer, b, 2));
b = cookie_nextfield(b);
}
} while(b != NULL && max_cookies > 0);
if (cook) { // on a envoyé un (ou plusieurs) cookie?
print_buffer(&bstr, H_CRLF);
#if DEBUG_COOK
printf("Header:\n%s\n", bstr.buffer);
#endif
}
} }
// gérer le keep-alive (garder socket) // gérer le keep-alive (garder socket)
if (retour->req.http11 && !retour->req.nokeepalive) { if (retour->req.http11 && !retour->req.nokeepalive) {
@@ -1808,6 +1986,24 @@ int check_readinput_t(T_SOC soc, int timeout) {
return 0; return 0;
} }
// wait until the socket is writable, up to timeout seconds
int check_writeinput_t(T_SOC soc, int timeout) {
if (soc != INVALID_SOCKET) {
fd_set fds;
struct timeval tv;
const int isoc = (int) soc;
assertf(isoc == soc);
FD_ZERO(&fds);
FD_SET(isoc, &fds);
tv.tv_sec = timeout;
tv.tv_usec = 0;
select(isoc + 1, NULL, &fds, NULL, &tv);
return FD_ISSET(isoc, &fds) ? 1 : 0;
} else
return 0;
}
// idem, sauf qu'ici on peut choisir la taille max de données à recevoir // idem, sauf qu'ici on peut choisir la taille max de données à recevoir
// SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes // SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes
// en éliminant les cr (ex: header), arrêt si double-lf // en éliminant les cr (ex: header), arrêt si double-lf
@@ -2409,6 +2605,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++) for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\') if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/'; adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
} }
// no hostname // no hostname
@@ -2580,8 +2778,8 @@ HTSEXT_API TStamp mtime_local(void) {
assert(! "gettimeofday"); assert(! "gettimeofday");
} }
return (TStamp) (((TStamp) tv.tv_sec * (TStamp) 1000) return (TStamp) (((TStamp) tv.tv_sec * (TStamp) 1000) +
+ ((TStamp) tv.tv_usec / (TStamp) 1000000)); ((TStamp) tv.tv_usec / (TStamp) 1000));
#else #else
struct timeb B; struct timeb B;
ftime(&B); ftime(&B);
@@ -3646,8 +3844,9 @@ HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const cha
// DOES NOT DECODE %25 (part of CHAR_DELIM) // DOES NOT DECODE %25 (part of CHAR_DELIM)
// no_high & 1: decode high chars // no_high & 1: decode high chars
// no_high & 2: decode space // no_high & 2: decode space
HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size,
const char *s, const int no_high) { const char *s,
const hts_boolean no_high) {
size_t i, j; size_t i, j;
RUNTIME_TIME_CHECK_SIZE(size); RUNTIME_TIME_CHECK_SIZE(size);
@@ -3931,8 +4130,8 @@ void hts_replace(char *s, char from, char to) {
// guess a local file's mime type (e.g. fil="toto.gif" -> s="image/gif") // guess a local file's mime type (e.g. fil="toto.gif" -> s="image/gif")
// returns 1 if a type was written to s, 0 otherwise // returns 1 if a type was written to s, 0 otherwise
int guess_httptype_sized(httrackp *opt, char *s, size_t ssize, hts_boolean guess_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil) { const char *fil) {
return get_httptype_sized(opt, s, ssize, fil, 1); return get_httptype_sized(opt, s, ssize, fil, 1);
} }
@@ -3945,8 +4144,8 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
// write the mime type for fil into s (capacity ssize) // write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream // flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise // fallback) returns 1 if a type was written to s, 0 otherwise
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag) { const char *fil, hts_boolean flag) {
// userdef overrides get_httptype (a rule with an empty value, e.g. "--assume // userdef overrides get_httptype (a rule with an empty value, e.g. "--assume
// cgi=", matches but writes nothing: report it as "no type" like the old // cgi=", matches but writes nothing: report it as "no type" like the old
// code, whose callers tested strnotempty(s)) // code, whose callers tested strnotempty(s))
@@ -4196,7 +4395,7 @@ HTSEXT_API int is_userknowntype(httrackp * opt, const char *fil) {
// page dynamique? // page dynamique?
// is_dyntype(get_ext("foo.asp")) // is_dyntype(get_ext("foo.asp"))
HTSEXT_API int is_dyntype(const char *fil) { HTSEXT_API hts_boolean is_dyntype(const char *fil) {
int j = 0; int j = 0;
if (!fil) if (!fil)
@@ -4214,7 +4413,7 @@ HTSEXT_API int is_dyntype(const char *fil) {
// types critiques qui ne doivent pas être changés car renvoyés par des serveurs qui ne // types critiques qui ne doivent pas être changés car renvoyés par des serveurs qui ne
// connaissent pas le type // connaissent pas le type
int may_unknown(httrackp * opt, const char *st) { hts_boolean may_unknown(httrackp *opt, const char *st) {
int j = 0; int j = 0;
// types média // types média
@@ -5236,7 +5435,8 @@ HTSEXT_API int hts_uninit_module(void) {
} }
// legacy. do not use // legacy. do not use
HTSEXT_API int hts_log(httrackp * opt, const char *prefix, const char *msg) { HTSEXT_API hts_boolean hts_log(httrackp *opt, const char *prefix,
const char *msg) {
if (opt->log != NULL) { if (opt->log != NULL) {
fspc(opt, opt->log, prefix); fspc(opt, opt->log, prefix);
fprintf(opt->log, "%s" LF, msg); fprintf(opt->log, "%s" LF, msg);
@@ -5434,69 +5634,72 @@ HTSEXT_API httrackp *hts_create_opt(void) {
/* default settings */ /* default settings */
opt->wizard = 2; // wizard automatique opt->wizard = HTS_WIZARD_AUTO; // wizard automatique
opt->quiet = 0; // questions opt->quiet = HTS_FALSE;
// //
opt->travel = 0; // même adresse opt->travel = HTS_TRAVEL_SAME_ADDRESS; // même adresse
opt->depth = 9999; // mirror total par défaut opt->depth = 9999; // mirror total par défaut
opt->extdepth = 0; // mais pas à l'extérieur opt->extdepth = 0; // mais pas à l'extérieur
opt->seeker = 1; // down opt->seeker = HTS_SEEKER_DOWN; // down
opt->urlmode = 2; // relatif par défaut opt->urlmode = HTS_URLMODE_RELATIVE; // relatif par défaut
opt->no_type_change = 0; // change file types opt->no_type_change = HTS_FALSE;
opt->debug = LOG_NOTICE; // small log opt->debug = LOG_NOTICE; // small log
opt->getmode = 3; // linear scan opt->getmode = HTS_GETMODE_HTML | HTS_GETMODE_NONHTML;
opt->maxsite = -1; // taille max site (aucune) opt->maxsite = -1; // taille max site (aucune)
opt->maxfile_nonhtml = -1; // taille max fichier non html opt->maxfile_nonhtml = -1; // taille max fichier non html
opt->maxfile_html = -1; // idem pour html opt->maxfile_html = -1; // idem pour html
opt->maxsoc = 4; // nbre socket max opt->maxsoc = 4; // nbre socket max
opt->fragment = -1; // pas de fragmentation opt->fragment = -1; // pas de fragmentation
opt->nearlink = 0; // ne pas prendre les liens non-html "adjacents" opt->nearlink = HTS_FALSE;
opt->makeindex = 1; // faire un index opt->makeindex = HTS_TRUE;
opt->kindex = 0; // index 'keyword' opt->kindex = HTS_FALSE;
opt->delete_old = 1; // effacer anciens fichiers opt->delete_old = HTS_TRUE;
opt->background_on_suspend = 1; // Background the process if Control Z calls signal suspend. opt->background_on_suspend = HTS_TRUE;
opt->makestat = 0; // pas de fichier de stats opt->makestat = HTS_FALSE;
opt->maketrack = 0; // ni de tracking opt->maketrack = HTS_FALSE;
opt->timeout = 120; // timeout par défaut (2 minutes) opt->timeout = 120; // timeout par défaut (2 minutes)
opt->cache = 1; // cache prioritaire opt->cache = HTS_CACHE_PRIORITY; // cache prioritaire
opt->shell = 0; // pas de shell par defaut opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = 1; // envoyer un user-agent opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent, StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"); "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, ""); StringCopy(opt->referer, "");
StringCopy(opt->from, ""); StringCopy(opt->from, "");
opt->savename_83 = 0; // noms longs par défaut opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default
opt->savename_type = 0; // avec structure originale opt->savename_type = 0; // avec structure originale
opt->savename_delayed = 2; // hard delayed type (default) opt->savename_delayed =
opt->delayed_cached = 1; // cached delayed type (default) HTS_SAVENAME_DELAYED_HARD; // always delay the type check (default)
opt->mimehtml = 0; // pas MIME-html opt->delayed_cached = HTS_TRUE;
opt->mimehtml = HTS_FALSE;
opt->parsejava = HTSPARSE_DEFAULT; // parser classes opt->parsejava = HTSPARSE_DEFAULT; // parser classes
opt->hostcontrol = 0; // PAS de control host pour timeout et traffic jammer opt->hostcontrol = 0; // PAS de control host pour timeout et traffic jammer
opt->retry = 2; // 2 retry par défaut opt->retry = 2; // 2 retry par défaut
opt->errpage = 1; // copier ou générer une page d'erreur en cas d'erreur (404 etc.) opt->errpage = HTS_TRUE;
opt->check_type = 1; // vérifier type si inconnu (cgi,asp..) SAUF / considéré comme html // d'erreur (404 etc.)
opt->all_in_cache = 0; // ne pas tout stocker en cache opt->check_type = HTS_TRUE;
opt->robots = 2; // traiter les robots.txt // considéré comme html
opt->external = 0; // liens externes normaux opt->all_in_cache = HTS_FALSE;
opt->passprivacy = 0; // mots de passe dans les fichiers opt->robots = HTS_ROBOTS_ALWAYS; // traiter les robots.txt
opt->includequery = 1; // include query-string par défaut opt->external = HTS_FALSE;
opt->mirror_first_page = 0; // pas mode mirror links opt->passprivacy = HTS_FALSE;
opt->accept_cookie = 1; // gérer les cookies opt->includequery = HTS_TRUE;
opt->mirror_first_page = HTS_FALSE;
opt->accept_cookie = HTS_TRUE;
opt->cookie = NULL; opt->cookie = NULL;
opt->http10 = 0; // laisser http/1.1 opt->http10 = HTS_FALSE;
opt->nokeepalive = 0; // pas keep-alive opt->nokeepalive = HTS_FALSE;
opt->nocompression = 0; // pas de compression opt->nocompression = HTS_FALSE;
opt->tolerant = 0; // ne pas accepter content-length incorrect opt->tolerant = HTS_FALSE;
opt->parseall = 1; // tout parser (tags inconnus, par exemple) opt->parseall = HTS_TRUE;
opt->parsedebug = 0; // pas de mode débuggage opt->parsedebug = HTS_FALSE;
opt->norecatch = 0; // ne pas reprendre les fichiers effacés par l'utilisateur opt->norecatch = HTS_FALSE;
opt->verbosedisplay = 0; // pas d'animation texte opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = 0; // size hack opt->sizehack = HTS_FALSE;
opt->urlhack = 1; // url hack (normalizer) opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER); StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
opt->ftp_proxy = 1; // proxy http pour ftp opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = 1; // convert html to UTF-8 opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, ""); StringCopy(opt->filelist, "");
StringCopy(opt->lang_iso, "en, *"); StringCopy(opt->lang_iso, "en, *");
StringCopy(opt->accept, StringCopy(opt->accept,
@@ -5507,9 +5710,9 @@ HTSEXT_API httrackp *hts_create_opt(void) {
// //
opt->log = stdout; opt->log = stdout;
opt->errlog = stderr; opt->errlog = stderr;
opt->flush = 1; // flush sur les fichiers log opt->flush = HTS_TRUE;
//opt->aff_progress=0; // opt->aff_progress=0;
opt->keyboard = 0; opt->keyboard = HTS_FALSE;
// //
StringCopy(opt->path_html, ""); StringCopy(opt->path_html, "");
StringCopy(opt->path_html_utf8, ""); StringCopy(opt->path_html_utf8, "");
@@ -5526,10 +5729,10 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->waittime = -1; // wait until.. hh*3600+mm*60+ss opt->waittime = -1; // wait until.. hh*3600+mm*60+ss
// //
opt->exec = ""; opt->exec = "";
opt->is_update = 0; // not an update (yet) opt->is_update = HTS_FALSE;
opt->dir_topindex = 0; // do not built top index (yet) opt->dir_topindex = HTS_FALSE;
// //
opt->bypass_limits = 0; // enforce limits by default opt->bypass_limits = HTS_FALSE;
opt->state.stop = 0; // stopper opt->state.stop = 0; // stopper
opt->state.exit_xh = 0; // abort opt->state.exit_xh = 0; // abort
// //

View File

@@ -182,6 +182,11 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, const char *xsend
const char *adr, const char *fil, const char *adr, const char *fil,
const char *referer_adr, const char *referer_fil, const char *referer_adr, const char *referer_fil,
htsblk * retour); htsblk * retour);
/* Build the request "Cookie:" header line for stored cookies matching
domain/path into dst (NUL-terminated). Exposed for the -#Q self-test;
wraps the same logic http_sendhead() uses. Returns cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size);
//int newhttp(char* iadr,char* err=NULL); //int newhttp(char* iadr,char* err=NULL);
T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port, T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port,
@@ -193,6 +198,17 @@ HTS_INLINE void deletesoc_r(htsblk * r);
htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc); htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc);
int check_readinput(htsblk * r); int check_readinput(htsblk * r);
int check_readinput_t(T_SOC soc, int timeout); int check_readinput_t(T_SOC soc, int timeout);
int check_writeinput_t(T_SOC soc, int timeout);
/* Open an HTTP CONNECT tunnel through the active proxy for an https request:
`retour->soc` must already be TCP-connected to the proxy, and `adr` is the
origin authority (url_adr, e.g. "https://host:port"). Sends the CONNECT
request (with Proxy-Authorization when the proxy carries credentials) and
reads the proxy's status line, so the caller's TLS handshake then runs
end-to-end with the origin. Blocks up to `timeout` seconds. Returns 1 on a
2xx tunnel, 0 on failure (retour->msg/statuscode set). */
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout);
void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour, void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour,
char *rcvd); char *rcvd);
void treatfirstline(htsblk * retour, const char *rcvd); void treatfirstline(htsblk * retour, const char *rcvd);

View File

@@ -184,10 +184,11 @@ int url_savename(lien_adrfilsave *const afs,
/* 8-3 ? */ /* 8-3 ? */
switch (opt->savename_83) { switch (opt->savename_83) {
case 1: // 8-3 case HTS_SAVENAME_83_DOS: // 8-3
max_char = 8; max_char = 8;
break; break;
case 2: // Level 2 File names may be up to 31 characters. case HTS_SAVENAME_83_ISO9660: // Level 2 File names may be up to 31
// characters.
max_char = 31; max_char = 31;
break; break;
default: default:
@@ -324,7 +325,7 @@ int url_savename(lien_adrfilsave *const afs,
} }
/* replace shtml to html.. */ /* replace shtml to html.. */
if (opt->savename_delayed == 2) if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD)
is_html = -1; /* ALWAYS delay type */ is_html = -1; /* ALWAYS delay type */
else else
is_html = ishtml(opt, fil); is_html = ishtml(opt, fil);
@@ -363,7 +364,9 @@ int url_savename(lien_adrfilsave *const afs,
) { ) {
// tester type avec requète HEAD si on ne connait pas le type du fichier // tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html? if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == 2 || (ishtest = ishtml(opt, fil)) < 0) { // on ne sait pas si c'est un html ou un fichier.. if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
// lire dans le cache // lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
@@ -393,11 +396,12 @@ int url_savename(lien_adrfilsave *const afs,
} }
#endif #endif
// //
} else if (opt->savename_delayed != 2 && is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER. } else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
Lookup mimetype not only by extension, is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
but also by filename */ Lookup mimetype not only by extension,
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the text/html MIME file type, but also by filename */
that is, ".html" */ /* Note: "foo.cgi => text/html" means that foo.cgi shall have the
text/html MIME file type, that is, ".html" */
char BIGSTK mime[1024]; char BIGSTK mime[1024];
mime[0] = ext[0] = '\0'; mime[0] = ext[0] = '\0';
@@ -408,9 +412,13 @@ int url_savename(lien_adrfilsave *const afs,
} }
} }
} }
// note: if savename_delayed is enabled, the naming will be temporary (and slightly invalid!) // note: if savename_delayed is enabled, the naming will be temporary
// note: if we are about to stop (opt->state.stop), back_add() will fail later // (and slightly invalid!)
else if (opt->savename_delayed != 0 && !opt->state.stop) { //
// note: if we are about to stop (opt->state.stop), back_add() will
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later. // Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess // FIXME: we should cleanup and factorize this unholy mess
if (headers != NULL && headers->status >= 0 && !is_redirect) { if (headers != NULL && headers->status >= 0 && !is_redirect) {
@@ -698,7 +706,7 @@ int url_savename(lien_adrfilsave *const afs,
} }
// restaurer // restaurer
opt->state._hts_in_html_parsing = hihp; opt->state._hts_in_html_parsing = hihp;
} // caché? } // caché?
} }
} }
} }
@@ -1190,7 +1198,8 @@ int url_savename(lien_adrfilsave *const afs,
// Not used anymore unless non-delayed types. // Not used anymore unless non-delayed types.
// de même en cas de manque d'extension on en place une de manière forcée.. // de même en cas de manque d'extension on en place une de manière forcée..
// cela évite les /chez/toto et les /chez/toto/index.html incompatibles // cela évite les /chez/toto et les /chez/toto/index.html incompatibles
if (opt->savename_type != -1 && opt->savename_delayed != 2) { if (opt->savename_type != -1 &&
opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD) {
char *a = afs->save + strlen(afs->save) - 1; char *a = afs->save + strlen(afs->save) - 1;
while((a > afs->save) && (*a != '.') && (*a != '/')) while((a > afs->save) && (*a != '.') && (*a != '/'))
@@ -1236,31 +1245,21 @@ int url_savename(lien_adrfilsave *const afs,
size_t i; size_t i;
for(i = 0 ; afs->save[i] != '\0' ; i++) { for(i = 0 ; afs->save[i] != '\0' ; i++) {
unsigned char c = (unsigned char) afs->save[i]; unsigned char c = (unsigned char) afs->save[i];
if (c < 32 // control if (c < 32 // control
|| c == 127 // unwise || c == 127 // unwise
|| c == '~' // unix unwise || c == '~' // unix unwise
|| c == '\\' // windows separator || c == '\\' // windows separator
|| c == ':' // windows forbidden || c == ':' // windows forbidden
|| c == '*' // windows forbidden || c == '*' // windows forbidden
|| c == '?' // windows forbidden || c == '?' // windows forbidden
|| c == '\"' // windows forbidden || c == '\"' // windows forbidden
|| c == '<' // windows forbidden || c == '<' // windows forbidden
|| c == '>' // windows forbidden || c == '>' // windows forbidden
|| c == '|' // windows forbidden || c == '|' // windows forbidden
//|| c == '@' // ? //|| c == '@' // ?
|| || (opt->savename_83 == HTS_SAVENAME_83_ISO9660 // CDROM
( && (c == '-' || c == '=' || c == '+'))) {
opt->savename_83 == 2 // CDROM afs->save[i] = '_';
&&
(
c == '-'
|| c == '='
|| c == '+'
)
)
)
{
afs->save[i] = '_';
} }
} }
} }
@@ -1521,7 +1520,8 @@ int url_savename(lien_adrfilsave *const afs,
char *a = afs->save + strlen(afs->save) - 1; char *a = afs->save + strlen(afs->save) - 1;
char *b; char *b;
int n = 2; int n = 2;
char collisionSeparator = ((opt->savename_83 != 2) ? '-' : '_'); char collisionSeparator =
((opt->savename_83 != HTS_SAVENAME_83_ISO9660) ? '-' : '_');
tempo[0] = '\0'; tempo[0] = '\0';

View File

@@ -285,6 +285,102 @@ typedef enum htsparsejava_flags {
HTSPARSE_NO_AGGRESSIVE = 8 // don't aggressively parse .js or .java HTSPARSE_NO_AGGRESSIVE = 8 // don't aggressively parse .js or .java
} htsparsejava_flags; } htsparsejava_flags;
/* Link-rewriting style for saved pages (opt->urlmode). */
#ifndef HTS_DEF_DEFSTRUCT_hts_urlmode
#define HTS_DEF_DEFSTRUCT_hts_urlmode
typedef enum hts_urlmode {
HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */
HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */
HTS_URLMODE_RELATIVE = 2, /**< relative link (default) */
HTS_URLMODE_ABSOLUTE_URI = 3, /**< absolute URI from root (/path) */
HTS_URLMODE_KEEP_ORIGINAL = 4, /**< keep the original link, do not rewrite */
HTS_URLMODE_TRANSPARENT_PROXY = 5 /**< transparent-proxy URL */
} hts_urlmode;
#endif
/* Cache policy for updates and retries (opt->cache). */
#ifndef HTS_DEF_DEFSTRUCT_hts_cachemode
#define HTS_DEF_DEFSTRUCT_hts_cachemode
typedef enum hts_cachemode {
HTS_CACHE_NONE = 0, /**< no cache */
HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */
HTS_CACHE_TEST_UPDATE = 2 /**< check for update before reuse (default) */
} hts_cachemode;
#endif
/* Interactive wizard level (opt->wizard). */
#ifndef HTS_DEF_DEFSTRUCT_hts_wizard
#define HTS_DEF_DEFSTRUCT_hts_wizard
typedef enum hts_wizard {
HTS_WIZARD_NONE = 0, /**< no wizard */
HTS_WIZARD_ASK = 1, /**< wizard asks questions */
HTS_WIZARD_AUTO = 2 /**< wizard runs without asking */
} hts_wizard;
#endif
/* robots.txt / meta-robots obedience level (opt->robots). */
#ifndef HTS_DEF_DEFSTRUCT_hts_robots
#define HTS_DEF_DEFSTRUCT_hts_robots
typedef enum hts_robots {
HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */
HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */
HTS_ROBOTS_ALWAYS = 2, /**< obey robots rules */
HTS_ROBOTS_ALWAYS_STRICT = 3 /**< obey even strict rules */
} hts_robots;
#endif
/* What to fetch (opt->getmode bitmask). */
typedef enum hts_getmode {
HTS_GETMODE_HTML = 1 << 0, /**< save HTML files */
HTS_GETMODE_NONHTML = 1 << 1, /**< save non-HTML files */
HTS_GETMODE_HTML_FIRST = 1 << 2 /**< fetch HTML first, then the other files */
} hts_getmode;
/* Allowed directions in the directory tree (opt->seeker bitmask). */
typedef enum hts_seeker {
HTS_SEEKER_DOWN = 1 << 0, /**< may descend into subdirectories */
HTS_SEEKER_UP = 1 << 1 /**< may ascend to parent directories */
} hts_seeker;
/* opt->travel: link-following scope in the low byte, flags OR'd in above it. */
typedef enum hts_travel_scope {
HTS_TRAVEL_SAME_ADDRESS = 0, /**< stay on the same address (host) */
HTS_TRAVEL_SAME_DOMAIN = 1, /**< stay on the same principal domain */
HTS_TRAVEL_SAME_TLD = 2, /**< stay on the same TLD (e.g. .com) */
HTS_TRAVEL_EVERYWHERE = 7, /**< follow links anywhere on the web */
HTS_TRAVEL_TEST_ALL = 1 << 8 /**< also test forbidden URLs (-t) */
} hts_travel_scope;
/* Mask selecting the scope value out of opt->travel. */
#define HTS_TRAVEL_SCOPE_MASK 0xff
/* Text progress display detail (opt->verbosedisplay). */
typedef enum hts_verbosedisplay {
HTS_VERBOSE_NONE = 0, /**< no animated progress display (default) */
HTS_VERBOSE_SIMPLE = 1, /**< minimal single-line progress */
HTS_VERBOSE_FULL = 2 /**< full animated progress */
} hts_verbosedisplay;
/* Delayed file-type resolution policy (opt->savename_delayed). */
typedef enum hts_savename_delayed {
HTS_SAVENAME_DELAYED_NONE = 0, /**< resolve the type immediately */
HTS_SAVENAME_DELAYED_SOFT = 1, /**< delay the type check when unknown */
HTS_SAVENAME_DELAYED_HARD = 2 /**< always delay the type check (default) */
} hts_savename_delayed;
/* Saved-name length layout (opt->savename_83). */
typedef enum hts_savename_83 {
HTS_SAVENAME_83_LONG = 0, /**< long file names (default) */
HTS_SAVENAME_83_DOS = 1, /**< DOS 8.3 names (ISO9660 level 1) */
HTS_SAVENAME_83_ISO9660 = 2 /**< ISO9660 level 2 names (up to 31 chars) */
} hts_savename_83;
/* Host-banning triggers (opt->hostcontrol bitmask). */
typedef enum hts_hostcontrol {
HTS_HOSTCONTROL_BAN_TIMEOUT = 1 << 0, /**< ban a timing-out host */
HTS_HOSTCONTROL_BAN_SLOW = 1 << 1 /**< ban a too-slow host */
} hts_hostcontrol;
#ifndef HTS_DEF_FWSTRUCT_lien_buffers #ifndef HTS_DEF_FWSTRUCT_lien_buffers
#define HTS_DEF_FWSTRUCT_lien_buffers #define HTS_DEF_FWSTRUCT_lien_buffers
typedef struct lien_buffers lien_buffers; typedef struct lien_buffers lien_buffers;
@@ -308,15 +404,16 @@ typedef struct httrackp httrackp;
struct httrackp { struct httrackp {
size_t size_httrackp; /**< size of this structure (version/ABI guard) */ size_t size_httrackp; /**< size of this structure (version/ABI guard) */
/* */ /* */
int wizard; /**< interactive wizard level (none/full/light) */ hts_wizard wizard; /**< interactive wizard level (none/ask/auto) */
int flush; /**< fflush() log files after each write */ hts_boolean flush; /**< fflush() log files after each write */
int travel; /**< link-following scope (same domain, etc.) */ int travel; /**< link-following scope (same domain, etc.) */
int seeker; /**< allowed direction: go up and/or down the tree */ int seeker; /**< allowed direction: go up and/or down the tree */
int depth; /**< maximum recursion depth (-rN) */ int depth; /**< maximum recursion depth (-rN) */
int extdepth; /**< maximum recursion depth outside the start domain */ int extdepth; /**< maximum recursion depth outside the start domain */
int urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */ hts_urlmode
int no_type_change; // do not change file type according to MIME urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */
int debug; /**< debug logging level */ hts_boolean no_type_change; // do not change file type according to MIME
hts_log_type debug; /**< debug logging level */
int getmode; /**< what to fetch (HTML, images, ...) bitmask */ int getmode; /**< what to fetch (HTML, images, ...) bitmask */
FILE *log; /**< informational log stream; NULL mutes it */ FILE *log; /**< informational log stream; NULL mutes it */
FILE *errlog; /**< error log stream; NULL mutes it */ FILE *errlog; /**< error log stream; NULL mutes it */
@@ -325,28 +422,31 @@ struct httrackp {
LLint maxfile_html; /**< max bytes per HTML file */ LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */ int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */ LLint fragment; /**< split site after this many bytes */
int nearlink; /**< also fetch images/data adjacent to a page but off-site */ hts_boolean
int makeindex; /**< build a top-level index.html */ nearlink; /**< also fetch images/data adjacent to a page but off-site */
int kindex; /**< build a keyword index */ hts_boolean makeindex; /**< build a top-level index.html */
int delete_old; /**< delete locally obsolete files after update */ hts_boolean kindex; /**< build a keyword index */
hts_boolean delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */ int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */ int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */ int maxtime; /**< max total mirror duration in seconds */
int maxrate; /**< max transfer rate cap (bytes/s) */ int maxrate; /**< max transfer rate cap (bytes/s) */
float maxconn; /**< max connections per second */ float maxconn; /**< max connections per second */
int waittime; /**< scheduled start time (wall-clock seconds) */ int waittime; /**< scheduled start time (wall-clock seconds) */
int cache; /**< cache generation mode */ hts_cachemode cache; /**< cache generation mode */
// int aff_progress; // progress bar // int aff_progress; // progress bar
int shell; /**< driven by a shell over stdin/stdout pipes */ hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */
t_proxy proxy; /**< proxy configuration */ t_proxy proxy; /**< proxy configuration */
int savename_83; /**< force 8.3 (DOS) file names */ hts_savename_83
savename_83; /**< saved-name length layout (long/DOS/ISO9660) */
int savename_type; /**< saved-name layout (original tree, flat, ...) */ int savename_type; /**< saved-name layout (original tree, flat, ...) */
String String
savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */ savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */
int savename_delayed; // delayed type check hts_savename_delayed savename_delayed; /**< delayed type-check policy */
int delayed_cached; // delayed type check can be cached to speedup updates hts_boolean
int mimehtml; /**< produce a single MIME/MHTML archive */ delayed_cached; // delayed type check can be cached to speedup updates
int user_agent_send; /**< send a User-Agent header */ hts_boolean mimehtml; /**< produce a single MIME/MHTML archive */
hts_boolean user_agent_send; /**< send a User-Agent header */
String user_agent; /**< User-Agent value (e.g. httrack/1.0) */ String user_agent; /**< User-Agent value (e.g. httrack/1.0) */
String referer; /**< Referer value to send */ String referer; /**< Referer value to send */
String from; /**< From value to send */ String from; /**< From value to send */
@@ -355,37 +455,39 @@ struct httrackp {
String path_html_utf8; /**< output directory for the mirror, UTF-8 form */ String path_html_utf8; /**< output directory for the mirror, UTF-8 form */
String path_bin; /**< directory for HTML templates */ String path_bin; /**< directory for HTML templates */
int retry; /**< extra retries on a failed transfer */ int retry; /**< extra retries on a failed transfer */
int makestat; /**< maintain a transfer-statistics log */ hts_boolean makestat; /**< maintain a transfer-statistics log */
int maketrack; /**< maintain an operations-statistics log */ hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */ int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< drop hosts that are too slow, etc. */ int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
int errpage; /**< generate an error page on 404 and similar */ hts_boolean errpage; /**< generate an error page on 404 and similar */
int check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves hts_boolean
*/ check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
int all_in_cache; /**< keep all retrieved data in the cache */ */
int robots; /**< robots.txt handling level */ hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
int external; /**< render external links as error pages */ hts_robots robots; /**< robots.txt handling level */
int passprivacy; /**< strip passwords from external links */ hts_boolean external; /**< render external links as error pages */
int includequery; /**< include the query string in saved names */ hts_boolean passprivacy; /**< strip passwords from external links */
int mirror_first_page; /**< only mirror the links of the first page */ hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */
String sys_com; /**< system command to run */ String sys_com; /**< system command to run */
int sys_com_exec; /**< actually execute sys_com */ hts_boolean sys_com_exec; /**< actually execute sys_com */
int accept_cookie; /**< accept and send cookies */ hts_boolean accept_cookie; /**< accept and send cookies */
t_cookie *cookie; /**< cookie store */ t_cookie *cookie; /**< cookie store */
int http10; /**< force HTTP/1.0 */ hts_boolean http10; /**< force HTTP/1.0 */
int nokeepalive; /**< disable keep-alive */ hts_boolean nokeepalive; /**< disable keep-alive */
int nocompression; /**< disable content compression */ hts_boolean nocompression; /**< disable content compression */
int sizehack; /**< treat same-size response as "updated" */ hts_boolean sizehack; /**< treat same-size response as "updated" */
int urlhack; // force "url normalization" to avoid loops hts_boolean urlhack; // force "url normalization" to avoid loops
int tolerant; /**< accept an incorrect Content-Length */ hts_boolean tolerant; /**< accept an incorrect Content-Length */
int parseall; /**< parse aggressively, including unknown tags with links */ hts_boolean
int parsedebug; /**< parser debug mode */ parseall; /**< parse aggressively, including unknown tags with links */
int norecatch; /**< do not re-fetch files the user deleted locally */ hts_boolean parsedebug; /**< parser debug mode */
int verbosedisplay; /**< animated text progress display */ hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */
hts_verbosedisplay verbosedisplay; /**< animated text progress display */
String footer; /**< footer/info line injected into pages */ String footer; /**< footer/info line injected into pages */
int maxcache; /**< in-memory cache backing limit (bytes) */ int maxcache; /**< in-memory cache backing limit (bytes) */
// int maxcache_anticipate; // maximum links to anticipate (upper bound) // int maxcache_anticipate; // maximum links to anticipate (upper bound)
int ftp_proxy; /**< use the HTTP proxy for FTP too */ hts_boolean ftp_proxy; /**< use the HTTP proxy for FTP too */
String filelist; /**< file listing URLs to include */ String filelist; /**< file listing URLs to include */
String urllist; /**< file listing filters to include */ String urllist; /**< file listing filters to include */
htsfilters filters; /**< filter pointers (+/-pattern rules) */ htsfilters filters; /**< filter pointers (+/-pattern rules) */
@@ -399,20 +501,20 @@ struct httrackp {
String headers; // Additional headers String headers; // Additional headers
String mimedefs; // ext1=mimetype1\next2=mimetype2.. String mimedefs; // ext1=mimetype1\next2=mimetype2..
String mod_blacklist; /**< blacklisted modules */ String mod_blacklist; /**< blacklisted modules */
int convert_utf8; // filenames UTF-8 conversion (3.46) hts_boolean convert_utf8; // filenames UTF-8 conversion (3.46)
// //
int maxlink; /**< max number of links */ int maxlink; /**< max number of links */
int maxfilter; /**< max number of filters */ int maxfilter; /**< max number of filters */
// //
const char *exec; /**< path of the running executable */ const char *exec; /**< path of the running executable */
// //
int quiet; /**< suppress non-wizard questions */ hts_boolean quiet; /**< suppress non-wizard questions */
int keyboard; /**< poll stdin for keyboard input */ hts_boolean keyboard; /**< poll stdin for keyboard input */
int bypass_limits; // bypass built-in limits hts_boolean bypass_limits; // bypass built-in limits
int background_on_suspend; // background process on suspend signal hts_boolean background_on_suspend; // background process on suspend signal
// //
int is_update; /**< this run is an update (show "File updated...") */ hts_boolean is_update; /**< this run is an update (show "File updated...") */
int dir_topindex; /**< rebuild the top index afterwards */ hts_boolean dir_topindex; /**< rebuild the top index afterwards */
// //
// callbacks // callbacks
t_hts_htmlcheck_callbacks t_hts_htmlcheck_callbacks

View File

@@ -296,6 +296,48 @@ static const char *html_inline_safe(const char *src, char *dst, size_t size) {
return dst; return dst;
} }
/* Byte before html, or a space sentinel at the buffer start where html[-1]
would underflow; space reads as the word boundary the guards want there. */
static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
static const char *const methods[] = {"GET", "POST", "PUT",
"DELETE", "HEAD", "OPTIONS",
"PATCH", "TRACE", NULL};
int i;
for (i = 0; methods[i] != NULL; i++) {
if (strlen(methods[i]) == len && strfield(s, methods[i]) == (int) len)
return 1;
}
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */ /* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) { int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE]; char catbuff[CATBUFF_SIZE];
@@ -349,7 +391,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
#endif #endif
// Now, parsing // Now, parsing
if ((opt->getmode & 1) && (ptr > 0)) { // récupérer les html sur disque if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
// créer le fichier html local // créer le fichier html local
HT_ADD_FOP; // écrire peu à peu le fichier HT_ADD_FOP; // écrire peu à peu le fichier
} }
@@ -553,10 +595,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->depth == heap(ptr)->depth) { // on note toujours les premiers liens if (opt->depth == heap(ptr)->depth) { // on note toujours les premiers liens
if (!in_media) { if (!in_media) {
if (opt->makeindex && (ptr > 0)) { if (opt->makeindex && (ptr > 0)) {
if (opt->getmode & 1) { // autorisation d'écrire if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title"); p = strfield(html, "title");
if (p) { if (p) {
if (*(html - 1) == '/') if (html_prevc(html, r->adr) == '/')
p = 0; // /title p = 0; // /title
} else { } else {
if (strfield(html, "/html")) if (strfield(html, "/html"))
@@ -704,7 +746,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
} }
if (opt->getmode & 1) { // sauver html if (opt->getmode & HTS_GETMODE_HTML) { // sauver html
p = 0; p = 0;
switch (emited_footer) { switch (emited_footer) {
case 0: case 0:
@@ -740,7 +782,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (strchr(r->adr, '\r')) if (strchr(r->adr, '\r'))
eol = "\r\n"; eol = "\r\n";
if (StringNotEmpty(opt->footer) || opt->urlmode != 4) { /* != preserve */ if (StringNotEmpty(opt->footer) ||
opt->urlmode != HTS_URLMODE_KEEP_ORIGINAL) {
if (StringNotEmpty(opt->footer)) { if (StringNotEmpty(opt->footer)) {
char BIGSTK tempo[1024 + HTS_URLMAXSIZE * 2]; char BIGSTK tempo[1024 + HTS_URLMAXSIZE * 2];
char gmttime[256]; char gmttime[256];
@@ -1340,6 +1383,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0; int can_avoid_quotes = 0;
char quotes_replacement = '\0'; char quotes_replacement = '\0';
int ensure_not_mime = 0; int ensure_not_mime = 0;
// .open(method,url): reject an HTTP-method first arg (#218)
int ensure_not_method = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag) if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'" expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1356,9 +1404,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (!nc) if (!nc)
nc = strfield(html, ":location"); // javascript:location="doc" nc = strfield(html, ":location"); // javascript:location="doc"
if (!nc) { // location="doc" if (!nc) { // location="doc"
if ((nc = strfield(html, "location")) if ((nc = strfield(html, "location")) &&
&& !isspace(*(html - 1)) !isspace(html_prevc(html, r->adr)))
)
nc = 0; nc = 0;
} }
if (!nc) if (!nc)
@@ -1368,6 +1415,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = "),"; // fin: virgule ou parenthèse expected_end = "),"; // fin: virgule ou parenthèse
ensure_not_mime = 1; //* ensure the url is not a mime type */ ensure_not_mime = 1; //* ensure the url is not a mime type */
ensure_not_method = 1; // xhr.open: don't grab method
} }
if (!nc) if (!nc)
if ((nc = strfield(html, ".replace"))) { // window.replace("url") if ((nc = strfield(html, ".replace"))) { // window.replace("url")
@@ -1379,7 +1427,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
} }
if (!nc && (nc = strfield(html, "url")) && (!isalnum(*(html - 1))) && *(html - 1) != '_') { // url(url) if (!nc && (nc = strfield(html, "url")) &&
(!isalnum(html_prevc(html, r->adr))) &&
html_prevc(html, r->adr) != '_') { // url(url)
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
can_avoid_quotes = 1; can_avoid_quotes = 1;
@@ -1389,6 +1439,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url" if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) { if (is_space(*(html + nc))) {
expected = 0; // no char expected expected = 0; // no char expected
is_import = 1;
} else } else
nc = 0; nc = 0;
} }
@@ -1406,6 +1457,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) { if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c; const char *b, *c;
int ndelim = 1; int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\'')) if ((*a == 34) || (*a == '\''))
a++; a++;
@@ -1420,12 +1472,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++; b++;
} }
c = b--; c = b--;
c += ndelim; // no closing delimiter here (truncated input):
while(*c == ' ') // Don't scan past the buffer NUL or capture it.
c++; if (*c != '\0') {
if ((strchr(expected_end, *c)) || (*c == '\n') c += ndelim;
|| (*c == '\r')) { while (*c == ' ')
c -= (ndelim + 1); c++;
valid_url =
(strchr(expected_end, *c)) || (*c == '\n') ||
(*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) { if ((int) (c - a + 1)) {
if (ensure_not_mime) { if (ensure_not_mime) {
int i = 0; int i = 0;
@@ -1441,6 +1501,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
i++; i++;
} }
} }
// XHR.open's "GET" etc. is a method, not a URL
if (a != NULL && ensure_not_method &&
is_http_method(a, (size_t) (c - a + 1))) {
a = NULL;
}
// Check for bogus links (Vasiliy) // Check for bogus links (Vasiliy)
if (a != NULL) { if (a != NULL) {
const size_t size = c - a + 1; const size_t size = c - a + 1;
@@ -1484,7 +1549,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
} }
} }
} }
} }
} }
@@ -1691,6 +1755,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
hts_nodetect[i - hts_nodetect[i -
1]); 1]);
} }
// xmlns / xmlns:prefix declare
// XML namespaces, not resources
// (#191)
else {
const int xl = strfield(
intag_startattr, "xmlns");
const char xc =
intag_startattr[xl];
if (xl &&
(xc == ':' || xc == '=' ||
is_space(xc))) {
url_ok = 0;
hts_log_print(
opt, LOG_DEBUG,
"dirty parsing: xmlns "
"namespace avoided");
}
}
} }
} }
@@ -1746,7 +1828,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// écrire codebase avant, flusher avant code // écrire codebase avant, flusher avant code
if ((p_type == -1) || (p_type == -2)) { if ((p_type == -1) || (p_type == -2)) {
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_add_adr; // refresh HT_add_adr; // refresh
} }
lastsaved = html; // dernier écrit+1 lastsaved = html; // dernier écrit+1
@@ -1837,7 +1919,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// ne pas flusher après code si on doit écrire le codebase avant! // ne pas flusher après code si on doit écrire le codebase avant!
if ((p_type != -1) && (p_type != 2) && (p_type != -2)) { if ((p_type != -1) && (p_type != 2) && (p_type != -2)) {
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_add_adr; // refresh HT_add_adr; // refresh
} }
lastsaved = html; // dernier écrit+1 lastsaved = html; // dernier écrit+1
@@ -1914,7 +1996,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (*html != '#') { // Not empty+unique # if (*html != '#') { // Not empty+unique #
if (eadr - html == 1) { // 1=link empty with delim (end_adr-start_adr) if (eadr - html == 1) { // 1=link empty with delim (end_adr-start_adr)
if (quote) { if (quote) {
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_ADD("#"); // We add this for a <href=""> HT_ADD("#"); // We add this for a <href="">
} }
} }
@@ -2569,7 +2651,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((p_type == 2) || (p_type == -2)) { // base href ou codebase, pas un lien if ((p_type == 2) || (p_type == -2)) { // base href ou codebase, pas un lien
hts_log_print(opt, LOG_DEBUG, "Code/Codebase: %s%s", hts_log_print(opt, LOG_DEBUG, "Code/Codebase: %s%s",
afs.af.adr, afs.af.fil); afs.af.adr, afs.af.fil);
} else if ((opt->getmode & 4) == 0) { } else if ((opt->getmode & HTS_GETMODE_HTML_FIRST) ==
0) {
hts_log_print(opt, LOG_DEBUG, "Record: %s%s -> %s", hts_log_print(opt, LOG_DEBUG, "Record: %s%s -> %s",
afs.af.adr, afs.af.fil, afs.save); afs.af.adr, afs.af.fil, afs.save);
} else { } else {
@@ -2592,8 +2675,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
lastsaved = eadr - 1 + 1; // sauter " lastsaved = eadr - 1 + 1; // sauter "
} }
/* */ /* */
else if (opt->urlmode == 0) { // URL absolue dans tous les cas else if (opt->urlmode == HTS_URLMODE_ABSOLUTE) {
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (!link_has_authority(afs.af.adr)) { if (!link_has_authority(afs.af.adr)) {
HT_ADD("http://"); HT_ADD("http://");
} else { } else {
@@ -2620,12 +2703,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein) lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
/* */ /* */
} else if (opt->urlmode == 4) { // ne rien faire! } else if (opt->urlmode == HTS_URLMODE_KEEP_ORIGINAL) {
/* */ /* */
/* leave the link 'as is' */ /* leave the link 'as is' */
/* Sinon, dépend de interne/externe */ /* Sinon, dépend de interne/externe */
} else if (forbidden_url == 1) { // le lien ne sera pas chargé, référence externe! } else if (forbidden_url ==
if ((opt->getmode & 1) && (ptr > 0)) { 1) { // le lien ne sera pas chargé, référence
// externe!
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (p_type != -1) { // pas que le nom de fichier (pas classe java) if (p_type != -1) { // pas que le nom de fichier (pas classe java)
if (!opt->external) { if (!opt->external) {
if (!link_has_authority(afs.af.adr)) { if (!link_has_authority(afs.af.adr)) {
@@ -2674,7 +2759,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
'/') ? 1 : (ishtml(opt, afs.af.fil)))) { '/') ? 1 : (ishtml(opt, afs.af.fil)))) {
case 1: case 1:
case -2: // html ou répertoire case -2: // html ou répertoire
if (opt->getmode & 1) { // sauver html if (opt->getmode & HTS_GETMODE_HTML) {
patch_it = 1; // redirect patch_it = 1; // redirect
add_url = 1; // avec link? add_url = 1; // avec link?
cat_name = "external.html"; cat_name = "external.html";
@@ -2847,7 +2932,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
// érire codebase="chemin" // érire codebase="chemin"
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
char BIGSTK tempo4[HTS_URLMAXSIZE * 2]; char BIGSTK tempo4[HTS_URLMAXSIZE * 2];
tempo4[0] = '\0'; tempo4[0] = '\0';
@@ -2875,9 +2961,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
lastsaved = eadr - 1; lastsaved = eadr - 1;
} }
/* /*
else if (opt->urlmode==1) { // ABSOLU, c'est le cas le moins courant else if (opt->urlmode==1) { // ABSOLU, c'est le cas le
moins courant
// NE FONCTIONNE PAS!! (et est inutile) // NE FONCTIONNE PAS!! (et est inutile)
if ((opt->getmode & 1) && (ptr>0)) { // ecrire les html if ((opt->getmode & 1) && (ptr>0)) { // ecrire les
html
// écrire le lien modifié, absolu // écrire le lien modifié, absolu
HT_ADD("file:"); HT_ADD("file:");
if (*save=='/') if (*save=='/')
@@ -2885,7 +2973,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
else else
HT_ADD(save) HT_ADD(save)
} }
lastsaved=eadr-1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein) lastsaved=eadr-1; // dernier écrit+1 (enfin euh apres
on fait un ++ alors hein)
} }
*/ */
else if (opt->mimehtml) { else if (opt->mimehtml) {
@@ -2895,18 +2984,18 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
make_content_id(afs.af.adr, afs.af.fil, cid, sizeof(cid)); make_content_id(afs.af.adr, afs.af.fil, cid, sizeof(cid));
HT_ADD_HTMLESCAPED(cid); HT_ADD_HTMLESCAPED(cid);
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein) lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 3) { // URI absolue / } else if (opt->urlmode == HTS_URLMODE_ABSOLUTE_URI) {
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_ADD_HTMLESCAPED(afs.af.fil); HT_ADD_HTMLESCAPED(afs.af.fil);
} }
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein) lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 5) { // transparent proxy URL } else if (opt->urlmode == HTS_URLMODE_TRANSPARENT_PROXY) {
char BIGSTK tempo[HTS_URLMAXSIZE * 2]; char BIGSTK tempo[HTS_URLMAXSIZE * 2];
const char *uri; const char *uri;
int i; int i;
char *pos; char *pos;
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (!link_has_authority(afs.af.adr)) { if (!link_has_authority(afs.af.adr)) {
HT_ADD("http://"); HT_ADD("http://");
} else { } else {
@@ -2947,7 +3036,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
HT_ADD_HTMLESCAPED(tempo); HT_ADD_HTMLESCAPED(tempo);
} }
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein) lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 2) { // RELATIF } else if (opt->urlmode == HTS_URLMODE_RELATIVE) {
char BIGSTK tempo[HTS_URLMAXSIZE * 2]; char BIGSTK tempo[HTS_URLMAXSIZE * 2];
tempo[0] = '\0'; tempo[0] = '\0';
@@ -2959,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */ /* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo)); inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf) //if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx // escape_uri(tempo); // escape with %xx
//else { //else {
@@ -3009,7 +3102,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
// érire codebase="chemin" // érire codebase="chemin"
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
char BIGSTK tempo4[HTS_URLMAXSIZE * 2]; char BIGSTK tempo4[HTS_URLMAXSIZE * 2];
tempo4[0] = '\0'; tempo4[0] = '\0';
@@ -3027,7 +3121,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
//lastsaved=adr; // dernier écrit+1 //lastsaved=adr; // dernier écrit+1
} }
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
// convert to local codepage - NOT, already converted into %NN, and passed to the remote server so we do not have anything to do // convert to local codepage - NOT, already converted into %NN, and passed to the remote server so we do not have anything to do
//if (str->page_charset_ != NULL && *str->page_charset_ != '\0') { //if (str->page_charset_ != NULL && *str->page_charset_ != '\0') {
// char *const local_save = hts_convertStringFromUTF8(tempo, strlen(tempo), str->page_charset_); // char *const local_save = hts_convertStringFromUTF8(tempo, strlen(tempo), str->page_charset_);
@@ -3061,7 +3155,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
"Error building relative link %s and %s", "Error building relative link %s and %s",
afs.save, relativesavename()); afs.save, relativesavename());
} }
} // sinon le lien sera écrit normalement } // sinon le lien sera écrit normalement
#if 0 #if 0
if (fexist(save)) { // le fichier existe.. if (fexist(save)) { // le fichier existe..
@@ -3089,7 +3183,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
opt->maxlink); opt->maxlink);
hts_log_print(opt, LOG_INFO, hts_log_print(opt, LOG_INFO,
"To avoid that: use #L option for more links (example: -#L1000000)"); "To avoid that: use #L option for more links (example: -#L1000000)");
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (fp) { if (fp) {
fclose(fp); fclose(fp);
fp = NULL; fp = NULL;
@@ -3101,9 +3195,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int pass_fix, dejafait = 0; int pass_fix, dejafait = 0;
// Calculer la priorité de ce lien // Calculer la priorité de ce lien
if ((opt->getmode & 4) == 0) { // traiter html après if ((opt->getmode & HTS_GETMODE_HTML_FIRST) == 0) {
pass_fix = 0; pass_fix = 0;
} else { // vérifier que ce n'est pas un !html } else { // vérifier que ce n'est pas un !html
if (!ishtml(opt, afs.af.fil)) if (!ishtml(opt, afs.af.fil))
pass_fix = 1; // priorité inférieure (traiter après) pass_fix = 1; // priorité inférieure (traiter après)
else else
@@ -3167,7 +3261,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (checkrobots(_ROBOTS, afs.af.adr, "") == -1) { // robots.txt ? if (checkrobots(_ROBOTS, afs.af.adr, "") == -1) { // robots.txt ?
// enregistrer robots.txt (MACRO) // enregistrer robots.txt (MACRO)
if (!hts_record_link(opt, afs.af.adr, "/robots.txt", "", "", "", NULL)) { if (!hts_record_link(opt, afs.af.adr, "/robots.txt", "", "", "", NULL)) {
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
if (fp) { if (fp) {
fclose(fp); fclose(fp);
fp = NULL; fp = NULL;
@@ -3206,7 +3301,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// enregistrer // enregistrer
if (!hts_record_link(opt, afs.af.adr, afs.af.fil, afs.save, if (!hts_record_link(opt, afs.af.adr, afs.af.fil, afs.save,
former.adr, former.fil, codebase)) { former.adr, former.fil, codebase)) {
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
if (fp) { if (fp) {
fclose(fp); fclose(fp);
fp = NULL; fp = NULL;
@@ -3351,7 +3447,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
// ---------- // ----------
// écrire peu à peu // écrire peu à peu
if ((opt->getmode & 1) && (ptr > 0)) if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0))
HT_add_adr; HT_add_adr;
lastsaved = html; // dernier écrit+1 lastsaved = html; // dernier écrit+1
// ---------- // ----------
@@ -3411,7 +3507,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
opt->state._hts_in_html_parsing = 0; // flag opt->state._hts_in_html_parsing = 0; // flag
opt->state._hts_cancel = 0; // pas de cancel opt->state._hts_cancel = 0; // pas de cancel
if ((opt->getmode & 1) && (ptr > 0)) { if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
{ {
char *cAddr = TypedArrayElts(output_buffer); char *cAddr = TypedArrayElts(output_buffer);
int cSize = (int) TypedArraySize(output_buffer); int cSize = (int) TypedArraySize(output_buffer);
@@ -3443,7 +3539,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// //
} // if !error } // if !error
if (opt->getmode & 1) { if (opt->getmode & HTS_GETMODE_HTML) {
if (fp) { if (fp) {
fclose(fp); fclose(fp);
fp = NULL; fp = NULL;
@@ -3711,7 +3807,8 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
//case -1: can_retry=1; break; //case -1: can_retry=1; break;
case STATUSCODE_TIMEOUT: case STATUSCODE_TIMEOUT:
if (opt->hostcontrol) { // timeout et retry épuisés if (opt->hostcontrol) { // timeout et retry épuisés
if ((opt->hostcontrol & 1) && (heap(ptr)->retry <= 0)) { if ((opt->hostcontrol & HTS_HOSTCONTROL_BAN_TIMEOUT) &&
(heap(ptr)->retry <= 0)) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil()); hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr())); host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -3724,7 +3821,7 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
break; break;
case STATUSCODE_SLOW: case STATUSCODE_SLOW:
if ((opt->hostcontrol) && (heap(ptr)->retry <= 0)) { // too slow if ((opt->hostcontrol) && (heap(ptr)->retry <= 0)) { // too slow
if (opt->hostcontrol & 2) { if (opt->hostcontrol & HTS_HOSTCONTROL_BAN_SLOW) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil()); hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr())); host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -4250,10 +4347,10 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
char com[256]; char com[256];
linput(stdin, com, 200); linput(stdin, com, 200);
if (opt->verbosedisplay == 2) if (opt->verbosedisplay == HTS_VERBOSE_FULL)
opt->verbosedisplay = 1; opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
else else
opt->verbosedisplay = 2; opt->verbosedisplay = HTS_VERBOSE_FULL;
/* Info for wrappers */ /* Info for wrappers */
hts_log_print(opt, LOG_INFO, "engine: change-options"); hts_log_print(opt, LOG_INFO, "engine: change-options");
RUN_CALLBACK0(opt, chopt); RUN_CALLBACK0(opt, chopt);
@@ -4363,7 +4460,7 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
printf("%c\x0d", ("/-\\|")[roll]); printf("%c\x0d", ("/-\\|")[roll]);
fflush(stdout); fflush(stdout);
} }
} else if (opt->verbosedisplay == 1) { } else if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (b >= 0) { if (b >= 0) {
if (back[b].r.statuscode == HTTP_OK) if (back[b].r.statuscode == HTTP_OK)
printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot, printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot,
@@ -4454,8 +4551,8 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
char in_error_msg[32]; char in_error_msg[32];
// resolve unresolved type // resolve unresolved type
if (opt->savename_delayed != 0 && *forbidden_url == 0 && IS_DELAYED_EXT(afs->save) if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
&& !opt->state.stop) { *forbidden_url == 0 && IS_DELAYED_EXT(afs->save) && !opt->state.stop) {
int loops; int loops;
int continue_loop; int continue_loop;
@@ -4839,7 +4936,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
} }
} }
} // delayed type check ? } // delayed type check ?
ENGINE_SAVE_CONTEXT_BASE(); ENGINE_SAVE_CONTEXT_BASE();

View File

@@ -1213,7 +1213,7 @@ HTSEXT_API find_handle hts_findfirst(char *path) {
return NULL; return NULL;
} }
HTSEXT_API int hts_findnext(find_handle find) { HTSEXT_API hts_boolean hts_findnext(find_handle find) {
if (find) { if (find) {
#ifdef _WIN32 #ifdef _WIN32
if ((FindNextFileA(find->handle, &find->hdata))) if ((FindNextFileA(find->handle, &find->hdata)))
@@ -1273,7 +1273,7 @@ HTSEXT_API int hts_findgetsize(find_handle find) {
return -1; return -1;
} }
HTSEXT_API int hts_findisdir(find_handle find) { HTSEXT_API hts_boolean hts_findisdir(find_handle find) {
if (find) { if (find) {
if (!hts_findissystem(find)) { if (!hts_findissystem(find)) {
#ifdef _WIN32 #ifdef _WIN32
@@ -1287,7 +1287,7 @@ HTSEXT_API int hts_findisdir(find_handle find) {
} }
return 0; return 0;
} }
HTSEXT_API int hts_findisfile(find_handle find) { HTSEXT_API hts_boolean hts_findisfile(find_handle find) {
if (find) { if (find) {
if (!hts_findissystem(find)) { if (!hts_findissystem(find)) {
#ifdef _WIN32 #ifdef _WIN32
@@ -1301,7 +1301,7 @@ HTSEXT_API int hts_findisfile(find_handle find) {
} }
return 0; return 0;
} }
HTSEXT_API int hts_findissystem(find_handle find) { HTSEXT_API hts_boolean hts_findissystem(find_handle find) {
if (find) { if (find) {
#ifdef _WIN32 #ifdef _WIN32
if (find->hdata. if (find->hdata.

View File

@@ -108,15 +108,15 @@ HTSEXT_API int hts_buildtopindex(httrackp * opt, const char *path,
// Portable directory find functions // Portable directory find functions
// Directory find functions // Directory find functions
HTSEXT_API find_handle hts_findfirst(char *path); HTSEXT_API find_handle hts_findfirst(char *path);
HTSEXT_API int hts_findnext(find_handle find); HTSEXT_API hts_boolean hts_findnext(find_handle find);
HTSEXT_API int hts_findclose(find_handle find); HTSEXT_API int hts_findclose(find_handle find);
// //
HTSEXT_API char *hts_findgetname(find_handle find); HTSEXT_API char *hts_findgetname(find_handle find);
HTSEXT_API int hts_findgetsize(find_handle find); HTSEXT_API int hts_findgetsize(find_handle find);
HTSEXT_API int hts_findisdir(find_handle find); HTSEXT_API hts_boolean hts_findisdir(find_handle find);
HTSEXT_API int hts_findisfile(find_handle find); HTSEXT_API hts_boolean hts_findisfile(find_handle find);
HTSEXT_API int hts_findissystem(find_handle find); HTSEXT_API hts_boolean hts_findissystem(find_handle find);
#endif #endif

View File

@@ -178,7 +178,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// -------------------- PHASE 1 -------------------- // -------------------- PHASE 1 --------------------
/* Doit-on traiter les non html? */ /* Doit-on traiter les non html? */
if ((opt->getmode & 2) == 0) { // non on ne doit pas if ((opt->getmode & HTS_GETMODE_NONHTML) == 0) { // non on ne doit pas
if (!ishtml(opt, fil)) { // non il ne faut pas if (!ishtml(opt, fil)) { // non il ne faut pas
//adr[0]='\0'; // ne pas traiter ce lien, pas traiter //adr[0]='\0'; // ne pas traiter ce lien, pas traiter
forbidden_url = 1; // interdire récupération du lien forbidden_url = 1; // interdire récupération du lien
@@ -266,11 +266,11 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
test2 = test2 =
(strchr(tempo2 + ((*tempo2 == '/') ? 1 : 0), '/') != NULL); (strchr(tempo2 + ((*tempo2 == '/') ? 1 : 0), '/') != NULL);
if ((test1) && (test2)) { // on ne peut que descendre if ((test1) && (test2)) { // on ne peut que descendre
if ((opt->seeker & 1) == 0) { // interdiction de descendre if ((opt->seeker & HTS_SEEKER_DOWN) == 0) {
forbidden_url = 1; forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG, "lower link canceled: %s%s", adr, hts_log_print(opt, LOG_DEBUG, "lower link canceled: %s%s", adr,
fil); fil);
} else { // autorisé à priori - NEW } else { // autorisé à priori - NEW
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved' if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0; forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s", hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s",
@@ -278,7 +278,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
} }
} }
} else if ((test1) || (test2)) { // on peut descendre pour accéder au lien } else if ((test1) || (test2)) { // on peut descendre pour accéder au lien
if ((opt->seeker & 1) != 0) { // on peut descendre - NEW if ((opt->seeker & HTS_SEEKER_DOWN) != 0) {
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved' if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0; forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s", hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s",
@@ -290,11 +290,11 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// up // up
if ((!strncmp(tempo, "../", 3)) && (!strncmp(tempo2, "../", 3))) { // impossible sans monter if ((!strncmp(tempo, "../", 3)) && (!strncmp(tempo2, "../", 3))) { // impossible sans monter
if ((opt->seeker & 2) == 0) { // interdiction de monter if ((opt->seeker & HTS_SEEKER_UP) == 0) {
forbidden_url = 1; forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG, "upper link canceled: %s%s", adr, hts_log_print(opt, LOG_DEBUG, "upper link canceled: %s%s", adr,
fil); fil);
} else { // autorisé à monter - NEW } else { // autorisé à monter - NEW
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved' if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0; forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s", hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s",
@@ -302,13 +302,13 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
} }
} }
} else if ((!strncmp(tempo, "../", 3)) || (!strncmp(tempo2, "../", 3))) { // Possible en montant } else if ((!strncmp(tempo, "../", 3)) || (!strncmp(tempo2, "../", 3))) { // Possible en montant
if ((opt->seeker & 2) != 0) { // autorisé à monter - NEW if ((opt->seeker & HTS_SEEKER_UP) != 0) {
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved' if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0; forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s", hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s",
adr, fil); adr, fil);
} }
} // sinon autorisé en descente } // sinon autorisé en descente
} }
} else { } else {
@@ -345,83 +345,81 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
//if (!opt->wizard) { // mode non wizard //if (!opt->wizard) { // mode non wizard
// doit-on traiter ce lien?.. vérifier droits de sortie // doit-on traiter ce lien?.. vérifier droits de sortie
switch ((opt->travel & 255)) { switch ((opt->travel & HTS_TRAVEL_SCOPE_MASK)) {
case 0: case HTS_TRAVEL_SAME_ADDRESS:
if (!opt->wizard) // mode non wizard if (!opt->wizard) // mode non wizard
forbidden_url = 1; forbidden_url = 1;
break; // interdicton de sortir au dela de l'adresse break; // interdicton de sortir au dela de l'adresse
case 1:{ // sortie sur le même dom.xxx case HTS_TRAVEL_SAME_DOMAIN: {
size_t i = strlen(adr) - 1; size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1; size_t j = strlen(urladr()) - 1;
if ((i > 0) && (j > 0)) { if ((i > 0) && (j > 0)) {
while((i > 0) && (adr[i] != '.')) while ((i > 0) && (adr[i] != '.'))
i--;
while((j > 0) && (urladr()[j] != '.'))
j--;
if ((i > 0) && (j > 0)) {
i--;
j--;
while((i > 0) && (adr[i] != '.'))
i--;
while((j > 0) && (urladr()[j] != '.'))
j--;
}
}
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !=
if (!opt->wizard) { // mode non wizard
//printf("refused: %s\n",adr);
forbidden_url = 1; // pas même domaine
hts_log_print(opt, LOG_DEBUG,
"foreign domain link canceled: %s%s", adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same domain link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
}
break;
case 2:{ // sortie sur le même .xxx
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
while((i > 0) && (adr[i] != '.'))
i--; i--;
while((j > 0) && (urladr()[j] != '.')) while ((j > 0) && (urladr()[j] != '.'))
j--; j--;
if ((i > 0) && (j > 0)) { if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !- i--;
if (!opt->wizard) { // mode non wizard j--;
//printf("refused: %s\n",adr); while ((i > 0) && (adr[i] != '.'))
forbidden_url = 1; // pas même .xx i--;
hts_log_print(opt, LOG_DEBUG, while ((j > 0) && (urladr()[j] != '.'))
"foreign location link canceled: %s%s", adr, fil); j--;
} }
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG,
"same location link authorized: %s%s", adr, fil);
}
}
} else
forbidden_url = 1;
} }
break; if ((i > 0) && (j > 0)) {
case 7: // everywhere!! if (!strfield2(adr + i, urladr() + j)) { // !=
if (!opt->wizard) { // mode non wizard
// printf("refused: %s\n",adr);
forbidden_url = 1; // pas même domaine
hts_log_print(opt, LOG_DEBUG, "foreign domain link canceled: %s%s",
adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same domain link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
} break;
case HTS_TRAVEL_SAME_TLD: {
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
while ((i > 0) && (adr[i] != '.'))
i--;
while ((j > 0) && (urladr()[j] != '.'))
j--;
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !-
if (!opt->wizard) { // mode non wizard
// printf("refused: %s\n",adr);
forbidden_url = 1; // pas même .xx
hts_log_print(opt, LOG_DEBUG,
"foreign location link canceled: %s%s", adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same location link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
} break;
case HTS_TRAVEL_EVERYWHERE:
if (opt->wizard) { // mode wizard if (opt->wizard) { // mode wizard
forbidden_url = 0; forbidden_url = 0;
break; break;
} }
} // switch } // switch
// ANCIENNE POS -- récupérer les liens à côtés d'un lien (nearlink) // ANCIENNE POS -- récupérer les liens à côtés d'un lien (nearlink)
@@ -583,7 +581,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// on doit poser la question.. peut on la poser? // on doit poser la question.. peut on la poser?
// (oui je sais quel preuve de délicatesse, merci merci) // (oui je sais quel preuve de délicatesse, merci merci)
if ((question) && (ptr > 0) && (!force_mirror)) { if ((question) && (ptr > 0) && (!force_mirror)) {
if (opt->wizard == 2) { // éliminer tous les liens non répertoriés comme autorisés (ou inconnus) if (opt->wizard == HTS_WIZARD_AUTO) {
question = 0; question = 0;
forbidden_url = 1; forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -600,8 +598,8 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
printf("robots.txt forbidden: %s%s\n", adr, fil); printf("robots.txt forbidden: %s%s\n", adr, fil);
#endif #endif
// question résolue, par les filtres, et mode robot non strict // question résolue, par les filtres, et mode robot non strict
if ((!question) && (filters_answer) && (opt->robots == 1) if ((!question) && (filters_answer) &&
&& (forbidden_url != 1)) { (opt->robots == HTS_ROBOTS_SOMETIMES) && (forbidden_url != 1)) {
r = 0; // annuler interdiction des robots r = 0; // annuler interdiction des robots
if (!forbidden_url) { if (!forbidden_url) {
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -685,7 +683,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
io_flush; io_flush;
} else { // lien primaire: autoriser répertoire entier } else { // lien primaire: autoriser répertoire entier
if (!force_mirror) { if (!force_mirror) {
if ((opt->seeker & 1) == 0) { // interdiction de descendre if ((opt->seeker & HTS_SEEKER_DOWN) == 0) {
n = 7; n = 7;
} else { } else {
n = 5; // autoriser miroir répertoires descendants (lien primaire) n = 5; // autoriser miroir répertoires descendants (lien primaire)
@@ -712,7 +710,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
switch (n) { switch (n) {
case -1: // sauter tout le reste case -1: // sauter tout le reste
forbidden_url = 1; forbidden_url = 1;
opt->wizard = 2; // sauter tout le reste opt->wizard = HTS_WIZARD_AUTO; // sauter tout le reste
break; break;
case 0: // forbid the same link: adr/fil case 0: // forbid the same link: adr/fil
forbidden_url = 1; forbidden_url = 1;
@@ -796,7 +794,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
break; break;
case 5: // allow the whole directory and its children case 5: // allow the whole directory and its children
if ((opt->seeker & 2) == 0) { // not allowed to go up if ((opt->seeker & HTS_SEEKER_UP) == 0) { // not allowed to go up
size_t i = strlen(fil) - 1; size_t i = strlen(fil) - 1;
while((fil[i] != '/') && (i > 0)) while((fil[i] != '/') && (i > 0))
@@ -872,7 +870,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// lien non autorisé, peut-on juste le tester? // lien non autorisé, peut-on juste le tester?
if (just_test_it) { if (just_test_it) {
if (forbidden_url == 1) { if (forbidden_url == 1) {
if (opt->travel & 256) { // tester tout de même if (opt->travel & HTS_TRAVEL_TEST_ALL) { // tester tout de même
if (strfield(adr, "ftp://") == 0) { // PAS ftp! if (strfield(adr, "ftp://") == 0) { // PAS ftp!
forbidden_url = 1; // oui oui toujours interdit (note: sert à rien car ==1 mais c pour comprendre) forbidden_url = 1; // oui oui toujours interdit (note: sert à rien car ==1 mais c pour comprendre)
*just_test_it = 1; // mais on teste *just_test_it = 1; // mais on teste

View File

@@ -206,7 +206,8 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
/* Logging */ /* Logging */
/** Legacy: write prefix then msg to opt->log. Returns 0 if written, 1 if /** Legacy: write prefix then msg to opt->log. Returns 0 if written, 1 if
opt->log is NULL. Prefer hts_log_print(). */ opt->log is NULL. Prefer hts_log_print(). */
HTSEXT_API int hts_log(httrackp * opt, const char *prefix, const char *msg); HTSEXT_API hts_boolean hts_log(httrackp *opt, const char *prefix,
const char *msg);
/** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO). /** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO).
Forwards to the registered log callback, and when the level is <= opt->debug Forwards to the registered log callback, and when the level is <= opt->debug
@@ -254,15 +255,6 @@ HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct);
or 0 if none or unknown. */ or 0 if none or unknown. */
HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name); HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name);
/** @warning No implementation is linked into the library; calling this fails to
link. For per-callback user data use the CHAIN_FUNCTION() ARGUMENT and
CALLBACKARG_USERDEF() instead. */
HTSEXT_API int htswrap_set_userdef(httrackp * opt, void *userdef);
/** @warning No implementation is linked into the library; calling this fails to
link. Read per-callback user data with CALLBACKARG_USERDEF() instead. */
HTSEXT_API void *htswrap_get_userdef(httrackp * opt);
/* Internal library allocators, if a different libc is being used by the client */ /* Internal library allocators, if a different libc is being used by the client */
/** strdup() through the library allocator. Returns a heap copy freed with /** strdup() through the library allocator. Returns a heap copy freed with
hts_free(), or NULL on failure. */ hts_free(), or NULL on failure. */
@@ -322,7 +314,8 @@ HTSEXT_API T_SOC catch_url_init(int *port, char *adr);
"ip:port". The buffers are caller-allocated and not bounds-checked: @p data "ip:port". The buffers are caller-allocated and not bounds-checked: @p data
must be CATCH_URL_DATA_SIZE bytes, and @p url / @p method must fit the must be CATCH_URL_DATA_SIZE bytes, and @p url / @p method must fit the
captured request line. */ captured request line. */
HTSEXT_API int catch_url(T_SOC soc, char *url, char *method, char *data); HTSEXT_API hts_boolean catch_url(T_SOC soc, char *url, char *method,
char *data);
/* State */ /* State */
/** Whether the engine is parsing HTML. Returns 0 if not, otherwise the percent /** Whether the engine is parsing HTML. Returns 0 if not, otherwise the percent
@@ -343,10 +336,10 @@ HTSEXT_API int hts_is_exiting(httrackp * opt);
caller-owned, NULL-terminated array of strings; the engine stores the caller-owned, NULL-terminated array of strings; the engine stores the
pointer without copying, so the array and its strings must stay valid until pointer without copying, so the array and its strings must stay valid until
the engine consumes them. @return nonzero if a list is now set. */ the engine consumes them. @return nonzero if a list is now set. */
HTSEXT_API int hts_addurl(httrackp * opt, char **url); HTSEXT_API hts_boolean hts_addurl(httrackp *opt, char **url);
/** Clear any pending add-URL list set by hts_addurl(). Always returns 0. */ /** Clear any pending add-URL list set by hts_addurl(). Always returns 0. */
HTSEXT_API int hts_resetaddurl(httrackp * opt); HTSEXT_API hts_boolean hts_resetaddurl(httrackp *opt);
/** Apply the runtime-tunable options from @p from onto @p to, to adjust a live /** Apply the runtime-tunable options from @p from onto @p to, to adjust a live
mirror. Only fields set to a non-sentinel value are copied; the rest of @p mirror. Only fields set to a non-sentinel value are copied; the rest of @p
@@ -365,7 +358,7 @@ HTSEXT_API int hts_setpause(httrackp * opt, int);
lock, so it is safe to call from another thread). @p force is currently lock, so it is safe to call from another thread). @p force is currently
ignored. ignored.
@return 0; no-op if @p opt is NULL. */ @return 0; no-op if @p opt is NULL. */
HTSEXT_API int hts_request_stop(httrackp * opt, int force); HTSEXT_API int hts_request_stop(httrackp *opt, hts_boolean force);
/** Queue a single in-progress file, by URL, to be cancelled by the engine. /** Queue a single in-progress file, by URL, to be cancelled by the engine.
@p url is copied internally. Takes the state lock, so it is thread-safe. @p url is copied internally. Takes the state lock, so it is thread-safe.
@@ -382,7 +375,7 @@ HTSEXT_API void hts_cancel_parsing(httrackp * opt);
/** Nonzero once the mirror has fully ended. Read under the engine state lock, /** Nonzero once the mirror has fully ended. Read under the engine state lock,
so safe to poll from another thread. Wait for this before hts_free_opt(). */ so safe to poll from another thread. Wait for this before hts_free_opt(). */
HTSEXT_API int hts_has_stopped(httrackp * opt); HTSEXT_API hts_boolean hts_has_stopped(httrackp *opt);
/* Tools */ /* Tools */
/** Ensure the directory chain leading to @p path exists, creating missing /** Ensure the directory chain leading to @p path exists, creating missing
@@ -399,7 +392,7 @@ HTSEXT_API int structcheck_utf8(const char *path);
/** Whether the directory containing @p path exists. The basename is stripped /** Whether the directory containing @p path exists. The basename is stripped
first, so passing a file path tests its parent directory. @return 1 if it is first, so passing a file path tests its parent directory. @return 1 if it is
a directory, 0 otherwise. */ a directory, 0 otherwise. */
HTSEXT_API int dir_exists(const char *path); HTSEXT_API hts_boolean dir_exists(const char *path);
/** Write the HTTP reason phrase for @p statuscode into @p msg, a caller buffer /** Write the HTTP reason phrase for @p statuscode into @p msg, a caller buffer
of at least 64 bytes. For an unknown code a non-empty @p msg is kept, of at least 64 bytes. For an unknown code a non-empty @p msg is kept,
@@ -582,20 +575,15 @@ HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const cha
must-avoid escapes are kept encoded, and %25 is never decoded). @p no_high & must-avoid escapes are kept encoded, and %25 is never decoded). @p no_high &
1 also decodes high (>= 128) bytes; @p no_high & 2 also decodes an escaped 1 also decodes high (>= 128) bytes; @p no_high & 2 also decodes an escaped
space. Returns @p catbuff. */ space. Returns @p catbuff. */
HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, const char *s, const int no_high); HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size,
const char *s, const hts_boolean no_high);
/** @warning No implementation is linked into the library; calling this fails to
link. */
HTSEXT_API char *antislash_unescaped(char *catbuff, const char *s);
HTSEXT_API void escape_remove_control(char *s);
/** Determine the MIME type of local file name @p fil into @p s (capacity /** Determine the MIME type of local file name @p fil into @p s (capacity
@p ssize): user --assume rules, then ".html", then the built-in extension @p ssize): user --assume rules, then ".html", then the built-in extension
table. @p flag != 0 forces a fallback type. @return 1 if a type was written, table. @p flag != 0 forces a fallback type. @return 1 if a type was written,
0 otherwise. */ 0 otherwise. */
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag); const char *fil, hts_boolean flag);
/** @deprecated Use get_httptype_sized(). Assumes @p s has at least /** @deprecated Use get_httptype_sized(). Assumes @p s has at least
HTS_MIMETYPE_SIZE capacity. */ HTS_MIMETYPE_SIZE capacity. */
@@ -615,7 +603,7 @@ HTSEXT_API int is_userknowntype(httrackp * opt, const char *fil);
/** 1 if @p fil, an extension such as "asp" or "php" (not a full filename), is a /** 1 if @p fil, an extension such as "asp" or "php" (not a full filename), is a
known dynamic-page type, else 0. */ known dynamic-page type, else 0. */
HTSEXT_API int is_dyntype(const char *fil); HTSEXT_API hts_boolean is_dyntype(const char *fil);
/** Extract the extension of @p fil (text after the last '.', stopping at '?') /** Extract the extension of @p fil (text after the last '.', stopping at '?')
into caller scratch @p catbuff (capacity @p size) and return it. Returns "" into caller scratch @p catbuff (capacity @p size) and return it. Returns ""
@@ -625,12 +613,12 @@ HTSEXT_API const char *get_ext(char *catbuff, size_t size, const char *fil);
/** 1 if MIME type @p st must not be reclassified or renamed (hypertext types /** 1 if MIME type @p st must not be reclassified or renamed (hypertext types
and a built-in keep-list of commonly mislabeled types), else 0. */ and a built-in keep-list of commonly mislabeled types), else 0. */
HTSEXT_API int may_unknown(httrackp * opt, const char *st); HTSEXT_API hts_boolean may_unknown(httrackp *opt, const char *st);
/** Guess the MIME type of local file @p fil into @p s (capacity @p ssize), /** Guess the MIME type of local file @p fil into @p s (capacity @p ssize),
always producing a type. @return 1 if a type was written. */ always producing a type. @return 1 if a type was written. */
HTSEXT_API int guess_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean guess_httptype_sized(httrackp *opt, char *s,
const char *fil); size_t ssize, const char *fil);
/** @deprecated Use guess_httptype_sized(). Assumes @p s has at least /** @deprecated Use guess_httptype_sized(). Assumes @p s has at least
HTS_MIMETYPE_SIZE capacity. */ HTS_MIMETYPE_SIZE capacity. */
@@ -692,7 +680,7 @@ HTSEXT_API find_handle hts_findfirst(char *path);
/** Advance to the next directory entry. Returns 1 if an entry is available, 0 /** Advance to the next directory entry. Returns 1 if an entry is available, 0
at end of directory. */ at end of directory. */
HTSEXT_API int hts_findnext(find_handle find); HTSEXT_API hts_boolean hts_findnext(find_handle find);
/** Close the iteration and free @p find. Always returns 0; NULL is accepted. */ /** Close the iteration and free @p find. Always returns 0; NULL is accepted. */
HTSEXT_API int hts_findclose(find_handle find); HTSEXT_API int hts_findclose(find_handle find);
@@ -707,16 +695,16 @@ HTSEXT_API int hts_findgetsize(find_handle find);
/** 1 if the current entry is a directory, else 0 (a system/special entry, see /** 1 if the current entry is a directory, else 0 (a system/special entry, see
hts_findissystem(), reports 0). */ hts_findissystem(), reports 0). */
HTSEXT_API int hts_findisdir(find_handle find); HTSEXT_API hts_boolean hts_findisdir(find_handle find);
/** 1 if the current entry is a regular file, else 0 (a system/special entry, /** 1 if the current entry is a regular file, else 0 (a system/special entry,
see hts_findissystem(), reports 0). */ see hts_findissystem(), reports 0). */
HTSEXT_API int hts_findisfile(find_handle find); HTSEXT_API hts_boolean hts_findisfile(find_handle find);
/** 1 if the current entry is a special/system entry to skip: "." or "..", on /** 1 if the current entry is a special/system entry to skip: "." or "..", on
POSIX also device/fifo/socket nodes, on Windows also system, hidden or POSIX also device/fifo/socket nodes, on Windows also system, hidden or
temporary entries. Else 0. */ temporary entries. Else 0. */
HTSEXT_API int hts_findissystem(find_handle find); HTSEXT_API hts_boolean hts_findissystem(find_handle find);
/* UTF-8 aware FILE API */ /* UTF-8 aware FILE API */
/* On non-Windows these macros resolve directly to the POSIX calls. On Windows /* On non-Windows these macros resolve directly to the POSIX calls. On Windows

View File

@@ -288,7 +288,7 @@ static void __cdecl htsshow_uninit(t_hts_callbackarg * carg) {
} }
static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) { static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) {
use_show = 0; use_show = 0;
if (opt->verbosedisplay == 2) { if (opt->verbosedisplay == HTS_VERBOSE_FULL) {
use_show = 1; use_show = 1;
vt_clear(); vt_clear();
} }
@@ -852,7 +852,7 @@ static void sig_doback(int blind) { // mettre en backing
if (global_opt != NULL) { if (global_opt != NULL) {
// suppress logging and asking lousy questions // suppress logging and asking lousy questions
global_opt->quiet = 1; global_opt->quiet = 1;
global_opt->verbosedisplay = 0; global_opt->verbosedisplay = HTS_VERBOSE_NONE;
} }
if (!blind) if (!blind)

View File

@@ -4,131 +4,140 @@
# Initializes the htsserver GUI frontend and launch the default browser # Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE= BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition" SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set # sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}" SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi fi
# Patch for Darwin/Mac by Ross Williams # Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find # Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for # the default browser. The -W flag causes it to wait for
# the browser to exit # the browser to exit
BROWSEREXE="/usr/bin/open -W" BROWSEREXE="/usr/bin/open -W"
fi fi
BINWD=`dirname "$0"` BINWD=$(dirname "$0")
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin" SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "` IFS=':' read -ra pathdirs <<<"$PATH"
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share" for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
### ###
# And now some famous cuisine # And now some famous cuisine
function log { function log {
echo "$0($$): $@" >&2 echo "$0($$): $*" >&2
return 0 return 0
} }
function launch_browser { function launch_browser {
log "Launching $1" log "Launching $1"
browser=$1 browser=$1
url=$2 url=$2
log "Spawning browser.." log "Spawning browser.."
${browser} "${url}" ${browser} "${url}"
# note: browser can hiddenly use the -remote feature of # note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately # mozilla and therefore return immediately
log "Browser (or helper) exited" log "Browser (or helper) exited"
} }
# First ensure that we can launch the server # First ensure that we can launch the server
BINPATH= BINPATH=
for i in ${SRCHPATH}; do for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i} ! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done done
for i in ${SRCHDISTPATH}; do for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack" ! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1 test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1 test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1 test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1 test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1 test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1 test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale # Locale
HTSLANG="${LC_MESSAGES}" HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}" ! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}" ! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`" HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'` LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1 ! test -n "${LANGN}" && LANGN=1
# Find the browser # Find the browser
# note: not all systems have sensible-browser or www-browser alternative # note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found # thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do for i in "${SRCHBROWSEREXE[@]}"; do
for j in ${SRCHPATH}; do for j in "${SRCHPATH[@]}"; do
if test -x ${j}/${i}; then if test -x "${j}/${i}"; then
BROWSEREXE=${j}/${i} BROWSEREXE="${j}/${i}"
fi fi
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1 test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command # "browse" command
if test "$1" = "browse"; then if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='` INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html" INDEXF="${INDEXF}/index.html"
else else
INDEXF="" INDEXF=""
fi fi
fi fi
if ! test -n "$INDEXF"; then if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html" INDEXF="${HOME}/websites/index.html"
fi fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}" launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $? exit $?
fi fi
# Create a temporary filename # Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1 TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server # Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}& (
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL # Find the generated SRVURL
SRVURL= SRVURL=
MAXCOUNT=60 MAXCOUNT=60
while ! test -n "$SRVURL"; do while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1] MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1 test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.." test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=` SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1 test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1 test -n "$SRVURL" || sleep 1
done done
# Cleanup function # Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup { function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.." test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself # Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=` test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 ${SRVPID} test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE} test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done" test -n "$1" && log "..Done"
return 0 return 0
} }
# Cleanup in case of emergency # Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser # Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}" launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks! # That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup cleanup
exit 0 exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8). # charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8. # -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() { conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1 test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
} }
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9. # the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.

15
tests/01_engine-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1
}

17
tests/01_engine-copyopt.test Executable file
View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
set -eu
# A trailing token is required; a bare '-#9' falls through to the usage screen.
out=$(httrack -#9 run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "copy-htsopt: OK" || {
echo "expected 'copy-htsopt: OK', got: $out" >&2
exit 1
}

View File

@@ -89,4 +89,37 @@ grep -q NEWCONTENT "$(find "$out" -path '*/a.html' -print -quit)" || {
exit 1 exit 1
} }
# --- 3. an empty quoted arg survives the doit.log round-trip (#106) ----------
# -%F "" (empty footer) records an empty "" token in doit.log; -r2 follows it so
# a "drop the empty token" bug shifts -r2 into -%F's slot (the reprise then sees
# -%F -r2 and panics "%F needs to be followed by ..."), making the bug visible
# rather than a harmless run off the end of argv.
out2="$tmp/out2"
rc=0
"$bin" "$url" -O "$out2" --quiet -n -%v0 -%F "" -r2 >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: initial mirror with empty footer exited $rc"
exit 1
}
# precondition: the writer put the empty token on disk for the reader to reload.
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer not recorded as -%F \"\" -r2 in doit.log"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
# no-url reprise: the reader rebuilds argv from doit.log and rewrites doit.log
# from it. The empty token surviving in the regenerated file proves the reader
# kept it (a drop/swallow would panic above or rewrite -%F without the "").
rc=0
"$bin" -O "$out2" --quiet >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: empty-footer reprise exited $rc (empty token dropped from doit.log?)"
exit 1
}
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer did not survive the doit.log reload round-trip"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
exit 0 exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset). # HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output). # -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() { ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1 test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
} }
# named entities # named entities

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...". # -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() { match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
} }
nomatch() { nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
} }
# bare star matches everything # bare star matches everything
@@ -67,7 +67,7 @@ nomatch '*[\[]' 'a'
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed # filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy) # by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change. # behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']' match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']' match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x' nomatch '*[\[\]]' '[]x'

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'". # -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() { mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
} }
unknown() { unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
} }
mime '/a/b.html' 'text/html' mime '/a/b.html' 'text/html'

View File

@@ -154,4 +154,173 @@ grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
grep -q 'title="file://' "$saved2" || grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1 ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
# xmlns / xmlns:prefix decls must not be crawled (#191). Local file:// targets so a
# regression downloads them; each is the LAST attr (heuristic only scans a value before '>').
site3="$tmp/xmlns"
mkdir -p "$site3"
for f in ns og rdfs real; do gif "$site3/$f.gif"; done
cat >"$site3/index.html" <<EOF
<html xmlns="file://$site3/ns.gif"><body>
<svg xmlns:og="file://$site3/og.gif"></svg>
<div class="c" xmlns:rdfs="file://$site3/rdfs.gif"></div>
<a href="file://$site3/real.gif">real link</a>
</body></html>
EOF
out3="$tmp/xmlns-out"
crawl "$site3/index.html" "$out3"
# the real link is still captured
found "real.gif" "$out3"
# namespace-declaration targets must not be fetched (default + prefixed forms)
notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
# Offset-0 underflow (#396): a token at the buffer start makes the detector's
# word-boundary guard read *(html-1) one byte early (aborts under ASan). The
# url() target is still captured; here it just must not underflow.
site6="$tmp/parse-off0"
mkdir -p "$site6"
printf 'body{}\n' >"$site6/off0.css"
printf 'url(off0.css)\n' >"$site6/main.css"
out6="$tmp/parse-off0-out"
crawl "$site6/main.css" "$out6"
found "off0.css" "$out6"
# XMLHttpRequest.open(method, url) (#218): the first argument is an HTTP method,
# not a URL. Without the fix "GET" is captured as a link and fetched (the offline
# fixture saves a bare file named GET; a live server mangles it to GET.html).
# window.open(url) detection must be unaffected.
site7="$tmp/xhropen"
mkdir -p "$site7"
gif "$site7/winopen.gif"
cat >"$site7/index.html" <<EOF
<html><body><script>
var x = new XMLHttpRequest();
x.open("GET", "ajax_info.txt");
var y = new XMLHttpRequest();
y.open("Post", "submit.cgi");
window.open("file://$site7/winopen.gif");
</script></body></html>
EOF
out7="$tmp/xhropen-out"
crawl "$site7/index.html" "$out7"
# negative control: without the fix a file named exactly GET is downloaded
notfound "GET" "$out7"
# methods are matched case-insensitively (XHR spec normalizes them): a mixed-case
# method is rejected too, so a file named Post must not appear either
notfound "Post" "$out7"
# regression guard: window.open(url) is still detected, so its absolute URL is
# rewritten to a local link. The rewrite only happens if the parser saw it, so
# these two assertions fail if .open detection broke (not a trivial --near save).
saved7=$(savedhtml "$out7")
test -n "$saved7" || ! echo "FAIL: saved xhr page not found" || exit 1
grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! echo "FAIL #218: window.open(url) no longer detected/rewritten" || exit 1
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0 exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments. # path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() { simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1 test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
} }
simp './foo/bar/' 'foo/bar/' simp './foo/bar/' 'foo/bar/'
@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986 # empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b' simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it # the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"overflow while copying"*) ;; echo "over-capacity write was NOT caught" >&2
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"overflow while copying"*) ;;
*)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer # Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert. # aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"htsbuff append overflow"*) ;; echo "htsbuff over-capacity write was NOT caught" >&2
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"htsbuff append overflow"*) ;;
*)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac esac

View File

@@ -3,6 +3,6 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/cookies/third.html \ --found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \ --found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \ --found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php httrack http://ut.httrack.com/cookies/entrance.php

View File

@@ -3,21 +3,21 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 5 \ --errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \ --found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \ --found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \ --found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \ --found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \ httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0 '+*.ut.httrack.com/*' --robots=0
# unicode tests (bogus links) # unicode tests (bogus links)
bash crawl-test.sh \ bash crawl-test.sh \
--errors 0 --files 1 \ --errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \ --found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \ httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0 '-*' --robots=0

View File

@@ -3,67 +3,67 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 10 \ --errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \ --found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \ --found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \ --found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \ --found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \ --found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html httrack http://ut.httrack.com/unicode-links/utf8.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 7 \ --errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \ --found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \ --found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \ --found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html httrack http://ut.httrack.com/unicode-links/default.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 2 --files 9 \ --errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \ --found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \ --found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html httrack http://ut.httrack.com/unicode-links/iso88591.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 9 \ --errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \ --found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \ --found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \ --found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \ --found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \ --found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \ --found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html httrack http://ut.httrack.com/unicode-links/gb18030.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=42&can=1 # http://code.google.com/p/httrack/issues/detail?id=42&can=1
# we expect 2 errors only because other links are too longs (to be modified if suitable) # we expect 2 errors only because other links are too longs (to be modified if suitable)
bash crawl-test.sh --errors 2 --files 1 \ bash crawl-test.sh --errors 2 --files 1 \
--found ut.httrack.com/overflow/longquerywithaccents.html \ --found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php httrack http://ut.httrack.com/overflow/longquerywithaccents.php

View File

@@ -3,45 +3,45 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=4&can=1 # http://code.google.com/p/httrack/issues/detail?id=4&can=1
bash crawl-test.sh --errors 0 --files 4 \ bash crawl-test.sh --errors 0 --files 4 \
--found ut.httrack.com/parsing/back5e1f.gif \ --found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \ --found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \ --found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \ --found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html httrack http://ut.httrack.com/parsing/events.html
# http://code.google.com/p/httrack/issues/detail?id=2&can=1 # http://code.google.com/p/httrack/issues/detail?id=2&can=1
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/background-image.css \ --found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \ --found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html httrack http://ut.httrack.com/parsing/background-image.html
# javascript parsing # javascript parsing
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/back.gif \ --found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \ --found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html httrack http://ut.httrack.com/parsing/javascript.html
# handling of + before query string # handling of + before query string
bash crawl-test.sh --errors 0 --files 6 \ bash crawl-test.sh --errors 0 --files 6 \
--found ut.httrack.com/parsing/escaping.html \ --found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \ --found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \ --found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \ --found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \ --found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \ --found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html httrack http://ut.httrack.com/parsing/escaping.html
# handling of # encoded in filename # handling of # encoded in filename
# see http://code.google.com/p/httrack/issues/detail?id=25 # see http://code.google.com/p/httrack/issues/detail?id=25
bash crawl-test.sh --errors 2 --files 4 \ bash crawl-test.sh --errors 2 --files 4 \
--found "ut.httrack.com/parsing/escaping2.html" \ --found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \ --found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \ --found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \ --found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html httrack http://ut.httrack.com/parsing/escaping2.html

View File

@@ -3,11 +3,11 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
if test "${HTTPS_SUPPORT:-}" == "no"; then if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping" echo "no https support compiled, skipping"
exit 77 exit 77
fi fi
bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html

View File

@@ -0,0 +1,136 @@
#!/bin/bash
#
# Issue #85: an https crawl must go through the configured proxy (CONNECT
# tunnel), not bypass it and hit the origin directly. Fully local: a self-signed
# TLS origin plus a logging CONNECT proxy, so no network access is needed.
set -euo pipefail
: "${top_srcdir:=..}"
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
if ! command -v python3 >/dev/null 2>&1 || ! command -v openssl >/dev/null 2>&1; then
echo "python3/openssl missing, skipping"
exit 77
fi
server="$top_srcdir/tests/proxy-https-server.py"
tmpdir=$(mktemp -d)
pids=
cleanup() {
for pid in $pids; do
kill "$pid" 2>/dev/null || true
done
rm -rf "$tmpdir"
}
trap cleanup EXIT
# self-signed cert for the local TLS origin (httrack does not verify certs)
openssl req -x509 -newkey rsa:2048 -keyout "$tmpdir/key.pem" \
-out "$tmpdir/cert.pem" -days 2 -nodes -subj "/CN=127.0.0.1" \
>/dev/null 2>&1
cat "$tmpdir/key.pem" "$tmpdir/cert.pem" >"$tmpdir/both.pem"
# start_server <logdir> <mode>: launches a proxy+origin pair, sets $origin_port
# and $proxy_port from its announced ephemeral ports.
start_server() {
local dir="$1" mode="$2" ports
mkdir -p "$dir"
ports="$dir/ports.txt"
python3 "$server" "$tmpdir/both.pem" "$dir" "$mode" \
>"$ports" 2>"$dir/server.err" &
pids="$pids $!"
for _ in $(seq 1 100); do
grep -q "^ready" "$ports" 2>/dev/null && break
sleep 0.1
done
grep -q "^ready" "$ports" 2>/dev/null || {
echo "server ($mode) did not start" >&2
cat "$dir/server.err" >&2
exit 1
}
origin_port=$(awk '/^ORIGIN/{print $2}' "$ports")
proxy_port=$(awk '/^PROXY/{print $2}' "$ports")
}
# Run httrack, but kill it after a deadline so a hang (e.g. a missing bound on
# the proxy response) surfaces as the kill code $HANG_RC instead of stalling the
# whole job. A portable stand-in for `timeout`, which macOS lacks.
HANG_RC=137 # 128 + SIGKILL
run_crawl() {
local out="$1" proxy="$2" port="$3"
rm -rf "$out"
httrack "https://127.0.0.1:${port}/" --proxy "$proxy" \
-O "$out" -r1 -s0 --timeout=10 >"$out.log" 2>&1 &
local pid=$!
(sleep 60 && kill -9 "$pid" 2>/dev/null) &
local guard=$!
local rc=0
wait "$pid" 2>/dev/null || rc=$?
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
return "$rc"
}
# --- working proxy ----------------------------------------------------------
ok="$tmpdir/ok"
start_server "$ok" ok
# 1. page retrieved AND the proxy saw a CONNECT to the origin
run_crawl "$ok/out" "127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out" || {
echo "FAIL: origin page not downloaded through proxy" >&2
cat "$ok/out.log" >&2
exit 1
}
grep -q "^CONNECT 127.0.0.1:${origin_port} " "$ok/proxy.log" || {
echo "FAIL: proxy never received a CONNECT (https bypassed the proxy)" >&2
cat "$ok/proxy.log" >&2
exit 1
}
echo "OK: https tunneled through proxy via CONNECT"
# 2. authenticated proxy: creds ride the CONNECT, and NEVER reach the origin
: >"$ok/proxy.log"
: >"$ok/origin-headers.log"
run_crawl "$ok/out2" "user:secret@127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out2" || {
echo "FAIL: origin page not downloaded through authenticated proxy" >&2
exit 1
}
got=$(awk '/^AUTH Basic /{print $3}' "$ok/proxy.log" | head -1)
# base64("user:secret"); compared as a literal to stay portable (no base64 -d,
# which differs between GNU and BSD)
test "$got" == "dXNlcjpzZWNyZXQ=" || {
echo "FAIL: Proxy-Authorization not carried on CONNECT (got '$got')" >&2
cat "$ok/proxy.log" >&2
exit 1
}
if grep -qi "proxy-authorization" "$ok/origin-headers.log"; then
echo "FAIL: proxy credentials leaked to the origin through the tunnel" >&2
cat "$ok/origin-headers.log" >&2
exit 1
fi
echo "OK: proxy credentials carried on CONNECT, not leaked to origin"
# --- hostile proxy ----------------------------------------------------------
# A proxy that answers 200 then streams headers forever must not hang the crawl:
# the client bounds the response. run_crawl kills a hung httrack after 60s, so a
# missing bound surfaces as $HANG_RC here.
flood="$tmpdir/flood"
start_server "$flood" flood
rc=0
run_crawl "$flood/out" "127.0.0.1:${proxy_port}" "$origin_port" || rc=$?
test "$rc" -ne "$HANG_RC" || {
echo "FAIL: crawl hung on a flooding proxy (bounded read missing)" >&2
exit 1
}
grep -rq "ORIGIN-PAGE-85" "$flood/out" 2>/dev/null && {
echo "FAIL: flooding proxy unexpectedly served the page" >&2
exit 1
}
echo "OK: bounded proxy response, no hang on a flooding proxy"

View File

@@ -2,6 +2,7 @@
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would # explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck". # silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
fixtures/cache-golden/hts-cache/new.zip fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT = TESTS_ENVIRONMENT =
@@ -24,6 +25,8 @@ TESTS = \
01_engine-cache-golden.test \ 01_engine-cache-golden.test \
01_engine-charset.test \ 01_engine-charset.test \
01_engine-cmdline.test \ 01_engine-cmdline.test \
01_engine-cookies.test \
01_engine-copyopt.test \
01_engine-doitlog.test \ 01_engine-doitlog.test \
01_engine-entities.test \ 01_engine-entities.test \
01_engine-filter.test \ 01_engine-filter.test \
@@ -32,6 +35,7 @@ TESTS = \
01_engine-mime.test \ 01_engine-mime.test \
01_engine-parse.test \ 01_engine-parse.test \
01_engine-rcfile.test \ 01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \ 01_engine-simplify.test \
01_engine-strsafe.test \ 01_engine-strsafe.test \
02_manpage-regen.test \ 02_manpage-regen.test \
@@ -42,6 +46,7 @@ TESTS = \
11_crawl-international.test \ 11_crawl-international.test \
11_crawl-longurl.test \ 11_crawl-longurl.test \
11_crawl-parsing.test \ 11_crawl-parsing.test \
12_crawl_https.test 12_crawl_https.test \
13_crawl_proxy_https.test
CLEANFILES = check-network_sh.cache CLEANFILES = check-network_sh.cache

View File

@@ -6,39 +6,39 @@
# do not enable online tests (./configure --disable-online-unit-tests) # do not enable online tests (./configure --disable-online-unit-tests)
if test "$ONLINE_UNIT_TESTS" == "no"; then if test "$ONLINE_UNIT_TESTS" == "no"; then
echo "online tests are disabled" >&2 echo "online tests are disabled" >&2
exit 1 exit 1
# enable online tests (--enable-online-unit-tests) # enable online tests (--enable-online-unit-tests)
elif test "$ONLINE_UNIT_TESTS" == "yes"; then elif test "$ONLINE_UNIT_TESTS" == "yes"; then
exit 0 exit 0
# check if online tests are reachable # check if online tests are reachable
else else
# test url # test url
url=http://ut.httrack.com/enabled url=http://ut.httrack.com/enabled
# cache file name # cache file name
cache=check-network_sh.cache cache=check-network_sh.cache
# cached result ? # cached result ?
if test -f $cache ; then if test -f $cache; then
if grep -q "ok" $cache ; then if grep -q "ok" $cache; then
exit 0 exit 0
else else
echo "online tests are disabled (cached)" >&2 echo "online tests are disabled (cached)" >&2
exit 1 exit 1
fi fi
# fetch single file # fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null ; then elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null; then
echo "ok" > $cache echo "ok" >$cache
exit 0 exit 0
else else
echo "error" > $cache echo "error" >$cache
echo "online tests are disabled (auto)" >&2 echo "online tests are disabled (auto)" >&2
exit 1 exit 1
fi fi
fi fi

View File

@@ -2,185 +2,184 @@
# #
function warning { function warning {
echo "** $*" >&2 echo "** $*" >&2
return 0 return 0
} }
function die { function die {
warning "$*" warning "$*"
exit 1 exit 1
} }
function debug { function debug {
if test -n "$verbose"; then if test -n "$verbose"; then
echo "$*" >&2 echo "$*" >&2
fi fi
} }
function info { function info {
printf "[$*] ..\t" >&2 printf '[%s] ..\t' "$*" >&2
} }
function result { function result {
echo "$*" >&2 echo "$*" >&2
} }
function cleanup { function cleanup {
debug "cleaning function called" debug "cleaning function called"
if test -n "$tmpdir"; then if test -n "$tmpdir"; then
if test -d "$tmpdir"; then if test -d "$tmpdir"; then
if test -z "$nopurge"; then if test -z "$nopurge"; then
debug "cleaning up $tmpdir" debug "cleaning up $tmpdir"
rm -rf "$tmpdir" rm -rf "$tmpdir"
fi fi
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
} }
function usage { function usage {
cat << EOF cat <<EOF
usage: $0 usage: $0
EOF EOF
} }
function assert_equals { function assert_equals {
info "$1" info "$1"
if test ! "$2" == "$3"; then if test ! "$2" == "$3"; then
result "expected '$2', got '$3'" result "expected '$2', got '$3'"
exit 1 exit 1
else else
result "OK ($2)" result "OK ($2)"
fi fi
} }
function start-crawl { function start-crawl {
# parse args # parse args
pos=1 pos=1
while test "$#" -ge "$pos" ; do while test "$#" -ge "$pos"; do
case "${!pos}" in case "${!pos}" in
--debug) --debug)
verbose=1 verbose=1
;; ;;
--no-purge|--summary|--print-files) --no-purge | --summary | --print-files) ;;
;; --errors | --files | --found | --not-found | --directory)
--errors|--files|--found|--not-found|--directory) pos=$((pos + 1))
pos=$[${pos}+1] test "$#" -ge "$pos" || warning "missing argument" || return 1
test "$#" -ge "$pos" || warning "missing argument" || return 1 ;;
;; httrack)
httrack) pos=$((pos + 1))
pos=$[${pos}+1] break
break; ;;
;; *)
*) warning "unrecognized option ${!pos}"
warning "unrecognized option ${!pos}" return 1
return 1 ;;
;; esac
esac pos=$((pos + 1))
pos=$[${pos}+1] done
done debug "remaining args: ${*:pos}"
debug "remaining args: ${@:${pos}}"
# ut/ won't exceed 2 minutes # ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5" moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ? # proxy environment ?
if test -n "$http_proxy"; then if test -n "${http_proxy:-}"; then
moreargs="$moreargs --proxy $http_proxy" moreargs+=(--proxy "$http_proxy")
fi fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1 test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl" tmp="${tmpdir}/crawl"
rm -rf "$tmp"
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}}
info "running httrack ${@:${pos}}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break;
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp" rm -rf "$tmp"
else mkdir "$tmp" || ! warning "could not create $tmp" || return 1
tmpdir=
fi which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp"
else
tmpdir=
fi
} }
# check args # check args
@@ -195,7 +194,7 @@ tmpdir=
crawlpid= crawlpid=
nopurge= nopurge=
verbose= verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory # working directory
tmpdir="${tmptopdir}/httrack_ut.$$" tmpdir="${tmptopdir}/httrack_ut.$$"

151
tests/proxy-https-server.py Normal file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Local CONNECT proxy + self-signed HTTPS origin for the issue #85 test.
Starts a TLS origin server and an HTTP proxy that honours CONNECT, on ephemeral
ports. Every request line the proxy receives (and any Proxy-Authorization) is
appended to the proxy log; every header the origin receives over the tunnel is
appended to the origin log. That lets the test assert both that an https crawl
tunneled through the proxy and that proxy credentials never leaked to the origin.
Proxy modes (argv[3], default "ok"):
ok - honour CONNECT and tunnel to the origin
flood - answer 200 then stream headers forever with no blank line, to exercise
the client's bound on the proxy response (must not hang the crawl)
Usage: proxy-https-server.py <cert.pem> <logdir> [mode]
Prints "ORIGIN <port>", "PROXY <port>", then "ready" (one per line) on stdout.
"""
import http.server
import os
import socket
import socketserver
import ssl
import sys
import threading
ORIGIN_BODY = b"<html><body>ORIGIN-PAGE-85</body></html>"
PROXY_LOG = "proxy.log"
ORIGIN_LOG = "origin-headers.log"
def make_origin(logdir):
class Origin(http.server.BaseHTTPRequestHandler):
def do_GET(self):
with open(os.path.join(logdir, ORIGIN_LOG), "a") as handle:
for key in self.headers.keys():
handle.write(key + "\n")
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Length", str(len(ORIGIN_BODY)))
self.end_headers()
self.wfile.write(ORIGIN_BODY)
def log_message(self, *args):
pass
return Origin
def start_origin(certfile, logdir):
httpd = socketserver.TCPServer(("127.0.0.1", 0), make_origin(logdir))
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
threading.Thread(target=httpd.serve_forever, daemon=True).start()
return port
def pipe(src, dst):
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
for sock in (src, dst):
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
def handle_client(conn, logdir, mode):
rfile = conn.makefile("rb")
request_line = rfile.readline().decode("latin-1").strip()
auth = None
while True:
line = rfile.readline().decode("latin-1")
if line in ("\r\n", "\n", ""):
break
key, _, value = line.partition(":")
if key.strip().lower() == "proxy-authorization":
auth = value.strip()
with open(os.path.join(logdir, PROXY_LOG), "a") as handle:
handle.write(request_line + "\n")
if auth is not None:
handle.write("AUTH " + auth + "\n")
parts = request_line.split()
if not (len(parts) >= 2 and parts[0] == "CONNECT"):
conn.sendall(b"HTTP/1.0 501 Not Implemented\r\n\r\n")
conn.close()
return
if mode == "flood":
# 200, then an endless header stream with no terminating blank line: the
# client must bound this and give up, not hang.
try:
conn.sendall(b"HTTP/1.0 200 Connection established\r\n")
while True:
conn.sendall(b"X-Pad: 0123456789\r\n")
except OSError:
pass
conn.close()
return
host, _, port = parts[1].partition(":")
try:
upstream = socket.create_connection((host, int(port or 443)))
except OSError:
conn.sendall(b"HTTP/1.0 502 Bad Gateway\r\n\r\n")
conn.close()
return
conn.sendall(b"HTTP/1.0 200 Connection established\r\n\r\n")
threading.Thread(target=pipe, args=(conn, upstream), daemon=True).start()
pipe(upstream, conn)
def start_proxy(logdir, mode):
srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("127.0.0.1", 0))
srv.listen(16)
port = srv.getsockname()[1]
def serve():
while True:
conn, _ = srv.accept()
threading.Thread(
target=handle_client, args=(conn, logdir, mode), daemon=True
).start()
threading.Thread(target=serve, daemon=True).start()
return port
def main():
certfile, logdir = sys.argv[1], sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "ok"
for name in (PROXY_LOG, ORIGIN_LOG):
open(os.path.join(logdir, name), "w").close()
origin_port = start_origin(certfile, logdir)
proxy_port = start_proxy(logdir, mode)
print("ORIGIN %d" % origin_port, flush=True)
print("PROXY %d" % proxy_port, flush=True)
print("ready", flush=True)
threading.Event().wait()
if __name__ == "__main__":
main()

View File

@@ -2,19 +2,19 @@
# #
error=0 error=0
for i in *.test ; do for i in *.test; do
if bash $i ; then if bash "$i"; then
echo "$i: passed" >&2 echo "$i: passed" >&2
else else
echo "$i: ERROR" >&2 echo "$i: ERROR" >&2
error=$[${error}+1] error=$((error + 1))
fi fi
done done
if test "$error" -eq 0; then if test "$error" -eq 0; then
echo "all tests passed" >&2 echo "all tests passed" >&2
else else
echo "${error} test(s) failed" >&2 echo "${error} test(s) failed" >&2
fi fi
exit $error exit $error