Compare commits

..

60 Commits

Author SHA1 Message Date
Xavier Roche
1611dbcabf Trust a declared Content-Type over a binary URL extension (#409)
PR #408 stopped a bogus or missing html-ish wire type from clobbering a URL
extension that maps to a specific non-HTML type (the #267 mangle). But it
treated an explicitly declared text/html the same as a missing type, so a
binary-looking URL that legitimately serves HTML, such as a login or error
interstitial or a soft-404 at a .pdf or .jpg link, was saved under the binary
extension with HTML inside and would not render locally.

The response body is the only true discriminator, but under the default delayed
type check the save name is committed from the headers while the body is still
downloading, so it cannot be sniffed at naming time. Instead, keep the URL
extension only when the server sent no Content-Type at all (a missing header is
defaulted to text/html upstream and must not be trusted); an explicitly declared
type, even text/html, now wins. This trades the rare case of a real binary
explicitly mislabeled text/html (now named .html) for the common interstitial
and soft-404 case.

Whether a Content-Type header was actually received cannot be recovered after
parsing, since treatfirstline defaults a missing header to text/html, so it is
recorded as a new hts_boolean contenttype_given on htsblk. That grows the
installed struct, an incompatible ABI change: soname bumped 3 -> 4, and the
Debian runtime package renamed libhttrack3 -> libhttrack4 to match.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:18:16 +02:00
Xavier Roche
099501ee50 Make lintian actually gate the Debian package build (#410)
The deb CI job and mkdeb.sh ran lintian via debuild with
--fail-on=error,warning and were believed to gate on it. They did not:
debuild only reports lintian, it does not propagate lintian's exit status,
so a package that lintian flags with errors or warnings still built green.
This was demonstrated by a SONAME bump landing without the matching
libhttrackN package rename: lintian emitted shared-library-is-multi-arch-foreign
and package-name-doesnt-match-sonames, yet the job passed.

Disable debuild's lintian run and run lintian ourselves on the produced
.changes, under set -e, so any error or warning fails the build. Two CI-only
adjustments keep a clean package green: --profile debian, because the Ubuntu
runners' vendor data would otherwise reject the Debian "unstable" distribution,
and --suppress-tags newer-standards-version, which only reflects the runner's
lintian being older than the buildds'. The long-standing script-not-executable
hint on the sample search.sh gets an override.

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:13:12 +02:00
Xavier Roche
1b9eefa3b4 Merge pull request #408 from xroche/fix/267-delayed-ext-mangle
Stop mangling saved-file extensions under the default delayed type check
2026-06-20 18:28:27 +02:00
Xavier Roche
9c8d3a41eb tests: tighten the type-matrix guards
Add two assertions surfaced by review of the override path: control.php
must not survive its rename to control.html (a dual-write regression
would leave both), and gen.php?id=5 (a query/extension-less URL served
image/png) must keep its .png and not be mangled to .html. Both exercise
the "override still fires" direction that the suppression cases don't.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:25:45 +02:00
Xavier Roche
ae77cd9d6d Honor --assume under the default delayed type check (-%N2)
Under HARD savename-delayed (the default), url_savename() forced
is_html=-1 before consulting the user's --assume rules, so a type the
user pinned was lost to the delayed name and never applied (#56). Skip
the forced delay when is_userknowntype() matches: ishtml() already
consults the user type, so the immediate naming path applies it. Files
with no --assume rule are unaffected -- is_userknowntype() is false and
the delay still fires.

tests/16_local-assume.test crawls a .png served as image/png but assumed
text/html and checks it is saved .html; it fails without this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:12:01 +02:00
Xavier Roche
51b8dcd81c Keep a known URL extension against a bogus html/empty Content-Type
Under the default delayed type check (-%N2), url_savename() rewrote a
saved file's extension from the wire Content-Type, gated only by
!may_unknown2(). text/html is not in the keep-list, so a response
labeled text/html -- or a typeless one, which is coerced to text/html --
clobbered the URL's own extension: a PNG served as text/html or with no
Content-Type was saved as .html, and .htm was normalized to .html (#29).
The bytes stayed intact; only the name was silently wrong.

wire_patches_ext() now lets the wire type override the extension only
when the type is patchable and doing so would not clobber a URL
extension that already maps to a specific, non-HTML type. A generator or
extension-less URL still becomes .html; a .png stays .png.

tests/15_local-types.test locks this with a deterministic offline crawl
of a content-type/extension matrix (tests/local-server.py); it fails on
the unfixed engine. Addresses the #267 mangle family (incl. #29).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 18:07:08 +02:00
Xavier Roche
bcce664143 Merge pull request #364 from xroche/feature/local-test-server
tests: offline local test server prototype (cookies + HTTPS)
2026-06-20 16:41:26 +02:00
Xavier Roche
7a24add87c tests: add offline local test server prototype (cookies + HTTPS)
Replace the network dependency for crawl tests with a self-contained Python
stdlib server (http.server + ssl) that httrack crawls over loopback. The server
binds an ephemeral port and prints it on stdout; local-crawl.sh discovers the
port, substitutes the BASEURL token into the httrack arguments, runs the crawl,
and audits the mirror under the discovered host-root directory.

This prototype migrates two cases off ut.httrack.com:

- 13_local-cookies.test drives the cookie chain (entrance/second/third)
  reimplemented as Python handlers from the old ut/cookies/*.php fixtures. A
  missing or wrong cookie answers 500, so a clean 3-files/0-errors run proves
  the cookie jar is replayed across links.
- 14_local-https.test crawls over HTTPS using a shipped long-dated self-signed
  cert. httrack does not verify certs, so the cert is accepted as-is and the
  real TLS path runs offline.

The group skips (exit 77) when python3 is missing, mirroring check-network.sh.
Fixtures and the cert are listed explicitly in EXTRA_DIST (automake does not
expand globs); make distcheck passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 16:35:13 +02:00
Xavier Roche
2308e7bafd Merge pull request #407 from xroche/fix/mkdeb-orig-artifact-rev2
mkdeb: cut a Debian revision >= 2 without bypassing the tool
2026-06-20 15:46:57 +02:00
Xavier Roche
ef5691fc47 mkdeb: reuse a frozen orig tarball for a Debian revision >= 2
mkdeb.sh regenerated the upstream orig from a fresh `git archive HEAD | make
dist` on every run. That is right for a -1 release, but a Debian revision >= 2
reuses the orig frozen in the archive at -1: the .dsc pins it by checksum, and
a regenerated orig (different mtimes, and content drift whenever the release
tooling shipped in EXTRA_DIST changes) gets rejected by dak. The -2 upload had
to bypass mkdeb.sh and stitch the package by hand.

Derive the upstream version and Debian revision from debian/changelog and let
the revision pick the orig: revision 1 builds a fresh tarball as before;
revision >= 2 reuses the one passed with --orig FILE, untouched. The --orig
requirement is enforced only for a signed (upload-bound) build: an unsigned
build is a throwaway (CI, local lintian) that can never reach the archive, so
it still regenerates the orig as before rather than demanding a frozen one.

Two guards close the gap the old code left implicit: the regenerate path
asserts the built tarball matches the changelog version (catching a
configure.ac/changelog skew), and the overlay step confirms the orig unpacks
to httrack-<ver>/ before dropping debian/ on top.

Validated end to end by reusing the official 3.49.8 orig to build 3.49.8-2:
the resulting .dsc pins the frozen orig's checksum byte for byte.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:44:12 +02:00
Xavier Roche
0a6eb73903 mkdeb: emit the orig website artifact on a Debian revision >= 2
The release-artifacts step signs and checksums httrack_<ver>.orig.tar.gz in
$outdir, but $outdir is populated by `dcmd cp` from the .changes, which lists
only the files in the upload. dpkg-genchanges omits the orig from a revision
>= 2 .changes (it is already in the archive), so the orig never reached
$outdir and `gpg --detach-sign` failed with "No such file or directory",
aborting a -2 (or later) release after the source package was already built.

Copy the orig from the build tree into $outdir before signing so the website
artifacts are produced regardless of the Debian revision. The upload is
unaffected: dput uploads the .changes-referenced files, not the extra orig.

CI didn't catch this because the deb job builds unsigned and the artifact
block is gated on a signing key.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 15:12:03 +02:00
Xavier Roche
fdb243e5a2 Merge pull request #406 from xroche/debian/libhttrack3-rename
debian: rename libhttrack2 to libhttrack3 to follow the SONAME
2026-06-20 15:04:12 +02:00
Xavier Roche
f8546e146d debian: drop the dead libhttrack-swf1.files and fix the overrides comment
Two packaging nits surfaced while reviewing the libhttrack3 rename, both
debian/-only:

- debian/libhttrack-swf1.files listed libhtsswf.so.1* but there is no
  libhttrack-swf1 package in debian/control and the swf module is no longer
  built (lib_LTLIBRARIES is just libhttrack/libhtsjava). dh_movefiles only
  consults built packages, so the list was dead. Remove it.

- libhttrack3.lintian-overrides claimed the ABI is tracked via "a strict
  =version dependency", but dh_makeshlibs --version-info emits the
  conservative (>= upstream-version) form, which is the correct choice for a
  soname-versioned library; a = ${binary:Version} shlibs dependency draws
  lintian's distant-prerequisite-in-shlibs. Correct the comment to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:59:00 +02:00
Xavier Roche
b7f602f2eb debian: rename libhttrack2 to libhttrack3 to follow the SONAME
The 3.49.8 ABI bump moved the soname to libhttrack.so.3, but the packaging
still globbed .so.2 in debian/libhttrack2.files, so the runtime libraries
matched nothing there and fell through into the catch-all httrack package;
libhttrack2 shipped no library (lintian package-name-doesnt-match-sonames).

Rename the binary package to libhttrack3, take over the misplaced libraries
from httrack and the old libhttrack2 via Breaks/Replaces, and switch the
.files globs to a .so.3* wildcard so a future soname bump no longer silently
misplaces the libraries. Ships as 3.49.8-2; new binary name goes through NEW.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:46:14 +02:00
Xavier Roche
550100b56a Merge pull request #405 from xroche/feature/mkdeb-sbuild
mkdeb: optional --sbuild clean-room build gate
2026-06-20 14:43:43 +02:00
Xavier Roche
33ddb27243 mk-sbuild-chroot: suggest a concrete usermod for the subuid range
Compute a start past every range already in /etc/subuid+subgid and print the
canonical sudo usermod --add-subuids/--add-subgids command, instead of a raw
file append the user has to adjust by hand.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:13:06 +02:00
Xavier Roche
4606dfbf66 mk-sbuild-chroot: require a subuid/subgid range up front
The unshare backend maps a whole UID range, not just the caller's, because the
base install creates system users. Without an /etc/subuid+subgid entry the
install crashes (dpkg SIGSEGV) instead of failing cleanly. Check for the range
before bootstrapping and point at the one-line fix; skip the check for root,
which uses mode=root.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:07:10 +02:00
Xavier Roche
a6f1b9a3dd mk-sbuild-chroot: only treat an active $chroot_mode line as configured
The idempotency guard matched chroot_mode.*unshare anywhere in ~/.sbuildrc,
including a commented-out line, so --write-sbuildrc would silently skip the
append and leave the unshare backend unconfigured. Anchor the match to an
active assignment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 14:02:42 +02:00
Xavier Roche
fb35d6a0f1 tools: add mk-sbuild-chroot.sh to set up the --sbuild gate
The --sbuild gate needs an sbuild chroot, which was only documented as loose
commands. This adds a companion script that bootstraps one with the rootless
unshare backend (mmdebstrap into ~/.cache/sbuild/<dist>-<arch>.tar.zst, where
sbuild finds it by name), idempotent unless --force, optionally writing the
unshare mode into ~/.sbuildrc. mkdeb.sh's --sbuild help now points at it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:43:34 +02:00
Xavier Roche
8a270fec03 mkdeb: add an optional --sbuild clean-room build gate
With source-only uploads the archive's buildds are the first place the package
is built in a clean environment, so an undeclared Build-Depends or any FTBFS
only shows up after the upload. --sbuild rebuilds the freshly produced .dsc in a
minimal chroot holding only the declared Build-Depends, reproducing the buildd
environment; a failure aborts the release before the upload. It runs after the
source package is built and before the upstream-tarball release artifacts are
signed. Logs and the clean-built debs land in <outdir>/sbuild.

The distribution comes from the changelog (UNRELEASED falls back to unstable),
and the flag fails fast if sbuild isn't installed. Off by default; needs an
sbuild chroot for the target suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:37:20 +02:00
Xavier Roche
0cbd5279f2 Merge pull request #404 from xroche/release/3.49.8
Curate the 3.49-8 release notes
2026-06-20 13:06:13 +02:00
Xavier Roche
05306ee4fd Curate the 3.49-8 release notes
Round out the 3.49-8 entry in history.txt and the debian changelog with the
user-facing work landed since 3.49-7: the HTTPS-proxy CONNECT tunnel, wider
srcset parsing, the crawler and parser fixes (CSS @import, xmlns, relative
paths, RFC 6265 cookies, doit.log reload), the parser and engine buffer-copy
security hardening, and brief summary lines for the API, build, CI and test
work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 13:02:51 +02:00
Xavier Roche
1d0fc0a566 Merge pull request #403 from xroche/chore/clang-format-separate-defs
Separate definition blocks in the public headers
2026-06-20 12:56:23 +02:00
Xavier Roche
a4452592b4 Separate definition blocks and canonicalize the public headers
Set SeparateDefinitionBlocks: Always in .clang-format so clang-format keeps
a blank line between adjacent definitions, then reformat the installed
(DevIncludes) headers in full. Several of them packed struct/typedef/macro
definitions with no separation and carried non-canonical spacing (char*,
__attribute__ ((x)), padded inner parens), which made them hard to read;
this brings them to the repo's clang-format-19 canonical form and inserts
the separating blank lines.

Headers only, no semantic change: out-of-tree build is clean and make check
passes (21 pass, 7 network skip, 0 fail). htsconfig.h is UTF-8 and its
French comments survive byte-for-byte (clang-format only reflowed them to 80
columns). The new option also governs future touched-line formatting of the
engine sources.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:52:19 +02:00
Xavier Roche
62c2364b59 Merge pull request #402 from xroche/chore/lint-all-shell-scripts
Lint every shell script with shfmt and shellcheck
2026-06-20 12:42:19 +02:00
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
Xavier Roche
02c7f4ebf6 htsparse: don't treat XHR.open's method argument as a URL (#218)
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.

Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.

Closes #218

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 20:27:04 +02:00
Xavier Roche
9070b44a70 Merge pull request #398 from xroche/fix/html-underflow-396
htsparse: fix buffer underflow reading *(html-1) at offset 0 (#396)
2026-06-19 19:55:40 +02:00
Xavier Roche
799c045061 htsparse: don't read *(html-1) before the parse buffer (#396)
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.

Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.

Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:44:25 +02:00
Xavier Roche
fb1ee3bf2e Merge pull request #397 from xroche/fix/css-import-94
CSS @import: capture URLs that carry a media/supports/layer condition (#94)
2026-06-19 19:30:21 +02:00
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
Xavier Roche
0145ec37a3 htsparse: don't crawl xmlns namespace declarations (#191)
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.

01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:24:55 +02:00
Xavier Roche
a80fab38ba Merge pull request #394 from xroche/fix/proxy-https-connect-85
Tunnel https through the proxy via CONNECT (#85)
2026-06-19 18:03:31 +02:00
Xavier Roche
c52a524a63 htslib: bound the proxy CONNECT response; harden + cover review findings
Follow-up to the CONNECT-tunnel change, from an adversarial review (the proxy
response is hostile input: a malicious or MITM proxy controls every byte).

- Bound the response read so a proxy cannot stall the single-threaded back_wait
  crawl: proxy_getline now fails on an over-long line instead of consuming it
  forever, the header drain is capped at 64 lines, and the send loop gives up
  rather than spin against a socket that reports writable but never accepts.
- Size `authority` to hold any url_adr host (HTS_URLMAXSIZE*2) so an oversized
  hostname can't trip the abort-on-overflow buff helpers; grow `req` to match.
- Reject control bytes in the CONNECT authority as a local backstop; today the
  CR/LF defense lives entirely upstream (escape_remove_control / header-line
  splitting).
- Test: the origin now records the headers it receives, and the test asserts
  Proxy-Authorization never reaches the origin through the tunnel (the previous
  assertions couldn't see a leak). Added a flooding-proxy scenario that proves
  the crawl terminates instead of hanging on an unbounded response.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 09:52:10 +02:00
Xavier Roche
1907621d37 htslib: tunnel https through the proxy via CONNECT (#85)
httrack opened https connections straight to the origin even when a proxy
was configured, so --proxy was silently ignored for https and the crawler
used the real IP. http_xfopen bypassed the proxy for any https:// URL,
because the absolute-URI proxy form it uses for http cannot carry https.

Connect to the proxy instead and, once the TCP connection is up, open an
HTTP CONNECT tunnel (http_proxy_tunnel) before the TLS handshake, so TLS
runs end-to-end with the origin. Proxy credentials now ride the CONNECT
request rather than the tunneled GET, where they would leak to the origin.
The exchange is a bounded blocking read inside the back_wait connect path:
no new async state, no struct/ABI change (the helpers stay visibility-hidden).

Verified end-to-end by 13_crawl_proxy_https.test: it crawls a local
self-signed https origin through a logging CONNECT proxy and asserts the
proxy saw the CONNECT and that credentials ride it. The assertion fails on
the pre-fix bypass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:43:56 +02:00
Xavier Roche
3b2d7afdaa Merge pull request #393 from xroche/fix/empty-footer-doitlog-106
Keep empty quoted args when reloading doit.log (#106)
2026-06-19 08:13:19 +02:00
Xavier Roche
6ee539619e htscoremain: keep empty quoted args when reloading doit.log (#106)
An empty footer (-%F "") is written to hts-cache/doit.log correctly as the
two-character token "", and next_token() unquotes it back to an empty string.
But the doit.log reload loop only re-inserted a token when strnotempty(lastp),
which dropped the empty one. With its argument gone, -%F absorbed the following
token (or had none), so a no-url --continue/--update reprise misparsed and
failed.

Track whether the token started with a quote (before next_token() strips it in
place) and keep it even when empty, so "" survives the round-trip. Whitespace
gaps still produce no token, so spacing behavior is unchanged.

01_engine-doitlog.test gains a scenario that mirrors with -%F "" -r2, then on
the no-url reprise checks the regenerated doit.log still round-trips the empty
token -- probing the reader's rebuilt argv, not just that the reprise didn't
crash. The trailing -r2 makes a dropped-token bug visible (it shifts into -%F's
slot and panics) rather than a harmless run off the end of argv. Reverting only
the guard makes the scenario fail (reprise exits 255).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 08:09:57 +02:00
Xavier Roche
fb098b27b4 Merge pull request #392 from xroche/fix/cookie-rfc6265-151
Drop $Version/$Path from the request Cookie header (#151)
2026-06-18 22:42:47 +02:00
Xavier Roche
5f6a3fb917 htslib: drop $Version/$Path from request Cookie header (#151)
The request "Cookie:" header was built in the obsolete RFC 2965 style,
emitting "$Version=1" before the first cookie and a "$Path=..." attribute
after every value:

  Cookie: $Version=1; name=value; $Path=/; has_js=1; $Path=/

Servers expecting RFC 6265 treat $Version and $Path as stray cookies and
reject or misread the request. Emit bare name=value pairs joined by "; ":

  Cookie: name=value; has_js=1

The cookie loop is factored out of http_sendhead into append_cookie_header
(same logic, same buffer), with a thin http_cookie_header_selftest wrapper
so the exact code path can be unit-tested. A new hidden "-#Q" subcommand
builds the header for two same-domain cookies plus one on a different
domain (which must be filtered out) and checks the output is the clean
RFC 6265 form with no $Version/$Path and no cross-domain leak; driven by
tests/01_engine-cookies.test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 22:12:28 +02:00
Xavier Roche
f9e676dbe3 Merge pull request #391 from xroche/feature/api-enum-callsites-savename83
htsopt: name the savename_83 enum and finish the call-site constant adoption
2026-06-18 21:43:34 +02:00
Xavier Roche
1b440c44b5 htsopt: name savename_83 enum and adopt enum constants at call sites
Type opt->savename_83 as a new hts_savename_83 enum (LONG/DOS/ISO9660 =
0/1/2) and replace the remaining magic-number literals for the already-
typed verbosedisplay and savename_delayed fields with their named enum
constants across the engine.

Behavior-preserving: every constant equals the literal it replaces, and a
C enum is int-sized, so struct layout is unchanged (sizeof(httrackp) and
offsetof(savename_83) are identical to origin/master, no soname bump). The
-L option block is deliberately reflowed to clang-format style, which is
what made the savename_83 retype tractable. Bitmask fields (travel/seeker/
getmode/parsejava/hostcontrol) intentionally stay int with named bit enums,
per the existing flags-as-enum split.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 21:03:33 +02:00
Xavier Roche
ac6dd1a570 Merge pull request #390 from xroche/fix/copy-htsopt-unsigned-enum-guards
copy_htsopt silently drops boolean option fields
2026-06-18 20:46:00 +02:00
Xavier Roche
4549ec3695 htsopt: fix copy_htsopt dropping unsigned-enum fields
copy_htsopt() copies each field only when it is not the "-1 means unset"
sentinel, written as `if (from->X > -1)`. The boolean/enum option
migrations turned nearlink, errpage and parseall into hts_boolean, which
GCC backs with unsigned int. `unsigned > -1` is always false, so those
three fields silently stopped being copied.

Cast to int at the guard to restore the signed sentinel test. Add a
hidden `httrack -#9` self-test that drives copy_htsopt over distinct
boolean values plus an int positive control (tests/01_engine-copyopt.test);
it fails on the unfixed guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 20:25:42 +02:00
Xavier Roche
ac56c31b24 Merge pull request #389 from xroche/fix/travel-test-all-enum
htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
2026-06-18 18:40:33 +02:00
Xavier Roche
ee6beeeb7d htsopt: fold HTS_TRAVEL_TEST_ALL into the hts_travel_scope enum
The -t "test all" flag was a stray #define sitting next to the scope
enum; make it an enum constant so the named travel values live in one
place. The mask (HTS_TRAVEL_SCOPE_MASK) stays a #define: it selects the
scope out of opt->travel, it is not a member of the value set.

Name and value (1 << 8) are unchanged, so every use site compiles
identically and opt->travel stays plain int. No ABI change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:29:23 +02:00
Xavier Roche
6788bda380 Merge pull request #388 from xroche/feature/api-enum-fields-2
htsopt: type debug, savename_delayed and verbosedisplay as named enums
2026-06-18 18:25:44 +02:00
Xavier Roche
7ead8d595e htsopt: type three more option fields as named enums
debug becomes hts_log_type (it already stored LOG_* values; the int
declaration was a latent type hole), savename_delayed becomes a new
hts_savename_delayed { NONE, SOFT, HARD }, and verbosedisplay becomes a
new hts_verbosedisplay { NONE, SIMPLE, FULL }. hostcontrol stays int but
its bits are now named by a new hts_hostcontrol flags enum, matching the
existing getmode/seeker/travel/htsparsejava_flags pattern.

A C enum is int-sized, so struct layout, field offsets and
sizeof(httrackp) are unchanged: no ABI break, no soname bump. The three
sscanf("%d", ...) sites that fill these fields now write through an int*
(size-identical) to keep the format type exact.

These enums are unsigned-backed (all enumerators non-negative), so the
non-negative debug comparisons (debug < level, debug > LOG_INFO, etc.)
now compile to unsigned jumps. debug is never negative, never sscanf'd
and never tested against a negative bound, so the result is unchanged;
disassembly is otherwise byte-identical bar instruction scheduling.

savename_83 is left as int on purpose: its sscanf sits in the -L parser
block whose old indentation does not round-trip through clang-format.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 18:11:19 +02:00
Xavier Roche
93f502990c Merge pull request #387 from xroche/feature/api-bool-returns
Return hts_boolean from the yes/no library functions
2026-06-18 17:38:48 +02:00
Xavier Roche
0f4b2596b2 htslib: return hts_boolean from the yes/no library functions
The exported API had many functions returning int where the int is really a
yes/no answer. Type the 14 genuinely-boolean ones as hts_boolean
(catch_url, dir_exists, is_dyntype, may_unknown, hts_findnext,
hts_findisdir/isfile/issystem, hts_has_stopped, hts_addurl, hts_resetaddurl,
hts_log, get_httptype_sized, guess_httptype_sized) and the three boolean int
parameters likewise (get_httptype_sized's flag, unescape_http_unharm's no_high,
hts_request_stop's force).

hts_boolean moves from htsopt.h to htsglobal.h so the library header, which only
forward-declares httrackp and does not include htsopt.h, can see the type.

The audit deliberately left alone the functions whose name suggests a boolean
but whose value is not 0/1: hts_is_testing returns 0..5, hts_is_exiting and
is_knowntype/is_userknowntype are tri-state, structcheck and the *_utf8 wrappers
are POSIX 0/-1, hts_findgetsize is a size, hts_main is an exit code, and
copy_htsopt returns 0 for success (a bool would read backwards). hts_setpause
and hts_is_parsing keep int params because they gate on '>= 0', not 0/1.

Not an ABI break: int -> int-sized enum is the same calling convention for both
return values (eax) and parameters, and enum<->int is implicit for callers, so
already-compiled consumers keep working. Verified by comparing per-object
disassembly against master: 39 of 45 objects byte-identical, htslib differs only
in __LINE__ immediates, and the five caller/definer objects differ only in
register allocation and return-block merging (no control-flow or value change).
make check passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 09:19:36 +02:00
Xavier Roche
4a676bb5e1 Merge pull request #386 from xroche/feature/api-boolean-enum
Type the boolean option fields as a named enum
2026-06-18 09:04:14 +02:00
78 changed files with 3574 additions and 1378 deletions

View File

@@ -16,6 +16,7 @@ BasedOnStyle: LLVM
SpaceAfterCStyleCast: true # "(int) x", overwhelmingly dominant (542 vs 7) SpaceAfterCStyleCast: true # "(int) x", overwhelmingly dominant (542 vs 7)
SortIncludes: false # C include order can be significant; never reorder SortIncludes: false # C include order can be significant; never reorder
IncludeBlocks: Preserve # do not merge/reflow include groups IncludeBlocks: Preserve # do not merge/reflow include groups
SeparateDefinitionBlocks: Always # blank line between definitions (readability)
# Stated explicitly for robustness against base-style drift (these match LLVM): # Stated explicitly for robustness against base-style drift (these match LLVM):
IndentWidth: 2 IndentWidth: 2

5
.flake8 Normal file
View File

@@ -0,0 +1,5 @@
[flake8]
# Match black's formatting so the two tools don't fight.
max-line-length = 88
# E203/W503 conflict with black's slice and line-break style.
extend-ignore = E203, W503

View File

@@ -227,7 +227,8 @@ jobs:
# Validate the Debian packaging via the same script maintainers release with. # Validate the Debian packaging via the same script maintainers release with.
# One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt # One amd64/gcc run is enough: packaging (control/rules/manifest/lintian/quilt
# source build) is arch- and compiler-independent, and the build matrix above # source build) is arch- and compiler-independent, and the build matrix above
# already covers compile portability. lintian runs with --fail-on=error. # already covers compile portability. mkdeb.sh runs lintian as an explicit gate
# (debuild does not propagate lintian's exit) with --fail-on=error,warning.
deb: deb:
name: deb package (lintian) name: deb package (lintian)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
@@ -320,6 +321,21 @@ jobs:
lint: lint:
name: lint (shellcheck, shfmt) name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps: steps:
- uses: actions/checkout@v6 - uses: actions/checkout@v6
@@ -332,12 +348,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck - name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh run: shellcheck $SHELL_SCRIPTS
- name: shfmt - name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format # Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip, # (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -29,9 +29,9 @@ AC_CONFIG_SRCDIR(src/httrack.c)
AC_CONFIG_MACRO_DIR([m4]) AC_CONFIG_MACRO_DIR([m4])
AC_CONFIG_HEADERS(config.h) AC_CONFIG_HEADERS(config.h)
AM_INIT_AUTOMAKE([subdir-objects]) AM_INIT_AUTOMAKE([subdir-objects])
# 3:0:0: htsblk layout changed (contenttype/charset/contentencoding widened to # 4:0:0: htsblk gained the contenttype_given field, an incompatible ABI break,
# 128), an incompatible ABI break, so bump current and reset revision/age. # so bump current and reset revision/age.
VERSION_INFO="3:0:0" VERSION_INFO="4:0:0"
AM_MAINTAINER_MODE AM_MAINTAINER_MODE
AC_USE_SYSTEM_EXTENSIONS AC_USE_SYSTEM_EXTENSIONS

30
debian/changelog vendored
View File

@@ -1,6 +1,32 @@
httrack (3.49.8-3) unstable; urgency=medium
* Rename libhttrack3 to libhttrack4 to follow the SONAME bump to
libhttrack.so.4: htsblk gained a contenttype_given field, an
incompatible ABI change (VERSION_INFO 3 -> 4). The .files wildcard
now tracks .so.4* so the runtime libraries land in the right
package. New binary package, via NEW.
-- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 19:46:16 +0200
httrack (3.49.8-2) unstable; urgency=medium
* Rename libhttrack2 to libhttrack3 to follow the SONAME, which the 3.49.8
ABI bump moved to libhttrack.so.3 (package-name-doesnt-match-sonames). In
3.49.8-1 the libhttrack2.files glob still matched .so.2, so the runtime
libraries fell through into the httrack package and libhttrack2 shipped no
library. The new .files uses a .so.3* wildcard so a future SONAME bump no
longer silently misplaces the libraries. New binary package, via NEW.
* Drop the stale debian/libhttrack-swf1.files: the swf module is no longer
built and no libhttrack-swf1 package exists.
-- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 14:42:13 +0200
httrack (3.49.8-1) unstable; urgency=medium httrack (3.49.8-1) unstable; urgency=medium
* New upstream release. * New upstream release: HTTPS-proxy CONNECT tunnelling and wider srcset
parsing, a batch of crawler and parser fixes (CSS @import, xmlns
namespaces, relative paths, RFC 6265 cookies), and security hardening of
the parser and of buffer copies throughout the engine.
* Drop the OpenSSL linking exception from the license: OpenSSL 3.0+ is * Drop the OpenSSL linking exception from the license: OpenSSL 3.0+ is
Apache-2.0 and GPL-compatible, so it is no longer needed. httrack is now Apache-2.0 and GPL-compatible, so it is no longer needed. httrack is now
plain GPL-3.0-or-later. Updated debian/copyright accordingly. plain GPL-3.0-or-later. Updated debian/copyright accordingly.
@@ -14,7 +40,7 @@ httrack (3.49.8-1) unstable; urgency=medium
the QA debcheck page. Depend on firefox-esr | chromium | www-browser the QA debcheck page. Depend on firefox-esr | chromium | www-browser
instead. instead.
-- Xavier Roche <xavier@debian.org> Sun, 07 Jun 2026 14:29:24 +0200 -- Xavier Roche <xavier@debian.org> Sat, 20 Jun 2026 13:02:08 +0200
httrack (3.49.7-2) unstable; urgency=medium httrack (3.49.7-2) unstable; urgency=medium

6
debian/control vendored
View File

@@ -58,13 +58,13 @@ Description: webhttrack common files
This package is the common files of webhttrack, website copier and This package is the common files of webhttrack, website copier and
mirroring utility mirroring utility
Package: libhttrack2 Package: libhttrack4
Architecture: any Architecture: any
Multi-Arch: same Multi-Arch: same
Section: libs Section: libs
Replaces: libhttrack1
Conflicts: libhttrack1
Depends: ${misc:Depends}, ${shlibs:Depends} Depends: ${misc:Depends}, ${shlibs:Depends}
Replaces: libhttrack3, httrack (<< 3.49.8-3~)
Breaks: libhttrack3, httrack (<< 3.49.8-3~)
Description: Httrack website copier library Description: Httrack website copier library
This package is the library part of httrack, website copier and mirroring This package is the library part of httrack, website copier and mirroring
utility utility

View File

@@ -4,3 +4,6 @@
# so the path lives in the display pointer, not the override -- match with '*'. # so the path lives in the display pointer, not the override -- match with '*'.
httrack-doc: extra-license-file * httrack-doc: extra-license-file *
httrack-doc: package-contains-documentation-outside-usr-share-doc * httrack-doc: package-contains-documentation-outside-usr-share-doc *
# search.sh is a sample CGI shipped alongside the HTML manual, not meant to be
# run from the package tree; it stays non-executable by design.
httrack-doc: script-not-executable *

View File

@@ -1,2 +0,0 @@
usr/lib/*/libhtsswf.so.1.0.0
usr/lib/*/libhtsswf.so.1

View File

@@ -1,5 +0,0 @@
usr/lib/*/libhttrack.so.2.0.49
usr/lib/*/libhttrack.so.2
usr/lib/*/libhtsjava.so.2.0.49
usr/lib/*/libhtsjava.so.2
usr/share/httrack/templates

View File

@@ -1,3 +0,0 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME and a strict =version dependency, see debian/rules).
libhttrack2: no-symbols-control-file usr/lib/*

3
debian/libhttrack4.files vendored Normal file
View File

@@ -0,0 +1,3 @@
usr/lib/*/libhttrack.so.4*
usr/lib/*/libhtsjava.so.4*
usr/share/httrack/templates

3
debian/libhttrack4.lintian-overrides vendored Normal file
View File

@@ -0,0 +1,3 @@
# The shared libraries ship without a versioned symbols control file (ABI is
# tracked via the SONAME plus a >= upstream-version dependency, see debian/rules).
libhttrack4: no-symbols-control-file usr/lib/*

2
debian/rules vendored
View File

@@ -135,7 +135,7 @@ binary-arch: build install
dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info dh_makeshlibs -a -X/usr/lib/$(DEB_HOST_MULTIARCH)/httrack/libtest --version-info
dh_installdeb -a dh_installdeb -a
# we depend on the current version (ABI may change) # we depend on the current version (ABI may change)
dh_shlibdeps -a -ldebian/libhttrack2/usr/lib/$(DEB_HOST_MULTIARCH) dh_shlibdeps -a -ldebian/libhttrack4/usr/lib/$(DEB_HOST_MULTIARCH)
dh_gencontrol -a dh_gencontrol -a
dh_md5sums -a dh_md5sums -a
dh_builddeb -a dh_builddeb -a

View File

@@ -5,12 +5,31 @@ HTTrack Website Copier release history:
This file lists all changes and fixes that have been made for HTTrack This file lists all changes and fixes that have been made for HTTrack
3.49-8 3.49-8
+ New: tunnel HTTPS downloads through the configured HTTP proxy via CONNECT (#85)
+ New: parse every candidate URL in <img> and <source> srcset lists (#326)
+ Changed: dropped the obsolete OpenSSL linking exception (OpenSSL 3.0+ is Apache-2.0 and GPL-compatible); httrack is now plain GPLv3-or-later + Changed: dropped the obsolete OpenSSL linking exception (OpenSSL 3.0+ is Apache-2.0 and GPL-compatible); httrack is now plain GPLv3-or-later
+ Fixed: link libhtsjava and the libtest examples directly against libc + Fixed: several out-of-bounds reads in the HTML/CSS parser on hostile input (#94, #396)
+ Fixed: stored XSS via an unescaped URL in the generated page footer (#165)
+ Fixed: hardened buffer copies throughout the engine against overflow
+ Fixed: capture conditional CSS @import URLs (#94)
+ Fixed: don't crawl xmlns namespace declarations as links (#191)
+ Fixed: don't mistake the method argument of XMLHttpRequest.open for a URL (#218)
+ Fixed: percent-encode parentheses when rewriting CSS url() targets (#163)
+ Fixed: collapse ../ in file:// URLs and widen relative-link handling (#137, #162)
+ Fixed: drop the obsolete $Version/$Path attributes from the request Cookie header, per RFC 6265 (#151)
+ Fixed: keep empty quoted arguments when reloading doit.log for --update/--continue (#106)
+ Fixed: raise the User-Agent and custom-header length limits (#152)
+ Fixed: abort on a long log path (lock-file buffer too small) (#183)
+ Fixed: race in lazy mutex initialization (#297)
+ Fixed: sub-second mtime precision when comparing local files on POSIX (#383)
+ Fixed: modernize OpenSSL TLS initialization for the 3.x to 4.x transition (#308)
+ Fixed: in-place changes made by the postprocess callback were not applied (Roman Sęk) + Fixed: in-place changes made by the postprocess callback were not applied (Roman Sęk)
+ Fixed: "preffered" typo in the help text and man page (yosinn1-blip) + Fixed: "preffered" typo in the help text and man page (yosinn1-blip)
+ Fixed: corrections and updates of the Russian translation (German Aizek) + Fixed: corrections and updates of the Russian translation (German Aizek)
+ Fixed: corrections and updates of the Danish translation (scootergrisen) + Fixed: corrections and updates of the Danish translation (scootergrisen)
+ Fixed: link libhtsjava and the libtest examples directly against libc
+ New: documented the public library API headers and typed the option fields as named enums
+ Fixed: numerous build, packaging, CI and test-coverage improvements (out-of-tree builds, sanitizer/distcheck CI, shell and Python linting, AppStream metainfo)
3.49-7 3.49-7
+ Fixed: keep generated config.h architecture-independent (Debian #1133728) + Fixed: keep generated config.h architecture-independent (Debian #1133728)

View File

@@ -1,8 +1,7 @@
#!/bin/sh #!/bin/sh
# Simple indexing test using HTTrack # Simple indexing test using HTTrack
# A "real" script/program would use advanced search, and # A "real" script/program would use advanced search, and
# use dichotomy to find the word in the index.txt file # use dichotomy to find the word in the index.txt file
# This script is really basic and NOT optimized, and # This script is really basic and NOT optimized, and
# should not be used for professional purpose :) # should not be used for professional purpose :)
@@ -11,50 +10,49 @@ TESTSITE="http://localhost/"
# Create an index if necessary # Create an index if necessary
if ! test -f "index.txt"; then if ! test -f "index.txt"; then
echo "Building the index .." echo "Building the index .."
rm -rf test rm -rf test
httrack --display "$TESTSITE" -%I -O test httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./ mv test/index.txt ./
fi fi
# Convert crlf to lf # Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .." echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt tr -d '\r' <index.txt.old >index.txt
fi fi
keyword=- keyword=-
while test -n "$keyword"; do while test -n "$keyword"; do
printf "Enter a keyword: " printf "Enter a keyword: "
read keyword read -r keyword
if test -n "$keyword"; then if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`" FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches # Multiple matches
printf "Found multiple keywords: " printf "Found multiple keywords: "
echo "$FOUNDK"|cut -f2 -d':'|tr '\n' ' ' echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
echo "" echo ""
echo "Use keyword$ to find only one" echo "Use keyword$ to find only one"
else else
# One match # One match
N=`echo "$FOUNDK"|cut -f1 -d':'` N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=`tail +$N index.txt|grep -nE "\("|head -n 1` PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM"|grep "ignored">/dev/null; then if ! echo "$PM" | grep "ignored" >/dev/null; then
M=`echo $PM|cut -f1 -d':'` M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:" echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' ' tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else else
echo "keyword ignored (too many hits)" echo "keyword ignored (too many hits)"
fi fi
fi fi
else else
echo "not found" echo "not found"
fi fi
fi fi
done done

View File

@@ -48,112 +48,115 @@ Please visit our Website: http://www.httrack.com
/* Abort (with the failed byte count) when a growth allocation fails. The /* Abort (with the failed byte count) when a growth allocation fails. The
array macros never return an out-of-memory error; they assert and abort. */ array macros never return an out-of-memory error; they assert and abort. */
static void hts_record_assert_memory_failed(const size_t size) { static void hts_record_assert_memory_failed(const size_t size) {
fprintf(stderr, "memory allocation failed (%lu bytes)", \ fprintf(stderr, "memory allocation failed (%lu bytes)", (long int) size);
(long int) size); \ assertf(!"memory allocation failed");
assertf(! "memory allocation failed"); \
} }
/** Dynamic array of T elements. **/ /** Dynamic array of T elements. **/
#define TypedArray(T) \ #define TypedArray(T) \
struct { \ struct { \
/** Elements. **/ \ /** Elements. **/ \
union { \ union { \
/** Typed. **/ \ /** Typed. **/ \
T* elts; \ T *elts; \
/** Opaque. **/ \ /** Opaque. **/ \
void* ptr; \ void *ptr; \
} data; \ } data; \
/** Count. **/ \ /** Count. **/ \
size_t size; \ size_t size; \
/** Capacity. **/ \ /** Capacity. **/ \
size_t capa; \ size_t capa; \
} }
/** Initializer for an empty array (no backing store, size and capacity 0). **/ /** Initializer for an empty array (no backing store, size and capacity 0). **/
#define EMPTY_TYPED_ARRAY { { NULL }, 0, 0 } #define EMPTY_TYPED_ARRAY {{NULL}, 0, 0}
/** Array size, in elements. **/ /** Array size, in elements. **/
#define TypedArraySize(A) ((A).size) #define TypedArraySize(A) ((A).size)
/** Array capacity, in elements. **/ /** Array capacity, in elements. **/
#define TypedArrayCapa(A) ((A).capa) #define TypedArrayCapa(A) ((A).capa)
/** /**
* Remaining free space, in elements. * Remaining free space, in elements.
* Macro, first element evaluated multiple times. * Macro, first element evaluated multiple times.
**/ **/
#define TypedArrayRoom(A) ( TypedArrayCapa(A) - TypedArraySize(A) ) #define TypedArrayRoom(A) (TypedArrayCapa(A) - TypedArraySize(A))
/** Array elements, of type T*. **/ /** Array elements, of type T*. **/
#define TypedArrayElts(A) ((A).data.elts) #define TypedArrayElts(A) ((A).data.elts)
/** Array pointer, of type void*. **/ /** Array pointer, of type void*. **/
#define TypedArrayPtr(A) ((A).data.ptr) #define TypedArrayPtr(A) ((A).data.ptr)
/** Size of T. **/ /** Size of T. **/
#define TypedArrayWidth(A) (sizeof(*TypedArrayElts(A))) #define TypedArrayWidth(A) (sizeof(*TypedArrayElts(A)))
/** Nth element of the array, as an lvalue. No bounds check; N must be /** Nth element of the array, as an lvalue. No bounds check; N must be
< TypedArraySize(A). **/ < TypedArraySize(A). **/
#define TypedArrayNth(A, N) (TypedArrayElts(A)[N]) #define TypedArrayNth(A, N) (TypedArrayElts(A)[N])
/** /**
* Tail of the array (outside the array). * Tail of the array (outside the array).
* The returned pointer points to the beginning of TypedArrayRoom(A) * The returned pointer points to the beginning of TypedArrayRoom(A)
* free elements. * free elements.
**/ **/
#define TypedArrayTail(A) (TypedArrayNth(A, TypedArraySize(A))) #define TypedArrayTail(A) (TypedArrayNth(A, TypedArraySize(A)))
/** /**
* Ensure at least 'ROOM' elements can be put in the remaining space. * Ensure at least 'ROOM' elements can be put in the remaining space.
* After a call to this macro, TypedArrayRoom(A) is guaranteed to be at * After a call to this macro, TypedArrayRoom(A) is guaranteed to be at
* least equal to 'ROOM'. * least equal to 'ROOM'.
**/ **/
#define TypedArrayEnsureRoom(A, ROOM) do { \ #define TypedArrayEnsureRoom(A, ROOM) \
const size_t room_ = (ROOM); \ do { \
while (TypedArrayRoom(A) < room_) { \ const size_t room_ = (ROOM); \
TypedArrayCapa(A) = TypedArrayCapa(A) < 16 ? 16 : TypedArrayCapa(A) * 2; \ while (TypedArrayRoom(A) < room_) { \
} \ TypedArrayCapa(A) = TypedArrayCapa(A) < 16 ? 16 : TypedArrayCapa(A) * 2; \
TypedArrayPtr(A) = realloc(TypedArrayPtr(A), \ } \
TypedArrayCapa(A)*TypedArrayWidth(A)); \ TypedArrayPtr(A) = \
if (TypedArrayPtr(A) == NULL) { \ realloc(TypedArrayPtr(A), TypedArrayCapa(A) * TypedArrayWidth(A)); \
hts_record_assert_memory_failed(TypedArrayCapa(A)*TypedArrayWidth(A)); \ if (TypedArrayPtr(A) == NULL) { \
} \ hts_record_assert_memory_failed(TypedArrayCapa(A) * TypedArrayWidth(A)); \
} while(0) } \
} while (0)
/** Add an element. Macro, first element evaluated multiple times. **/ /** Add an element. Macro, first element evaluated multiple times. **/
#define TypedArrayAdd(A, E) do { \ #define TypedArrayAdd(A, E) \
TypedArrayEnsureRoom(A, 1); \ do { \
assertf(TypedArraySize(A) < TypedArrayCapa(A)); \ TypedArrayEnsureRoom(A, 1); \
TypedArrayTail(A) = (E); \ assertf(TypedArraySize(A) < TypedArrayCapa(A)); \
TypedArraySize(A)++; \ TypedArrayTail(A) = (E); \
} while(0) TypedArraySize(A)++; \
} while (0)
/** /**
* Add 'COUNT' elements from 'PTR'. * Add 'COUNT' elements from 'PTR'.
* Macro, first element evaluated multiple times. * Macro, first element evaluated multiple times.
**/ **/
#define TypedArrayAppend(A, PTR, COUNT) do { \ #define TypedArrayAppend(A, PTR, COUNT) \
const size_t count_ = (COUNT); \ do { \
/* This 1-case is to benefit from type safety. */ \ const size_t count_ = (COUNT); \
if (count_ == 1) { \ /* This 1-case is to benefit from type safety. */ \
TypedArrayAdd(A, *(PTR)); \ if (count_ == 1) { \
} else { \ TypedArrayAdd(A, *(PTR)); \
const void *const source_ = (PTR); \ } else { \
TypedArrayEnsureRoom(A, count_); \ const void *const source_ = (PTR); \
assertf(count_ <= TypedArrayRoom(A)); \ TypedArrayEnsureRoom(A, count_); \
memcpy(&TypedArrayTail(A), source_, count_ * TypedArrayWidth(A)); \ assertf(count_ <= TypedArrayRoom(A)); \
TypedArraySize(A) += count_; \ memcpy(&TypedArrayTail(A), source_, count_ *TypedArrayWidth(A)); \
} \ TypedArraySize(A) += count_; \
} while(0) } \
} while (0)
/** Clear an array, freeing memory and clearing size and capacity. **/ /** Clear an array, freeing memory and clearing size and capacity. **/
#define TypedArrayFree(A) do { \ #define TypedArrayFree(A) \
if (TypedArrayPtr(A) != NULL) { \ do { \
TypedArrayCapa(A) = TypedArraySize(A) = 0; \ if (TypedArrayPtr(A) != NULL) { \
free(TypedArrayPtr(A)); \ TypedArrayCapa(A) = TypedArraySize(A) = 0; \
TypedArrayPtr(A) = NULL; \ free(TypedArrayPtr(A)); \
} \ TypedArrayPtr(A) = NULL; \
} while(0) } \
} while (0)
#endif #endif

View File

@@ -2532,8 +2532,26 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#if HTS_USEOPENSSL #if HTS_USEOPENSSL
/* SSL mode */ /* SSL mode */
if (back[i].r.ssl) { if (back[i].r.ssl) {
int tunnel_ok = 1;
// https via proxy: CONNECT-tunnel before TLS (#85)
if (back[i].r.req.proxy.active && back[i].r.ssl_con == NULL) {
const int timeout = back[i].timeout > 0 ? back[i].timeout : 30;
tunnel_ok =
http_proxy_tunnel(opt, &back[i].r, back[i].url_adr, timeout);
if (!tunnel_ok) {
if (!strnotempty(back[i].r.msg))
strcpybuff(back[i].r.msg, "proxy CONNECT failed");
deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET;
back[i].r.statuscode = STATUSCODE_NON_FATAL;
back[i].status = STATUS_READY;
back_set_finished(sback, i);
}
}
// handshake not yet launched // handshake not yet launched
if (!back[i].r.ssl_con) { if (tunnel_ok && !back[i].r.ssl_con) {
SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL); SSL_CTX_set_options(openssl_ctx, SSL_OP_ALL);
// new session // new session
back[i].r.ssl_con = SSL_new(openssl_ctx); back[i].r.ssl_con = SSL_new(openssl_ctx);
@@ -2551,7 +2569,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE; back[i].r.statuscode = STATUSCODE_SSL_HANDSHAKE;
} }
/* Error */ /* Error */
if (back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) { if (tunnel_ok && back[i].r.statuscode == STATUSCODE_SSL_HANDSHAKE) {
strcpybuff(back[i].r.msg, "bad SSL/TLS handshake"); strcpybuff(back[i].r.msg, "bad SSL/TLS handshake");
deletehttp(&back[i].r); deletehttp(&back[i].r);
back[i].r.soc = INVALID_SOCKET; back[i].r.soc = INVALID_SOCKET;
@@ -3838,7 +3856,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
/* funny log for commandline users */ /* funny log for commandline users */
//if (!opt->quiet) { //if (!opt->quiet) {
// petite animation // petite animation
if (opt->verbosedisplay == 1) { if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (back[i].status == STATUS_READY) { if (back[i].status == STATUS_READY) {
if (back[i].r.statuscode == HTTP_OK) if (back[i].r.statuscode == HTTP_OK)
printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r", printf("* %s%s (" LLintP " bytes) - OK" VT_CLREOL "\r",

View File

@@ -41,7 +41,7 @@ Please visit our Website: http://www.httrack.com
#ifdef _WIN32 #ifdef _WIN32
#if HTS_INET6==0 #if HTS_INET6 == 0
#include <winsock2.h> #include <winsock2.h>
#else #else
@@ -49,13 +49,14 @@ Please visit our Website: http://www.httrack.com
#define WIN32_LEAN_AND_MEAN #define WIN32_LEAN_AND_MEAN
// KB955045 (http://support.microsoft.com/kb/955045) // KB955045 (http://support.microsoft.com/kb/955045)
// To execute an application using this function on earlier versions of Windows // To execute an application using this function on earlier versions of Windows
// (Windows 2000, Windows NT, and Windows Me/98/95), then it is mandatary to #include Ws2tcpip.h // (Windows 2000, Windows NT, and Windows Me/98/95), then it is mandatary to
// and also Wspiapi.h. When the Wspiapi.h header file is included, the 'getaddrinfo' function is // #include Ws2tcpip.h and also Wspiapi.h. When the Wspiapi.h header file is
// #defined to the 'WspiapiGetAddrInfo' inline function in Wspiapi.h. // included, the 'getaddrinfo' function is #defined to the 'WspiapiGetAddrInfo'
// inline function in Wspiapi.h.
#include <ws2tcpip.h> #include <ws2tcpip.h>
#include <Wspiapi.h> #include <Wspiapi.h>
//#include <winsock2.h> // #include <winsock2.h>
//#include <tpipv6.h> // #include <tpipv6.h>
#endif #endif

View File

@@ -3,57 +3,59 @@
# Change this to download files # Change this to download files
if false; then if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi fi
# Produce code # Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n" printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT ; do for i in *.TXT; do
echo "processing $i" >&2 echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \ grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
( (
unset arr unset arr
while read LINE ; do while read -r LINE; do
from=$[$(echo $LINE | cut -f1 -d' ')] from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then if ! test -n "$from"; then
echo "error with $i" >&2 echo "error with $i" >&2
exit 1 exit 1
elif test $from -ge 256; then elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2 echo "out-of-range ($LINE) with $i" >&2
exit 1 exit 1
fi fi
to=$(echo $LINE | cut -f2 -d' ') to=$(echo "$LINE" | cut -f2 -d' ')
arr[$from]=$to arr[from]=$to
done done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n " name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
i=0 printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
while test "$i" -lt 256; do idx=0
if test "$i" -gt 0; then while test "$idx" -lt 256; do
printf ", " if test "$idx" -gt 0; then
if test $[${i}%8] -eq 0; then printf ", "
printf "\n " if test $((idx % 8)) -eq 0; then
fi printf "\n "
fi fi
value=${arr[$i]:-0} fi
printf "0x%04x" $value value=${arr[$idx]:-0}
i=$[${i}+1] printf "0x%04x" "$value"
done idx=$((idx + 1))
printf " };\n\n" done
) printf " };\n\n"
echo "processed $i" >&2 )
echo "processed $i" >&2
done done
# Indexes # Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n" printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT ; do for i in *.TXT; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n" name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done done
printf " { NULL, NULL }\n};\n" printf " { NULL, NULL }\n};\n"

View File

@@ -68,14 +68,15 @@ struct t_cookie {
#ifdef HTS_INTERNAL_BYTECODE #ifdef HTS_INTERNAL_BYTECODE
/* cookies */ /* cookies */
int cookie_add(t_cookie * cookie, const char *cook_name, const char *cook_value, int cookie_add(t_cookie *cookie, const char *cook_name, const char *cook_value,
const char *domain, const char *path); const char *domain, const char *path);
int cookie_del(t_cookie * cookie, const char *cook_name, const char *domain, const char *path); int cookie_del(t_cookie *cookie, const char *cook_name, const char *domain,
const char *path);
int cookie_load(t_cookie * cookie, const char *path, const char *name); int cookie_load(t_cookie *cookie, const char *path, const char *name);
int cookie_save(t_cookie * cookie, const char *name); int cookie_save(t_cookie *cookie, const char *name);
void cookie_insert(char *s, size_t s_size, const char *ins); void cookie_insert(char *s, size_t s_size, const char *ins);
@@ -83,7 +84,8 @@ void cookie_delete(char *s, size_t s_size, size_t pos);
const char *cookie_get(char *buffer, const char *cookie_base, int param); const char *cookie_get(char *buffer, const char *cookie_base, int param);
char *cookie_find(char *s, const char *cook_name, const char *domain, const char *path); char *cookie_find(char *s, const char *cook_name, const char *domain,
const char *path);
char *cookie_nextfield(char *a); char *cookie_nextfield(char *a);
@@ -92,12 +94,13 @@ char *cookie_nextfield(char *a);
/** Register credentials (auth = base-64 user:pass) for the prefix derived from /** Register credentials (auth = base-64 user:pass) for the prefix derived from
adr (host) and fil (path). No-op returning 0 if cookie is NULL, allocation adr (host) and fil (path). No-op returning 0 if cookie is NULL, allocation
fails, or a matching prefix is already stored; returns 1 on insertion. */ fails, or a matching prefix is already stored; returns 1 on insertion. */
int bauth_add(t_cookie * cookie, const char *adr, const char *fil, const char *auth); int bauth_add(t_cookie *cookie, const char *adr, const char *fil,
const char *auth);
/** Return the stored base-64 credentials whose prefix matches adr+fil, or NULL /** Return the stored base-64 credentials whose prefix matches adr+fil, or NULL
if none (or cookie is NULL). Returned pointer aliases the jar's bauth_chain; if none (or cookie is NULL). Returned pointer aliases the jar's bauth_chain;
caller must not free it. */ caller must not free it. */
char *bauth_check(t_cookie * cookie, const char *adr, const char *fil); char *bauth_check(t_cookie *cookie, const char *adr, const char *fil);
/** Build the auth lookup key (host + path, query string stripped, truncated at /** Build the auth lookup key (host + path, query string stripped, truncated at
the last '/') from adr and fil into prefix; returns prefix. Caller must the last '/') from adr and fil into prefix; returns prefix. Caller must

View File

@@ -135,7 +135,8 @@ HTSEXT_API T_SOC catch_url_init(int *port, /* 128 bytes */ char *adr) {
// returns 0 if error // returns 0 if error
// url: buffer where URL must be stored - or ip:port in case of failure // url: buffer where URL must be stored - or ip:port in case of failure
// data: 32Kb // data: 32Kb
HTSEXT_API int catch_url(T_SOC soc, char *url, char *method, char *data) { HTSEXT_API hts_boolean catch_url(T_SOC soc, char *url, char *method,
char *data) {
int retour = 0; int retour = 0;
// connexion (accept) // connexion (accept)

View File

@@ -52,12 +52,12 @@ Please visit our Website: http://www.httrack.com
#define DEFAULT_FTP "index.txt" #define DEFAULT_FTP "index.txt"
// extension par défaut pour fichiers n'en ayant pas // extension par défaut pour fichiers n'en ayant pas
#define DEFAULT_EXT ".html" #define DEFAULT_EXT ".html"
#define DEFAULT_EXT_SHORT ".htm" #define DEFAULT_EXT_SHORT ".htm"
//#define DEFAULT_BIN_EXT ".bin" // #define DEFAULT_BIN_EXT ".bin"
//#define DEFAULT_BIN_EXT_SHORT ".bin" // #define DEFAULT_BIN_EXT_SHORT ".bin"
//#define DEFAULT_EXT ".txt" // #define DEFAULT_EXT ".txt"
//#define DEFAULT_EXT_SHORT ".txt" // #define DEFAULT_EXT_SHORT ".txt"
// éviter les /nul, /con.. // éviter les /nul, /con..
#define HTS_OVERRIDE_DOS_FOLDERS 1 #define HTS_OVERRIDE_DOS_FOLDERS 1
@@ -87,7 +87,8 @@ Please visit our Website: http://www.httrack.com
// fast cache (build hash table) // fast cache (build hash table)
#define HTS_FAST_CACHE 1 #define HTS_FAST_CACHE 1
// le > peut être considéré comme un tag de fermeture de commentaire (<!-- > est valide) // le > peut être considéré comme un tag de fermeture de commentaire (<!-- > est
// valide)
#define GT_ENDS_COMMENT 1 #define GT_ENDS_COMMENT 1
// always adds a '/' at the end if a '~' is encountered (/~smith -> /~smith/) // always adds a '/' at the end if a '~' is encountered (/~smith -> /~smith/)
@@ -97,7 +98,8 @@ Please visit our Website: http://www.httrack.com
#define HTS_STRIP_DOUBLE_SLASH 0 #define HTS_STRIP_DOUBLE_SLASH 0
// case-sensitive pour les dossiers et fichiers (0/1) // case-sensitive pour les dossiers et fichiers (0/1)
// [normalement 1, mais pose des problèmes (url malformée par exemple) et n'est pas très utile.. // [normalement 1, mais pose des problèmes (url malformée par exemple) et n'est
// pas très utile..
// ..et pas bcp respecté] // ..et pas bcp respecté]
// REMOVED // REMOVED
// #define HTS_CASSE 0 // #define HTS_CASSE 0

View File

@@ -2585,7 +2585,7 @@ static int mkdir_compat(const char *pathname) {
/* path must end with "/" or with the finename (/tmp/bar/ or /tmp/bar/foo.zip) */ /* path must end with "/" or with the finename (/tmp/bar/ or /tmp/bar/foo.zip) */
/* Note: preserve errno */ /* Note: preserve errno */
HTSEXT_API int dir_exists(const char *path) { HTSEXT_API hts_boolean dir_exists(const char *path) {
const int err = errno; const int err = errno;
STRUCT_STAT st; STRUCT_STAT st;
char BIGSTK file[HTS_URLMAXSIZE * 2]; char BIGSTK file[HTS_URLMAXSIZE * 2];
@@ -3342,7 +3342,8 @@ int back_fill(struct_back * sback, httrackp * opt, cache_back * cache,
int ptr, int numero_passe) { int ptr, int numero_passe) {
int n = back_pluggable_sockets(sback, opt); int n = back_pluggable_sockets(sback, opt);
if (opt->savename_delayed == 2 && !opt->delayed_cached) /* cancel (always delayed) */ if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!opt->delayed_cached) /* cancel (always delayed) */
return 0; return 0;
if (n > 0) { if (n > 0) {
int p; int p;
@@ -3646,7 +3647,7 @@ HTSEXT_API int hts_setpause(httrackp * opt, int p) {
} }
// ask for termination // ask for termination
HTSEXT_API int hts_request_stop(httrackp * opt, int force) { HTSEXT_API int hts_request_stop(httrackp *opt, hts_boolean force) {
if (opt != NULL) { if (opt != NULL) {
hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user"); hts_log_print(opt, LOG_ERROR, "Exit requested by shell or user");
hts_mutexlock(&opt->state.lock); hts_mutexlock(&opt->state.lock);
@@ -3656,7 +3657,7 @@ HTSEXT_API int hts_request_stop(httrackp * opt, int force) {
return 0; return 0;
} }
HTSEXT_API int hts_has_stopped(httrackp * opt) { HTSEXT_API hts_boolean hts_has_stopped(httrackp *opt) {
int ended; int ended;
hts_mutexlock(&opt->state.lock); hts_mutexlock(&opt->state.lock);
ended = opt->state.is_ended; ended = opt->state.is_ended;
@@ -3678,12 +3679,12 @@ HTSEXT_API int hts_has_stopped(httrackp * opt) {
//} //}
// ajout d'URL // ajout d'URL
// -1 : erreur // -1 : erreur
HTSEXT_API int hts_addurl(httrackp * opt, char **url) { HTSEXT_API hts_boolean hts_addurl(httrackp *opt, char **url) {
if (url) if (url)
opt->state._hts_addurl = url; opt->state._hts_addurl = url;
return (opt->state._hts_addurl != NULL); return (opt->state._hts_addurl != NULL);
} }
HTSEXT_API int hts_resetaddurl(httrackp * opt) { HTSEXT_API hts_boolean hts_resetaddurl(httrackp *opt) {
opt->state._hts_addurl = NULL; opt->state._hts_addurl = NULL;
return (opt->state._hts_addurl != NULL); return (opt->state._hts_addurl != NULL);
} }
@@ -3702,7 +3703,9 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->maxsoc > 0) if (from->maxsoc > 0)
to->maxsoc = from->maxsoc; to->maxsoc = from->maxsoc;
if (from->nearlink > -1) /* hts_boolean/enum fields are unsigned (GCC), so a bare `> -1` unset-guard
is always false; cast to int to keep the -1 "unset" sentinel test. */
if ((int) from->nearlink > -1)
to->nearlink = from->nearlink; to->nearlink = from->nearlink;
if (from->timeout > -1) if (from->timeout > -1)
@@ -3729,10 +3732,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
if (from->hostcontrol > -1) if (from->hostcontrol > -1)
to->hostcontrol = from->hostcontrol; to->hostcontrol = from->hostcontrol;
if (from->errpage > -1) if ((int) from->errpage > -1)
to->errpage = from->errpage; to->errpage = from->errpage;
if (from->parseall > -1) if ((int) from->parseall > -1)
to->parseall = from->parseall; to->parseall = from->parseall;
// test all: bit 8 de travel // test all: bit 8 de travel
@@ -3844,7 +3847,7 @@ int htsAddLink(htsmoduleStruct * str, char *link) {
a = opt->savename_type; a = opt->savename_type;
b = opt->savename_83; b = opt->savename_83;
opt->savename_type = 0; opt->savename_type = 0;
opt->savename_83 = 0; opt->savename_83 = HTS_SAVENAME_83_LONG;
// note: adr,fil peuvent être patchés // note: adr,fil peuvent être patchés
r = r =
url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe, url_savename(&afs, NULL, NULL, NULL, opt, sback, cache, hashptr, ptr, numero_passe,

View File

@@ -612,12 +612,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Terminal is a tty, may ask questions and display funny information */ /* Terminal is a tty, may ask questions and display funny information */
if (isatty(1)) { if (isatty(1)) {
opt->quiet = 0; opt->quiet = 0;
opt->verbosedisplay = 1; opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
} }
/* Not a tty, no stdin input or funny output! */ /* Not a tty, no stdin input or funny output! */
else { else {
opt->quiet = 1; opt->quiet = 1;
opt->verbosedisplay = 0; opt->verbosedisplay = HTS_VERBOSE_NONE;
} }
#endif #endif
@@ -953,9 +953,11 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
p = buff; p = buff;
do { do {
int insert_after_argc; int insert_after_argc;
int quoted; /* "" unquotes to empty but is still a real token (#106) */
// read next // read next
lastp = p; lastp = p;
quoted = (p != NULL && *p == '"');
if (p) { if (p) {
p = next_token(p, 1); p = next_token(p, 1);
if (p) { if (p) {
@@ -966,7 +968,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
/* Insert parameters BUT so that they can be in the same order */ /* Insert parameters BUT so that they can be in the same order */
if (lastp) { if (lastp) {
if (strnotempty(lastp)) { if (strnotempty(lastp) || quoted) {
insert_after_argc = argc - insert_after; insert_after_argc = argc - insert_after;
cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk, cmdl_ins(lastp, insert_after_argc, (argv + insert_after), x_argvblk,
x_argvblk_size, x_ptr); x_argvblk_size, x_ptr);
@@ -1815,24 +1817,22 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
com++; com++;
} }
break; break;
case 'L': case 'L': {
{ sscanf(com + 1, "%d", (int *) &opt->savename_83);
sscanf(com + 1, "%d", &opt->savename_83); switch (opt->savename_83) {
switch (opt->savename_83) { case 0: // 8-3 (ISO9660 L1)
case 0: // 8-3 (ISO9660 L1) opt->savename_83 = HTS_SAVENAME_83_DOS;
opt->savename_83 = 1; break;
break; case 1:
case 1: opt->savename_83 = HTS_SAVENAME_83_LONG;
opt->savename_83 = 0; break;
break; default: // 2 == ISO9660 (ISO9660 L2)
default: // 2 == ISO9660 (ISO9660 L2) opt->savename_83 = HTS_SAVENAME_83_ISO9660;
opt->savename_83 = 2; break;
break;
}
while(isdigit((unsigned char) *(com + 1)))
com++;
} }
break; while (isdigit((unsigned char) *(com + 1)))
com++;
} break;
case 's': case 's':
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", (int *) &opt->robots); sscanf(com + 1, "%d", (int *) &opt->robots);
@@ -1989,9 +1989,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
} }
break; // url hack break; // url hack
case 'v': case 'v':
opt->verbosedisplay = 2; opt->verbosedisplay = HTS_VERBOSE_FULL;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->verbosedisplay); sscanf(com + 1, "%d", (int *) &opt->verbosedisplay);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} }
@@ -2004,9 +2004,9 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
} }
break; break;
case 'N': case 'N':
opt->savename_delayed = 2; opt->savename_delayed = HTS_SAVENAME_DELAYED_HARD;
if (isdigit((unsigned char) *(com + 1))) { if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->savename_delayed); sscanf(com + 1, "%d", (int *) &opt->savename_delayed);
while(isdigit((unsigned char) *(com + 1))) while(isdigit((unsigned char) *(com + 1)))
com++; com++;
} }
@@ -2787,6 +2787,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; return 0;
} }
break; break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs case '2': // mimedefs
if (na + 1 >= argc) { if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL"); HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");
@@ -3096,6 +3137,78 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
htsmain_free(); htsmain_free();
return 0; return 0;
break; break;
case '9': { // copy_htsopt selftest: httrack -#9
httrackp *from = hts_create_opt();
httrackp *to = hts_create_opt();
int err = 0;
/* from-values differ from both the to-values and the
hts_create_opt() defaults (nearlink FALSE, errpage/parseall
TRUE), so a copy that no-ops or just resets to defaults is
caught too, not only the unsigned-guard bug. */
from->retry = 7; /* int field: positive control */
to->retry = 0;
from->nearlink = HTS_TRUE;
to->nearlink = HTS_FALSE;
from->errpage = HTS_FALSE;
to->errpage = HTS_TRUE;
from->parseall = HTS_FALSE;
to->parseall = HTS_TRUE;
copy_htsopt(from, to);
if (to->retry != 7)
err = 1;
if (to->nearlink != HTS_TRUE)
err = 1;
if (to->errpage != HTS_FALSE)
err = 1;
if (to->parseall != HTS_FALSE)
err = 1;
hts_free_opt(from);
hts_free_opt(to);
printf("copy-htsopt: %s\n", err ? "FAIL" : "OK");
htsmain_free();
return err;
} break;
case 'Q': { // cookie request-header selftest: httrack -#Q
static t_cookie cookie;
char hdr[1024];
/* RFC 6265: bare name=value pairs, no $Version/$Path (#151). */
const char *expected = "Cookie: name=value; has_js=1" H_CRLF;
int err = 0;
const char *dom = "www.example.com";
int added;
cookie.max_len = (int) sizeof(cookie.data);
cookie.data[0] = '\0';
added = cookie_add(&cookie, "name", "value", dom, "/");
added |= cookie_add(&cookie, "has_js", "1", dom, "/");
/* different domain: must be filtered out */
added |= cookie_add(&cookie, "junk", "x", "other.org", "/");
if (added) {
printf("cookie-header: FAIL (cookie_add setup)\n");
htsmain_free();
return 1;
}
http_cookie_header_selftest(&cookie, dom, "/", hdr,
sizeof(hdr));
if (strcmp(hdr, expected) != 0)
err = 1;
if (strstr(hdr, "$Version") != NULL ||
strstr(hdr, "$Path") != NULL)
err = 1;
if (strstr(hdr, "junk") != NULL) // wrong-domain cookie leaked
err = 1;
printf("cookie-header: %s\n", err ? "FAIL" : "OK");
if (err)
printf(" got: %s\n", hdr);
htsmain_free();
return err;
} break;
case '!': case '!':
HTS_PANIC_PRINTF HTS_PANIC_PRINTF
("Option #! is disabled for security reasons"); ("Option #! is disabled for security reasons");

View File

@@ -82,9 +82,9 @@ typedef struct t_hts_callbackarg t_hts_callbackarg;
/* Entry points of a --wrapper plug-in: hts_plug(opt, argv) is called once to /* Entry points of a --wrapper plug-in: hts_plug(opt, argv) is called once to
install the wrapper (argv is the wrapper's argument string), hts_unplug(opt) install the wrapper (argv is the wrapper's argument string), hts_unplug(opt)
once to tear it down. Both return non-zero on success. */ once to tear it down. Both return non-zero on success. */
typedef int (*t_hts_plug) (httrackp * opt, const char *argv); typedef int (*t_hts_plug)(httrackp *opt, const char *argv);
typedef int (*t_hts_unplug) (httrackp * opt); typedef int (*t_hts_unplug)(httrackp *opt);
/* Engine callback prototypes. Each is one hook the engine fires at a defined /* Engine callback prototypes. Each is one hook the engine fires at a defined
point of a mirror; a wrapper installs the ones it cares about in the point of a mirror; a wrapper installs the ones it cares about in the
@@ -92,27 +92,27 @@ typedef int (*t_hts_unplug) (httrackp * opt);
returns are 1 to continue/accept, 0 to abort/refuse unless noted. */ returns are 1 to continue/accept, 0 to abort/refuse unless noted. */
/* Called once when the wrapper is installed; allocate per-run state here. */ /* Called once when the wrapper is installed; allocate per-run state here. */
typedef void (*t_hts_htmlcheck_init) (t_hts_callbackarg * carg); typedef void (*t_hts_htmlcheck_init)(t_hts_callbackarg *carg);
/* Called once when the wrapper is removed; release per-run state here. */ /* Called once when the wrapper is removed; release per-run state here. */
typedef void (*t_hts_htmlcheck_uninit) (t_hts_callbackarg * carg); typedef void (*t_hts_htmlcheck_uninit)(t_hts_callbackarg *carg);
/* Fired at the start of a mirror, after options are parsed. */ /* Fired at the start of a mirror, after options are parsed. */
typedef int (*t_hts_htmlcheck_start) (t_hts_callbackarg * carg, httrackp * opt); typedef int (*t_hts_htmlcheck_start)(t_hts_callbackarg *carg, httrackp *opt);
/* Fired at the end of a mirror. */ /* Fired at the end of a mirror. */
typedef int (*t_hts_htmlcheck_end) (t_hts_callbackarg * carg, httrackp * opt); typedef int (*t_hts_htmlcheck_end)(t_hts_callbackarg *carg, httrackp *opt);
/* Fired while options are being changed, to validate or adjust them. */ /* Fired while options are being changed, to validate or adjust them. */
typedef int (*t_hts_htmlcheck_chopt) (t_hts_callbackarg * carg, httrackp * opt); typedef int (*t_hts_htmlcheck_chopt)(t_hts_callbackarg *carg, httrackp *opt);
/* Rewrite hook over an in-memory page: the html and len arguments point at the /* Rewrite hook over an in-memory page: the html and len arguments point at the
buffer and its length (the callback may reallocate and resize it), buffer and its length (the callback may reallocate and resize it),
url_adresse and url_fichier name it. */ url_adresse and url_fichier name it. */
typedef int (*t_hts_htmlcheck_process) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_process)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, char **html, int *len, char **html, int *len,
const char *url_adresse, const char *url_adresse,
const char *url_fichier); const char *url_fichier);
/* Same shape as process, run before HTML parsing. */ /* Same shape as process, run before HTML parsing. */
typedef t_hts_htmlcheck_process t_hts_htmlcheck_preprocess; typedef t_hts_htmlcheck_process t_hts_htmlcheck_preprocess;
@@ -121,113 +121,111 @@ typedef t_hts_htmlcheck_process t_hts_htmlcheck_preprocess;
typedef t_hts_htmlcheck_process t_hts_htmlcheck_postprocess; typedef t_hts_htmlcheck_process t_hts_htmlcheck_postprocess;
/* Inspect a page (read-only html/len) without rewriting it. */ /* Inspect a page (read-only html/len) without rewriting it. */
typedef int (*t_hts_htmlcheck_check_html) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_check_html)(t_hts_callbackarg *carg,
httrackp * opt, char *html, int len, httrackp *opt, char *html, int len,
const char *url_adresse, const char *url_adresse,
const char *url_fichier); const char *url_fichier);
/* Answer an engine query identified by 'question'; returns the answer string /* Answer an engine query identified by 'question'; returns the answer string
(owned by the callback, must stay valid until the next call). */ (owned by the callback, must stay valid until the next call). */
typedef const char *(*t_hts_htmlcheck_query) (t_hts_callbackarg * carg, typedef const char *(*t_hts_htmlcheck_query)(t_hts_callbackarg *carg,
httrackp * opt, httrackp *opt,
const char *question); const char *question);
/* Second query channel, same contract as query. */ /* Second query channel, same contract as query. */
typedef const char *(*t_hts_htmlcheck_query2) (t_hts_callbackarg * carg, typedef const char *(*t_hts_htmlcheck_query2)(t_hts_callbackarg *carg,
httrackp * opt, httrackp *opt,
const char *question); const char *question);
/* Third query channel, same contract as query. */ /* Third query channel, same contract as query. */
typedef const char *(*t_hts_htmlcheck_query3) (t_hts_callbackarg * carg, typedef const char *(*t_hts_htmlcheck_query3)(t_hts_callbackarg *carg,
httrackp * opt, httrackp *opt,
const char *question); const char *question);
/* Per-tick progress hook: 'back' is the transfer slot array of 'back_max' /* Per-tick progress hook: 'back' is the transfer slot array of 'back_max'
entries, back_index the active one; lien_tot/lien_ntot and stats report entries, back_index the active one; lien_tot/lien_ntot and stats report
queue size and running totals, stat_time the elapsed time. */ queue size and running totals, stat_time the elapsed time. */
typedef int (*t_hts_htmlcheck_loop) (t_hts_callbackarg * carg, httrackp * opt, typedef int (*t_hts_htmlcheck_loop)(t_hts_callbackarg *carg, httrackp *opt,
lien_back * back, int back_max, lien_back *back, int back_max,
int back_index, int lien_tot, int back_index, int lien_tot, int lien_ntot,
int lien_ntot, int stat_time, int stat_time, hts_stat_struct *stats);
hts_stat_struct * stats);
/* Veto a link (adr host, fil path) after its transfer; status is the result. /* Veto a link (adr host, fil path) after its transfer; status is the result.
Return 0 to drop the link. */ Return 0 to drop the link. */
typedef int (*t_hts_htmlcheck_check_link) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_check_link)(t_hts_callbackarg *carg,
httrackp * opt, const char *adr, httrackp *opt, const char *adr,
const char *fil, int status); const char *fil, int status);
/* Veto a link by its MIME type before download; return 0 to skip it. */ /* Veto a link by its MIME type before download; return 0 to skip it. */
typedef int (*t_hts_htmlcheck_check_mime) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_check_mime)(t_hts_callbackarg *carg,
httrackp * opt, const char *adr, httrackp *opt, const char *adr,
const char *fil, const char *mime, const char *fil, const char *mime,
int status); int status);
/* Fired when the mirror pauses, waiting on 'lockfile' to be removed. */ /* Fired when the mirror pauses, waiting on 'lockfile' to be removed. */
typedef void (*t_hts_htmlcheck_pause) (t_hts_callbackarg * carg, httrackp * opt, typedef void (*t_hts_htmlcheck_pause)(t_hts_callbackarg *carg, httrackp *opt,
const char *lockfile); const char *lockfile);
/* Fired after a file is written to disk; 'file' is the local path. */ /* Fired after a file is written to disk; 'file' is the local path. */
typedef void (*t_hts_htmlcheck_filesave) (t_hts_callbackarg * carg, typedef void (*t_hts_htmlcheck_filesave)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, const char *file); const char *file);
/* Richer file-saved notification: source host/filename, local path, and flags /* Richer file-saved notification: source host/filename, local path, and flags
telling whether the file is new, modified, or left unchanged. */ telling whether the file is new, modified, or left unchanged. */
typedef void (*t_hts_htmlcheck_filesave2) (t_hts_callbackarg * carg, typedef void (*t_hts_htmlcheck_filesave2)(t_hts_callbackarg *carg,
httrackp * opt, const char *hostname, httrackp *opt, const char *hostname,
const char *filename, const char *filename,
const char *localfile, int is_new, const char *localfile, int is_new,
int is_modified, int not_updated); int is_modified, int not_updated);
/* Fired for each link parsed out of a page; 'link' may be edited in place. */ /* Fired for each link parsed out of a page; 'link' may be edited in place. */
typedef int (*t_hts_htmlcheck_linkdetected) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_linkdetected)(t_hts_callbackarg *carg,
httrackp * opt, char *link); httrackp *opt, char *link);
/* As linkdetected, plus tag_start, the markup the link was found in. */ /* As linkdetected, plus tag_start, the markup the link was found in. */
typedef int (*t_hts_htmlcheck_linkdetected2) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_linkdetected2)(t_hts_callbackarg *carg,
httrackp * opt, char *link, httrackp *opt, char *link,
const char *tag_start); const char *tag_start);
/* Fired on each transfer-status change of slot 'back'. */ /* Fired on each transfer-status change of slot 'back'. */
typedef int (*t_hts_htmlcheck_xfrstatus) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_xfrstatus)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, lien_back * back); lien_back *back);
/* Choose the local save path for a URL; write it into 'save'. adr/fil name the /* Choose the local save path for a URL; write it into 'save'. adr/fil name the
target, referer_adr/referer_fil the page that linked it. */ target, referer_adr/referer_fil the page that linked it. */
typedef int (*t_hts_htmlcheck_savename) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_savename)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, const char *adr_complete,
const char *adr_complete, const char *fil_complete,
const char *fil_complete, const char *referer_adr,
const char *referer_adr, const char *referer_fil, char *save);
const char *referer_fil, char *save);
/* Extended save-name hook, same signature as savename. */ /* Extended save-name hook, same signature as savename. */
typedef t_hts_htmlcheck_savename t_hts_htmlcheck_extsavename; typedef t_hts_htmlcheck_savename t_hts_htmlcheck_extsavename;
/* Inspect or edit the outgoing request headers in 'buff' before they are sent. /* Inspect or edit the outgoing request headers in 'buff' before they are sent.
*/ */
typedef int (*t_hts_htmlcheck_sendhead) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_sendhead)(t_hts_callbackarg *carg, httrackp *opt,
httrackp * opt, char *buff, char *buff, const char *adr,
const char *adr, const char *fil, const char *fil,
const char *referer_adr, const char *referer_adr,
const char *referer_fil, const char *referer_fil,
htsblk * outgoing); htsblk *outgoing);
/* Inspect the incoming response headers in 'buff' after they are received. */ /* Inspect the incoming response headers in 'buff' after they are received. */
typedef int (*t_hts_htmlcheck_receivehead) (t_hts_callbackarg * carg, typedef int (*t_hts_htmlcheck_receivehead)(t_hts_callbackarg *carg,
httrackp * opt, char *buff, httrackp *opt, char *buff,
const char *adr, const char *fil, const char *adr, const char *fil,
const char *referer_adr, const char *referer_adr,
const char *referer_fil, const char *referer_fil,
htsblk * incoming); htsblk *incoming);
/* External parser module hooks: detect claims a document type (return 1 to /* External parser module hooks: detect claims a document type (return 1 to
take it), parse then extracts its links. 'str' carries the document. */ take it), parse then extracts its links. 'str' carries the document. */
typedef int (*t_hts_htmlcheck_detect) (t_hts_callbackarg * carg, httrackp * opt, typedef int (*t_hts_htmlcheck_detect)(t_hts_callbackarg *carg, httrackp *opt,
htsmoduleStruct * str); htsmoduleStruct *str);
typedef int (*t_hts_htmlcheck_parse) (t_hts_callbackarg * carg, httrackp * opt, typedef int (*t_hts_htmlcheck_parse)(t_hts_callbackarg *carg, httrackp *opt,
htsmoduleStruct * str); htsmoduleStruct *str);
/* Callbacks */ /* Callbacks */
#ifndef HTS_DEF_FWSTRUCT_t_hts_htmlcheck_callbacks #ifndef HTS_DEF_FWSTRUCT_t_hts_htmlcheck_callbacks
@@ -237,10 +235,10 @@ typedef struct t_hts_htmlcheck_callbacks t_hts_htmlcheck_callbacks;
/* Declares one named callback slot: its function pointer (typed /* Declares one named callback slot: its function pointer (typed
t_hts_htmlcheck_<NAME>) paired with the carg passed to it. */ t_hts_htmlcheck_<NAME>) paired with the carg passed to it. */
#define DEFCALLBACK(NAME) \ #define DEFCALLBACK(NAME) \
struct NAME { \ struct NAME { \
t_hts_htmlcheck_ ##NAME fun; \ t_hts_htmlcheck_##NAME fun; \
t_hts_callbackarg *carg; \ t_hts_callbackarg *carg; \
} NAME } NAME
/* Generic, type-erased callback slot used where the hook type is opaque. */ /* Generic, type-erased callback slot used where the hook type is opaque. */
@@ -324,18 +322,18 @@ extern const t_hts_htmlcheck_callbacks default_callbacks;
/* Internal helpers for building an HTTP request/response into the engine's /* Internal helpers for building an HTTP request/response into the engine's
scratch buffer (opt->state.HTbuff): START resets it, PRINT appends; the scratch buffer (opt->state.HTbuff): START resets it, PRINT appends; the
PANIC variant records a fatal error message. */ PANIC variant records a fatal error message. */
#define HT_PRINT(A) strcatbuff(opt->state.HTbuff,A); #define HT_PRINT(A) strcatbuff(opt->state.HTbuff, A);
#define HT_REQUEST_START opt->state.HTbuff[0]='\0'; #define HT_REQUEST_START opt->state.HTbuff[0] = '\0';
#define HT_REQUEST_END #define HT_REQUEST_END
#define HTT_REQUEST_START opt->state.HTbuff[0]='\0'; #define HTT_REQUEST_START opt->state.HTbuff[0] = '\0';
#define HTT_REQUEST_END #define HTT_REQUEST_END
#define HTS_REQUEST_START opt->state.HTbuff[0]='\0'; #define HTS_REQUEST_START opt->state.HTbuff[0] = '\0';
#define HTS_REQUEST_END #define HTS_REQUEST_END
#define HTS_PANIC_PRINTF(S) strcpybuff(opt->state._hts_errmsg,S); #define HTS_PANIC_PRINTF(S) strcpybuff(opt->state._hts_errmsg, S);
#endif #endif

View File

@@ -33,43 +33,43 @@ EOF
else else
GET "${url}" GET "${url}"
fi fi
) \ ) |
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \ grep -E '^<!ENTITY [a-zA-Z0-9_]' |
| sed \ sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \ -e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \ -e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\ -e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
| ( \ (
read A read -r A
while test -n "$A"; do while test -n "$A"; do
ent="${A%% *}" ent="${A%% *}"
code=$(echo "$A"|cut -f2 -d' ') code=$(echo "$A" | cut -f2 -d' ')
# compute hash # compute hash
hash=0 hash=0
i=0 i=0
a=1664525 a=1664525
c=1013904223 c=1013904223
m="$[1 << 32]" m="$((1 << 32))"
while test "$i" -lt ${#ent}; do while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}"|hexdump -v -e '/1 "%d"')" d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]" hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$[${i}+1] i=$((i + 1))
done done
echo -e " /* $A */" echo -e " /* $A */"
echo -e " case ${hash}u:" echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {" echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};" echo -e " return ${code};"
echo -e " }" echo -e " }"
echo -e " break;" echo -e " break;"
# next # next
read A read -r A
done done
) )
cat <<EOF cat <<EOF
} }
/* unknown */ /* unknown */
return -1; return -1;
} }
EOF EOF
) > ${dest} ) >${dest}

View File

@@ -43,10 +43,10 @@ Please visit our Website: http://www.httrack.com
configure.ac, decoupled from these). VERSION is the display form, VERSIONID configure.ac, decoupled from these). VERSION is the display form, VERSIONID
the dotted numeric form, AFF_VERSION the short form shown in footers, the dotted numeric form, AFF_VERSION the short form shown in footers,
LIB_VERSION the data/cache format generation. */ LIB_VERSION the data/cache format generation. */
#define HTTRACK_VERSION "3.49-8" #define HTTRACK_VERSION "3.49-8"
#define HTTRACK_VERSIONID "3.49.8" #define HTTRACK_VERSIONID "3.49.8"
#define HTTRACK_AFF_VERSION "3.x" #define HTTRACK_AFF_VERSION "3.x"
#define HTTRACK_LIB_VERSION "2.0" #define HTTRACK_LIB_VERSION "2.0"
#ifndef HTS_NOINCLUDES #ifndef HTS_NOINCLUDES
#include <stdio.h> #include <stdio.h>
@@ -71,11 +71,11 @@ Please visit our Website: http://www.httrack.com
varargs starting at arg. */ varargs starting at arg. */
#ifndef HTS_UNUSED #ifndef HTS_UNUSED
#ifdef __GNUC__ #ifdef __GNUC__
#define HTS_UNUSED __attribute__ ((unused)) #define HTS_UNUSED __attribute__((unused))
#define HTS_STATIC static __attribute__ ((unused)) #define HTS_STATIC static __attribute__((unused))
#define HTS_PRINTF_FUN(fmt, arg) __attribute__ ((format (printf, fmt, arg))) #define HTS_PRINTF_FUN(fmt, arg) __attribute__((format(printf, fmt, arg)))
#else #else
#define HTS_UNUSED #define HTS_UNUSED
#define HTS_STATIC static #define HTS_STATIC static
@@ -113,7 +113,7 @@ Please visit our Website: http://www.httrack.com
#ifndef HTS_LONGLONG #ifndef HTS_LONGLONG
#ifdef SIZEOF_LONG_LONG #ifdef SIZEOF_LONG_LONG
#if SIZEOF_LONG_LONG==8 #if SIZEOF_LONG_LONG == 8
#define HTS_LONGLONG 1 #define HTS_LONGLONG 1
#endif #endif
#endif #endif
@@ -204,12 +204,12 @@ Please visit our Website: http://www.httrack.com
#endif #endif
#define HTS_HTTRACKRC ".httrackrc" #define HTS_HTTRACKRC ".httrackrc"
#define HTS_HTTRACKCNF HTS_ETCPATH"/httrack.conf" #define HTS_HTTRACKCNF HTS_ETCPATH "/httrack.conf"
#ifdef DATADIR #ifdef DATADIR
#define HTS_HTTRACKDIR DATADIR"/httrack/" #define HTS_HTTRACKDIR DATADIR "/httrack/"
#else #else
#define HTS_HTTRACKDIR HTS_PREFIX"/share/httrack/" #define HTS_HTTRACKDIR HTS_PREFIX "/share/httrack/"
#endif #endif
#endif #endif
@@ -226,12 +226,17 @@ Please visit our Website: http://www.httrack.com
/* Copyright (C) 1998 Xavier Roche and other contributors */ /* Copyright (C) 1998 Xavier Roche and other contributors */
#define HTTRACK_AFF_AUTHORS "[XR&CO'2014]" #define HTTRACK_AFF_AUTHORS "[XR&CO'2014]"
#define HTS_DEFAULT_FOOTER "<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION " " HTTRACK_AFF_AUTHORS ", %s -->" #define HTS_DEFAULT_FOOTER \
"<!-- Mirrored from %s%s by HTTrack Website Copier/" HTTRACK_AFF_VERSION \
" " HTTRACK_AFF_AUTHORS ", %s -->"
#define HTTRACK_WEB "http://www.httrack.com" #define HTTRACK_WEB "http://www.httrack.com"
#define HTS_UPDATE_WEBSITE "http://www.httrack.com/update.php3?Product=HTTrack&Version=" HTTRACK_VERSIONID "&VersionStr=" HTTRACK_VERSION "&Platform=%d&Language=%s" #define HTS_UPDATE_WEBSITE \
"http://www.httrack.com/" \
"update.php3?Product=HTTrack&Version=" HTTRACK_VERSIONID \
"&VersionStr=" HTTRACK_VERSION "&Platform=%d&Language=%s"
#define H_CRLF "\x0d\x0a" #define H_CRLF "\x0d\x0a"
#define CRLF "\x0d\x0a" #define CRLF "\x0d\x0a"
#ifdef _WIN32 #ifdef _WIN32
#define LF "\x0d\x0a" #define LF "\x0d\x0a"
#else #else
@@ -242,10 +247,19 @@ Please visit our Website: http://www.httrack.com
#define HTS_NOPARAM "(none)" #define HTS_NOPARAM "(none)"
#define HTS_NOPARAM2 "\"(none)\"" #define HTS_NOPARAM2 "\"(none)\""
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */ /* Boolean flag for option fields and API yes/no returns. An enum (not C bool)
#define maximum(A,B) ( (A) > (B) ? (A) : (B) ) so it stays int-sized: option fields keep the httrackp layout/ABI, and a
return type stays compatible with the int it replaces. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean
#define minimum(A,B) ( (A) < (B) ? (A) : (B) ) typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
#endif
/* Larger/smaller of two values. Macros: arguments are evaluated twice. */
#define maximum(A, B) ((A) > (B) ? (A) : (B))
#define minimum(A, B) ((A) < (B) ? (A) : (B))
/* True when A is a non-NULL, non-empty string. */ /* True when A is a non-NULL, non-empty string. */
#define strnotempty(A) (((A) != NULL && (A)[0] != '\0')) #define strnotempty(A) (((A) != NULL && (A)[0] != '\0'))
@@ -270,10 +284,10 @@ Please visit our Website: http://www.httrack.com
#endif #endif
#else #else
/* See <http://gcc.gnu.org/wiki/Visibility> */ /* See <http://gcc.gnu.org/wiki/Visibility> */
#if ( ( defined(__GNUC__) && ( __GNUC__ >= 4 ) ) \ #if ((defined(__GNUC__) && (__GNUC__ >= 4)) || \
|| ( defined(HAVE_VISIBILITY) && HAVE_VISIBILITY ) ) (defined(HAVE_VISIBILITY) && HAVE_VISIBILITY))
#define HTSEXT_API __attribute__ ((visibility ("default"))) #define HTSEXT_API __attribute__((visibility("default")))
#else #else
#define HTSEXT_API #define HTSEXT_API
#endif #endif
@@ -327,8 +341,8 @@ typedef __int64 LLint;
typedef __int64 TStamp; typedef __int64 TStamp;
#define LLintP "%I64d" #define LLintP "%I64d"
#elif (defined(_LP64) || defined(__x86_64__) \ #elif (defined(_LP64) || defined(__x86_64__) || defined(__powerpc64__) || \
|| defined(__powerpc64__) || defined(__64BIT__)) defined(__64BIT__))
typedef long int LLint; typedef long int LLint;
@@ -392,16 +406,17 @@ typedef int T_SOC;
/* Permission bits for created folders and files (mkdir and chmod). /* Permission bits for created folders and files (mkdir and chmod).
PROTECT_FOLDER is owner-only. With HTS_ACCESS set (the default) the ACCESS_ PROTECT_FOLDER is owner-only. With HTS_ACCESS set (the default) the ACCESS_
modes also grant group/other read; otherwise they stay owner-only. */ modes also grant group/other read; otherwise they stay owner-only. */
#define HTS_PROTECT_FOLDER (S_IRUSR|S_IWUSR|S_IXUSR) #define HTS_PROTECT_FOLDER (S_IRUSR | S_IWUSR | S_IXUSR)
#if HTS_ACCESS #if HTS_ACCESS
#define HTS_ACCESS_FILE (S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH) #define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)
#define HTS_ACCESS_FOLDER (S_IRUSR|S_IWUSR|S_IXUSR|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH) #define HTS_ACCESS_FOLDER \
(S_IRUSR | S_IWUSR | S_IXUSR | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH)
#else #else
#define HTS_ACCESS_FILE (S_IRUSR|S_IWUSR) #define HTS_ACCESS_FILE (S_IRUSR | S_IWUSR)
#define HTS_ACCESS_FOLDER (S_IRUSR|S_IWUSR|S_IXUSR) #define HTS_ACCESS_FOLDER (S_IRUSR | S_IWUSR | S_IXUSR)
#endif #endif
/* Sanity-check that the required preprocessor switches are defined */ /* Sanity-check that the required preprocessor switches are defined */
@@ -419,7 +434,11 @@ typedef int T_SOC;
#endif #endif
/* fflush sur stdout */ /* fflush sur stdout */
#define io_flush { fflush(stdout); fflush(stdin); } #define io_flush \
{ \
fflush(stdout); \
fflush(stdin); \
}
/* HTSLib */ /* HTSLib */
@@ -439,7 +458,7 @@ typedef int T_SOC;
#ifdef _DEBUG #ifdef _DEBUG
// trace mallocs // trace mallocs
//#define HTS_TRACE_MALLOC // #define HTS_TRACE_MALLOC
#ifdef HTS_TRACE_MALLOC #ifdef HTS_TRACE_MALLOC
typedef unsigned long int t_htsboundary; typedef unsigned long int t_htsboundary;
@@ -516,7 +535,13 @@ static const t_htsboundary htsboundary = 0xDEADBEEF;
#if _HTS_WIDE #if _HTS_WIDE
extern FILE *DEBUG_fp; extern FILE *DEBUG_fp;
#define DEBUG_W(A) { if (DEBUG_fp==NULL) DEBUG_fp=fopen("bug.out","wb"); fprintf(DEBUG_fp,":>"A); fflush(DEBUG_fp); } #define DEBUG_W(A) \
{ \
if (DEBUG_fp == NULL) \
DEBUG_fp = fopen("bug.out", "wb"); \
fprintf(DEBUG_fp, ":>" A); \
fflush(DEBUG_fp); \
}
#undef _ #undef _
#define _ , #define _ ,
#endif #endif

View File

@@ -644,6 +644,165 @@ T_SOC http_fopen(httrackp * opt, const char *adr, const char *fil, htsblk * reto
return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour); return http_xfopen(opt, 0, 1, 1, NULL, adr, fil, retour);
} }
// Read a CRLF line from a non-blocking socket (waits up to timeout per recv).
// Returns the line length (0 = empty), or -1 on timeout/EOF/error.
static int proxy_getline(T_SOC soc, char *s, int max, int timeout) {
int j = 0;
for (;;) {
unsigned char ch;
int n;
if (!check_readinput_t(soc, timeout))
return -1; // timed out waiting for data
n = (int) recv(soc, &ch, 1, 0);
if (n == 1) {
if (ch == 13) // CR
continue;
if (ch == 10) // LF: end of line
break;
if (j >= max - 1)
return -1; // line too long: bound the read against a hostile proxy
s[j++] = (char) ch;
} else if (n == 0) {
return -1; // connection closed
} else {
#ifdef _WIN32
if (WSAGetLastError() == WSAEWOULDBLOCK)
continue;
#else
if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
continue;
#endif
return -1;
}
}
s[j] = '\0';
return j;
}
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout) {
const T_SOC soc = retour->soc;
const char *const host = jump_identification_const(adr); // host[:port]
const char *const portsep = jump_toport_const(adr); // ":port" or NULL
char BIGSTK authority[HTS_URLMAXSIZE * 2];
char BIGSTK req[HTS_URLMAXSIZE * 4 + 1100];
char line[1024];
int code;
if (soc == INVALID_SOCKET)
return 0;
// CONNECT needs an explicit host:port; default the https port
authority[0] = '\0';
if (portsep != NULL)
strlcatbuff(authority, host, sizeof(authority)); // already host:port
else
snprintf(authority, sizeof(authority), "%s:%d", host, 443);
// backstop: never let a stray CR/LF in the host smuggle a second line into
// the CONNECT request (the host is already sanitized upstream)
{
const char *c;
for (c = authority; *c != '\0'; c++) {
if ((unsigned char) *c < ' ') {
strcpybuff(retour->msg, "proxy CONNECT: invalid host");
return 0;
}
}
}
snprintf(req, sizeof(req), "CONNECT %s HTTP/1.0" H_CRLF "Host: %s" H_CRLF,
authority, authority);
// creds go on the CONNECT, not the tunneled origin request
if (link_has_authorization(retour->req.proxy.name)) {
const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name);
char autorisation[1100];
char user_pass[256];
autorisation[0] = user_pass[0] = '\0';
strncatbuff(user_pass, astart, (int) (a - astart) - 1);
strcpybuff(user_pass, unescape_http(OPT_GET_BUFF(opt),
OPT_GET_BUFF_SIZE(opt), user_pass));
code64((unsigned char *) user_pass, (int) strlen(user_pass),
(unsigned char *) autorisation, 0);
strlcatbuff(req, "Proxy-Authorization: Basic ", sizeof(req));
strlcatbuff(req, autorisation, sizeof(req));
strlcatbuff(req, H_CRLF, sizeof(req));
}
strlcatbuff(req, H_CRLF, sizeof(req)); // end of request headers
// raw send: ssl is set, so sendc() would route to TLS
{
const char *p = req;
size_t remain = strlen(req);
int stalls = 0;
while (remain > 0) {
const int n = (int) send(soc, p, (int) remain, 0);
if (n > 0) {
p += n;
remain -= (size_t) n;
stalls = 0;
} else {
#ifdef _WIN32
const int wouldblock = (WSAGetLastError() == WSAEWOULDBLOCK);
#else
const int wouldblock =
(errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR);
#endif
// don't spin forever on a fatal error or an unwritable socket
if (!wouldblock || !check_writeinput_t(soc, timeout) ||
++stalls > 100) {
strcpybuff(retour->msg, "proxy CONNECT: write error");
return 0;
}
}
}
}
// proxy status line: "HTTP/1.x <code> ..."
if (proxy_getline(soc, line, sizeof(line), timeout) < 0) {
strcpybuff(retour->msg, "proxy CONNECT: no response");
return 0;
}
if (sscanf(line, "HTTP/%*d.%*d %d", &code) < 1)
code = 0;
if (code < 200 || code >= 300) {
snprintf(retour->msg, sizeof(retour->msg), "proxy CONNECT refused: %s",
strnotempty(line) ? line : "(no status)");
return 0;
}
// drain headers to the blank line; cap the count so a flooding proxy can't
// stall the crawl
{
int headers = 0;
for (;;) {
const int n = proxy_getline(soc, line, sizeof(line), timeout);
if (n < 0) {
strcpybuff(retour->msg, "proxy CONNECT: truncated response");
return 0;
}
if (n == 0)
break; // blank line: tunnel ready
if (++headers > 64) {
strcpybuff(retour->msg, "proxy CONNECT: too many response headers");
return 0;
}
}
}
return 1;
}
// ouverture d'une liaison http, envoi d'une requète // ouverture d'une liaison http, envoi d'une requète
// mode: 0 GET 1 HEAD [2 POST] // mode: 0 GET 1 HEAD [2 POST]
// treat: traiter header? // treat: traiter header?
@@ -680,14 +839,14 @@ T_SOC http_xfopen(httrackp * opt, int mode, int treat, int waitconnect,
/* connexion */ /* connexion */
if (retour) { if (retour) {
if ((!(retour->req.proxy.active)) /* no proxy, or proxy not usable here (local file) */
|| ((strcmp(adr, "file://") == 0) if ((!(retour->req.proxy.active)) || (strcmp(adr, "file://") == 0)) {
|| (strncmp(adr, "https://", 8) == 0)
)
) { /* pas de proxy, ou non utilisable ici */
soc = newhttp(opt, adr, retour, -1, waitconnect); soc = newhttp(opt, adr, retour, -1, waitconnect);
} else { } else {
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port, waitconnect); // ouvrir sur le proxy à la place // to the proxy; https tunnels to the origin via CONNECT in back_wait
// (#85)
soc = newhttp(opt, retour->req.proxy.name, retour, retour->req.proxy.port,
waitconnect);
} }
} else { } else {
soc = newhttp(opt, adr, NULL, -1, waitconnect); soc = newhttp(opt, adr, NULL, -1, waitconnect);
@@ -874,6 +1033,50 @@ static void print_buffer(buff_struct*const str, const char *format, ...) {
assertf(str->pos < str->capacity); assertf(str->pos < str->capacity);
} }
/* Append the request "Cookie:" header line for every stored cookie matching
domain/path. RFC 6265 form: bare "name=value" pairs joined by "; ", no
$Version/$Path attributes (those are RFC 2965 syntax that modern servers
reject, issue #151). Returns the number of cookies emitted. */
static int append_cookie_header(buff_struct *bstr, t_cookie *cookie,
const char *domain, const char *path) {
char buffer[8192];
char *b;
int cook = 0;
int max_cookies = 8;
if (cookie == NULL)
return 0;
b = cookie->data;
do {
b = cookie_find(b, "", domain, path); // next matching cookie
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(bstr, "Cookie: ");
cook = 1;
} else
print_buffer(bstr, "; ");
print_buffer(bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(bstr, "=%s", cookie_get(buffer, b, 6));
b = cookie_nextfield(b);
}
} while (b != NULL && max_cookies > 0);
if (cook)
print_buffer(bstr, H_CRLF);
return cook;
}
/* Self-test entry for append_cookie_header(): build the request Cookie line
into dst (always NUL-terminated). Returns the number of cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size) {
buff_struct bstr = {dst, dst_size, 0};
assertf(dst != NULL && dst_size > 0);
dst[0] = '\0';
return append_cookie_header(&bstr, cookie, domain, path);
}
// envoi d'une requète // envoi d'une requète
int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
const char *xsend, const char *adr, const char *fil, const char *xsend, const char *adr, const char *fil,
@@ -999,8 +1202,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
if (xsend) if (xsend)
print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes print_buffer(&bstr, "%s", xsend); // éventuelles autres lignes
// tester proxy authentication // for https, auth rides the CONNECT (the tunneled GET would leak it)
if (retour->req.proxy.active) { if (retour->req.proxy.active && strncmp(adr, "https://", 8) != 0) {
if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy! if (link_has_authorization(retour->req.proxy.name)) { // et hop, authentification proxy!
const char *a = jump_identification_const(retour->req.proxy.name); const char *a = jump_identification_const(retour->req.proxy.name);
const char *astart = jump_protocol_const(retour->req.proxy.name); const char *astart = jump_protocol_const(retour->req.proxy.name);
@@ -1048,34 +1251,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
search_tag + strlen(POSTTOK) + 1)))); search_tag + strlen(POSTTOK) + 1))));
} }
} }
// gestion cookies? // send stored cookies matching this host/path
if (cookie) { if (cookie) {
char buffer[8192]; append_cookie_header(&bstr, cookie, jump_identification_const(adr), fil);
char *b = cookie->data;
int cook = 0;
int max_cookies = 8;
do {
b = cookie_find(b, "", jump_identification_const(adr), fil); // prochain cookie satisfaisant aux conditions
if (b != NULL) {
max_cookies--;
if (!cook) {
print_buffer(&bstr, "Cookie: $Version=1; ");
cook = 1;
} else
print_buffer(&bstr, "; ");
print_buffer(&bstr, "%s", cookie_get(buffer, b, 5));
print_buffer(&bstr, "=%s", cookie_get(buffer, b, 6));
print_buffer(&bstr, "; $Path=%s", cookie_get(buffer, b, 2));
b = cookie_nextfield(b);
}
} while(b != NULL && max_cookies > 0);
if (cook) { // on a envoyé un (ou plusieurs) cookie?
print_buffer(&bstr, H_CRLF);
#if DEBUG_COOK
printf("Header:\n%s\n", bstr.buffer);
#endif
}
} }
// gérer le keep-alive (garder socket) // gérer le keep-alive (garder socket)
if (retour->req.http11 && !retour->req.nokeepalive) { if (retour->req.http11 && !retour->req.nokeepalive) {
@@ -1218,6 +1396,8 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
void treatfirstline(htsblk * retour, const char *rcvd) { void treatfirstline(htsblk * retour, const char *rcvd) {
const char *a = rcvd; const char *a = rcvd;
retour->contenttype_given = HTS_FALSE; /* set when a Content-Type is seen */
// exemple: // exemple:
// HTTP/1.0 200 OK // HTTP/1.0 200 OK
if (*a) { if (*a) {
@@ -1416,6 +1596,8 @@ void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * ret
strcpybuff(retour->contenttype, tempo); strcpybuff(retour->contenttype, tempo);
else else
strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur strcpybuff(retour->contenttype, "application/octet-stream-unknown"); // erreur
retour->contenttype_given =
HTS_TRUE; /* the server declared a Content-Type */
} }
} else if ((p = strfield(rcvd, "Content-Range:")) != 0) { } else if ((p = strfield(rcvd, "Content-Range:")) != 0) {
// Content-Range: bytes 0-70870/70871 // Content-Range: bytes 0-70870/70871
@@ -1808,6 +1990,24 @@ int check_readinput_t(T_SOC soc, int timeout) {
return 0; return 0;
} }
// wait until the socket is writable, up to timeout seconds
int check_writeinput_t(T_SOC soc, int timeout) {
if (soc != INVALID_SOCKET) {
fd_set fds;
struct timeval tv;
const int isoc = (int) soc;
assertf(isoc == soc);
FD_ZERO(&fds);
FD_SET(isoc, &fds);
tv.tv_sec = timeout;
tv.tv_usec = 0;
select(isoc + 1, NULL, &fds, NULL, &tv);
return FD_ISSET(isoc, &fds) ? 1 : 0;
} else
return 0;
}
// idem, sauf qu'ici on peut choisir la taille max de données à recevoir // idem, sauf qu'ici on peut choisir la taille max de données à recevoir
// SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes // SI bufl==0 alors le buffer est censé être de 8kos, et on recoit par bloc de lignes
// en éliminant les cr (ex: header), arrêt si double-lf // en éliminant les cr (ex: header), arrêt si double-lf
@@ -2409,6 +2609,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++) for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\') if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/'; adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
} }
// no hostname // no hostname
@@ -3646,8 +3848,9 @@ HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const cha
// DOES NOT DECODE %25 (part of CHAR_DELIM) // DOES NOT DECODE %25 (part of CHAR_DELIM)
// no_high & 1: decode high chars // no_high & 1: decode high chars
// no_high & 2: decode space // no_high & 2: decode space
HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size,
const char *s, const int no_high) { const char *s,
const hts_boolean no_high) {
size_t i, j; size_t i, j;
RUNTIME_TIME_CHECK_SIZE(size); RUNTIME_TIME_CHECK_SIZE(size);
@@ -3931,8 +4134,8 @@ void hts_replace(char *s, char from, char to) {
// guess a local file's mime type (e.g. fil="toto.gif" -> s="image/gif") // guess a local file's mime type (e.g. fil="toto.gif" -> s="image/gif")
// returns 1 if a type was written to s, 0 otherwise // returns 1 if a type was written to s, 0 otherwise
int guess_httptype_sized(httrackp *opt, char *s, size_t ssize, hts_boolean guess_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil) { const char *fil) {
return get_httptype_sized(opt, s, ssize, fil, 1); return get_httptype_sized(opt, s, ssize, fil, 1);
} }
@@ -3945,8 +4148,8 @@ void guess_httptype(httrackp * opt, char *s, const char *fil) {
// write the mime type for fil into s (capacity ssize) // write the mime type for fil into s (capacity ssize)
// flag: 1 to always return a type (the "application/..." / octet-stream // flag: 1 to always return a type (the "application/..." / octet-stream
// fallback) returns 1 if a type was written to s, 0 otherwise // fallback) returns 1 if a type was written to s, 0 otherwise
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag) { const char *fil, hts_boolean flag) {
// userdef overrides get_httptype (a rule with an empty value, e.g. "--assume // userdef overrides get_httptype (a rule with an empty value, e.g. "--assume
// cgi=", matches but writes nothing: report it as "no type" like the old // cgi=", matches but writes nothing: report it as "no type" like the old
// code, whose callers tested strnotempty(s)) // code, whose callers tested strnotempty(s))
@@ -4196,7 +4399,7 @@ HTSEXT_API int is_userknowntype(httrackp * opt, const char *fil) {
// page dynamique? // page dynamique?
// is_dyntype(get_ext("foo.asp")) // is_dyntype(get_ext("foo.asp"))
HTSEXT_API int is_dyntype(const char *fil) { HTSEXT_API hts_boolean is_dyntype(const char *fil) {
int j = 0; int j = 0;
if (!fil) if (!fil)
@@ -4214,7 +4417,7 @@ HTSEXT_API int is_dyntype(const char *fil) {
// types critiques qui ne doivent pas être changés car renvoyés par des serveurs qui ne // types critiques qui ne doivent pas être changés car renvoyés par des serveurs qui ne
// connaissent pas le type // connaissent pas le type
int may_unknown(httrackp * opt, const char *st) { hts_boolean may_unknown(httrackp *opt, const char *st) {
int j = 0; int j = 0;
// types média // types média
@@ -5236,7 +5439,8 @@ HTSEXT_API int hts_uninit_module(void) {
} }
// legacy. do not use // legacy. do not use
HTSEXT_API int hts_log(httrackp * opt, const char *prefix, const char *msg) { HTSEXT_API hts_boolean hts_log(httrackp *opt, const char *prefix,
const char *msg) {
if (opt->log != NULL) { if (opt->log != NULL) {
fspc(opt, opt->log, prefix); fspc(opt, opt->log, prefix);
fprintf(opt->log, "%s" LF, msg); fprintf(opt->log, "%s" LF, msg);
@@ -5466,9 +5670,10 @@ HTSEXT_API httrackp *hts_create_opt(void) {
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"); "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, ""); StringCopy(opt->referer, "");
StringCopy(opt->from, ""); StringCopy(opt->from, "");
opt->savename_83 = 0; // noms longs par défaut opt->savename_83 = HTS_SAVENAME_83_LONG; // long names by default
opt->savename_type = 0; // avec structure originale opt->savename_type = 0; // avec structure originale
opt->savename_delayed = 2; // hard delayed type (default) opt->savename_delayed =
HTS_SAVENAME_DELAYED_HARD; // always delay the type check (default)
opt->delayed_cached = HTS_TRUE; opt->delayed_cached = HTS_TRUE;
opt->mimehtml = HTS_FALSE; opt->mimehtml = HTS_FALSE;
opt->parsejava = HTSPARSE_DEFAULT; // parser classes opt->parsejava = HTSPARSE_DEFAULT; // parser classes
@@ -5493,7 +5698,7 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->parseall = HTS_TRUE; opt->parseall = HTS_TRUE;
opt->parsedebug = HTS_FALSE; opt->parsedebug = HTS_FALSE;
opt->norecatch = HTS_FALSE; opt->norecatch = HTS_FALSE;
opt->verbosedisplay = 0; // pas d'animation texte opt->verbosedisplay = HTS_VERBOSE_NONE; // no text animation
opt->sizehack = HTS_FALSE; opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE; opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER); StringCopy(opt->footer, HTS_DEFAULT_FOOTER);

View File

@@ -182,6 +182,11 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode, const char *xsend
const char *adr, const char *fil, const char *adr, const char *fil,
const char *referer_adr, const char *referer_fil, const char *referer_adr, const char *referer_fil,
htsblk * retour); htsblk * retour);
/* Build the request "Cookie:" header line for stored cookies matching
domain/path into dst (NUL-terminated). Exposed for the -#Q self-test;
wraps the same logic http_sendhead() uses. Returns cookies emitted. */
int http_cookie_header_selftest(t_cookie *cookie, const char *domain,
const char *path, char *dst, size_t dst_size);
//int newhttp(char* iadr,char* err=NULL); //int newhttp(char* iadr,char* err=NULL);
T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port, T_SOC newhttp(httrackp * opt, const char *iadr, htsblk * retour, int port,
@@ -193,6 +198,17 @@ HTS_INLINE void deletesoc_r(htsblk * r);
htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc); htsblk http_test(httrackp * opt, const char *adr, const char *fil, char *loc);
int check_readinput(htsblk * r); int check_readinput(htsblk * r);
int check_readinput_t(T_SOC soc, int timeout); int check_readinput_t(T_SOC soc, int timeout);
int check_writeinput_t(T_SOC soc, int timeout);
/* Open an HTTP CONNECT tunnel through the active proxy for an https request:
`retour->soc` must already be TCP-connected to the proxy, and `adr` is the
origin authority (url_adr, e.g. "https://host:port"). Sends the CONNECT
request (with Proxy-Authorization when the proxy carries credentials) and
reads the proxy's status line, so the caller's TLS handshake then runs
end-to-end with the origin. Blocks up to `timeout` seconds. Returns 1 on a
2xx tunnel, 0 on failure (retour->msg/statuscode set). */
int http_proxy_tunnel(httrackp *opt, htsblk *retour, const char *adr,
int timeout);
void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour, void treathead(t_cookie * cookie, const char *adr, const char *fil, htsblk * retour,
char *rcvd); char *rcvd);
void treatfirstline(htsblk * retour, const char *rcvd); void treatfirstline(htsblk * retour, const char *rcvd);

View File

@@ -69,41 +69,41 @@ typedef struct hash_struct hash_struct;
#define HTS_DEF_FWSTRUCT_htsmoduleStruct #define HTS_DEF_FWSTRUCT_htsmoduleStruct
typedef struct htsmoduleStruct htsmoduleStruct; typedef struct htsmoduleStruct htsmoduleStruct;
#endif #endif
typedef int (*t_htsAddLink) (htsmoduleStruct * str, char *link); typedef int (*t_htsAddLink)(htsmoduleStruct *str, char *link);
/** Per-object context passed to a parser module for one downloaded file. /** Per-object context passed to a parser module for one downloaded file.
Field access classes are noted; engine owns all pointers unless stated. */ Field access classes are noted; engine owns all pointers unless stated. */
struct htsmoduleStruct { struct htsmoduleStruct {
/* Read-only elements */ /* Read-only elements */
const char *filename; /* filename (C:\My Web Sites\...) */ const char *filename; /* filename (C:\My Web Sites\...) */
int size; /* size of filename (should be > 0) */ int size; /* size of filename (should be > 0) */
const char *mime; /* MIME type of the object */ const char *mime; /* MIME type of the object */
const char *url_host; /* incoming hostname (www.foo.com) */ const char *url_host; /* incoming hostname (www.foo.com) */
const char *url_file; /* incoming filename (/bar/bar.gny) */ const char *url_file; /* incoming filename (/bar/bar.gny) */
/* Write-only */ /* Write-only */
const char *wrapper_name; /* name of wrapper (static string) */ const char *wrapper_name; /* name of wrapper (static string) */
char *err_msg; /* if an error occurred, the error message (max. 1KB) */ char *err_msg; /* if an error occurred, the error message (max. 1KB) */
/* Read/Write */ /* Read/Write */
int relativeToHtmlLink; /* set this to 1 if all urls you pass to addLink int relativeToHtmlLink; /* set this to 1 if all urls you pass to addLink
are in fact relative to the html file where your are in fact relative to the html file where your
module was originally */ module was originally */
/* Callbacks */ /* Callbacks */
t_htsAddLink addLink; /* call this function when links are t_htsAddLink addLink; /* call this function when links are
being detected. it if not your responsability to decide being detected. it if not your responsability to
if the engine will keep them, or not. */ decide if the engine will keep them, or not. */
/* Optional */ /* Optional */
char *localLink; /* if non null, the engine will write there the local char *localLink; /* if non null, the engine will write there the local
relative filename of the link added by addLink(), or relative filename of the link added by addLink(), or
the absolute path if the link was refused by the wizard */ the absolute path if the link was refused by the wizard */
int localLinkSize; /* size of the optionnal buffer */ int localLinkSize; /* size of the optionnal buffer */
/* User-defined */ /* User-defined */
void *userdef; /* can be used by callback routines void *userdef; /* can be used by callback routines
*/ */
/* The parser httrackp structure (may be used) */ /* The parser httrackp structure (may be used) */
httrackp *opt; httrackp *opt;
@@ -117,7 +117,6 @@ struct htsmoduleStruct {
int *ptr_; int *ptr_;
const char *page_charset_; const char *page_charset_;
/* Internal use - please don't touch */ /* Internal use - please don't touch */
}; };
#ifdef __cplusplus #ifdef __cplusplus
@@ -126,11 +125,11 @@ extern "C" {
/** Module lifecycle hooks. Init/PlugInit return 1 on success, 0 on failure; /** Module lifecycle hooks. Init/PlugInit return 1 on success, 0 on failure;
Exit returns its own status (ignored by the engine). */ Exit returns its own status (ignored by the engine). */
typedef int (*t_htsWrapperInit) (char *fn, char *args); typedef int (*t_htsWrapperInit)(char *fn, char *args);
typedef int (*t_htsWrapperExit) (void); typedef int (*t_htsWrapperExit)(void);
typedef int (*t_htsWrapperPlugInit) (char *args); typedef int (*t_htsWrapperPlugInit)(char *args);
/* Library internal definictions */ /* Library internal definictions */
#ifdef HTS_INTERNAL_BYTECODE #ifdef HTS_INTERNAL_BYTECODE
@@ -138,7 +137,7 @@ typedef int (*t_htsWrapperPlugInit) (char *args);
/** Capabilities string ("-noV6", "-nossl", ...) followed by "+name" for each /** Capabilities string ("-noV6", "-nossl", ...) followed by "+name" for each
loaded module. Returned pointer aliases opt->state.HTbuff; do not free, and loaded module. Returned pointer aliases opt->state.HTbuff; do not free, and
it is overwritten by the next call. */ it is overwritten by the next call. */
HTSEXT_API const char *hts_get_version_info(httrackp * opt); HTSEXT_API const char *hts_get_version_info(httrackp *opt);
/** Static capabilities string set by htspe_init(); valid for the process /** Static capabilities string set by htspe_init(); valid for the process
lifetime, do not free. */ lifetime, do not free. */
@@ -154,7 +153,7 @@ extern void htspe_uninit(void);
/** Run the external-parser callbacks for the object described by str. /** Run the external-parser callbacks for the object described by str.
Returns the parse callback result (>=0) on a handled object, or -1 if no Returns the parse callback result (>=0) on a handled object, or -1 if no
module claimed it or its wrapper_name is blacklisted. */ module claimed it or its wrapper_name is blacklisted. */
extern int hts_parse_externals(htsmoduleStruct * str); extern int hts_parse_externals(htsmoduleStruct *str);
/** Nonzero if IPv6 support was compiled in (== HTS_INET6). */ /** Nonzero if IPv6 support was compiled in (== HTS_INET6). */
extern int V6_is_available; extern int V6_is_available;

View File

@@ -138,6 +138,34 @@ static void cleanEndingSpaceOrDot(char *s) {
} }
} }
/* Should the wire Content-Type override the URL's own extension when naming the
saved file? True when the type is patchable (may_unknown2) and either the URL
extension implies no specific type or the server declared a disagreeing one.
A URL extension mapping to a specific non-HTML type is kept only when the
server sent NO Content-Type (the #267 mangle guard): a typeless .png stays
.png, but a .pdf explicitly served as text/html is named .html. */
static int wire_patches_ext(httrackp *opt, const char *wiremime,
const char *file, int contenttype_given) {
char urlmime[256];
if (may_unknown2(opt, wiremime, file))
return 0; /* type kept verbatim (keep-list / bogus-multiple) */
urlmime[0] = '\0';
/* type implied by the URL extension, only when confidently known (flag 0) */
if (!get_httptype_sized(opt, urlmime, sizeof(urlmime), file, 0))
return 1; /* URL ext implies no known type: trust the wire type */
if (strfield2(wiremime, urlmime))
return 0; /* wire agrees with the ext: keep it (no .htm->.html churn) */
/* wire disagrees with a specific non-HTML URL ext. Keep the ext only when
the server sent NO Content-Type: a missing type is defaulted to text/html
upstream and must not clobber e.g. a .png. An explicitly declared type is
trusted, so a binary-looking URL that really serves HTML (login/error
interstitial, soft-404) is named .html instead of kept as .pdf/.jpg. */
if (!is_hypertext_mime(opt, urlmime, file) && !contenttype_given)
return 0;
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr // forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html) // système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
int url_savename(lien_adrfilsave *const afs, int url_savename(lien_adrfilsave *const afs,
@@ -184,10 +212,11 @@ int url_savename(lien_adrfilsave *const afs,
/* 8-3 ? */ /* 8-3 ? */
switch (opt->savename_83) { switch (opt->savename_83) {
case 1: // 8-3 case HTS_SAVENAME_83_DOS: // 8-3
max_char = 8; max_char = 8;
break; break;
case 2: // Level 2 File names may be up to 31 characters. case HTS_SAVENAME_83_ISO9660: // Level 2 File names may be up to 31
// characters.
max_char = 31; max_char = 31;
break; break;
default: default:
@@ -324,7 +353,10 @@ int url_savename(lien_adrfilsave *const afs,
} }
/* replace shtml to html.. */ /* replace shtml to html.. */
if (opt->savename_delayed == 2) /* HARD delays every type, except one the user pinned with --assume: honor it
immediately (ishtml() consults the user type), no delayed name (#56) */
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD &&
!is_userknowntype(opt, fil))
is_html = -1; /* ALWAYS delay type */ is_html = -1; /* ALWAYS delay type */
else else
is_html = ishtml(opt, fil); is_html = ishtml(opt, fil);
@@ -363,7 +395,9 @@ int url_savename(lien_adrfilsave *const afs,
) { ) {
// tester type avec requète HEAD si on ne connait pas le type du fichier // tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html? if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == 2 || (ishtest = ishtml(opt, fil)) < 0) { // on ne sait pas si c'est un html ou un fichier.. if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
// lire dans le cache // lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
@@ -377,7 +411,8 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(r.cdispo)) { /* filename given */ if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo); strcpybuff(ext, r.cdispo);
} else if (!may_unknown2(opt, r.contenttype, fil)) { // on peut patcher à priori? } else if (wire_patches_ext(opt, r.contenttype, fil,
r.contenttype_given)) {
if (give_mimext(s, sizeof(s), if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension r.contenttype)) { // recognized extension
ext_chg = 1; ext_chg = 1;
@@ -393,11 +428,12 @@ int url_savename(lien_adrfilsave *const afs,
} }
#endif #endif
// //
} else if (opt->savename_delayed != 2 && is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER. } else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
Lookup mimetype not only by extension, is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
but also by filename */ Lookup mimetype not only by extension,
/* Note: "foo.cgi => text/html" means that foo.cgi shall have the text/html MIME file type, but also by filename */
that is, ".html" */ /* Note: "foo.cgi => text/html" means that foo.cgi shall have the
text/html MIME file type, that is, ".html" */
char BIGSTK mime[1024]; char BIGSTK mime[1024];
mime[0] = ext[0] = '\0'; mime[0] = ext[0] = '\0';
@@ -408,16 +444,22 @@ int url_savename(lien_adrfilsave *const afs,
} }
} }
} }
// note: if savename_delayed is enabled, the naming will be temporary (and slightly invalid!) // note: if savename_delayed is enabled, the naming will be temporary
// note: if we are about to stop (opt->state.stop), back_add() will fail later // (and slightly invalid!)
else if (opt->savename_delayed != 0 && !opt->state.stop) { //
// note: if we are about to stop (opt->state.stop), back_add() will
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later. // Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess // FIXME: we should cleanup and factorize this unholy mess
if (headers != NULL && headers->status >= 0 && !is_redirect) { if (headers != NULL && headers->status >= 0 && !is_redirect) {
if (strnotempty(headers->r.cdispo)) { /* filename given */ if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo); strcpybuff(ext, headers->r.cdispo);
} else if (!may_unknown2(opt, headers->r.contenttype, headers->url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type) } else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil,
headers->r.contenttype_given)) {
char s[16]; char s[16];
if (give_mimext( if (give_mimext(
s, sizeof(s), s, sizeof(s),
@@ -645,7 +687,9 @@ int url_savename(lien_adrfilsave *const afs,
if (strnotempty(back[b].r.cdispo)) { /* filename given */ if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */ ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo); strcpybuff(ext, back[b].r.cdispo);
} else if (!may_unknown2(opt, back[b].r.contenttype, back[b].url_fil)) { // on peut patcher à priori? (pas interdit ou pas de type) } else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil,
back[b].r.contenttype_given)) {
if (give_mimext( if (give_mimext(
s, sizeof(s), s, sizeof(s),
back[b].r.contenttype)) { // recognized extension back[b].r.contenttype)) { // recognized extension
@@ -698,7 +742,7 @@ int url_savename(lien_adrfilsave *const afs,
} }
// restaurer // restaurer
opt->state._hts_in_html_parsing = hihp; opt->state._hts_in_html_parsing = hihp;
} // caché? } // caché?
} }
} }
} }
@@ -1190,7 +1234,8 @@ int url_savename(lien_adrfilsave *const afs,
// Not used anymore unless non-delayed types. // Not used anymore unless non-delayed types.
// de même en cas de manque d'extension on en place une de manière forcée.. // de même en cas de manque d'extension on en place une de manière forcée..
// cela évite les /chez/toto et les /chez/toto/index.html incompatibles // cela évite les /chez/toto et les /chez/toto/index.html incompatibles
if (opt->savename_type != -1 && opt->savename_delayed != 2) { if (opt->savename_type != -1 &&
opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD) {
char *a = afs->save + strlen(afs->save) - 1; char *a = afs->save + strlen(afs->save) - 1;
while((a > afs->save) && (*a != '.') && (*a != '/')) while((a > afs->save) && (*a != '.') && (*a != '/'))
@@ -1236,31 +1281,21 @@ int url_savename(lien_adrfilsave *const afs,
size_t i; size_t i;
for(i = 0 ; afs->save[i] != '\0' ; i++) { for(i = 0 ; afs->save[i] != '\0' ; i++) {
unsigned char c = (unsigned char) afs->save[i]; unsigned char c = (unsigned char) afs->save[i];
if (c < 32 // control if (c < 32 // control
|| c == 127 // unwise || c == 127 // unwise
|| c == '~' // unix unwise || c == '~' // unix unwise
|| c == '\\' // windows separator || c == '\\' // windows separator
|| c == ':' // windows forbidden || c == ':' // windows forbidden
|| c == '*' // windows forbidden || c == '*' // windows forbidden
|| c == '?' // windows forbidden || c == '?' // windows forbidden
|| c == '\"' // windows forbidden || c == '\"' // windows forbidden
|| c == '<' // windows forbidden || c == '<' // windows forbidden
|| c == '>' // windows forbidden || c == '>' // windows forbidden
|| c == '|' // windows forbidden || c == '|' // windows forbidden
//|| c == '@' // ? //|| c == '@' // ?
|| || (opt->savename_83 == HTS_SAVENAME_83_ISO9660 // CDROM
( && (c == '-' || c == '=' || c == '+'))) {
opt->savename_83 == 2 // CDROM afs->save[i] = '_';
&&
(
c == '-'
|| c == '='
|| c == '+'
)
)
)
{
afs->save[i] = '_';
} }
} }
} }
@@ -1521,7 +1556,8 @@ int url_savename(lien_adrfilsave *const afs,
char *a = afs->save + strlen(afs->save) - 1; char *a = afs->save + strlen(afs->save) - 1;
char *b; char *b;
int n = 2; int n = 2;
char collisionSeparator = ((opt->savename_83 != 2) ? '-' : '_'); char collisionSeparator =
((opt->savename_83 != HTS_SAVENAME_83_ISO9660) ? '-' : '_');
tempo[0] = '\0'; tempo[0] = '\0';

View File

@@ -112,10 +112,10 @@ struct SOCaddr {
/** Pointer to the port field (network byte order) for the active family. /** Pointer to the port field (network byte order) for the active family.
Asserts on NULL or an unset/unknown family. */ Asserts on NULL or an unset/unknown family. */
static HTS_INLINE HTS_UNUSED in_port_t* SOCaddr_sinport_(SOCaddr *const addr, static HTS_INLINE HTS_UNUSED in_port_t *
const char *file, const int line) { SOCaddr_sinport_(SOCaddr *const addr, const char *file, const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
switch(addr->m_addr.sa.sa_family) { switch (addr->m_addr.sa.sa_family) {
case AF_INET: case AF_INET:
return &addr->m_addr.in.sin_port; return &addr->m_addr.in.sin_port;
break; break;
@@ -125,7 +125,7 @@ static HTS_INLINE HTS_UNUSED in_port_t* SOCaddr_sinport_(SOCaddr *const addr,
break; break;
#endif #endif
default: default:
assertf_(! "invalid structure", file, line); assertf_(!"invalid structure", file, line);
return 0; return 0;
break; break;
} }
@@ -133,10 +133,11 @@ static HTS_INLINE HTS_UNUSED in_port_t* SOCaddr_sinport_(SOCaddr *const addr,
/** Length of the active sockaddr (sockaddr_in or sockaddr_in6), or 0 if the /** Length of the active sockaddr (sockaddr_in or sockaddr_in6), or 0 if the
family is unset/unknown. The 0 case doubles as the "not valid" test. */ family is unset/unknown. The 0 case doubles as the "not valid" test. */
static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr*const addr, static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr *const addr,
const char *file, const int line) { const char *file,
const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
switch(addr->m_addr.sa.sa_family) { switch (addr->m_addr.sa.sa_family) {
case AF_INET: case AF_INET:
return sizeof(addr->m_addr.in); return sizeof(addr->m_addr.in);
break; break;
@@ -152,8 +153,8 @@ static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_size_(const SOCaddr*const addr,
} }
/** Reset to the unset state (family AF_UNSPEC), making the address invalid. */ /** Reset to the unset state (family AF_UNSPEC), making the address invalid. */
static HTS_INLINE HTS_UNUSED void SOCaddr_clear_(SOCaddr*const addr, static HTS_INLINE HTS_UNUSED void
const char *file, const int line) { SOCaddr_clear_(SOCaddr *const addr, const char *file, const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
addr->m_addr.sa.sa_family = AF_UNSPEC; addr->m_addr.sa.sa_family = AF_UNSPEC;
} }
@@ -191,14 +192,16 @@ static HTS_INLINE HTS_UNUSED void SOCaddr_clear_(SOCaddr*const addr,
/** Set the port (host-order argument, stored network-order) on the active /** Set the port (host-order argument, stored network-order) on the active
* family. */ * family. */
#define SOCaddr_initport(server, port) do { \ #define SOCaddr_initport(server, port) \
SOCaddr_sinport(server) = htons((in_port_t) (port)); \ do { \
} while(0) SOCaddr_sinport(server) = htons((in_port_t) (port)); \
} while (0)
/** Initialize as an all-zero IPv4 wildcard (INADDR_ANY) address; returns its /** Initialize as an all-zero IPv4 wildcard (INADDR_ANY) address; returns its
sockaddr length. */ sockaddr length. */
static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr*const addr, static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr *const addr,
const char *file, const int line) { const char *file,
const int line) {
assertf_(addr != NULL, file, line); assertf_(addr != NULL, file, line);
memset(&addr->m_addr.in, 0, sizeof(addr->m_addr.in)); memset(&addr->m_addr.in, 0, sizeof(addr->m_addr.in));
addr->m_addr.in.sin_family = AF_INET; addr->m_addr.in.sin_family = AF_INET;
@@ -206,17 +209,20 @@ static HTS_INLINE HTS_UNUSED socklen_t SOCaddr_initany_(SOCaddr*const addr,
} }
/** Initialize server as an IPv4 wildcard (INADDR_ANY) address. */ /** Initialize server as an IPv4 wildcard (INADDR_ANY) address. */
#define SOCaddr_initany(server) do { \ #define SOCaddr_initany(server) \
SOCaddr_initany_(&(server), __FILE__, __LINE__); \ do { \
} while(0) SOCaddr_initany_(&(server), __FILE__, __LINE__); \
} while (0)
/** Populate server from data. data_size selects the source form: a full /** Populate server from data. data_size selects the source form: a full
sockaddr_in / sockaddr_in6, or a raw 4-byte (IPv4) / 16-byte (IPv6) address sockaddr_in / sockaddr_in6, or a raw 4-byte (IPv4) / 16-byte (IPv6) address
with port zeroed. Any other size leaves an AF_INET shell. Returns the with port zeroed. Any other size leaves an AF_INET shell. Returns the
resulting sockaddr length. */ resulting sockaddr length. */
static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr*const server, static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr *const server,
const void *data, const size_t data_size, const void *data,
const char *file, const int line) { const size_t data_size,
const char *file,
const int line) {
assertf_(server != NULL, file, line); assertf_(server != NULL, file, line);
assertf_(data != NULL, file, line); assertf_(data != NULL, file, line);
@@ -248,32 +254,35 @@ static HTS_UNUSED socklen_t SOCaddr_copyaddr_(SOCaddr*const server,
/** Copy hpaddr (length hpsize) into server, writing the result length into the /** Copy hpaddr (length hpsize) into server, writing the result length into the
lvalue server_len (int). See SOCaddr_copyaddr_ for accepted forms. */ lvalue server_len (int). See SOCaddr_copyaddr_ for accepted forms. */
#define SOCaddr_copyaddr(server, server_len, hpaddr, hpsize) do { \ #define SOCaddr_copyaddr(server, server_len, hpaddr, hpsize) \
server_len = (int) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \ do { \
} while(0) server_len = (int) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, \
__LINE__); \
} while (0)
/** Like SOCaddr_copyaddr but discards the result length. */ /** Like SOCaddr_copyaddr but discards the result length. */
#define SOCaddr_copyaddr2(server, hpaddr, hpsize) do { \ #define SOCaddr_copyaddr2(server, hpaddr, hpsize) \
(void) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \ do { \
} while(0) (void) SOCaddr_copyaddr_(&(server), hpaddr, hpsize, __FILE__, __LINE__); \
} while (0)
/** Copy one SOCaddr (src) into another (dest), preserving family and port. */ /** Copy one SOCaddr (src) into another (dest), preserving family and port. */
#define SOCaddr_copy_SOCaddr(dest, src) do { \ #define SOCaddr_copy_SOCaddr(dest, src) \
SOCaddr_copyaddr_(&(dest), &(src).m_addr.sa, SOCaddr_size(src), __FILE__, __LINE__); \ do { \
} while(0) SOCaddr_copyaddr_(&(dest), &(src).m_addr.sa, SOCaddr_size(src), __FILE__, \
__LINE__); \
} while (0)
/** Write the numeric (dotted/colon) host of ss into namebuf (capacity /** Write the numeric (dotted/colon) host of ss into namebuf (capacity
namebuflen), scope id stripped. On failure namebuf becomes "". */ namebuflen), scope id stripped. On failure namebuf becomes "". */
static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen, static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen,
SOCaddr *const ss, SOCaddr *const ss, const char *file,
const char *file, const int line) { const int line) {
assertf_(namebuf != NULL, file, line); assertf_(namebuf != NULL, file, line);
assertf_(ss != NULL, file, line); assertf_(ss != NULL, file, line);
if (getnameinfo(&ss->m_addr.sa, sizeof(ss->m_addr), if (getnameinfo(&ss->m_addr.sa, sizeof(ss->m_addr), namebuf, namebuflen, NULL,
namebuf, namebuflen, 0, NI_NUMERICHOST) == 0) {
NULL, 0,
NI_NUMERICHOST) == 0) {
/* remove scope id(s) */ /* remove scope id(s) */
char *const pos = strchr(namebuf, '%'); char *const pos = strchr(namebuf, '%');
if (pos != NULL) { if (pos != NULL) {
@@ -285,11 +294,12 @@ static HTS_UNUSED void SOCaddr_inetntoa_(char *namebuf, size_t namebuflen,
} }
/** Numeric host of ss into namebuf (capacity namebuflen); "" on failure. */ /** Numeric host of ss into namebuf (capacity namebuflen); "" on failure. */
#define SOCaddr_inetntoa(namebuf, namebuflen, ss) \ #define SOCaddr_inetntoa(namebuf, namebuflen, ss) \
SOCaddr_inetntoa_(namebuf, namebuflen, &(ss), __FILE__, __LINE__) SOCaddr_inetntoa_(namebuf, namebuflen, &(ss), __FILE__, __LINE__)
/** Single-char family tag: '1' for IPv4, '2' otherwise (used in the cache). */ /** Single-char family tag: '1' for IPv4, '2' otherwise (used in the cache). */
#define SOCaddr_getproto(ss) ( SOCaddr_size(ss) == sizeof(struct sockaddr_in) ? '1' : '2') #define SOCaddr_getproto(ss) \
(SOCaddr_size(ss) == sizeof(struct sockaddr_in) ? '1' : '2')
/** Length type for socket APIs (getsockname, accept, ...). */ /** Length type for socket APIs (getsockname, accept, ...). */
typedef socklen_t SOClen; typedef socklen_t SOClen;

View File

@@ -72,6 +72,7 @@ typedef struct String String;
#endif #endif
#ifndef HTS_DEF_STRUCT_String #ifndef HTS_DEF_STRUCT_String
#define HTS_DEF_STRUCT_String #define HTS_DEF_STRUCT_String
struct String { struct String {
char *buffer_; char *buffer_;
size_t length_; size_t length_;
@@ -80,7 +81,7 @@ struct String {
#endif #endif
/* Defines */ /* Defines */
#define CATBUFF_SIZE (STRING_SIZE*2*2) #define CATBUFF_SIZE (STRING_SIZE * 2 * 2)
#define STRING_SIZE 2048 #define STRING_SIZE 2048
@@ -108,7 +109,7 @@ struct htsfilters {
}; };
/* User callbacks chain */ /* User callbacks chain */
typedef int (*htscallbacksfncptr) (void); typedef int (*htscallbacksfncptr)(void);
typedef struct htscallbacks htscallbacks; typedef struct htscallbacks htscallbacks;
@@ -179,6 +180,7 @@ typedef struct lien_url lien_url;
#ifndef HTS_DEF_DEFSTRUCT_hts_log_type #ifndef HTS_DEF_DEFSTRUCT_hts_log_type
#define HTS_DEF_DEFSTRUCT_hts_log_type #define HTS_DEF_DEFSTRUCT_hts_log_type
typedef enum hts_log_type { typedef enum hts_log_type {
LOG_PANIC, LOG_PANIC,
LOG_ERROR, LOG_ERROR,
@@ -278,16 +280,17 @@ struct htslibhandles {
/* Javascript parser flags */ /* Javascript parser flags */
typedef enum htsparsejava_flags { typedef enum htsparsejava_flags {
HTSPARSE_NONE = 0, // don't parse HTSPARSE_NONE = 0, // don't parse
HTSPARSE_DEFAULT = 1, // parse default (all) HTSPARSE_DEFAULT = 1, // parse default (all)
HTSPARSE_NO_CLASS = 2, // don't parse .java HTSPARSE_NO_CLASS = 2, // don't parse .java
HTSPARSE_NO_JAVASCRIPT = 4, // don't parse .js HTSPARSE_NO_JAVASCRIPT = 4, // don't parse .js
HTSPARSE_NO_AGGRESSIVE = 8 // don't aggressively parse .js or .java HTSPARSE_NO_AGGRESSIVE = 8 // don't aggressively parse .js or .java
} htsparsejava_flags; } htsparsejava_flags;
/* Link-rewriting style for saved pages (opt->urlmode). */ /* Link-rewriting style for saved pages (opt->urlmode). */
#ifndef HTS_DEF_DEFSTRUCT_hts_urlmode #ifndef HTS_DEF_DEFSTRUCT_hts_urlmode
#define HTS_DEF_DEFSTRUCT_hts_urlmode #define HTS_DEF_DEFSTRUCT_hts_urlmode
typedef enum hts_urlmode { typedef enum hts_urlmode {
HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */ HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */
HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */ HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */
@@ -301,6 +304,7 @@ typedef enum hts_urlmode {
/* Cache policy for updates and retries (opt->cache). */ /* Cache policy for updates and retries (opt->cache). */
#ifndef HTS_DEF_DEFSTRUCT_hts_cachemode #ifndef HTS_DEF_DEFSTRUCT_hts_cachemode
#define HTS_DEF_DEFSTRUCT_hts_cachemode #define HTS_DEF_DEFSTRUCT_hts_cachemode
typedef enum hts_cachemode { typedef enum hts_cachemode {
HTS_CACHE_NONE = 0, /**< no cache */ HTS_CACHE_NONE = 0, /**< no cache */
HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */ HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */
@@ -311,6 +315,7 @@ typedef enum hts_cachemode {
/* Interactive wizard level (opt->wizard). */ /* Interactive wizard level (opt->wizard). */
#ifndef HTS_DEF_DEFSTRUCT_hts_wizard #ifndef HTS_DEF_DEFSTRUCT_hts_wizard
#define HTS_DEF_DEFSTRUCT_hts_wizard #define HTS_DEF_DEFSTRUCT_hts_wizard
typedef enum hts_wizard { typedef enum hts_wizard {
HTS_WIZARD_NONE = 0, /**< no wizard */ HTS_WIZARD_NONE = 0, /**< no wizard */
HTS_WIZARD_ASK = 1, /**< wizard asks questions */ HTS_WIZARD_ASK = 1, /**< wizard asks questions */
@@ -321,6 +326,7 @@ typedef enum hts_wizard {
/* robots.txt / meta-robots obedience level (opt->robots). */ /* robots.txt / meta-robots obedience level (opt->robots). */
#ifndef HTS_DEF_DEFSTRUCT_hts_robots #ifndef HTS_DEF_DEFSTRUCT_hts_robots
#define HTS_DEF_DEFSTRUCT_hts_robots #define HTS_DEF_DEFSTRUCT_hts_robots
typedef enum hts_robots { typedef enum hts_robots {
HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */ HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */
HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */ HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */
@@ -342,24 +348,44 @@ typedef enum hts_seeker {
HTS_SEEKER_UP = 1 << 1 /**< may ascend to parent directories */ HTS_SEEKER_UP = 1 << 1 /**< may ascend to parent directories */
} hts_seeker; } hts_seeker;
/* Link-following scope, stored in the low byte of opt->travel. */ /* opt->travel: link-following scope in the low byte, flags OR'd in above it. */
typedef enum hts_travel_scope { typedef enum hts_travel_scope {
HTS_TRAVEL_SAME_ADDRESS = 0, /**< stay on the same address (host) */ HTS_TRAVEL_SAME_ADDRESS = 0, /**< stay on the same address (host) */
HTS_TRAVEL_SAME_DOMAIN = 1, /**< stay on the same principal domain */ HTS_TRAVEL_SAME_DOMAIN = 1, /**< stay on the same principal domain */
HTS_TRAVEL_SAME_TLD = 2, /**< stay on the same TLD (e.g. .com) */ HTS_TRAVEL_SAME_TLD = 2, /**< stay on the same TLD (e.g. .com) */
HTS_TRAVEL_EVERYWHERE = 7 /**< follow links anywhere on the web */ HTS_TRAVEL_EVERYWHERE = 7, /**< follow links anywhere on the web */
HTS_TRAVEL_TEST_ALL = 1 << 8 /**< also test forbidden URLs (-t) */
} hts_travel_scope; } hts_travel_scope;
/* Flags OR'd into opt->travel above the scope value. */ /* Mask selecting the scope value out of opt->travel. */
#define HTS_TRAVEL_SCOPE_MASK 0xff /**< mask selecting the scope value */ #define HTS_TRAVEL_SCOPE_MASK 0xff
#define HTS_TRAVEL_TEST_ALL (1 << 8) /**< also test forbidden URLs (-t) */
/* Boolean option flag. An enum (not C bool) so the option fields stay int-sized /* Text progress display detail (opt->verbosedisplay). */
and the httrackp layout/ABI is unchanged. */ typedef enum hts_verbosedisplay {
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean HTS_VERBOSE_NONE = 0, /**< no animated progress display (default) */
#define HTS_DEF_DEFSTRUCT_hts_boolean HTS_VERBOSE_SIMPLE = 1, /**< minimal single-line progress */
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean; HTS_VERBOSE_FULL = 2 /**< full animated progress */
#endif } hts_verbosedisplay;
/* Delayed file-type resolution policy (opt->savename_delayed). */
typedef enum hts_savename_delayed {
HTS_SAVENAME_DELAYED_NONE = 0, /**< resolve the type immediately */
HTS_SAVENAME_DELAYED_SOFT = 1, /**< delay the type check when unknown */
HTS_SAVENAME_DELAYED_HARD = 2 /**< always delay the type check (default) */
} hts_savename_delayed;
/* Saved-name length layout (opt->savename_83). */
typedef enum hts_savename_83 {
HTS_SAVENAME_83_LONG = 0, /**< long file names (default) */
HTS_SAVENAME_83_DOS = 1, /**< DOS 8.3 names (ISO9660 level 1) */
HTS_SAVENAME_83_ISO9660 = 2 /**< ISO9660 level 2 names (up to 31 chars) */
} hts_savename_83;
/* Host-banning triggers (opt->hostcontrol bitmask). */
typedef enum hts_hostcontrol {
HTS_HOSTCONTROL_BAN_TIMEOUT = 1 << 0, /**< ban a timing-out host */
HTS_HOSTCONTROL_BAN_SLOW = 1 << 1 /**< ban a too-slow host */
} hts_hostcontrol;
#ifndef HTS_DEF_FWSTRUCT_lien_buffers #ifndef HTS_DEF_FWSTRUCT_lien_buffers
#define HTS_DEF_FWSTRUCT_lien_buffers #define HTS_DEF_FWSTRUCT_lien_buffers
@@ -386,101 +412,102 @@ struct httrackp {
/* */ /* */
hts_wizard wizard; /**< interactive wizard level (none/ask/auto) */ hts_wizard wizard; /**< interactive wizard level (none/ask/auto) */
hts_boolean flush; /**< fflush() log files after each write */ hts_boolean flush; /**< fflush() log files after each write */
int travel; /**< link-following scope (same domain, etc.) */ int travel; /**< link-following scope (same domain, etc.) */
int seeker; /**< allowed direction: go up and/or down the tree */ int seeker; /**< allowed direction: go up and/or down the tree */
int depth; /**< maximum recursion depth (-rN) */ int depth; /**< maximum recursion depth (-rN) */
int extdepth; /**< maximum recursion depth outside the start domain */ int extdepth; /**< maximum recursion depth outside the start domain */
hts_urlmode hts_urlmode
urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */ urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */
hts_boolean no_type_change; // do not change file type according to MIME hts_boolean no_type_change; // do not change file type according to MIME
int debug; /**< debug logging level */ hts_log_type debug; /**< debug logging level */
int getmode; /**< what to fetch (HTML, images, ...) bitmask */ int getmode; /**< what to fetch (HTML, images, ...) bitmask */
FILE *log; /**< informational log stream; NULL mutes it */ FILE *log; /**< informational log stream; NULL mutes it */
FILE *errlog; /**< error log stream; NULL mutes it */ FILE *errlog; /**< error log stream; NULL mutes it */
LLint maxsite; /**< max total bytes for the whole mirror */ LLint maxsite; /**< max total bytes for the whole mirror */
LLint maxfile_nonhtml; /**< max bytes per non-HTML file */ LLint maxfile_nonhtml; /**< max bytes per non-HTML file */
LLint maxfile_html; /**< max bytes per HTML file */ LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */ int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */ LLint fragment; /**< split site after this many bytes */
hts_boolean hts_boolean
nearlink; /**< also fetch images/data adjacent to a page but off-site */ nearlink; /**< also fetch images/data adjacent to a page but off-site */
hts_boolean makeindex; /**< build a top-level index.html */ hts_boolean makeindex; /**< build a top-level index.html */
hts_boolean kindex; /**< build a keyword index */ hts_boolean kindex; /**< build a keyword index */
hts_boolean delete_old; /**< delete locally obsolete files after update */ hts_boolean delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */ int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */ int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */ int maxtime; /**< max total mirror duration in seconds */
int maxrate; /**< max transfer rate cap (bytes/s) */ int maxrate; /**< max transfer rate cap (bytes/s) */
float maxconn; /**< max connections per second */ float maxconn; /**< max connections per second */
int waittime; /**< scheduled start time (wall-clock seconds) */ int waittime; /**< scheduled start time (wall-clock seconds) */
hts_cachemode cache; /**< cache generation mode */ hts_cachemode cache; /**< cache generation mode */
// int aff_progress; // progress bar // int aff_progress; // progress bar
hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */ hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */
t_proxy proxy; /**< proxy configuration */ t_proxy proxy; /**< proxy configuration */
int savename_83; /**< force 8.3 (DOS) file names */ hts_savename_83
savename_83; /**< saved-name length layout (long/DOS/ISO9660) */
int savename_type; /**< saved-name layout (original tree, flat, ...) */ int savename_type; /**< saved-name layout (original tree, flat, ...) */
String String
savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */ savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */
int savename_delayed; // delayed type check hts_savename_delayed savename_delayed; /**< delayed type-check policy */
hts_boolean hts_boolean
delayed_cached; // delayed type check can be cached to speedup updates delayed_cached; // delayed type check can be cached to speedup updates
hts_boolean mimehtml; /**< produce a single MIME/MHTML archive */ hts_boolean mimehtml; /**< produce a single MIME/MHTML archive */
hts_boolean user_agent_send; /**< send a User-Agent header */ hts_boolean user_agent_send; /**< send a User-Agent header */
String user_agent; /**< User-Agent value (e.g. httrack/1.0) */ String user_agent; /**< User-Agent value (e.g. httrack/1.0) */
String referer; /**< Referer value to send */ String referer; /**< Referer value to send */
String from; /**< From value to send */ String from; /**< From value to send */
String path_log; /**< directory for cache and logs */ String path_log; /**< directory for cache and logs */
String path_html; /**< output directory for the mirror */ String path_html; /**< output directory for the mirror */
String path_html_utf8; /**< output directory for the mirror, UTF-8 form */ String path_html_utf8; /**< output directory for the mirror, UTF-8 form */
String path_bin; /**< directory for HTML templates */ String path_bin; /**< directory for HTML templates */
int retry; /**< extra retries on a failed transfer */ int retry; /**< extra retries on a failed transfer */
hts_boolean makestat; /**< maintain a transfer-statistics log */ hts_boolean makestat; /**< maintain a transfer-statistics log */
hts_boolean maketrack; /**< maintain an operations-statistics log */ hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */ int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< drop hosts that are too slow, etc. */ int hostcontrol; /**< ban slow/timing-out hosts; see hts_hostcontrol bits */
hts_boolean errpage; /**< generate an error page on 404 and similar */ hts_boolean errpage; /**< generate an error page on 404 and similar */
hts_boolean hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/ */
hts_boolean all_in_cache; /**< keep all retrieved data in the cache */ hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
hts_robots robots; /**< robots.txt handling level */ hts_robots robots; /**< robots.txt handling level */
hts_boolean external; /**< render external links as error pages */ hts_boolean external; /**< render external links as error pages */
hts_boolean passprivacy; /**< strip passwords from external links */ hts_boolean passprivacy; /**< strip passwords from external links */
hts_boolean includequery; /**< include the query string in saved names */ hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */ hts_boolean mirror_first_page; /**< only mirror the links of the first page */
String sys_com; /**< system command to run */ String sys_com; /**< system command to run */
hts_boolean sys_com_exec; /**< actually execute sys_com */ hts_boolean sys_com_exec; /**< actually execute sys_com */
hts_boolean accept_cookie; /**< accept and send cookies */ hts_boolean accept_cookie; /**< accept and send cookies */
t_cookie *cookie; /**< cookie store */ t_cookie *cookie; /**< cookie store */
hts_boolean http10; /**< force HTTP/1.0 */ hts_boolean http10; /**< force HTTP/1.0 */
hts_boolean nokeepalive; /**< disable keep-alive */ hts_boolean nokeepalive; /**< disable keep-alive */
hts_boolean nocompression; /**< disable content compression */ hts_boolean nocompression; /**< disable content compression */
hts_boolean sizehack; /**< treat same-size response as "updated" */ hts_boolean sizehack; /**< treat same-size response as "updated" */
hts_boolean urlhack; // force "url normalization" to avoid loops hts_boolean urlhack; // force "url normalization" to avoid loops
hts_boolean tolerant; /**< accept an incorrect Content-Length */ hts_boolean tolerant; /**< accept an incorrect Content-Length */
hts_boolean hts_boolean
parseall; /**< parse aggressively, including unknown tags with links */ parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */ hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */ hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */
int verbosedisplay; /**< animated text progress display */ hts_verbosedisplay verbosedisplay; /**< animated text progress display */
String footer; /**< footer/info line injected into pages */ String footer; /**< footer/info line injected into pages */
int maxcache; /**< in-memory cache backing limit (bytes) */ int maxcache; /**< in-memory cache backing limit (bytes) */
// int maxcache_anticipate; // maximum links to anticipate (upper bound) // int maxcache_anticipate; // maximum links to anticipate (upper bound)
hts_boolean ftp_proxy; /**< use the HTTP proxy for FTP too */ hts_boolean ftp_proxy; /**< use the HTTP proxy for FTP too */
String filelist; /**< file listing URLs to include */ String filelist; /**< file listing URLs to include */
String urllist; /**< file listing filters to include */ String urllist; /**< file listing filters to include */
htsfilters filters; /**< filter pointers (+/-pattern rules) */ htsfilters filters; /**< filter pointers (+/-pattern rules) */
hash_struct *hash; // hash structure hash_struct *hash; // hash structure
lien_url **liens; // links lien_url **liens; // links
int lien_tot; // top index of "links" heap (always out-of-range) int lien_tot; // top index of "links" heap (always out-of-range)
lien_buffers *liensbuf; // links buffers lien_buffers *liensbuf; // links buffers
robots_wizard *robotsptr; // robots ptr robots_wizard *robotsptr; // robots ptr
String lang_iso; /**< Accept-Language value (en, fr, ...) */ String lang_iso; /**< Accept-Language value (en, fr, ...) */
String accept; // Accept: String accept; // Accept:
String headers; // Additional headers String headers; // Additional headers
String mimedefs; // ext1=mimetype1\next2=mimetype2.. String mimedefs; // ext1=mimetype1\next2=mimetype2..
String mod_blacklist; /**< blacklisted modules */ String mod_blacklist; /**< blacklisted modules */
hts_boolean convert_utf8; // filenames UTF-8 conversion (3.46) hts_boolean convert_utf8; // filenames UTF-8 conversion (3.46)
// //
int maxlink; /**< max number of links */ int maxlink; /**< max number of links */
int maxfilter; /**< max number of filters */ int maxfilter; /**< max number of filters */
@@ -566,17 +593,17 @@ typedef struct htsrequest htsrequest;
struct htsrequest { struct htsrequest {
short int user_agent_send; /**< send a User-Agent header */ short int user_agent_send; /**< send a User-Agent header */
short int http11; /**< sign the request as HTTP/1.1 rather than HTTP/1.0 */ short int http11; /**< sign the request as HTTP/1.1 rather than HTTP/1.0 */
short int nokeepalive; /**< disable keep-alive */ short int nokeepalive; /**< disable keep-alive */
short int range_used; /**< a Range header is in use */ short int range_used; /**< a Range header is in use */
short int nocompression; /**< disable compression */ short int nocompression; /**< disable compression */
short int flush_garbage; // recycled short int flush_garbage; // recycled
const char *user_agent; /**< User-Agent value */ const char *user_agent; /**< User-Agent value */
const char *referer; /**< Referer value */ const char *referer; /**< Referer value */
const char *from; /**< From value */ const char *from; /**< From value */
const char *lang_iso; /**< Accept-Language value */ const char *lang_iso; /**< Accept-Language value */
const char *accept; /**< Accept value */ const char *accept; /**< Accept value */
const char *headers; /**< extra request headers */ const char *headers; /**< extra request headers */
htsrequest_proxy proxy; /**< proxy for this request */ htsrequest_proxy proxy; /**< proxy for this request */
}; };
/* Result of a connection / header fetch. */ /* Result of a connection / header fetch. */
@@ -608,8 +635,8 @@ struct htsblk {
short int is_file; /**< 1 if a file descriptor rather than a socket */ short int is_file; /**< 1 if a file descriptor rather than a socket */
T_SOC soc; /**< socket id */ T_SOC soc; /**< socket id */
SOCaddr address; /**< peer IP address */ SOCaddr address; /**< peer IP address */
int address_size; // IP address structure length (unused internally) int address_size; // IP address structure length (unused internally)
FILE *fp; /**< file handle for file:// */ FILE *fp; /**< file handle for file:// */
#if HTS_USEOPENSSL #if HTS_USEOPENSSL
short int ssl; /**< nonzero if this is an SSL connection (https) */ short int ssl; /**< nonzero if this is an SSL connection (https) */
// BIO* ssl_soc; // SSL structure // BIO* ssl_soc; // SSL structure
@@ -624,6 +651,8 @@ struct htsblk {
int debugid; /**< connection debug id */ int debugid; /**< connection debug id */
/* */ /* */
htsrequest req; /**< parameters used for the request */ htsrequest req; /**< parameters used for the request */
/* a Content-Type header was received (else contenttype holds a default) */
hts_boolean contenttype_given;
/*char digest[32+2]; // md5 digest generated by the engine ("" if none) */ /*char digest[32+2]; // md5 digest generated by the engine ("" if none) */
}; };
@@ -691,7 +720,7 @@ struct lien_back {
LLint chunk_blocksize; /**< data size declared by the chunk */ LLint chunk_blocksize; /**< data size declared by the chunk */
LLint compressed_size; /**< compressed size (stats only) */ LLint compressed_size; /**< compressed size (stats only) */
// //
//int links_index; // to access liens[links_index] // int links_index; // to access liens[links_index]
// //
char info[256]; /**< status text, e.g. for FTP */ char info[256]; /**< status text, e.g. for FTP */
int stop_ftp; /**< stop flag for FTP */ int stop_ftp; /**< stop flag for FTP */

View File

@@ -296,6 +296,48 @@ static const char *html_inline_safe(const char *src, char *dst, size_t size) {
return dst; return dst;
} }
/* Byte before html, or a space sentinel at the buffer start where html[-1]
would underflow; space reads as the word boundary the guards want there. */
static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
static const char *const methods[] = {"GET", "POST", "PUT",
"DELETE", "HEAD", "OPTIONS",
"PATCH", "TRACE", NULL};
int i;
for (i = 0; methods[i] != NULL; i++) {
if (strlen(methods[i]) == len && strfield(s, methods[i]) == (int) len)
return 1;
}
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */ /* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) { int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE]; char catbuff[CATBUFF_SIZE];
@@ -556,7 +598,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->getmode & HTS_GETMODE_HTML) { if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title"); p = strfield(html, "title");
if (p) { if (p) {
if (*(html - 1) == '/') if (html_prevc(html, r->adr) == '/')
p = 0; // /title p = 0; // /title
} else { } else {
if (strfield(html, "/html")) if (strfield(html, "/html"))
@@ -1341,6 +1383,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0; int can_avoid_quotes = 0;
char quotes_replacement = '\0'; char quotes_replacement = '\0';
int ensure_not_mime = 0; int ensure_not_mime = 0;
// .open(method,url): reject an HTTP-method first arg (#218)
int ensure_not_method = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag) if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'" expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1357,9 +1404,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (!nc) if (!nc)
nc = strfield(html, ":location"); // javascript:location="doc" nc = strfield(html, ":location"); // javascript:location="doc"
if (!nc) { // location="doc" if (!nc) { // location="doc"
if ((nc = strfield(html, "location")) if ((nc = strfield(html, "location")) &&
&& !isspace(*(html - 1)) !isspace(html_prevc(html, r->adr)))
)
nc = 0; nc = 0;
} }
if (!nc) if (!nc)
@@ -1369,6 +1415,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = "),"; // fin: virgule ou parenthèse expected_end = "),"; // fin: virgule ou parenthèse
ensure_not_mime = 1; //* ensure the url is not a mime type */ ensure_not_mime = 1; //* ensure the url is not a mime type */
ensure_not_method = 1; // xhr.open: don't grab method
} }
if (!nc) if (!nc)
if ((nc = strfield(html, ".replace"))) { // window.replace("url") if ((nc = strfield(html, ".replace"))) { // window.replace("url")
@@ -1380,7 +1427,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
} }
if (!nc && (nc = strfield(html, "url")) && (!isalnum(*(html - 1))) && *(html - 1) != '_') { // url(url) if (!nc && (nc = strfield(html, "url")) &&
(!isalnum(html_prevc(html, r->adr))) &&
html_prevc(html, r->adr) != '_') { // url(url)
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
can_avoid_quotes = 1; can_avoid_quotes = 1;
@@ -1390,6 +1439,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url" if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) { if (is_space(*(html + nc))) {
expected = 0; // no char expected expected = 0; // no char expected
is_import = 1;
} else } else
nc = 0; nc = 0;
} }
@@ -1407,6 +1457,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) { if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c; const char *b, *c;
int ndelim = 1; int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\'')) if ((*a == 34) || (*a == '\''))
a++; a++;
@@ -1421,12 +1472,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++; b++;
} }
c = b--; c = b--;
c += ndelim; // no closing delimiter here (truncated input):
while(*c == ' ') // Don't scan past the buffer NUL or capture it.
c++; if (*c != '\0') {
if ((strchr(expected_end, *c)) || (*c == '\n') c += ndelim;
|| (*c == '\r')) { while (*c == ' ')
c -= (ndelim + 1); c++;
valid_url =
(strchr(expected_end, *c)) || (*c == '\n') ||
(*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) { if ((int) (c - a + 1)) {
if (ensure_not_mime) { if (ensure_not_mime) {
int i = 0; int i = 0;
@@ -1442,6 +1501,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
i++; i++;
} }
} }
// XHR.open's "GET" etc. is a method, not a URL
if (a != NULL && ensure_not_method &&
is_http_method(a, (size_t) (c - a + 1))) {
a = NULL;
}
// Check for bogus links (Vasiliy) // Check for bogus links (Vasiliy)
if (a != NULL) { if (a != NULL) {
const size_t size = c - a + 1; const size_t size = c - a + 1;
@@ -1485,7 +1549,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
} }
} }
} }
} }
} }
@@ -1692,6 +1755,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
hts_nodetect[i - hts_nodetect[i -
1]); 1]);
} }
// xmlns / xmlns:prefix declare
// XML namespaces, not resources
// (#191)
else {
const int xl = strfield(
intag_startattr, "xmlns");
const char xc =
intag_startattr[xl];
if (xl &&
(xc == ':' || xc == '=' ||
is_space(xc))) {
url_ok = 0;
hts_log_print(
opt, LOG_DEBUG,
"dirty parsing: xmlns "
"namespace avoided");
}
}
} }
} }
@@ -2967,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */ /* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo)); inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf) //if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx // escape_uri(tempo); // escape with %xx
//else { //else {
@@ -3722,7 +3807,8 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
//case -1: can_retry=1; break; //case -1: can_retry=1; break;
case STATUSCODE_TIMEOUT: case STATUSCODE_TIMEOUT:
if (opt->hostcontrol) { // timeout et retry épuisés if (opt->hostcontrol) { // timeout et retry épuisés
if ((opt->hostcontrol & 1) && (heap(ptr)->retry <= 0)) { if ((opt->hostcontrol & HTS_HOSTCONTROL_BAN_TIMEOUT) &&
(heap(ptr)->retry <= 0)) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil()); hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr())); host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -3735,7 +3821,7 @@ int hts_mirror_check_moved(htsmoduleStruct * str,
break; break;
case STATUSCODE_SLOW: case STATUSCODE_SLOW:
if ((opt->hostcontrol) && (heap(ptr)->retry <= 0)) { // too slow if ((opt->hostcontrol) && (heap(ptr)->retry <= 0)) { // too slow
if (opt->hostcontrol & 2) { if (opt->hostcontrol & HTS_HOSTCONTROL_BAN_SLOW) {
hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil()); hts_log_print(opt, LOG_DEBUG, "Link banned: %s%s", urladr(), urlfil());
host_ban(opt, ptr, sback, jump_identification_const(urladr())); host_ban(opt, ptr, sback, jump_identification_const(urladr()));
hts_log_print(opt, LOG_DEBUG, hts_log_print(opt, LOG_DEBUG,
@@ -4261,10 +4347,10 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
char com[256]; char com[256];
linput(stdin, com, 200); linput(stdin, com, 200);
if (opt->verbosedisplay == 2) if (opt->verbosedisplay == HTS_VERBOSE_FULL)
opt->verbosedisplay = 1; opt->verbosedisplay = HTS_VERBOSE_SIMPLE;
else else
opt->verbosedisplay = 2; opt->verbosedisplay = HTS_VERBOSE_FULL;
/* Info for wrappers */ /* Info for wrappers */
hts_log_print(opt, LOG_INFO, "engine: change-options"); hts_log_print(opt, LOG_INFO, "engine: change-options");
RUN_CALLBACK0(opt, chopt); RUN_CALLBACK0(opt, chopt);
@@ -4374,7 +4460,7 @@ int hts_mirror_wait_for_next_file(htsmoduleStruct * str,
printf("%c\x0d", ("/-\\|")[roll]); printf("%c\x0d", ("/-\\|")[roll]);
fflush(stdout); fflush(stdout);
} }
} else if (opt->verbosedisplay == 1) { } else if (opt->verbosedisplay == HTS_VERBOSE_SIMPLE) {
if (b >= 0) { if (b >= 0) {
if (back[b].r.statuscode == HTTP_OK) if (back[b].r.statuscode == HTTP_OK)
printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot, printf("%d/%d: %s%s (" LLintP " bytes) - OK\33[K\r", ptr, opt->lien_tot,
@@ -4465,8 +4551,8 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
char in_error_msg[32]; char in_error_msg[32];
// resolve unresolved type // resolve unresolved type
if (opt->savename_delayed != 0 && *forbidden_url == 0 && IS_DELAYED_EXT(afs->save) if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
&& !opt->state.stop) { *forbidden_url == 0 && IS_DELAYED_EXT(afs->save) && !opt->state.stop) {
int loops; int loops;
int continue_loop; int continue_loop;
@@ -4850,7 +4936,7 @@ int hts_wait_delayed(htsmoduleStruct * str, lien_adrfilsave *afs,
} }
} }
} // delayed type check ? } // delayed type check ?
ENGINE_SAVE_CONTEXT_BASE(); ENGINE_SAVE_CONTEXT_BASE();

View File

@@ -48,7 +48,7 @@ Please visit our Website: http://www.httrack.com
/** Assert error callback. **/ /** Assert error callback. **/
#ifndef HTS_DEF_FWSTRUCT_htsErrorCallback #ifndef HTS_DEF_FWSTRUCT_htsErrorCallback
#define HTS_DEF_FWSTRUCT_htsErrorCallback #define HTS_DEF_FWSTRUCT_htsErrorCallback
typedef void (*htsErrorCallback) (const char *msg, const char *file, int line); typedef void (*htsErrorCallback)(const char *msg, const char *file, int line);
#ifdef __cplusplus #ifdef __cplusplus
extern "C" { extern "C" {
#endif #endif
@@ -58,12 +58,13 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
#endif #endif
#endif #endif
#define HTSSAFE_ABORT_FUNCTION(A,B,C) do { \ #define HTSSAFE_ABORT_FUNCTION(A, B, C) \
htsErrorCallback callback = hts_get_error_callback(); \ do { \
if (callback != NULL) { \ htsErrorCallback callback = hts_get_error_callback(); \
callback(A,B,C); \ if (callback != NULL) { \
} \ callback(A, B, C); \
} while(0) } \
} while (0)
#endif #endif
@@ -75,7 +76,8 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
/** /**
* Fatal assertion check. * Fatal assertion check.
*/ */
#define assertf__(exp, sexp, file, line) (void) ( (exp) || (abortf_(sexp, file, line), 0) ) #define assertf__(exp, sexp, file, line) \
(void) ((exp) || (abortf_(sexp, file, line), 0))
/** /**
* Fatal assertion check. * Fatal assertion check.
@@ -106,12 +108,13 @@ static HTS_UNUSED void abortf_(const char *exp, const char *file, int line) {
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
/* Note: char[] and const char[] are compatible */ /* Note: char[] and const char[] are compatible */
#define HTS_IS_CHAR_BUFFER(VAR) ( __builtin_types_compatible_p ( typeof (VAR), char[] ) ) #define HTS_IS_CHAR_BUFFER(VAR) \
(__builtin_types_compatible_p(typeof(VAR), char[]))
#else #else
/* Note: a bit lame as char[8] won't be seen. */ /* Note: a bit lame as char[8] won't be seen. */
#define HTS_IS_CHAR_BUFFER(VAR) ( sizeof(VAR) != sizeof(char*) ) #define HTS_IS_CHAR_BUFFER(VAR) (sizeof(VAR) != sizeof(char *))
#endif #endif
#define HTS_IS_NOT_CHAR_BUFFER(VAR) ( ! HTS_IS_CHAR_BUFFER(VAR) ) #define HTS_IS_NOT_CHAR_BUFFER(VAR) (!HTS_IS_CHAR_BUFFER(VAR))
/* Compile-time checks. */ /* Compile-time checks. */
static HTS_UNUSED void htssafe_compile_time_check_(void) { static HTS_UNUSED void htssafe_compile_time_check_(void) {
@@ -201,60 +204,74 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strncatbuff(A, B, N) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strncatbuff(A, B, N) \
strncat_safe_(A, sizeof(A), B, \ __builtin_choose_expr( \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \ HTS_IS_CHAR_BUFFER(A), \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \ strncat_safe_(A, sizeof(A), B, \
strncatbuff_ptr_((A), (B), (N)) ) HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
"overflow while appending '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strncatbuff_ptr_((A), (B), (N)))
#else #else
#define strncatbuff(A, B, N) \ #define strncatbuff(A, B, N) \
( HTS_IS_NOT_CHAR_BUFFER(A) \ (HTS_IS_NOT_CHAR_BUFFER(A) \
? strncat(A, B, N) \ ? strncat(A, B, N) \
: strncat_safe_(A, sizeof(A), B, \ : strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) ) "overflow while appending '" #B "' to '" #A "'", \
__FILE__, __LINE__))
#endif #endif
/** /**
* Append characters of "B" to "A". * Append characters of "B" to "A".
* If "A" is a char[] variable whose size is not sizeof(char*), then the size * If "A" is a char[] variable whose size is not sizeof(char*), then the size
* is assumed to be the capacity of this array. * is assumed to be the capacity of this array.
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strcatbuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strcatbuff(A, B) \
strncat_safe_(A, sizeof(A), B, \ __builtin_choose_expr( \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ HTS_IS_CHAR_BUFFER(A), \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \ strncat_safe_(A, sizeof(A), B, \
strcatbuff_ptr_((A), (B)) ) HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
(size_t) -1, \
"overflow while appending '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strcatbuff_ptr_((A), (B)))
#else #else
#define strcatbuff(A, B) \ #define strcatbuff(A, B) \
( HTS_IS_NOT_CHAR_BUFFER(A) \ (HTS_IS_NOT_CHAR_BUFFER(A) \
? strcat(A, B) \ ? strcat(A, B) \
: strncat_safe_(A, sizeof(A), B, \ : strncat_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) ) (size_t) -1, \
"overflow while appending '" #B "' to '" #A "'", \
__FILE__, __LINE__))
#endif #endif
/** /**
* Copy characters from "B" to "A". * Copy characters from "B" to "A".
* If "A" is a char[] variable whose size is not sizeof(char*), then the size * If "A" is a char[] variable whose size is not sizeof(char*), then the size
* is assumed to be the capacity of this array. * is assumed to be the capacity of this array.
*/ */
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
#define strcpybuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \ #define strcpybuff(A, B) \
strcpy_safe_(A, sizeof(A), B, \ __builtin_choose_expr( \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ HTS_IS_CHAR_BUFFER(A), \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__), \ strcpy_safe_(A, sizeof(A), B, \
strcpybuff_ptr_((A), (B)) ) HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while copying '" #B "' to '" #A "'", __FILE__, \
__LINE__), \
strcpybuff_ptr_((A), (B)))
#else #else
#define strcpybuff(A, B) \ #define strcpybuff(A, B) \
( HTS_IS_NOT_CHAR_BUFFER(A) \ (HTS_IS_NOT_CHAR_BUFFER(A) \
? strcpy(A, B) \ ? strcpy(A, B) \
: strcpy_safe_(A, sizeof(A), B, \ : strcpy_safe_(A, sizeof(A), B, \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__) ) "overflow while copying '" #B "' to '" #A "'", __FILE__, \
__LINE__))
#endif #endif
/* /*
@@ -268,10 +285,10 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
/** /**
* Append characters of "B" to "A", "A" having a maximum capacity of "S". * Append characters of "B" to "A", "A" having a maximum capacity of "S".
*/ */
#define strlcatbuff(A, B, S) \ #define strlcatbuff(A, B, S) \
strncat_safe_(A, S, B, \ strncat_safe_(A, S, B, HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \ (size_t) -1, "overflow while appending '" #B "' to '" #A "'", \
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) __FILE__, __LINE__)
/** /**
* Append at most "N" characters of "B" to "A", "A" having a maximum capacity * Append at most "N" characters of "B" to "A", "A" having a maximum capacity
@@ -285,17 +302,18 @@ static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
/** /**
* Copy characters of "B" to "A", "A" having a maximum capacity of "S". * Copy characters of "B" to "A", "A" having a maximum capacity of "S".
*/ */
#define strlcpybuff(A, B, S) \ #define strlcpybuff(A, B, S) \
strcpy_safe_(A, S, B, \ strcpy_safe_(A, S, B, HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \ "overflow while copying '" #B "' to '" #A "'", __FILE__, \
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__) __LINE__)
/** strnlen replacement (autotools). **/ /** strnlen replacement (autotools). **/
#if ( ! defined(_WIN32) && ! defined(HAVE_STRNLEN) ) #if (!defined(_WIN32) && !defined(HAVE_STRNLEN))
static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) { static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) {
size_t i; size_t i;
for(i = 0 ; i < maxlen && s[i] != '\0' ; i++) ; for (i = 0; i < maxlen && s[i] != '\0'; i++)
;
return i; return i;
} }
#endif #endif
@@ -304,13 +322,14 @@ static HTS_UNUSED size_t strnlen(const char *s, size_t maxlen) {
Aborts if source is NULL or has no NUL within that capacity. The sentinel Aborts if source is NULL or has no NUL within that capacity. The sentinel
sizeof_source == (size_t)-1 means "capacity unknown", and falls back to the sizeof_source == (size_t)-1 means "capacity unknown", and falls back to the
unbounded strlen (used when the source is a pointer rather than an array). */ unbounded strlen (used when the source is a pointer rather than an array). */
static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source, const size_t sizeof_source, static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source,
const size_t sizeof_source,
const char *file, int line) { const char *file, int line) {
size_t size; size_t size;
assertf_( source != NULL, file, line ); assertf_(source != NULL, file, line);
size = sizeof_source != (size_t) -1 size = sizeof_source != (size_t) -1 ? strnlen(source, sizeof_source)
? strnlen(source, sizeof_source) : strlen(source); : strlen(source);
assertf_( size < sizeof_source, file, line ); assertf_(size < sizeof_source, file, line);
return size; return size;
} }
@@ -319,10 +338,10 @@ static HTS_INLINE HTS_UNUSED size_t strlen_safe_(const char *source, const size_
source's capacity or (size_t)-1 if unknown. Aborts if the result (existing source's capacity or (size_t)-1 if unknown. Aborts if the result (existing
dest length + appended bytes + NUL) would not fit sizeof_dest: this NEVER dest length + appended bytes + NUL) would not fit sizeof_dest: this NEVER
truncates. Always NUL-terminates on success. */ truncates. Always NUL-terminates on success. */
static HTS_INLINE HTS_UNUSED char* strncat_safe_(char *const dest, const size_t sizeof_dest, static HTS_INLINE HTS_UNUSED char *
const char *const source, const size_t sizeof_source, strncat_safe_(char *const dest, const size_t sizeof_dest,
const size_t n, const char *const source, const size_t sizeof_source,
const char *exp, const char *file, int line) { const size_t n, const char *exp, const char *file, int line) {
const size_t source_len = strlen_safe_(source, sizeof_source, file, line); const size_t source_len = strlen_safe_(source, sizeof_source, file, line);
const size_t dest_len = strlen_safe_(dest, sizeof_dest, file, line); const size_t dest_len = strlen_safe_(dest, sizeof_dest, file, line);
/* note: "size_t is an unsigned integral type" ((size_t) -1 is positive) */ /* note: "size_t is an unsigned integral type" ((size_t) -1 is positive) */
@@ -337,12 +356,14 @@ static HTS_INLINE HTS_UNUSED char* strncat_safe_(char *const dest, const size_t
/* Core bounded copy: empties dest then appends all of source via /* Core bounded copy: empties dest then appends all of source via
strncat_safe_. sizeof_dest is dest's total capacity (NUL included). Aborts strncat_safe_. sizeof_dest is dest's total capacity (NUL included). Aborts
(no truncation) if source plus its NUL would not fit. */ (no truncation) if source plus its NUL would not fit. */
static HTS_INLINE HTS_UNUSED char* strcpy_safe_(char *const dest, const size_t sizeof_dest, static HTS_INLINE HTS_UNUSED char *
const char *const source, const size_t sizeof_source, strcpy_safe_(char *const dest, const size_t sizeof_dest,
const char *exp, const char *file, int line) { const char *const source, const size_t sizeof_source,
const char *exp, const char *file, int line) {
assertf_(sizeof_dest != 0, file, line); assertf_(sizeof_dest != 0, file, line);
dest[0] = '\0'; dest[0] = '\0';
return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1, exp, file, line); return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1,
exp, file, line);
} }
/** /**
@@ -360,9 +381,9 @@ static HTS_INLINE HTS_UNUSED char* strcpy_safe_(char *const dest, const size_t s
* htsbuff_ptr(). The buffer is kept NUL-terminated; htsbuff_str() returns it. * htsbuff_ptr(). The buffer is kept NUL-terminated; htsbuff_str() returns it.
*/ */
typedef struct { typedef struct {
char *buf; /* backing buffer (kept NUL-terminated) */ char *buf; /* backing buffer (kept NUL-terminated) */
size_t cap; /* total capacity of buf, including the NUL */ size_t cap; /* total capacity of buf, including the NUL */
size_t len; /* current length, excluding the NUL */ size_t len; /* current length, excluding the NUL */
} htsbuff; } htsbuff;
static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) { static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) {
@@ -384,23 +405,29 @@ static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) {
#if (defined(__GNUC__) && !defined(__cplusplus)) #if (defined(__GNUC__) && !defined(__cplusplus))
/* 0 for an array, a -1 array-size compile error for a pointer. */ /* 0 for an array, a -1 array-size compile error for a pointer. */
#define htsbuff_must_be_array_(A) \ #define htsbuff_must_be_array_(A) \
(sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), typeof(&(A)[0]))]) - 1) (sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), \
typeof(&(A)[0]))]) - \
1)
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR)) #define htsbuff_array(ARR) \
htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR))
#else #else
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR)) #define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR))
#endif #endif
/** Builder over pointer P of known capacity N (N includes the NUL). */ /** Builder over pointer P of known capacity N (N includes the NUL). */
#define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N)) #define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N))
/** Append at most n characters of s (stopping at its NUL). Aborts on overflow. */ /** Append at most n characters of s (stopping at its NUL). Aborts on overflow.
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s, size_t n) { */
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s,
size_t n) {
const size_t add = strnlen(s, n); const size_t add = strnlen(s, n);
/* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The /* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The
maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so
'add < cap - len' cannot wrap the way 'len + add < cap' could. */ 'add < cap - len' cannot wrap the way 'len + add < cap' could. */
assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__, __LINE__); assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__,
__LINE__);
memcpy(b->buf + b->len, s, add); memcpy(b->buf + b->len, s, add);
b->len += add; b->len += add;
b->buf[b->len] = '\0'; b->buf[b->len] = '\0';
@@ -433,15 +460,21 @@ static HTS_INLINE HTS_UNUSED const char *htsbuff_str(const htsbuff *b) {
added bounds checking. freet() also NULLs the freed pointer and tolerates added bounds checking. freet() also NULLs the freed pointer and tolerates
NULL. memcpybuff() despite the name is a raw memcpy: the caller owns the NULL. memcpybuff() despite the name is a raw memcpy: the caller owns the
bounds. */ bounds. */
#define malloct(A) malloc(A) #define malloct(A) malloc(A)
#define calloct(A,B) calloc((A), (B)) #define calloct(A, B) calloc((A), (B))
#define freet(A) do { if ((A) != NULL) { free(A); (A) = NULL; } } while(0) #define freet(A) \
do { \
if ((A) != NULL) { \
free(A); \
(A) = NULL; \
} \
} while (0)
#define strdupt(A) strdup(A) #define strdupt(A) strdup(A)
#define realloct(A,B) realloc(A, B) #define realloct(A, B) realloc(A, B)
#define memcpybuff(A, B, N) memcpy((A), (B), (N)) #define memcpybuff(A, B, N) memcpy((A), (B), (N))

View File

@@ -41,11 +41,11 @@ Please visit our Website: http://www.httrack.com
/* GCC extension */ /* GCC extension */
#ifndef HTS_UNUSED #ifndef HTS_UNUSED
#ifdef __GNUC__ #ifdef __GNUC__
#define HTS_UNUSED __attribute__ ((unused)) #define HTS_UNUSED __attribute__((unused))
#define HTS_STATIC static __attribute__ ((unused)) #define HTS_STATIC static __attribute__((unused))
#define HTS_PRINTF_FUN(fmt, arg) __attribute__ ((format (printf, fmt, arg))) #define HTS_PRINTF_FUN(fmt, arg) __attribute__((format(printf, fmt, arg)))
#else #else
#define HTS_UNUSED #define HTS_UNUSED
#define HTS_STATIC static #define HTS_STATIC static
@@ -60,6 +60,7 @@ typedef struct String String;
#endif #endif
#ifndef HTS_DEF_STRUCT_String #ifndef HTS_DEF_STRUCT_String
#define HTS_DEF_STRUCT_String #define HTS_DEF_STRUCT_String
/** /**
* Growable owned string. * Growable owned string.
* *
@@ -86,7 +87,7 @@ struct String {
/** Allocator **/ /** Allocator **/
#ifndef STRING_REALLOC #ifndef STRING_REALLOC
#define STRING_REALLOC(BUFF, SIZE) ( (char*) realloc(BUFF, SIZE) ) #define STRING_REALLOC(BUFF, SIZE) ((char *) realloc(BUFF, SIZE))
#define STRING_FREE(BUFF) free(BUFF) #define STRING_FREE(BUFF) free(BUFF)
#endif #endif
@@ -96,11 +97,11 @@ struct String {
#endif #endif
/** Initializer for an empty String (NULL buffer). Use to declare or reset. **/ /** Initializer for an empty String (NULL buffer). Use to declare or reset. **/
#define STRING_EMPTY { (char*) NULL, 0, 0 } #define STRING_EMPTY {(char *) NULL, 0, 0}
/** Read-only buffer pointer. NULL until the String has been written to. /** Read-only buffer pointer. NULL until the String has been written to.
Invalidated by any subsequent growing operation. **/ Invalidated by any subsequent growing operation. **/
#define StringBuff(BLK) ( (const char*) ((BLK).buffer_) ) #define StringBuff(BLK) ((const char *) ((BLK).buffer_))
/** Read/write buffer pointer. Same NULL/invalidation rules as StringBuff. **/ /** Read/write buffer pointer. Same NULL/invalidation rules as StringBuff. **/
#define StringBuffRW(BLK) ((BLK).buffer_) #define StringBuffRW(BLK) ((BLK).buffer_)
@@ -109,56 +110,60 @@ struct String {
#define StringLength(BLK) ((BLK).length_) #define StringLength(BLK) ((BLK).length_)
/** Non-zero if the String holds at least one byte. **/ /** Non-zero if the String holds at least one byte. **/
#define StringNotEmpty(BLK) ( StringLength(BLK) > 0 ) #define StringNotEmpty(BLK) (StringLength(BLK) > 0)
/** Allocated capacity in bytes, including room for the terminating NUL. **/ /** Allocated capacity in bytes, including room for the terminating NUL. **/
#define StringCapacity(BLK) ((BLK).capacity_) #define StringCapacity(BLK) ((BLK).capacity_)
/** Byte at POS (read). No bounds check; POS must be < StringLength. **/ /** Byte at POS (read). No bounds check; POS must be < StringLength. **/
#define StringSub(BLK, POS) ( StringBuff(BLK)[POS] ) #define StringSub(BLK, POS) (StringBuff(BLK)[POS])
/** Byte at POS (read/write). No bounds check; POS must be < StringLength. **/ /** Byte at POS (read/write). No bounds check; POS must be < StringLength. **/
#define StringSubRW(BLK, POS) ( StringBuffRW(BLK)[POS] ) #define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Subcharacter (read/write) **/ /** Subcharacter (read/write) **/
#define StringSubRW(BLK, POS) ( StringBuffRW(BLK)[POS] ) #define StringSubRW(BLK, POS) (StringBuffRW(BLK)[POS])
/** Byte POS positions from the end (read). POS==1 is the last byte. **/ /** Byte POS positions from the end (read). POS==1 is the last byte. **/
#define StringRight(BLK, POS) ( StringBuff(BLK)[StringLength(BLK) - POS] ) #define StringRight(BLK, POS) (StringBuff(BLK)[StringLength(BLK) - POS])
/** Byte POS positions from the end (read/write). POS==1 is the last byte. **/ /** Byte POS positions from the end (read/write). POS==1 is the last byte. **/
#define StringRightRW(BLK, POS) ( StringBuffRW(BLK)[StringLength(BLK) - POS] ) #define StringRightRW(BLK, POS) (StringBuffRW(BLK)[StringLength(BLK) - POS])
/** Drop the last byte and re-terminate. Undefined if the String is empty /** Drop the last byte and re-terminate. Undefined if the String is empty
(no length check; would underflow). **/ (no length check; would underflow). **/
#define StringPopRight(BLK) do { \ #define StringPopRight(BLK) \
StringBuffRW(BLK)[--StringLength(BLK)] = '\0'; \ do { \
} while(0) StringBuffRW(BLK)[--StringLength(BLK)] = '\0'; \
} while (0)
/** Grow so capacity_ >= CAPACITY (total bytes, including the NUL). May realloc /** Grow so capacity_ >= CAPACITY (total bytes, including the NUL). May realloc
(invalidating prior buffer pointers); aborts via STRING_ASSERT on OOM. (invalidating prior buffer pointers); aborts via STRING_ASSERT on OOM.
Never shrinks. **/ Never shrinks. **/
#define StringRoomTotal(BLK, CAPACITY) do { \ #define StringRoomTotal(BLK, CAPACITY) \
const size_t capacity_ = (size_t) (CAPACITY); \ do { \
while ((BLK).capacity_ < capacity_) { \ const size_t capacity_ = (size_t) (CAPACITY); \
if ((BLK).capacity_ < 16) { \ while ((BLK).capacity_ < capacity_) { \
(BLK).capacity_ = 16; \ if ((BLK).capacity_ < 16) { \
} else { \ (BLK).capacity_ = 16; \
(BLK).capacity_ *= 2; \ } else { \
} \ (BLK).capacity_ *= 2; \
(BLK).buffer_ = STRING_REALLOC((BLK).buffer_, (BLK).capacity_); \ } \
STRING_ASSERT((BLK).buffer_ != NULL); \ (BLK).buffer_ = STRING_REALLOC((BLK).buffer_, (BLK).capacity_); \
} \ STRING_ASSERT((BLK).buffer_ != NULL); \
} while(0) } \
} while (0)
/** Reserve room for SIZE more bytes beyond the current length (plus the NUL). /** Reserve room for SIZE more bytes beyond the current length (plus the NUL).
May realloc, invalidating prior buffer pointers. **/ May realloc, invalidating prior buffer pointers. **/
#define StringRoom(BLK, SIZE) StringRoomTotal(BLK, StringLength(BLK) + (SIZE) + 1) #define StringRoom(BLK, SIZE) \
StringRoomTotal(BLK, StringLength(BLK) + (SIZE) + 1)
/** Reserve room for SIZE more bytes and return the (post-realloc) RW buffer, /** Reserve room for SIZE more bytes and return the (post-realloc) RW buffer,
for appending in place. Does not update length_; the caller must. **/ for appending in place. Does not update length_; the caller must. **/
#define StringBuffN(BLK, SIZE) StringBuffN_(&(BLK), SIZE) #define StringBuffN(BLK, SIZE) StringBuffN_(&(BLK), SIZE)
HTS_STATIC char *StringBuffN_(String * blk, int size) {
HTS_STATIC char *StringBuffN_(String *blk, int size) {
StringRoom(*blk, size); StringRoom(*blk, size);
return StringBuffRW(*blk); return StringBuffRW(*blk);
} }
@@ -166,40 +171,44 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
/** Zero the fields (NULL buffer, no allocation). Use on an uninitialized /** Zero the fields (NULL buffer, no allocation). Use on an uninitialized
String only; does NOT free an existing buffer (use StringFree to reset String only; does NOT free an existing buffer (use StringFree to reset
an owned one), so calling it on a live String leaks. **/ an owned one), so calling it on a live String leaks. **/
#define StringInit(BLK) do { \ #define StringInit(BLK) \
(BLK).buffer_ = NULL; \ do { \
(BLK).capacity_ = 0; \ (BLK).buffer_ = NULL; \
(BLK).length_ = 0; \ (BLK).capacity_ = 0; \
} while(0) (BLK).length_ = 0; \
} while (0)
/** Truncate to length 0, keeping the allocation. Forces a non-NULL buffer /** Truncate to length 0, keeping the allocation. Forces a non-NULL buffer
(allocates if empty) and writes the leading NUL, so StringBuff is "". **/ (allocates if empty) and writes the leading NUL, so StringBuff is "". **/
#define StringClear(BLK) do { \ #define StringClear(BLK) \
(BLK).length_ = 0; \ do { \
StringRoom(BLK, 0); \ (BLK).length_ = 0; \
(BLK).buffer_[0] = '\0'; \ StringRoom(BLK, 0); \
} while(0) (BLK).buffer_[0] = '\0'; \
} while (0)
/** Set length_ to SIZE, or to strlen(buffer_) if SIZE is negative. Caller /** Set length_ to SIZE, or to strlen(buffer_) if SIZE is negative. Caller
asserts SIZE fits the existing content; does not (re)allocate. **/ asserts SIZE fits the existing content; does not (re)allocate. **/
#define StringSetLength(BLK, SIZE) do { \ #define StringSetLength(BLK, SIZE) \
if (SIZE >= 0) { \ do { \
(BLK).length_ = SIZE; \ if (SIZE >= 0) { \
} else { \ (BLK).length_ = SIZE; \
(BLK).length_ = strlen((BLK).buffer_); \ } else { \
} \ (BLK).length_ = strlen((BLK).buffer_); \
} while(0) } \
} while (0)
/** Release the owned buffer and reset to the empty state (NULL buffer). /** Release the owned buffer and reset to the empty state (NULL buffer).
Idempotent; safe on an already-empty String. **/ Idempotent; safe on an already-empty String. **/
#define StringFree(BLK) do { \ #define StringFree(BLK) \
if ((BLK).buffer_ != NULL) { \ do { \
STRING_FREE((BLK).buffer_); \ if ((BLK).buffer_ != NULL) { \
(BLK).buffer_ = NULL; \ STRING_FREE((BLK).buffer_); \
} \ (BLK).buffer_ = NULL; \
(BLK).capacity_ = 0; \ } \
(BLK).length_ = 0; \ (BLK).capacity_ = 0; \
} while(0) (BLK).length_ = 0; \
} while (0)
/** Take ownership of a NUL-terminated heap string STR (the String will free /** Take ownership of a NUL-terminated heap string STR (the String will free
it). Frees any current buffer first. STR MUST have been allocated by an it). Frees any current buffer first. STR MUST have been allocated by an
@@ -207,48 +216,52 @@ HTS_STATIC char *StringBuffN_(String * blk, int size) {
freed or used by the caller afterwards. length_/capacity_ are set to freed or used by the caller afterwards. length_/capacity_ are set to
strlen(STR) (capacity_ here excludes the NUL, so the next append reallocs). strlen(STR) (capacity_ here excludes the NUL, so the next append reallocs).
**/ **/
#define StringSetBuffer(BLK, STR) do { \ #define StringSetBuffer(BLK, STR) \
size_t len__ = strlen( STR ); \ do { \
StringFree(BLK); \ size_t len__ = strlen(STR); \
(BLK).buffer_ = ( STR ); \ StringFree(BLK); \
(BLK).capacity_ = len__; \ (BLK).buffer_ = (STR); \
(BLK).length_ = len__; \ (BLK).capacity_ = len__; \
} while(0) (BLK).length_ = len__; \
} while (0)
/** Append SIZE raw bytes from STR (NULs allowed as data). Grows as needed and /** Append SIZE raw bytes from STR (NULs allowed as data). Grows as needed and
re-terminates with a NUL after the appended bytes. STR must not alias re-terminates with a NUL after the appended bytes. STR must not alias
BLK's buffer (a realloc would invalidate it). **/ BLK's buffer (a realloc would invalidate it). **/
#define StringMemcat(BLK, STR, SIZE) do { \ #define StringMemcat(BLK, STR, SIZE) \
const char* str_mc_ = (STR); \ do { \
const size_t size_mc_ = (size_t) (SIZE); \ const char *str_mc_ = (STR); \
StringRoom(BLK, size_mc_); \ const size_t size_mc_ = (size_t) (SIZE); \
if (size_mc_ > 0) { \ StringRoom(BLK, size_mc_); \
memcpy((BLK).buffer_ + (BLK).length_, str_mc_, size_mc_); \ if (size_mc_ > 0) { \
(BLK).length_ += size_mc_; \ memcpy((BLK).buffer_ + (BLK).length_, str_mc_, size_mc_); \
} \ (BLK).length_ += size_mc_; \
*((BLK).buffer_ + (BLK).length_) = '\0'; \ } \
} while(0) *((BLK).buffer_ + (BLK).length_) = '\0'; \
} while (0)
/** Replace content with SIZE raw bytes from STR (NULs allowed as data). /** Replace content with SIZE raw bytes from STR (NULs allowed as data).
Same non-aliasing requirement as StringMemcat. **/ Same non-aliasing requirement as StringMemcat. **/
#define StringMemcpy(BLK, STR, SIZE) do { \ #define StringMemcpy(BLK, STR, SIZE) \
(BLK).length_ = 0; \ do { \
StringMemcat(BLK, STR, SIZE); \ (BLK).length_ = 0; \
} while(0) StringMemcat(BLK, STR, SIZE); \
} while (0)
/** Append one byte and re-terminate. Grows as needed. **/ /** Append one byte and re-terminate. Grows as needed. **/
#define StringAddchar(BLK, c) do { \ #define StringAddchar(BLK, c) \
String * const s__ = &(BLK); \ do { \
char c__ = (c); \ String *const s__ = &(BLK); \
StringRoom(*s__, 1); \ char c__ = (c); \
StringBuffRW(*s__)[StringLength(*s__)++] = c__; \ StringRoom(*s__, 1); \
StringBuffRW(*s__)[StringLength(*s__) ] = 0; \ StringBuffRW(*s__)[StringLength(*s__)++] = c__; \
} while(0) StringBuffRW(*s__)[StringLength(*s__)] = 0; \
} while (0)
/** Hand the buffer to the caller and reset the String to empty (NULL buffer). /** Hand the buffer to the caller and reset the String to empty (NULL buffer).
The returned pointer is now owned by the caller, who must STRING_FREE() it. The returned pointer is now owned by the caller, who must STRING_FREE() it.
Returns NULL if the String was empty. **/ Returns NULL if the String was empty. **/
HTS_STATIC char *StringAcquire(String * blk) { HTS_STATIC char *StringAcquire(String *blk) {
char *buff = StringBuffRW(*blk); char *buff = StringBuffRW(*blk);
StringBuffRW(*blk) = NULL; StringBuffRW(*blk) = NULL;
@@ -259,7 +272,7 @@ HTS_STATIC char *StringAcquire(String * blk) {
/** Return an independent deep copy of *src (its own allocation). The caller /** Return an independent deep copy of *src (its own allocation). The caller
owns the result and must StringFree it. **/ owns the result and must StringFree it. **/
HTS_STATIC String StringDup(const String * src) { HTS_STATIC String StringDup(const String *src) {
String s = STRING_EMPTY; String s = STRING_EMPTY;
StringMemcat(s, StringBuff(*src), StringLength(*src)); StringMemcat(s, StringBuff(*src), StringLength(*src));
@@ -270,7 +283,7 @@ HTS_STATIC String StringDup(const String * src) {
ownership transfers and the caller keeps no dangling alias. Frees any ownership transfers and the caller keeps no dangling alias. Frees any
current buffer first. *str MUST be allocator-compatible (see current buffer first. *str MUST be allocator-compatible (see
StringSetBuffer). No-op if str or *str is NULL. **/ StringSetBuffer). No-op if str or *str is NULL. **/
HTS_STATIC void StringAttach(String * blk, char **str) { HTS_STATIC void StringAttach(String *blk, char **str) {
StringFree(*blk); StringFree(*blk);
if (str != NULL && *str != NULL) { if (str != NULL && *str != NULL) {
StringBuffRW(*blk) = *str; StringBuffRW(*blk) = *str;
@@ -281,43 +294,46 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Append the C string STR (up to its NUL). No-op if STR is NULL. STR must not /** Append the C string STR (up to its NUL). No-op if STR is NULL. STR must not
alias BLK's buffer. **/ alias BLK's buffer. **/
#define StringCat(BLK, STR) do { \ #define StringCat(BLK, STR) \
const char *const str__ = ( STR ); \ do { \
if (str__ != NULL) { \ const char *const str__ = (STR); \
const size_t size__ = strlen(str__); \ if (str__ != NULL) { \
StringMemcat(BLK, str__, size__); \ const size_t size__ = strlen(str__); \
} \ StringMemcat(BLK, str__, size__); \
} while(0) } \
} while (0)
/** Append at most SIZE leading bytes of the C string STR. No-op if STR is /** Append at most SIZE leading bytes of the C string STR. No-op if STR is
NULL. STR must not alias BLK's buffer. **/ NULL. STR must not alias BLK's buffer. **/
#define StringCatN(BLK, STR, SIZE) do { \ #define StringCatN(BLK, STR, SIZE) \
const char *str__ = ( STR ); \ do { \
if (str__ != NULL) { \ const char *str__ = (STR); \
size_t size__ = strlen(str__); \ if (str__ != NULL) { \
if (size__ > (SIZE)) { \ size_t size__ = strlen(str__); \
size__ = (SIZE); \ if (size__ > (SIZE)) { \
} \ size__ = (SIZE); \
StringMemcat(BLK, str__, size__); \ } \
} \ StringMemcat(BLK, str__, size__); \
} while(0) } \
} while (0)
/** Replace content with at most SIZE leading bytes of the C string STR. /** Replace content with at most SIZE leading bytes of the C string STR.
If STR is NULL, clears to "". STR must not alias BLK's buffer. **/ If STR is NULL, clears to "". STR must not alias BLK's buffer. **/
#define StringCopyN(BLK, STR, SIZE) do { \ #define StringCopyN(BLK, STR, SIZE) \
const char *str__ = ( STR ); \ do { \
const size_t usize__ = (SIZE); \ const char *str__ = (STR); \
(BLK).length_ = 0; \ const size_t usize__ = (SIZE); \
if (str__ != NULL) { \ (BLK).length_ = 0; \
size_t size__ = strlen(str__); \ if (str__ != NULL) { \
if (size__ > usize__ ) { \ size_t size__ = strlen(str__); \
size__ = usize__; \ if (size__ > usize__) { \
} \ size__ = usize__; \
StringMemcat(BLK, str__, size__); \ } \
} else { \ StringMemcat(BLK, str__, size__); \
StringClear(BLK); \ } else { \
} \ StringClear(BLK); \
} while(0) } \
} while (0)
/** Replace blk's content with a copy of String blk2. blk and blk2 must be /** Replace blk's content with a copy of String blk2. blk and blk2 must be
distinct Strings (use StringCopyOverlapped if they may be the same). **/ distinct Strings (use StringCopyOverlapped if they may be the same). **/
@@ -326,23 +342,25 @@ HTS_STATIC void StringAttach(String * blk, char **str) {
/** Replace content with a copy of the C string STR. If STR is NULL, clears to /** Replace content with a copy of the C string STR. If STR is NULL, clears to
"". STR must not alias BLK's buffer (use StringCopyOverlapped if it might). "". STR must not alias BLK's buffer (use StringCopyOverlapped if it might).
**/ **/
#define StringCopy(BLK, STR) do { \ #define StringCopy(BLK, STR) \
const char *str__ = ( STR ); \ do { \
if (str__ != NULL) { \ const char *str__ = (STR); \
size_t size__ = strlen(str__); \ if (str__ != NULL) { \
StringMemcpy(BLK, str__, size__); \ size_t size__ = strlen(str__); \
} else { \ StringMemcpy(BLK, str__, size__); \
StringClear(BLK); \ } else { \
} \ StringClear(BLK); \
} while(0) } \
} while (0)
/** Like StringCopy but safe when STR aliases BLK's own buffer: copies via a /** Like StringCopy but safe when STR aliases BLK's own buffer: copies via a
temporary, so a self-copy or overlap is well-defined. **/ temporary, so a self-copy or overlap is well-defined. **/
#define StringCopyOverlapped(BLK, STR) do { \ #define StringCopyOverlapped(BLK, STR) \
String s__ = STRING_EMPTY; \ do { \
StringCopy(s__, STR); \ String s__ = STRING_EMPTY; \
StringCopyS(BLK, s__); \ StringCopy(s__, STR); \
StringFree(s__); \ StringCopyS(BLK, s__); \
} while(0) StringFree(s__); \
} while (0)
#endif #endif

View File

@@ -1213,7 +1213,7 @@ HTSEXT_API find_handle hts_findfirst(char *path) {
return NULL; return NULL;
} }
HTSEXT_API int hts_findnext(find_handle find) { HTSEXT_API hts_boolean hts_findnext(find_handle find) {
if (find) { if (find) {
#ifdef _WIN32 #ifdef _WIN32
if ((FindNextFileA(find->handle, &find->hdata))) if ((FindNextFileA(find->handle, &find->hdata)))
@@ -1273,7 +1273,7 @@ HTSEXT_API int hts_findgetsize(find_handle find) {
return -1; return -1;
} }
HTSEXT_API int hts_findisdir(find_handle find) { HTSEXT_API hts_boolean hts_findisdir(find_handle find) {
if (find) { if (find) {
if (!hts_findissystem(find)) { if (!hts_findissystem(find)) {
#ifdef _WIN32 #ifdef _WIN32
@@ -1287,7 +1287,7 @@ HTSEXT_API int hts_findisdir(find_handle find) {
} }
return 0; return 0;
} }
HTSEXT_API int hts_findisfile(find_handle find) { HTSEXT_API hts_boolean hts_findisfile(find_handle find) {
if (find) { if (find) {
if (!hts_findissystem(find)) { if (!hts_findissystem(find)) {
#ifdef _WIN32 #ifdef _WIN32
@@ -1301,7 +1301,7 @@ HTSEXT_API int hts_findisfile(find_handle find) {
} }
return 0; return 0;
} }
HTSEXT_API int hts_findissystem(find_handle find) { HTSEXT_API hts_boolean hts_findissystem(find_handle find) {
if (find) { if (find) {
#ifdef _WIN32 #ifdef _WIN32
if (find->hdata. if (find->hdata.

View File

@@ -108,15 +108,15 @@ HTSEXT_API int hts_buildtopindex(httrackp * opt, const char *path,
// Portable directory find functions // Portable directory find functions
// Directory find functions // Directory find functions
HTSEXT_API find_handle hts_findfirst(char *path); HTSEXT_API find_handle hts_findfirst(char *path);
HTSEXT_API int hts_findnext(find_handle find); HTSEXT_API hts_boolean hts_findnext(find_handle find);
HTSEXT_API int hts_findclose(find_handle find); HTSEXT_API int hts_findclose(find_handle find);
// //
HTSEXT_API char *hts_findgetname(find_handle find); HTSEXT_API char *hts_findgetname(find_handle find);
HTSEXT_API int hts_findgetsize(find_handle find); HTSEXT_API int hts_findgetsize(find_handle find);
HTSEXT_API int hts_findisdir(find_handle find); HTSEXT_API hts_boolean hts_findisdir(find_handle find);
HTSEXT_API int hts_findisfile(find_handle find); HTSEXT_API hts_boolean hts_findisfile(find_handle find);
HTSEXT_API int hts_findissystem(find_handle find); HTSEXT_API hts_boolean hts_findissystem(find_handle find);
#endif #endif

View File

@@ -57,17 +57,17 @@ extern "C" {
#endif #endif
/** Legacy no-op retained for ABI compatibility; always returns 1. */ /** Legacy no-op retained for ABI compatibility; always returns 1. */
HTSEXT_API int htswrap_init(void); // LEGACY HTSEXT_API int htswrap_init(void); // LEGACY
/** Legacy no-op retained for ABI compatibility; always returns 1. */ /** Legacy no-op retained for ABI compatibility; always returns 1. */
HTSEXT_API int htswrap_free(void); // LEGACY HTSEXT_API int htswrap_free(void); // LEGACY
#ifdef __cplusplus #ifdef __cplusplus
} }
#endif #endif
//HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct); // HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct);
//HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name); // HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name);
#endif #endif

View File

@@ -73,6 +73,7 @@ typedef struct strc_int2bytes2 strc_int2bytes2;
#endif #endif
#ifndef HTS_DEF_DEFSTRUCT_hts_log_type #ifndef HTS_DEF_DEFSTRUCT_hts_log_type
#define HTS_DEF_DEFSTRUCT_hts_log_type #define HTS_DEF_DEFSTRUCT_hts_log_type
/** Log severity levels, most to least severe. A message is emitted only if its /** Log severity levels, most to least severe. A message is emitted only if its
level is <= opt->debug. LOG_ERRNO is a flag OR'd into the level to append level is <= opt->debug. LOG_ERRNO is a flag OR'd into the level to append
": <strerror(errno)>" to the message. */ ": <strerror(errno)>" to the message. */
@@ -97,7 +98,7 @@ typedef struct hts_stat_struct hts_stat_struct;
retain them. */ retain them. */
#ifndef HTS_DEF_FWSTRUCT_htsErrorCallback #ifndef HTS_DEF_FWSTRUCT_htsErrorCallback
#define HTS_DEF_FWSTRUCT_htsErrorCallback #define HTS_DEF_FWSTRUCT_htsErrorCallback
typedef void (*htsErrorCallback) (const char *msg, const char *file, int line); typedef void (*htsErrorCallback)(const char *msg, const char *file, int line);
#endif #endif
/* Helpers for plugging callbacks /* Helpers for plugging callbacks
@@ -111,29 +112,35 @@ requires: htsdefines.h */
* CALLBACKARG_USERDEF(). Allocates a t_hts_callbackarg with hts_malloc (not * CALLBACKARG_USERDEF(). Allocates a t_hts_callbackarg with hts_malloc (not
* checked for OOM); it is freed by hts_free_opt(). * checked for OOM); it is freed by hts_free_opt().
*/ */
#define CHAIN_FUNCTION(OPT, MEMBER, FUNCTION, ARGUMENT) do { \ #define CHAIN_FUNCTION(OPT, MEMBER, FUNCTION, ARGUMENT) \
t_hts_callbackarg *carg = (t_hts_callbackarg*) hts_malloc(sizeof(t_hts_callbackarg)); \ do { \
carg->userdef = ( ARGUMENT ); \ t_hts_callbackarg *carg = \
carg->prev.fun = (void*) ( OPT )->callbacks_fun-> MEMBER .fun; \ (t_hts_callbackarg *) hts_malloc(sizeof(t_hts_callbackarg)); \
carg->prev.carg = ( OPT )->callbacks_fun-> MEMBER .carg; \ carg->userdef = (ARGUMENT); \
( OPT )->callbacks_fun-> MEMBER .fun = ( FUNCTION ); \ carg->prev.fun = (void *) (OPT)->callbacks_fun->MEMBER.fun; \
( OPT )->callbacks_fun-> MEMBER .carg = carg; \ carg->prev.carg = (OPT)->callbacks_fun->MEMBER.carg; \
} while(0) (OPT)->callbacks_fun->MEMBER.fun = (FUNCTION); \
(OPT)->callbacks_fun->MEMBER.carg = carg; \
} while (0)
/* The following helpers are useful only if you know that an existing callback migh be existing before before the call to CHAIN_FUNCTION() /* The following helpers are useful only if you know that an existing callback
If your functions were added just after hts_create_opt(), no need to make the previous function check */ migh be existing before before the call to CHAIN_FUNCTION() If your functions
were added just after hts_create_opt(), no need to make the previous function
check */
/** Inside a chained callback, return the ARGUMENT pointer originally passed to /** Inside a chained callback, return the ARGUMENT pointer originally passed to
CHAIN_FUNCTION(), or NULL when CARG is NULL. */ CHAIN_FUNCTION(), or NULL when CARG is NULL. */
#define CALLBACKARG_USERDEF(CARG) ( ( (CARG) != NULL ) ? (CARG)->userdef : NULL ) #define CALLBACKARG_USERDEF(CARG) (((CARG) != NULL) ? (CARG)->userdef : NULL)
/** Return the callback of type NAME that this one chained over, cast to its /** Return the callback of type NAME that this one chained over, cast to its
function-pointer type, or NULL. Call it to forward to the prior handler. */ function-pointer type, or NULL. Call it to forward to the prior handler. */
#define CALLBACKARG_PREV_FUN(CARG, NAME) ( (t_hts_htmlcheck_ ##NAME) ( ( (CARG) != NULL ) ? (CARG)->prev.fun : NULL ) ) #define CALLBACKARG_PREV_FUN(CARG, NAME) \
((t_hts_htmlcheck_##NAME)(((CARG) != NULL) ? (CARG)->prev.fun : NULL))
/** Return the carg of the callback this one chained over (pass it when /** Return the carg of the callback this one chained over (pass it when
forwarding to the CALLBACKARG_PREV_FUN result), or NULL. */ forwarding to the CALLBACKARG_PREV_FUN result), or NULL. */
#define CALLBACKARG_PREV_CARG(CARG) ( ( (CARG) != NULL ) ? (CARG)->prev.carg : NULL ) #define CALLBACKARG_PREV_CARG(CARG) \
(((CARG) != NULL) ? (CARG)->prev.carg : NULL)
/* Functions */ /* Functions */
@@ -162,7 +169,7 @@ HTSEXT_API int hts_main(int argc, char **argv);
hts_main() to set options or plug callbacks on opt first. Blocks until the hts_main() to set options or plug callbacks on opt first. Blocks until the
mirror ends and returns the engine exit code. The caller keeps ownership of mirror ends and returns the engine exit code. The caller keeps ownership of
opt and must release it with hts_free_opt(). */ opt and must release it with hts_free_opt(). */
HTSEXT_API int hts_main2(int argc, char **argv, httrackp * opt); HTSEXT_API int hts_main2(int argc, char **argv, httrackp *opt);
/* Options handling */ /* Options handling */
/** Allocate and default-initialize an option set, preloading the bundled parser /** Allocate and default-initialize an option set, preloading the bundled parser
@@ -174,7 +181,7 @@ HTSEXT_API httrackp *hts_create_opt(void);
modules, DNS cache, owned strings, and the structure). NULL is accepted. The modules, DNS cache, owned strings, and the structure). NULL is accepted. The
pointer is invalid afterward. Do not call while a mirror is running on that pointer is invalid afterward. Do not call while a mirror is running on that
opt; wait until hts_has_stopped() is true. */ opt; wait until hts_has_stopped() is true. */
HTSEXT_API void hts_free_opt(httrackp * opt); HTSEXT_API void hts_free_opt(httrackp *opt);
/** Return sizeof(httrackp) as the library sees it, for caller-vs-library struct /** Return sizeof(httrackp) as the library sees it, for caller-vs-library struct
ABI mismatch checks. */ ABI mismatch checks. */
@@ -184,16 +191,16 @@ HTSEXT_API size_t hts_sizeof_opt(void);
Returns NULL if opt is NULL. The result aliases a single process-global Returns NULL if opt is NULL. The result aliases a single process-global
static: it is not thread-safe and is overwritten by the next call, so copy static: it is not thread-safe and is overwritten by the next call, so copy
out the fields you need. */ out the fields you need. */
HTSEXT_API const hts_stat_struct* hts_get_stats(httrackp * opt); HTSEXT_API const hts_stat_struct *hts_get_stats(httrackp *opt);
/** Legacy no-op retained for API compatibility. */ /** Legacy no-op retained for API compatibility. */
HTSEXT_API void set_wrappers(httrackp * opt); /* LEGACY */ HTSEXT_API void set_wrappers(httrackp *opt); /* LEGACY */
/** Load a plugin shared library and run its hts_plug(opt, argv) entry point. On /** Load a plugin shared library and run its hts_plug(opt, argv) entry point. On
success the handle is recorded in opt and unloaded by hts_free_opt(). success the handle is recorded in opt and unloaded by hts_free_opt().
@return 1 if loaded and hts_plug succeeded; 0 if loaded but hts_plug was @return 1 if loaded and hts_plug succeeded; 0 if loaded but hts_plug was
missing or refused; -1 if the library could not be loaded. */ missing or refused; -1 if the library could not be loaded. */
HTSEXT_API int plug_wrapper(httrackp * opt, const char *moduleName, HTSEXT_API int plug_wrapper(httrackp *opt, const char *moduleName,
const char *argv); const char *argv);
/** Install the process-global assertion/error callback (NULL clears it). Not /** Install the process-global assertion/error callback (NULL clears it). Not
@@ -206,17 +213,18 @@ HTSEXT_API htsErrorCallback hts_get_error_callback(void);
/* Logging */ /* Logging */
/** Legacy: write prefix then msg to opt->log. Returns 0 if written, 1 if /** Legacy: write prefix then msg to opt->log. Returns 0 if written, 1 if
opt->log is NULL. Prefer hts_log_print(). */ opt->log is NULL. Prefer hts_log_print(). */
HTSEXT_API int hts_log(httrackp * opt, const char *prefix, const char *msg); HTSEXT_API hts_boolean hts_log(httrackp *opt, const char *prefix,
const char *msg);
/** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO). /** printf-style log at level @p type (an hts_log_type, optionally |LOG_ERRNO).
Forwards to the registered log callback, and when the level is <= opt->debug Forwards to the registered log callback, and when the level is <= opt->debug
also to opt->log. @p format must be non-NULL. */ also to opt->log. @p format must be non-NULL. */
HTSEXT_API void hts_log_print(httrackp * opt, int type, const char *format, HTSEXT_API void hts_log_print(httrackp *opt, int type, const char *format, ...)
...) HTS_PRINTF_FUN(3, 4); HTS_PRINTF_FUN(3, 4);
/** va_list form of hts_log_print(). @p opt may be NULL (only the callback /** va_list form of hts_log_print(). @p opt may be NULL (only the callback
runs). Preserves errno. @p format must be non-NULL. */ runs). Preserves errno. @p format must be non-NULL. */
HTSEXT_API void hts_log_vprint(httrackp * opt, int type, const char *format, HTSEXT_API void hts_log_vprint(httrackp *opt, int type, const char *format,
va_list args); va_list args);
/** Install the process-global log callback invoked by hts_log_vprint() for /** Install the process-global log callback invoked by hts_log_vprint() for
@@ -230,7 +238,7 @@ hts_set_log_vprint_callback(void (*callback)(httrackp *opt, int type,
result is written into and aliases a 2048-byte scratch buffer inside opt: it result is written into and aliases a 2048-byte scratch buffer inside opt: it
is valid until that buffer is next used, and must not be freed. opt must be is valid until that buffer is next used, and must not be freed. opt must be
non-NULL. */ non-NULL. */
HTSEXT_API const char *hts_get_version_info(httrackp * opt); HTSEXT_API const char *hts_get_version_info(httrackp *opt);
/** Static build-features string (TLS, zlib, ipv6, and so on). Process-global /** Static build-features string (TLS, zlib, ipv6, and so on). Process-global
storage; do not free or modify. */ storage; do not free or modify. */
@@ -240,21 +248,22 @@ HTSEXT_API const char *hts_is_available(void);
HTSEXT_API const char *hts_version(void); HTSEXT_API const char *hts_version(void);
/* Wrapper functions */ /* Wrapper functions */
HTSEXT_API int htswrap_init(void); // DEPRECATED - DUMMY FUNCTION HTSEXT_API int htswrap_init(void); // DEPRECATED - DUMMY FUNCTION
HTSEXT_API int htswrap_free(void); // DEPRECATED - DUMMY FUNCTION HTSEXT_API int htswrap_free(void); // DEPRECATED - DUMMY FUNCTION
/** Register callback @p fct under @p name in opt's callback table (for example /** Register callback @p fct under @p name in opt's callback table (for example
"start", "check-html", "linkdetected"). Returns 1 on success, 0 if @p name "start", "check-html", "linkdetected"). Returns 1 on success, 0 if @p name
is not a known slot. Prefer CHAIN_FUNCTION(), which preserves any prior is not a known slot. Prefer CHAIN_FUNCTION(), which preserves any prior
callback. */ callback. */
HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct); HTSEXT_API int htswrap_add(httrackp *opt, const char *name, void *fct);
/** Return the function pointer registered under @p name in opt as a uintptr_t, /** Return the function pointer registered under @p name in opt as a uintptr_t,
or 0 if none or unknown. */ or 0 if none or unknown. */
HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name); HTSEXT_API uintptr_t htswrap_read(httrackp *opt, const char *name);
/* Internal library allocators, if a different libc is being used by the client */ /* Internal library allocators, if a different libc is being used by the client
*/
/** strdup() through the library allocator. Returns a heap copy freed with /** strdup() through the library allocator. Returns a heap copy freed with
hts_free(), or NULL on failure. */ hts_free(), or NULL on failure. */
HTSEXT_API char *hts_strdup(const char *string); HTSEXT_API char *hts_strdup(const char *string);
@@ -271,13 +280,13 @@ HTSEXT_API void *hts_realloc(void *const data, const size_t size);
HTSEXT_API void hts_free(void *data); HTSEXT_API void hts_free(void *data);
/* Other functions */ /* Other functions */
HTSEXT_API int hts_resetvar(void); // DEPRECATED - DUMMY FUNCTION HTSEXT_API int hts_resetvar(void); // DEPRECATED - DUMMY FUNCTION
/** (Re)build the top-level index.html aggregating every mirror project found /** (Re)build the top-level index.html aggregating every mirror project found
under @p path. @p binpath is the data root used to locate the under @p path. @p binpath is the data root used to locate the
templates/topindex-*.html files, falling back to built-in templates. Writes templates/topindex-*.html files, falling back to built-in templates. Writes
<path>/index.html. @return 1 on success, 0 on failure. */ <path>/index.html. @return 1 on success, 0 on failure. */
HTSEXT_API int hts_buildtopindex(httrackp * opt, const char *path, HTSEXT_API int hts_buildtopindex(httrackp *opt, const char *path,
const char *binpath); const char *binpath);
/** Scan every mirror project under @p path and return a CRLF-separated list: /** Scan every mirror project under @p path and return a CRLF-separated list:
@@ -313,20 +322,21 @@ HTSEXT_API T_SOC catch_url_init(int *port, char *adr);
"ip:port". The buffers are caller-allocated and not bounds-checked: @p data "ip:port". The buffers are caller-allocated and not bounds-checked: @p data
must be CATCH_URL_DATA_SIZE bytes, and @p url / @p method must fit the must be CATCH_URL_DATA_SIZE bytes, and @p url / @p method must fit the
captured request line. */ captured request line. */
HTSEXT_API int catch_url(T_SOC soc, char *url, char *method, char *data); HTSEXT_API hts_boolean catch_url(T_SOC soc, char *url, char *method,
char *data);
/* State */ /* State */
/** Whether the engine is parsing HTML. Returns 0 if not, otherwise the percent /** Whether the engine is parsing HTML. Returns 0 if not, otherwise the percent
done (at least 1). @p flag >= 0 also requests a progress refresh; pass a done (at least 1). @p flag >= 0 also requests a progress refresh; pass a
negative value to query without side effects. */ negative value to query without side effects. */
HTSEXT_API int hts_is_parsing(httrackp * opt, int flag); HTSEXT_API int hts_is_parsing(httrackp *opt, int flag);
/** Current background phase: 0 none, 1 testing links, 2 purge, 3, 4 scheduling, /** Current background phase: 0 none, 1 testing links, 2 purge, 3, 4 scheduling,
5 waiting for a slot. */ 5 waiting for a slot. */
HTSEXT_API int hts_is_testing(httrackp * opt); HTSEXT_API int hts_is_testing(httrackp *opt);
/** Nonzero once the engine has begun its exit sequence. */ /** Nonzero once the engine has begun its exit sequence. */
HTSEXT_API int hts_is_exiting(httrackp * opt); HTSEXT_API int hts_is_exiting(httrackp *opt);
/*HTSEXT_API int hts_setopt(httrackp* opt); DEPRECATED ; see copy_htsopt() */ /*HTSEXT_API int hts_setopt(httrackp* opt); DEPRECATED ; see copy_htsopt() */
@@ -334,46 +344,46 @@ HTSEXT_API int hts_is_exiting(httrackp * opt);
caller-owned, NULL-terminated array of strings; the engine stores the caller-owned, NULL-terminated array of strings; the engine stores the
pointer without copying, so the array and its strings must stay valid until pointer without copying, so the array and its strings must stay valid until
the engine consumes them. @return nonzero if a list is now set. */ the engine consumes them. @return nonzero if a list is now set. */
HTSEXT_API int hts_addurl(httrackp * opt, char **url); HTSEXT_API hts_boolean hts_addurl(httrackp *opt, char **url);
/** Clear any pending add-URL list set by hts_addurl(). Always returns 0. */ /** Clear any pending add-URL list set by hts_addurl(). Always returns 0. */
HTSEXT_API int hts_resetaddurl(httrackp * opt); HTSEXT_API hts_boolean hts_resetaddurl(httrackp *opt);
/** Apply the runtime-tunable options from @p from onto @p to, to adjust a live /** Apply the runtime-tunable options from @p from onto @p to, to adjust a live
mirror. Only fields set to a non-sentinel value are copied; the rest of @p mirror. Only fields set to a non-sentinel value are copied; the rest of @p
to is left untouched. The user-agent string is deep-copied. @return 0. */ to is left untouched. The user-agent string is deep-copied. @return 0. */
HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to); HTSEXT_API int copy_htsopt(const httrackp *from, httrackp *to);
/** Return the engine's last error message, or NULL. The string is owned by /** Return the engine's last error message, or NULL. The string is owned by
@p opt; do not free it, and use it only while @p opt lives. */ @p opt; do not free it, and use it only while @p opt lives. */
HTSEXT_API char *hts_errmsg(httrackp * opt); HTSEXT_API char *hts_errmsg(httrackp *opt);
/** Get or set the transfer-pause flag. @p p >= 0 sets it (nonzero means /** Get or set the transfer-pause flag. @p p >= 0 sets it (nonzero means
paused); a negative value queries. @return the current pause flag. */ paused); a negative value queries. @return the current pause flag. */
HTSEXT_API int hts_setpause(httrackp * opt, int); HTSEXT_API int hts_setpause(httrackp *opt, int);
/** Ask the running mirror to terminate (sets the stop flag under the state /** Ask the running mirror to terminate (sets the stop flag under the state
lock, so it is safe to call from another thread). @p force is currently lock, so it is safe to call from another thread). @p force is currently
ignored. ignored.
@return 0; no-op if @p opt is NULL. */ @return 0; no-op if @p opt is NULL. */
HTSEXT_API int hts_request_stop(httrackp * opt, int force); HTSEXT_API int hts_request_stop(httrackp *opt, hts_boolean force);
/** Queue a single in-progress file, by URL, to be cancelled by the engine. /** Queue a single in-progress file, by URL, to be cancelled by the engine.
@p url is copied internally. Takes the state lock, so it is thread-safe. @p url is copied internally. Takes the state lock, so it is thread-safe.
@return the underlying push result. */ @return the underlying push result. */
HTSEXT_API int hts_cancel_file_push(httrackp * opt, const char *url); HTSEXT_API int hts_cancel_file_push(httrackp *opt, const char *url);
/** Cancel the in-progress link-testing phase. Effective only while a test runs. /** Cancel the in-progress link-testing phase. Effective only while a test runs.
*/ */
HTSEXT_API void hts_cancel_test(httrackp * opt); HTSEXT_API void hts_cancel_test(httrackp *opt);
/** Cancel the in-progress HTML parsing. Effective only while parsing is active. /** Cancel the in-progress HTML parsing. Effective only while parsing is active.
*/ */
HTSEXT_API void hts_cancel_parsing(httrackp * opt); HTSEXT_API void hts_cancel_parsing(httrackp *opt);
/** Nonzero once the mirror has fully ended. Read under the engine state lock, /** Nonzero once the mirror has fully ended. Read under the engine state lock,
so safe to poll from another thread. Wait for this before hts_free_opt(). */ so safe to poll from another thread. Wait for this before hts_free_opt(). */
HTSEXT_API int hts_has_stopped(httrackp * opt); HTSEXT_API hts_boolean hts_has_stopped(httrackp *opt);
/* Tools */ /* Tools */
/** Ensure the directory chain leading to @p path exists, creating missing /** Ensure the directory chain leading to @p path exists, creating missing
@@ -390,7 +400,7 @@ HTSEXT_API int structcheck_utf8(const char *path);
/** Whether the directory containing @p path exists. The basename is stripped /** Whether the directory containing @p path exists. The basename is stripped
first, so passing a file path tests its parent directory. @return 1 if it is first, so passing a file path tests its parent directory. @return 1 if it is
a directory, 0 otherwise. */ a directory, 0 otherwise. */
HTSEXT_API int dir_exists(const char *path); HTSEXT_API hts_boolean dir_exists(const char *path);
/** Write the HTTP reason phrase for @p statuscode into @p msg, a caller buffer /** Write the HTTP reason phrase for @p statuscode into @p msg, a caller buffer
of at least 64 bytes. For an unknown code a non-empty @p msg is kept, of at least 64 bytes. For an unknown code a non-empty @p msg is kept,
@@ -414,19 +424,19 @@ HTSEXT_API void qsec2str(char *st, TStamp t);
is reused, and a given strc is not reentrant. Use one strc per is reused, and a given strc is not reentrant. Use one strc per
concurrently-live result. */ concurrently-live result. */
/** Format @p n as a decimal string into @p strc and return it. */ /** Format @p n as a decimal string into @p strc and return it. */
HTSEXT_API char *int2char(strc_int2bytes2 * strc, int n); HTSEXT_API char *int2char(strc_int2bytes2 *strc, int n);
/** Format byte count @p n as "<num><unit>" (B/KiB/MiB/GiB and so on) into /** Format byte count @p n as "<num><unit>" (B/KiB/MiB/GiB and so on) into
@p strc and return it. */ @p strc and return it. */
HTSEXT_API char *int2bytes(strc_int2bytes2 * strc, LLint n); HTSEXT_API char *int2bytes(strc_int2bytes2 *strc, LLint n);
/** Format a transfer rate @p n as "<num><unit>/s" into @p strc and return it. /** Format a transfer rate @p n as "<num><unit>/s" into @p strc and return it.
*/ */
HTSEXT_API char *int2bytessec(strc_int2bytes2 * strc, long int n); HTSEXT_API char *int2bytessec(strc_int2bytes2 *strc, long int n);
/** Split byte count @p n into number and unit, returning a 2-element array /** Split byte count @p n into number and unit, returning a 2-element array
{number, unit} stored inside @p strc. */ {number, unit} stored inside @p strc. */
HTSEXT_API char **int2bytes2(strc_int2bytes2 * strc, LLint n); HTSEXT_API char **int2bytes2(strc_int2bytes2 *strc, LLint n);
/** Skip any "user[:pass]@" identification prefix in a URL, returning a pointer /** Skip any "user[:pass]@" identification prefix in a URL, returning a pointer
into the argument past it (or past the protocol if none). The result aliases into the argument past it (or past the protocol if none). The result aliases
@@ -488,40 +498,50 @@ HTSEXT_API void unescape_amp(char *s);
/** Percent-escape only spaces (' ' becomes "%20"); copy everything else /** Percent-escape only spaces (' ' becomes "%20"); copy everything else
* verbatim. */ * verbatim. */
HTSEXT_API size_t escape_spc_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_spc_url(const char *const src, char *const dest,
const size_t size);
/** Aggressively percent-escape @p src for use as a single URL path segment /** Aggressively percent-escape @p src for use as a single URL path segment
(reserved, delimiter, unwise, special, avoid and mark characters). */ (reserved, delimiter, unwise, special, avoid and mark characters). */
HTSEXT_API size_t escape_in_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_in_url(const char *const src, char *const dest,
const size_t size);
/** Percent-escape @p src as a URI, escaping only what is necessary and keeping /** Percent-escape @p src as a URI, escaping only what is necessary and keeping
'/' and other reserved characters. */ '/' and other reserved characters. */
HTSEXT_API size_t escape_uri(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_uri(const char *const src, char *const dest,
const size_t size);
/** Like escape_uri() for a UTF-8 URI: also escapes reserved characters other /** Like escape_uri() for a UTF-8 URI: also escapes reserved characters other
than '/'. */ than '/'. */
HTSEXT_API size_t escape_uri_utf(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_uri_utf(const char *const src, char *const dest,
const size_t size);
/** Minimal "make safe" escape: percent-escapes only '"', ' ' and control /** Minimal "make safe" escape: percent-escapes only '"', ' ' and control
characters, leaving an already-formed URL otherwise intact. */ characters, leaving an already-formed URL otherwise intact. */
HTSEXT_API size_t escape_check_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t escape_check_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_spc_url(): escapes @p src after the existing /** Append-variant of escape_spc_url(): escapes @p src after the existing
NUL-terminated content of @p dest. Returns the bytes appended (excluding the NUL-terminated content of @p dest. Returns the bytes appended (excluding the
NUL). */ NUL). */
HTSEXT_API size_t append_escape_spc_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_spc_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_in_url(). See append_escape_spc_url(). */ /** Append-variant of escape_in_url(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_in_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_in_url(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_uri(). See append_escape_spc_url(). */ /** Append-variant of escape_uri(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_uri(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_uri(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_uri_utf(). See append_escape_spc_url(). */ /** Append-variant of escape_uri_utf(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_uri_utf(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_uri_utf(const char *const src, char *const dest,
const size_t size);
/** Append-variant of escape_check_url(). See append_escape_spc_url(). */ /** Append-variant of escape_check_url(). See append_escape_spc_url(). */
HTSEXT_API size_t append_escape_check_url(const char *const src, char *const dest, const size_t size); HTSEXT_API size_t append_escape_check_url(const char *const src,
char *const dest, const size_t size);
/** In-place variant of escape_spc_url(): escapes the NUL-terminated string in /** In-place variant of escape_spc_url(): escapes the NUL-terminated string in
@p dest back into @p dest. */ @p dest back into @p dest. */
@@ -541,66 +561,73 @@ HTSEXT_API size_t inplace_escape_check_url(char *const dest, const size_t size);
/** Same escaping as escape_check_url() but returns @p dest instead of the byte /** Same escaping as escape_check_url() but returns @p dest instead of the byte
count. */ count. */
HTSEXT_API char *escape_check_url_addr(const char *const src, char *const dest, const size_t size); HTSEXT_API char *escape_check_url_addr(const char *const src, char *const dest,
const size_t size);
/** Build a MIME/MHTML content-id token in @p dest from @p adr and @p fil: /** Build a MIME/MHTML content-id token in @p dest from @p adr and @p fil:
escape_in_url() both, then replace every '%' with 'X' so the result is one escape_in_url() both, then replace every '%' with 'X' so the result is one
opaque token. */ opaque token. */
HTSEXT_API size_t make_content_id(const char *const adr, const char *const fil, char *const dest, const size_t size); HTSEXT_API size_t make_content_id(const char *const adr, const char *const fil,
char *const dest, const size_t size);
/** Low-level percent-escaper backing the escape_* family. @p mode selects the /** Low-level percent-escaper backing the escape_* family. @p mode selects the
character class to escape: 0 check_url, 1 in_url, 2 spc_url, 3 uri, character class to escape: 0 check_url, 1 in_url, 2 spc_url, 3 uri,
30 uri_utf. @p max_size is the dest capacity including the NUL. */ 30 uri_utf. @p max_size is the dest capacity including the NUL. */
HTSEXT_API size_t x_escape_http(const char *const s, char *const dest, const size_t max_size, const int mode); HTSEXT_API size_t x_escape_http(const char *const s, char *const dest,
const size_t max_size, const int mode);
/** Strip all control characters (byte value < 32) from @p s in place. */ /** Strip all control characters (byte value < 32) from @p s in place. */
HTSEXT_API void escape_remove_control(char *const s); HTSEXT_API void escape_remove_control(char *const s);
/** HTML-escape for text output: rewrite '&' to "&amp;" and pass every other /** HTML-escape for text output: rewrite '&' to "&amp;" and pass every other
byte through unchanged. */ byte through unchanged. */
HTSEXT_API size_t escape_for_html_print(const char *const s, char *const dest, const size_t size); HTSEXT_API size_t escape_for_html_print(const char *const s, char *const dest,
const size_t size);
/** Like escape_for_html_print() but also convert every high byte (>= 128) to a /** Like escape_for_html_print() but also convert every high byte (>= 128) to a
numeric entity "&#xNN;". */ numeric entity "&#xNN;". */
HTSEXT_API size_t escape_for_html_print_full(const char *const s, char *const dest, const size_t size); HTSEXT_API size_t escape_for_html_print_full(const char *const s,
char *const dest,
const size_t size);
/** Percent-decode @p s into @p catbuff (capacity @p size) and return @p /** Percent-decode @p s into @p catbuff (capacity @p size) and return @p
catbuff. Decodes every "%xx" hex escape. */ catbuff. Decodes every "%xx" hex escape. */
HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const char *const s); HTSEXT_API char *unescape_http(char *const catbuff, const size_t size,
const char *const s);
/** Percent-decode @p s into @p catbuff, but only the escapes that are safe to /** Percent-decode @p s into @p catbuff, but only the escapes that are safe to
decode while keeping a valid URI (reserved, delimiter, unwise, control and decode while keeping a valid URI (reserved, delimiter, unwise, control and
must-avoid escapes are kept encoded, and %25 is never decoded). @p no_high & must-avoid escapes are kept encoded, and %25 is never decoded). @p no_high &
1 also decodes high (>= 128) bytes; @p no_high & 2 also decodes an escaped 1 also decodes high (>= 128) bytes; @p no_high & 2 also decodes an escaped
space. Returns @p catbuff. */ space. Returns @p catbuff. */
HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, const char *s, const int no_high); HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size,
const char *s, const hts_boolean no_high);
/** Determine the MIME type of local file name @p fil into @p s (capacity /** Determine the MIME type of local file name @p fil into @p s (capacity
@p ssize): user --assume rules, then ".html", then the built-in extension @p ssize): user --assume rules, then ".html", then the built-in extension
table. @p flag != 0 forces a fallback type. @return 1 if a type was written, table. @p flag != 0 forces a fallback type. @return 1 if a type was written,
0 otherwise. */ 0 otherwise. */
HTSEXT_API int get_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean get_httptype_sized(httrackp *opt, char *s, size_t ssize,
const char *fil, int flag); const char *fil, hts_boolean flag);
/** @deprecated Use get_httptype_sized(). Assumes @p s has at least /** @deprecated Use get_httptype_sized(). Assumes @p s has at least
HTS_MIMETYPE_SIZE capacity. */ HTS_MIMETYPE_SIZE capacity. */
HTS_DEPRECATED("use get_httptype_sized(opt, s, ssize, fil, flag)") HTS_DEPRECATED("use get_httptype_sized(opt, s, ssize, fil, flag)")
HTSEXT_API void get_httptype(httrackp * opt, char *s, const char *fil, HTSEXT_API void get_httptype(httrackp *opt, char *s, const char *fil, int flag);
int flag);
/** Classify @p fil by its extension: 0 unknown, 1 known non-HTML, 2 known HTML. /** Classify @p fil by its extension: 0 unknown, 1 known non-HTML, 2 known HTML.
Consults the built-in table then user --assume rules. 0 for a NULL @p fil. Consults the built-in table then user --assume rules. 0 for a NULL @p fil.
*/ */
HTSEXT_API int is_knowntype(httrackp * opt, const char *fil); HTSEXT_API int is_knowntype(httrackp *opt, const char *fil);
/** Like is_knowntype() but consults only the user --assume rules: 0 no rule, /** Like is_knowntype() but consults only the user --assume rules: 0 no rule,
1 non-HTML, 2 HTML. */ 1 non-HTML, 2 HTML. */
HTSEXT_API int is_userknowntype(httrackp * opt, const char *fil); HTSEXT_API int is_userknowntype(httrackp *opt, const char *fil);
/** 1 if @p fil, an extension such as "asp" or "php" (not a full filename), is a /** 1 if @p fil, an extension such as "asp" or "php" (not a full filename), is a
known dynamic-page type, else 0. */ known dynamic-page type, else 0. */
HTSEXT_API int is_dyntype(const char *fil); HTSEXT_API hts_boolean is_dyntype(const char *fil);
/** Extract the extension of @p fil (text after the last '.', stopping at '?') /** Extract the extension of @p fil (text after the last '.', stopping at '?')
into caller scratch @p catbuff (capacity @p size) and return it. Returns "" into caller scratch @p catbuff (capacity @p size) and return it. Returns ""
@@ -610,18 +637,18 @@ HTSEXT_API const char *get_ext(char *catbuff, size_t size, const char *fil);
/** 1 if MIME type @p st must not be reclassified or renamed (hypertext types /** 1 if MIME type @p st must not be reclassified or renamed (hypertext types
and a built-in keep-list of commonly mislabeled types), else 0. */ and a built-in keep-list of commonly mislabeled types), else 0. */
HTSEXT_API int may_unknown(httrackp * opt, const char *st); HTSEXT_API hts_boolean may_unknown(httrackp *opt, const char *st);
/** Guess the MIME type of local file @p fil into @p s (capacity @p ssize), /** Guess the MIME type of local file @p fil into @p s (capacity @p ssize),
always producing a type. @return 1 if a type was written. */ always producing a type. @return 1 if a type was written. */
HTSEXT_API int guess_httptype_sized(httrackp *opt, char *s, size_t ssize, HTSEXT_API hts_boolean guess_httptype_sized(httrackp *opt, char *s,
const char *fil); size_t ssize, const char *fil);
/** @deprecated Use guess_httptype_sized(). Assumes @p s has at least /** @deprecated Use guess_httptype_sized(). Assumes @p s has at least
HTS_MIMETYPE_SIZE capacity. */ HTS_MIMETYPE_SIZE capacity. */
HTS_DEPRECATED("use guess_httptype_sized(opt, s, ssize, fil)") HTS_DEPRECATED("use guess_httptype_sized(opt, s, ssize, fil)")
HTSEXT_API void guess_httptype(httrackp * opt, char *s, const char *fil); HTSEXT_API void guess_httptype(httrackp *opt, char *s, const char *fil);
/* Ugly string tools */ /* Ugly string tools */
/* These take a caller scratch buffer catbuff of capacity size and return it. On /* These take a caller scratch buffer catbuff of capacity size and return it. On
@@ -630,11 +657,13 @@ HTSEXT_API void guess_httptype(httrackp * opt, char *s, const char *fil);
time), not a pointer. */ time), not a pointer. */
/** Concatenate @p a and @p b into @p catbuff (NULL or empty operands are /** Concatenate @p a and @p b into @p catbuff (NULL or empty operands are
* skipped). */ * skipped). */
HTSEXT_API char *concat(char *catbuff, size_t size, const char *a, const char *b); HTSEXT_API char *concat(char *catbuff, size_t size, const char *a,
const char *b);
/** Like concat(a, b) but convert '/' to the platform path separator (Windows). /** Like concat(a, b) but convert '/' to the platform path separator (Windows).
*/ */
HTSEXT_API char *fconcat(char *catbuff, size_t size, const char *a, const char *b); HTSEXT_API char *fconcat(char *catbuff, size_t size, const char *a,
const char *b);
/** Copy @p a into @p catbuff, converting '/' to the platform path separator /** Copy @p a into @p catbuff, converting '/' to the platform path separator
(Windows). */ (Windows). */
@@ -677,7 +706,7 @@ HTSEXT_API find_handle hts_findfirst(char *path);
/** Advance to the next directory entry. Returns 1 if an entry is available, 0 /** Advance to the next directory entry. Returns 1 if an entry is available, 0
at end of directory. */ at end of directory. */
HTSEXT_API int hts_findnext(find_handle find); HTSEXT_API hts_boolean hts_findnext(find_handle find);
/** Close the iteration and free @p find. Always returns 0; NULL is accepted. */ /** Close the iteration and free @p find. Always returns 0; NULL is accepted. */
HTSEXT_API int hts_findclose(find_handle find); HTSEXT_API int hts_findclose(find_handle find);
@@ -692,16 +721,16 @@ HTSEXT_API int hts_findgetsize(find_handle find);
/** 1 if the current entry is a directory, else 0 (a system/special entry, see /** 1 if the current entry is a directory, else 0 (a system/special entry, see
hts_findissystem(), reports 0). */ hts_findissystem(), reports 0). */
HTSEXT_API int hts_findisdir(find_handle find); HTSEXT_API hts_boolean hts_findisdir(find_handle find);
/** 1 if the current entry is a regular file, else 0 (a system/special entry, /** 1 if the current entry is a regular file, else 0 (a system/special entry,
see hts_findissystem(), reports 0). */ see hts_findissystem(), reports 0). */
HTSEXT_API int hts_findisfile(find_handle find); HTSEXT_API hts_boolean hts_findisfile(find_handle find);
/** 1 if the current entry is a special/system entry to skip: "." or "..", on /** 1 if the current entry is a special/system entry to skip: "." or "..", on
POSIX also device/fifo/socket nodes, on Windows also system, hidden or POSIX also device/fifo/socket nodes, on Windows also system, hidden or
temporary entries. Else 0. */ temporary entries. Else 0. */
HTSEXT_API int hts_findissystem(find_handle find); HTSEXT_API hts_boolean hts_findissystem(find_handle find);
/* UTF-8 aware FILE API */ /* UTF-8 aware FILE API */
/* On non-Windows these macros resolve directly to the POSIX calls. On Windows /* On non-Windows these macros resolve directly to the POSIX calls. On Windows
@@ -716,7 +745,7 @@ HTSEXT_API FILE *hts_fopen_utf8(const char *path, const char *mode);
#define STAT hts_stat_utf8 #define STAT hts_stat_utf8
typedef struct _stat STRUCT_STAT; typedef struct _stat STRUCT_STAT;
HTSEXT_API int hts_stat_utf8(const char *path, STRUCT_STAT * buf); HTSEXT_API int hts_stat_utf8(const char *path, STRUCT_STAT *buf);
#define UNLINK hts_unlink_utf8 #define UNLINK hts_unlink_utf8
HTSEXT_API int hts_unlink_utf8(const char *pathname); HTSEXT_API int hts_unlink_utf8(const char *pathname);
@@ -728,12 +757,12 @@ HTSEXT_API int hts_rename_utf8(const char *oldpath, const char *newpath);
HTSEXT_API int hts_mkdir_utf8(const char *pathname); HTSEXT_API int hts_mkdir_utf8(const char *pathname);
#define UTIME(A,B) hts_utime_utf8(A,B) #define UTIME(A, B) hts_utime_utf8(A, B)
typedef struct _utimbuf STRUCT_UTIMBUF; typedef struct _utimbuf STRUCT_UTIMBUF;
HTSEXT_API int hts_utime_utf8(const char *filename, HTSEXT_API int hts_utime_utf8(const char *filename,
const STRUCT_UTIMBUF * times); const STRUCT_UTIMBUF *times);
#else #else
#define FOPEN fopen #define FOPEN fopen
#define STAT stat #define STAT stat
@@ -745,7 +774,7 @@ typedef struct stat STRUCT_STAT;
typedef struct utimbuf STRUCT_UTIMBUF; typedef struct utimbuf STRUCT_UTIMBUF;
#define UTIME(A,B) utime(A,B) #define UTIME(A, B) utime(A, B)
#endif #endif
#define HTS_DEF_FILEAPI #define HTS_DEF_FILEAPI
#endif #endif
@@ -753,20 +782,21 @@ typedef struct utimbuf STRUCT_UTIMBUF;
/** Macro aimed to break at build-time if a size is not a sizeof() strictly /** Macro aimed to break at build-time if a size is not a sizeof() strictly
* greater than sizeof(char*). **/ * greater than sizeof(char*). **/
#undef COMPILE_TIME_CHECK_SIZE #undef COMPILE_TIME_CHECK_SIZE
#define COMPILE_TIME_CHECK_SIZE(A) (void) ((void (*)(char[A - sizeof(char*) - 1])) NULL) #define COMPILE_TIME_CHECK_SIZE(A) \
(void) ((void (*)(char[A - sizeof(char *) - 1])) NULL)
/** Macro aimed to break at compile-time if a size is not a sizeof() strictly /** Macro aimed to break at compile-time if a size is not a sizeof() strictly
* greater than sizeof(char*). **/ * greater than sizeof(char*). **/
#undef RUNTIME_TIME_CHECK_SIZE #undef RUNTIME_TIME_CHECK_SIZE
#define RUNTIME_TIME_CHECK_SIZE(A) assertf((A) != sizeof(void*)) #define RUNTIME_TIME_CHECK_SIZE(A) assertf((A) != sizeof(void *))
#define fconv(A,B,C) (COMPILE_TIME_CHECK_SIZE(B), fconv(A,B,C)) #define fconv(A, B, C) (COMPILE_TIME_CHECK_SIZE(B), fconv(A, B, C))
#define concat(A,B,C,D) (COMPILE_TIME_CHECK_SIZE(B), concat(A,B,C,D)) #define concat(A, B, C, D) (COMPILE_TIME_CHECK_SIZE(B), concat(A, B, C, D))
#define fconcat(A,B,C,D) (COMPILE_TIME_CHECK_SIZE(B), fconcat(A,B,C,D)) #define fconcat(A, B, C, D) (COMPILE_TIME_CHECK_SIZE(B), fconcat(A, B, C, D))
#define fslash(A,B,C) (COMPILE_TIME_CHECK_SIZE(B), fslash(A,B,C)) #define fslash(A, B, C) (COMPILE_TIME_CHECK_SIZE(B), fslash(A, B, C))
#ifdef __cplusplus #ifdef __cplusplus
} }

View File

@@ -288,7 +288,7 @@ static void __cdecl htsshow_uninit(t_hts_callbackarg * carg) {
} }
static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) { static int __cdecl htsshow_start(t_hts_callbackarg * carg, httrackp * opt) {
use_show = 0; use_show = 0;
if (opt->verbosedisplay == 2) { if (opt->verbosedisplay == HTS_VERBOSE_FULL) {
use_show = 1; use_show = 1;
vt_clear(); vt_clear();
} }
@@ -852,7 +852,7 @@ static void sig_doback(int blind) { // mettre en backing
if (global_opt != NULL) { if (global_opt != NULL) {
// suppress logging and asking lousy questions // suppress logging and asking lousy questions
global_opt->quiet = 1; global_opt->quiet = 1;
global_opt->verbosedisplay = 0; global_opt->verbosedisplay = HTS_VERBOSE_NONE;
} }
if (!blind) if (!blind)

View File

@@ -4,131 +4,140 @@
# Initializes the htsserver GUI frontend and launch the default browser # Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE= BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition" SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set # sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}" SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi fi
# Patch for Darwin/Mac by Ross Williams # Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find # Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for # the default browser. The -W flag causes it to wait for
# the browser to exit # the browser to exit
BROWSEREXE="/usr/bin/open -W" BROWSEREXE="/usr/bin/open -W"
fi fi
BINWD=`dirname "$0"` BINWD=$(dirname "$0")
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin" SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "` IFS=':' read -ra pathdirs <<<"$PATH"
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share" for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
### ###
# And now some famous cuisine # And now some famous cuisine
function log { function log {
echo "$0($$): $@" >&2 echo "$0($$): $*" >&2
return 0 return 0
} }
function launch_browser { function launch_browser {
log "Launching $1" log "Launching $1"
browser=$1 browser=$1
url=$2 url=$2
log "Spawning browser.." log "Spawning browser.."
${browser} "${url}" ${browser} "${url}"
# note: browser can hiddenly use the -remote feature of # note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately # mozilla and therefore return immediately
log "Browser (or helper) exited" log "Browser (or helper) exited"
} }
# First ensure that we can launch the server # First ensure that we can launch the server
BINPATH= BINPATH=
for i in ${SRCHPATH}; do for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i} ! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done done
for i in ${SRCHDISTPATH}; do for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack" ! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1 test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1 test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1 test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1 test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1 test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1 test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale # Locale
HTSLANG="${LC_MESSAGES}" HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}" ! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}" ! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`" HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'` LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1 ! test -n "${LANGN}" && LANGN=1
# Find the browser # Find the browser
# note: not all systems have sensible-browser or www-browser alternative # note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found # thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do for i in "${SRCHBROWSEREXE[@]}"; do
for j in ${SRCHPATH}; do for j in "${SRCHPATH[@]}"; do
if test -x ${j}/${i}; then if test -x "${j}/${i}"; then
BROWSEREXE=${j}/${i} BROWSEREXE="${j}/${i}"
fi fi
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1 test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command # "browse" command
if test "$1" = "browse"; then if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='` INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html" INDEXF="${INDEXF}/index.html"
else else
INDEXF="" INDEXF=""
fi fi
fi fi
if ! test -n "$INDEXF"; then if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html" INDEXF="${HOME}/websites/index.html"
fi fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}" launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $? exit $?
fi fi
# Create a temporary filename # Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1 TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server # Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}& (
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL # Find the generated SRVURL
SRVURL= SRVURL=
MAXCOUNT=60 MAXCOUNT=60
while ! test -n "$SRVURL"; do while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1] MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1 test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.." test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=` SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1 test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1 test -n "$SRVURL" || sleep 1
done done
# Cleanup function # Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup { function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.." test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself # Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=` test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 ${SRVPID} test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE} test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done" test -n "$1" && log "..Done"
return 0 return 0
} }
# Cleanup in case of emergency # Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser # Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}" launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks! # That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup cleanup
exit 0 exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8). # charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8. # -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() { conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1 test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
} }
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9. # the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.

15
tests/01_engine-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Issue #151 guard: the request Cookie header must be bare RFC 6265 name=value
# pairs, no $Version/$Path attributes. Driven by the 'httrack -#Q' selftest.
set -eu
# A trailing token is required; a bare '-#Q' falls through to the usage screen.
out=$(httrack -#Q run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "cookie-header: OK" || {
echo "expected 'cookie-header: OK', got: $out" >&2
exit 1
}

17
tests/01_engine-copyopt.test Executable file
View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Regression guard for the unsigned-enum sentinel trap: copy_htsopt's
# `if (from->X > -1)` guard is always false for unsigned hts_boolean fields, so
# they silently stop being copied. Driven by the in-process 'httrack -#9' test.
# Keep POSIX-portable (harness runs it via $(BASH), a plain /bin/sh on macOS).
set -eu
# A trailing token is required; a bare '-#9' falls through to the usage screen.
out=$(httrack -#9 run)
# Exact-match the success line so a fall-through to usage can't pass the test.
test "$out" = "copy-htsopt: OK" || {
echo "expected 'copy-htsopt: OK', got: $out" >&2
exit 1
}

View File

@@ -89,4 +89,37 @@ grep -q NEWCONTENT "$(find "$out" -path '*/a.html' -print -quit)" || {
exit 1 exit 1
} }
# --- 3. an empty quoted arg survives the doit.log round-trip (#106) ----------
# -%F "" (empty footer) records an empty "" token in doit.log; -r2 follows it so
# a "drop the empty token" bug shifts -r2 into -%F's slot (the reprise then sees
# -%F -r2 and panics "%F needs to be followed by ..."), making the bug visible
# rather than a harmless run off the end of argv.
out2="$tmp/out2"
rc=0
"$bin" "$url" -O "$out2" --quiet -n -%v0 -%F "" -r2 >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: initial mirror with empty footer exited $rc"
exit 1
}
# precondition: the writer put the empty token on disk for the reader to reload.
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer not recorded as -%F \"\" -r2 in doit.log"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
# no-url reprise: the reader rebuilds argv from doit.log and rewrites doit.log
# from it. The empty token surviving in the regenerated file proves the reader
# kept it (a drop/swallow would panic above or rewrite -%F without the "").
rc=0
"$bin" -O "$out2" --quiet >/dev/null 2>&1 || rc=$?
test "$rc" -eq 0 || {
echo "FAIL: empty-footer reprise exited $rc (empty token dropped from doit.log?)"
exit 1
}
grep -q ' -%F "" -r2' "$out2/hts-cache/doit.log" || {
echo "FAIL: empty footer did not survive the doit.log reload round-trip"
grep -- '-%F' "$out2/hts-cache/doit.log" || true
exit 1
}
exit 0 exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset). # HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output). # -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() { ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1 test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
} }
# named entities # named entities

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...". # -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() { match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
} }
nomatch() { nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
} }
# bare star matches everything # bare star matches everything
@@ -67,7 +67,7 @@ nomatch '*[\[]' 'a'
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed # filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy) # by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change. # behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']' match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']' match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x' nomatch '*[\[\]]' '[]x'

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'". # -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() { mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
} }
unknown() { unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
} }
mime '/a/b.html' 'text/html' mime '/a/b.html' 'text/html'

View File

@@ -154,4 +154,173 @@ grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
grep -q 'title="file://' "$saved2" || grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1 ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
# xmlns / xmlns:prefix decls must not be crawled (#191). Local file:// targets so a
# regression downloads them; each is the LAST attr (heuristic only scans a value before '>').
site3="$tmp/xmlns"
mkdir -p "$site3"
for f in ns og rdfs real; do gif "$site3/$f.gif"; done
cat >"$site3/index.html" <<EOF
<html xmlns="file://$site3/ns.gif"><body>
<svg xmlns:og="file://$site3/og.gif"></svg>
<div class="c" xmlns:rdfs="file://$site3/rdfs.gif"></div>
<a href="file://$site3/real.gif">real link</a>
</body></html>
EOF
out3="$tmp/xmlns-out"
crawl "$site3/index.html" "$out3"
# the real link is still captured
found "real.gif" "$out3"
# namespace-declaration targets must not be fetched (default + prefixed forms)
notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
# Offset-0 underflow (#396): a token at the buffer start makes the detector's
# word-boundary guard read *(html-1) one byte early (aborts under ASan). The
# url() target is still captured; here it just must not underflow.
site6="$tmp/parse-off0"
mkdir -p "$site6"
printf 'body{}\n' >"$site6/off0.css"
printf 'url(off0.css)\n' >"$site6/main.css"
out6="$tmp/parse-off0-out"
crawl "$site6/main.css" "$out6"
found "off0.css" "$out6"
# XMLHttpRequest.open(method, url) (#218): the first argument is an HTTP method,
# not a URL. Without the fix "GET" is captured as a link and fetched (the offline
# fixture saves a bare file named GET; a live server mangles it to GET.html).
# window.open(url) detection must be unaffected.
site7="$tmp/xhropen"
mkdir -p "$site7"
gif "$site7/winopen.gif"
cat >"$site7/index.html" <<EOF
<html><body><script>
var x = new XMLHttpRequest();
x.open("GET", "ajax_info.txt");
var y = new XMLHttpRequest();
y.open("Post", "submit.cgi");
window.open("file://$site7/winopen.gif");
</script></body></html>
EOF
out7="$tmp/xhropen-out"
crawl "$site7/index.html" "$out7"
# negative control: without the fix a file named exactly GET is downloaded
notfound "GET" "$out7"
# methods are matched case-insensitively (XHR spec normalizes them): a mixed-case
# method is rejected too, so a file named Post must not appear either
notfound "Post" "$out7"
# regression guard: window.open(url) is still detected, so its absolute URL is
# rewritten to a local link. The rewrite only happens if the parser saw it, so
# these two assertions fail if .open detection broke (not a trivial --near save).
saved7=$(savedhtml "$out7")
test -n "$saved7" || ! echo "FAIL: saved xhr page not found" || exit 1
grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! echo "FAIL #218: window.open(url) no longer detected/rewritten" || exit 1
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0 exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments. # path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() { simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1 test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
} }
simp './foo/bar/' 'foo/bar/' simp './foo/bar/' 'foo/bar/'
@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986 # empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b' simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it # the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"overflow while copying"*) ;; echo "over-capacity write was NOT caught" >&2
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"overflow while copying"*) ;;
*)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer # Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert. # aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"htsbuff append overflow"*) ;; echo "htsbuff over-capacity write was NOT caught" >&2
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"htsbuff append overflow"*) ;;
*)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac esac

View File

@@ -3,6 +3,6 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/cookies/third.html \ --found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \ --found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \ --found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php httrack http://ut.httrack.com/cookies/entrance.php

View File

@@ -3,21 +3,21 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 5 \ --errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \ --found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \ --found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \ --found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \ --found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \ httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0 '+*.ut.httrack.com/*' --robots=0
# unicode tests (bogus links) # unicode tests (bogus links)
bash crawl-test.sh \ bash crawl-test.sh \
--errors 0 --files 1 \ --errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \ --found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \ httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0 '-*' --robots=0

View File

@@ -3,67 +3,67 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 10 \ --errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \ --found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \ --found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \ --found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \ --found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \ --found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html httrack http://ut.httrack.com/unicode-links/utf8.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 7 \ --errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \ --found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \ --found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \ --found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html httrack http://ut.httrack.com/unicode-links/default.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 2 --files 9 \ --errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \ --found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \ --found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html httrack http://ut.httrack.com/unicode-links/iso88591.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 9 \ --errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \ --found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \ --found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \ --found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \ --found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \ --found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \ --found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html httrack http://ut.httrack.com/unicode-links/gb18030.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=42&can=1 # http://code.google.com/p/httrack/issues/detail?id=42&can=1
# we expect 2 errors only because other links are too longs (to be modified if suitable) # we expect 2 errors only because other links are too longs (to be modified if suitable)
bash crawl-test.sh --errors 2 --files 1 \ bash crawl-test.sh --errors 2 --files 1 \
--found ut.httrack.com/overflow/longquerywithaccents.html \ --found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php httrack http://ut.httrack.com/overflow/longquerywithaccents.php

View File

@@ -3,45 +3,45 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=4&can=1 # http://code.google.com/p/httrack/issues/detail?id=4&can=1
bash crawl-test.sh --errors 0 --files 4 \ bash crawl-test.sh --errors 0 --files 4 \
--found ut.httrack.com/parsing/back5e1f.gif \ --found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \ --found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \ --found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \ --found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html httrack http://ut.httrack.com/parsing/events.html
# http://code.google.com/p/httrack/issues/detail?id=2&can=1 # http://code.google.com/p/httrack/issues/detail?id=2&can=1
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/background-image.css \ --found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \ --found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html httrack http://ut.httrack.com/parsing/background-image.html
# javascript parsing # javascript parsing
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/back.gif \ --found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \ --found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html httrack http://ut.httrack.com/parsing/javascript.html
# handling of + before query string # handling of + before query string
bash crawl-test.sh --errors 0 --files 6 \ bash crawl-test.sh --errors 0 --files 6 \
--found ut.httrack.com/parsing/escaping.html \ --found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \ --found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \ --found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \ --found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \ --found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \ --found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html httrack http://ut.httrack.com/parsing/escaping.html
# handling of # encoded in filename # handling of # encoded in filename
# see http://code.google.com/p/httrack/issues/detail?id=25 # see http://code.google.com/p/httrack/issues/detail?id=25
bash crawl-test.sh --errors 2 --files 4 \ bash crawl-test.sh --errors 2 --files 4 \
--found "ut.httrack.com/parsing/escaping2.html" \ --found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \ --found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \ --found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \ --found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html httrack http://ut.httrack.com/parsing/escaping2.html

View File

@@ -3,11 +3,11 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
if test "${HTTPS_SUPPORT:-}" == "no"; then if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping" echo "no https support compiled, skipping"
exit 77 exit 77
fi fi
bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html

View File

@@ -0,0 +1,136 @@
#!/bin/bash
#
# Issue #85: an https crawl must go through the configured proxy (CONNECT
# tunnel), not bypass it and hit the origin directly. Fully local: a self-signed
# TLS origin plus a logging CONNECT proxy, so no network access is needed.
set -euo pipefail
: "${top_srcdir:=..}"
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
if ! command -v python3 >/dev/null 2>&1 || ! command -v openssl >/dev/null 2>&1; then
echo "python3/openssl missing, skipping"
exit 77
fi
server="$top_srcdir/tests/proxy-https-server.py"
tmpdir=$(mktemp -d)
pids=
cleanup() {
for pid in $pids; do
kill "$pid" 2>/dev/null || true
done
rm -rf "$tmpdir"
}
trap cleanup EXIT
# self-signed cert for the local TLS origin (httrack does not verify certs)
openssl req -x509 -newkey rsa:2048 -keyout "$tmpdir/key.pem" \
-out "$tmpdir/cert.pem" -days 2 -nodes -subj "/CN=127.0.0.1" \
>/dev/null 2>&1
cat "$tmpdir/key.pem" "$tmpdir/cert.pem" >"$tmpdir/both.pem"
# start_server <logdir> <mode>: launches a proxy+origin pair, sets $origin_port
# and $proxy_port from its announced ephemeral ports.
start_server() {
local dir="$1" mode="$2" ports
mkdir -p "$dir"
ports="$dir/ports.txt"
python3 "$server" "$tmpdir/both.pem" "$dir" "$mode" \
>"$ports" 2>"$dir/server.err" &
pids="$pids $!"
for _ in $(seq 1 100); do
grep -q "^ready" "$ports" 2>/dev/null && break
sleep 0.1
done
grep -q "^ready" "$ports" 2>/dev/null || {
echo "server ($mode) did not start" >&2
cat "$dir/server.err" >&2
exit 1
}
origin_port=$(awk '/^ORIGIN/{print $2}' "$ports")
proxy_port=$(awk '/^PROXY/{print $2}' "$ports")
}
# Run httrack, but kill it after a deadline so a hang (e.g. a missing bound on
# the proxy response) surfaces as the kill code $HANG_RC instead of stalling the
# whole job. A portable stand-in for `timeout`, which macOS lacks.
HANG_RC=137 # 128 + SIGKILL
run_crawl() {
local out="$1" proxy="$2" port="$3"
rm -rf "$out"
httrack "https://127.0.0.1:${port}/" --proxy "$proxy" \
-O "$out" -r1 -s0 --timeout=10 >"$out.log" 2>&1 &
local pid=$!
(sleep 60 && kill -9 "$pid" 2>/dev/null) &
local guard=$!
local rc=0
wait "$pid" 2>/dev/null || rc=$?
kill "$guard" 2>/dev/null || true
wait "$guard" 2>/dev/null || true
return "$rc"
}
# --- working proxy ----------------------------------------------------------
ok="$tmpdir/ok"
start_server "$ok" ok
# 1. page retrieved AND the proxy saw a CONNECT to the origin
run_crawl "$ok/out" "127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out" || {
echo "FAIL: origin page not downloaded through proxy" >&2
cat "$ok/out.log" >&2
exit 1
}
grep -q "^CONNECT 127.0.0.1:${origin_port} " "$ok/proxy.log" || {
echo "FAIL: proxy never received a CONNECT (https bypassed the proxy)" >&2
cat "$ok/proxy.log" >&2
exit 1
}
echo "OK: https tunneled through proxy via CONNECT"
# 2. authenticated proxy: creds ride the CONNECT, and NEVER reach the origin
: >"$ok/proxy.log"
: >"$ok/origin-headers.log"
run_crawl "$ok/out2" "user:secret@127.0.0.1:${proxy_port}" "$origin_port"
grep -rq "ORIGIN-PAGE-85" "$ok/out2" || {
echo "FAIL: origin page not downloaded through authenticated proxy" >&2
exit 1
}
got=$(awk '/^AUTH Basic /{print $3}' "$ok/proxy.log" | head -1)
# base64("user:secret"); compared as a literal to stay portable (no base64 -d,
# which differs between GNU and BSD)
test "$got" == "dXNlcjpzZWNyZXQ=" || {
echo "FAIL: Proxy-Authorization not carried on CONNECT (got '$got')" >&2
cat "$ok/proxy.log" >&2
exit 1
}
if grep -qi "proxy-authorization" "$ok/origin-headers.log"; then
echo "FAIL: proxy credentials leaked to the origin through the tunnel" >&2
cat "$ok/origin-headers.log" >&2
exit 1
fi
echo "OK: proxy credentials carried on CONNECT, not leaked to origin"
# --- hostile proxy ----------------------------------------------------------
# A proxy that answers 200 then streams headers forever must not hang the crawl:
# the client bounds the response. run_crawl kills a hung httrack after 60s, so a
# missing bound surfaces as $HANG_RC here.
flood="$tmpdir/flood"
start_server "$flood" flood
rc=0
run_crawl "$flood/out" "127.0.0.1:${proxy_port}" "$origin_port" || rc=$?
test "$rc" -ne "$HANG_RC" || {
echo "FAIL: crawl hung on a flooding proxy (bounded read missing)" >&2
exit 1
}
grep -rq "ORIGIN-PAGE-85" "$flood/out" 2>/dev/null && {
echo "FAIL: flooding proxy unexpectedly served the page" >&2
exit 1
}
echo "OK: bounded proxy response, no hang on a flooding proxy"

15
tests/13_local-cookies.test Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
#
# Cookie chain against the local test server (replaces the old online
# ut/cookies/*.php fixtures). entrance.php sets cat/cake; second.php checks
# them and sets badger; third.php checks all three. A missing or wrong cookie
# returns 500, which would surface as an httrack error and a missing file, so a
# clean 3-files/0-errors run proves the cookie jar is replayed across links.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 --files 3 \
--found 'cookies/entrance.html' \
--found 'cookies/second.html' \
--found 'cookies/third.html' \
httrack 'BASEURL/cookies/entrance.php'

18
tests/14_local-https.test Executable file
View File

@@ -0,0 +1,18 @@
#!/bin/bash
#
# HTTPS crawl against the local test server, using the shipped self-signed
# cert. httrack does not verify certs (htslib.c: SSL_CTX_new with no
# SSL_CTX_set_verify), so the self-signed cert is accepted as-is and this
# exercises the real TLS path offline. basic.html links to link.html with four
# distinct query strings, each saved under a hashed name -> 5 files.
: "${top_srcdir:=..}"
if test "$HTTPS_SUPPORT" == "no"; then
echo "no https support compiled, skipping"
exit 77
fi
bash "$top_srcdir/tests/local-crawl.sh" --tls --errors 0 --files 5 \
--found 'simple/basic.html' \
httrack 'BASEURL/simple/basic.html'

25
tests/15_local-types.test Normal file
View File

@@ -0,0 +1,25 @@
#!/bin/bash
#
# Content-Type vs URL-extension naming (issue #267 family) under the default
# delayed type check (-%N2). Policy: a MISSING Content-Type must not clobber a
# URL extension that maps to a specific non-HTML type (.png/.pdf stay as-is);
# an explicitly DECLARED type is trusted, so a binary-looking URL that really
# serves HTML (text/html on .pdf/.jpg) is named .html. The "wrong" names are
# asserted absent so a regression in either direction fails here.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/notype.png' \
--found 'types/notype.pdf' --not-found 'types/notype.html' \
--found 'types/photo.png' \
--found 'types/doc.pdf' \
--found 'types/lie.html' --not-found 'types/lie.png' \
--found 'types/report.html' --not-found 'types/report.pdf' \
--found 'types/page.htm' --not-found 'types/page.html' \
--found 'types/script.js' \
--found 'types/style.css' \
--found 'types/data.json' \
--found 'types/control.html' --not-found 'types/control.php' \
--found 'types/gend61c.png' --not-found 'types/gend61c.html' \
httrack 'BASEURL/types/index.html'

View File

@@ -0,0 +1,11 @@
#!/bin/bash
#
# --assume under the default delayed type check (-%N2), issue #56. A user type
# pinned with --assume must be honored immediately, not lost to the delayed
# name: photo.png served as image/png but assumed text/html is saved as .html.
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'types/photo.html' --not-found 'types/photo.png' \
httrack 'BASEURL/types/photo.png' --assume png=text/html

View File

@@ -2,6 +2,9 @@
# explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would # explicitly: automake does not expand wildcards in EXTRA_DIST, so a glob would
# silently drop it from the dist tarball and break "make distcheck". # silently drop it from the dist tarball and break "make distcheck".
EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \ EXTRA_DIST = $(TESTS) crawl-test.sh run-all-tests.sh check-network.sh \
proxy-https-server.py \
local-crawl.sh local-server.py server.crt server.key \
server-root/simple/basic.html server-root/simple/link.html \
fixtures/cache-golden/hts-cache/new.zip fixtures/cache-golden/hts-cache/new.zip
TESTS_ENVIRONMENT = TESTS_ENVIRONMENT =
@@ -24,6 +27,8 @@ TESTS = \
01_engine-cache-golden.test \ 01_engine-cache-golden.test \
01_engine-charset.test \ 01_engine-charset.test \
01_engine-cmdline.test \ 01_engine-cmdline.test \
01_engine-cookies.test \
01_engine-copyopt.test \
01_engine-doitlog.test \ 01_engine-doitlog.test \
01_engine-entities.test \ 01_engine-entities.test \
01_engine-filter.test \ 01_engine-filter.test \
@@ -32,6 +37,7 @@ TESTS = \
01_engine-mime.test \ 01_engine-mime.test \
01_engine-parse.test \ 01_engine-parse.test \
01_engine-rcfile.test \ 01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \ 01_engine-simplify.test \
01_engine-strsafe.test \ 01_engine-strsafe.test \
02_manpage-regen.test \ 02_manpage-regen.test \
@@ -42,6 +48,11 @@ TESTS = \
11_crawl-international.test \ 11_crawl-international.test \
11_crawl-longurl.test \ 11_crawl-longurl.test \
11_crawl-parsing.test \ 11_crawl-parsing.test \
12_crawl_https.test 12_crawl_https.test \
13_crawl_proxy_https.test \
13_local-cookies.test \
14_local-https.test \
15_local-types.test \
16_local-assume.test
CLEANFILES = check-network_sh.cache CLEANFILES = check-network_sh.cache

View File

@@ -6,39 +6,39 @@
# do not enable online tests (./configure --disable-online-unit-tests) # do not enable online tests (./configure --disable-online-unit-tests)
if test "$ONLINE_UNIT_TESTS" == "no"; then if test "$ONLINE_UNIT_TESTS" == "no"; then
echo "online tests are disabled" >&2 echo "online tests are disabled" >&2
exit 1 exit 1
# enable online tests (--enable-online-unit-tests) # enable online tests (--enable-online-unit-tests)
elif test "$ONLINE_UNIT_TESTS" == "yes"; then elif test "$ONLINE_UNIT_TESTS" == "yes"; then
exit 0 exit 0
# check if online tests are reachable # check if online tests are reachable
else else
# test url # test url
url=http://ut.httrack.com/enabled url=http://ut.httrack.com/enabled
# cache file name # cache file name
cache=check-network_sh.cache cache=check-network_sh.cache
# cached result ? # cached result ?
if test -f $cache ; then if test -f $cache; then
if grep -q "ok" $cache ; then if grep -q "ok" $cache; then
exit 0 exit 0
else else
echo "online tests are disabled (cached)" >&2 echo "online tests are disabled (cached)" >&2
exit 1 exit 1
fi fi
# fetch single file # fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null ; then elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null; then
echo "ok" > $cache echo "ok" >$cache
exit 0 exit 0
else else
echo "error" > $cache echo "error" >$cache
echo "online tests are disabled (auto)" >&2 echo "online tests are disabled (auto)" >&2
exit 1 exit 1
fi fi
fi fi

View File

@@ -2,185 +2,184 @@
# #
function warning { function warning {
echo "** $*" >&2 echo "** $*" >&2
return 0 return 0
} }
function die { function die {
warning "$*" warning "$*"
exit 1 exit 1
} }
function debug { function debug {
if test -n "$verbose"; then if test -n "$verbose"; then
echo "$*" >&2 echo "$*" >&2
fi fi
} }
function info { function info {
printf "[$*] ..\t" >&2 printf '[%s] ..\t' "$*" >&2
} }
function result { function result {
echo "$*" >&2 echo "$*" >&2
} }
function cleanup { function cleanup {
debug "cleaning function called" debug "cleaning function called"
if test -n "$tmpdir"; then if test -n "$tmpdir"; then
if test -d "$tmpdir"; then if test -d "$tmpdir"; then
if test -z "$nopurge"; then if test -z "$nopurge"; then
debug "cleaning up $tmpdir" debug "cleaning up $tmpdir"
rm -rf "$tmpdir" rm -rf "$tmpdir"
fi fi
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
} }
function usage { function usage {
cat << EOF cat <<EOF
usage: $0 usage: $0
EOF EOF
} }
function assert_equals { function assert_equals {
info "$1" info "$1"
if test ! "$2" == "$3"; then if test ! "$2" == "$3"; then
result "expected '$2', got '$3'" result "expected '$2', got '$3'"
exit 1 exit 1
else else
result "OK ($2)" result "OK ($2)"
fi fi
} }
function start-crawl { function start-crawl {
# parse args # parse args
pos=1 pos=1
while test "$#" -ge "$pos" ; do while test "$#" -ge "$pos"; do
case "${!pos}" in case "${!pos}" in
--debug) --debug)
verbose=1 verbose=1
;; ;;
--no-purge|--summary|--print-files) --no-purge | --summary | --print-files) ;;
;; --errors | --files | --found | --not-found | --directory)
--errors|--files|--found|--not-found|--directory) pos=$((pos + 1))
pos=$[${pos}+1] test "$#" -ge "$pos" || warning "missing argument" || return 1
test "$#" -ge "$pos" || warning "missing argument" || return 1 ;;
;; httrack)
httrack) pos=$((pos + 1))
pos=$[${pos}+1] break
break; ;;
;; *)
*) warning "unrecognized option ${!pos}"
warning "unrecognized option ${!pos}" return 1
return 1 ;;
;; esac
esac pos=$((pos + 1))
pos=$[${pos}+1] done
done debug "remaining args: ${*:pos}"
debug "remaining args: ${@:${pos}}"
# ut/ won't exceed 2 minutes # ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5" moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ? # proxy environment ?
if test -n "$http_proxy"; then if test -n "${http_proxy:-}"; then
moreargs="$moreargs --proxy $http_proxy" moreargs+=(--proxy "$http_proxy")
fi fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1 test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl" tmp="${tmpdir}/crawl"
rm -rf "$tmp"
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}}
info "running httrack ${@:${pos}}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break;
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp" rm -rf "$tmp"
else mkdir "$tmp" || ! warning "could not create $tmp" || return 1
tmpdir=
fi which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp"
else
tmpdir=
fi
} }
# check args # check args
@@ -195,7 +194,7 @@ tmpdir=
crawlpid= crawlpid=
nopurge= nopurge=
verbose= verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory # working directory
tmpdir="${tmptopdir}/httrack_ut.$$" tmpdir="${tmptopdir}/httrack_ut.$$"

235
tests/local-crawl.sh Executable file
View File

@@ -0,0 +1,235 @@
#!/bin/bash
#
# Launcher for httrack crawl tests against the local Python test server.
#
# Starts tests/local-server.py on an ephemeral port, discovers the port from
# the server's stdout, then runs httrack against http(s)://127.0.0.1:$PORT and
# audits the mirror. The server is always killed and the tmpdir removed on exit.
#
# The token BASEURL in any httrack argument is replaced with the discovered
# http(s)://127.0.0.1:$PORT base. --found/--directory paths are relative to the
# discovered host root (127.0.0.1_<port>/), since the random port leaks into
# the mirror directory name.
#
# Usage:
# bash local-crawl.sh [--tls] [--root DIR] \
# --errors N --files N --found PATH ... --directory PATH ... \
# httrack BASEURL/some/path [httrack-args...]
set -u
testdir=$(cd "$(dirname "$0")" && pwd)
server="${testdir}/local-server.py"
root="${LOCAL_SERVER_ROOT:-${testdir}/server-root}"
cert="${testdir}/server.crt"
key="${testdir}/server.key"
tls=
verbose=
tmpdir=
serverpid=
crawlpid=
function warning {
echo "** $*" >&2
return 0
}
function die {
warning "$*"
exit 1
}
function debug {
test -n "$verbose" && echo "$*" >&2
return 0
}
function info { printf "[%s] ..\t" "$*" >&2; }
function result { echo "$*" >&2; }
function cleanup {
if test -n "$crawlpid"; then
kill -9 "$crawlpid" 2>/dev/null
crawlpid=
fi
if test -n "$serverpid"; then
kill "$serverpid" 2>/dev/null
# Reap it so the port is released before we rm the tmpdir/log.
wait "$serverpid" 2>/dev/null
serverpid=
fi
if test -n "$tmpdir" && test -d "$tmpdir"; then
test -n "$nopurge" || rm -rf "$tmpdir"
fi
}
function assert_equals {
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
fi
result "OK ($2)"
}
nopurge=
trap cleanup EXIT HUP INT QUIT PIPE TERM
# python3 is required; mirror check-network.sh's skip-with-77 convention.
command -v python3 >/dev/null || ! echo "python3 not found; skipping local crawl tests" || exit 77
tmptopdir=${TMPDIR:-/tmp}
test -d "$tmptopdir" || mkdir -p "$tmptopdir" || die "no temporary directory; set TMPDIR"
tmpdir=$(mktemp -d "${tmptopdir}/httrack_local.XXXXXX") || die "could not create tmpdir"
# --- parse leading control flags --------------------------------------------
declare -a audit=()
scheme=http
pos=0
args=("$@")
nargs=$#
while test "$pos" -lt "$nargs"; do
case "${args[$pos]}" in
--debug) verbose=1 ;;
--no-purge)
nopurge=1
audit+=("--no-purge")
;;
--tls)
tls=1
scheme=https
;;
--root)
pos=$((pos + 1))
root="${args[$pos]}"
;;
--errors | --files)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
--found | --not-found | --directory)
audit+=("${args[$pos]}" "${args[$((pos + 1))]}")
pos=$((pos + 1))
;;
httrack)
pos=$((pos + 1))
break
;;
*) die "unrecognized option ${args[$pos]}" ;;
esac
pos=$((pos + 1))
done
# --- start the server --------------------------------------------------------
test -r "$server" || die "cannot read $server"
serverlog="${tmpdir}/server.log"
serverargs=(--root "$root")
if test -n "$tls"; then
serverargs+=(--tls --cert "$cert" --key "$key")
fi
debug "starting python3 $server ${serverargs[*]}"
python3 "$server" "${serverargs[@]}" >"$serverlog" 2>&1 &
serverpid=$!
# Wait for the "PORT <n>" line (server prints it once bound).
port=
for _ in $(seq 1 50); do
if test -s "$serverlog"; then
line=$(head -n1 "$serverlog")
if test "${line%% *}" == "PORT"; then
port="${line#PORT }"
break
fi
fi
kill -0 "$serverpid" 2>/dev/null || die "server exited early: $(cat "$serverlog")"
sleep 0.1
done
test -n "$port" || die "could not discover server port: $(cat "$serverlog")"
debug "server listening on ${scheme}://127.0.0.1:${port}"
baseurl="${scheme}://127.0.0.1:${port}"
# --- substitute BASEURL in the remaining (httrack) args ----------------------
declare -a hts=()
while test "$pos" -lt "$nargs"; do
hts+=("${args[$pos]//BASEURL/$baseurl}")
pos=$((pos + 1))
done
# --- run httrack -------------------------------------------------------------
which httrack >/dev/null || die "could not find httrack"
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || die "could not run httrack"
out="${tmpdir}/crawl"
mkdir "$out" || die "could not create $out"
# Localhost is fast; disable the rate/bandwidth safety limits but keep a
# max-time backstop so a hang cannot wedge the suite.
declare -a moreargs=(--quiet --max-time=120 --timeout=30 --disable-security-limits --robots=0)
log="${tmpdir}/log"
info "running httrack ${hts[*]}"
httrack -O "$out" --user-agent="httrack $ver local ($(uname -omrs))" "${moreargs[@]}" "${hts[@]}" >"$log" 2>&1 &
crawlpid=$!
wait "$crawlpid"
crawlres=$?
crawlpid=
# httrack exits 0 even on hard connect/DNS errors, so this is a backstop only;
# the real guard is the audit below (--errors 0 plus the host-root existence check).
test "$crawlres" -eq 0 || ! result "httrack exited $crawlres" || {
cat "$log" >&2
exit 1
}
result "OK"
grep -iE "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt" >&2
# --- discover the single host root (127.0.0.1_<port> or 127.0.0.1) -----------
hostroot=
for cand in "${out}/127.0.0.1_${port}" "${out}/127.0.0.1"; do
if test -d "$cand"; then
hostroot="$cand"
break
fi
done
test -n "$hostroot" || die "could not find host root under $out"
debug "host root: $hostroot"
# --- audit -------------------------------------------------------------------
i=0
while test "$i" -lt "${#audit[@]}"; do
case "${audit[$i]}" in
--errors)
i=$((i + 1))
assert_equals "checking errors" "${audit[$i]}" \
"$(grep -iEc "^[0-9:]*[[:space:]]Error:" "${out}/hts-log.txt")"
;;
--files)
i=$((i + 1))
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${out}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "${audit[$i]}" "$nFiles"
;;
--found)
i=$((i + 1))
info "checking for ${audit[$i]}"
if test -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
--not-found)
i=$((i + 1))
info "checking absence of ${audit[$i]}"
if test ! -f "${hostroot}/${audit[$i]}"; then result "OK"; else
result "present"
exit 1
fi
;;
--directory)
i=$((i + 1))
info "checking for dir ${audit[$i]}"
if test -d "${hostroot}/${audit[$i]}"; then result "OK"; else
result "not found"
exit 1
fi
;;
esac
i=$((i + 1))
done

250
tests/local-server.py Executable file
View File

@@ -0,0 +1,250 @@
#!/usr/bin/env python3
"""Self-contained local web server for httrack's crawl tests.
Serves static fixtures from a docroot plus a handful of dynamic endpoints
(cookies, ...) so httrack can be exercised over loopback, deterministically and
offline, instead of crawling the live ut.httrack.com.
Binds to an ephemeral port (port 0) and prints the chosen port to stdout as
"PORT <n>\n" so a launcher can discover it. Pass --tls to wrap the socket with
the shipped self-signed test cert; httrack does not verify certs, so no CA
trust plumbing is needed.
stdlib only (http.server + ssl) -- no new build or runtime dependency.
"""
import argparse
import os
from http.server import SimpleHTTPRequestHandler, ThreadingHTTPServer
from urllib.parse import quote, unquote, urlsplit
# Cookie chain replicated from the old ut/cookies/*.php fixtures.
COOKIE_PATH = "/cookies/"
COOKIES = {
"cat": "dog",
"cake": "is a lie!",
"badger": "mushroom, with 'ants'",
}
PAGE = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
\t<title>Sample test</title>
</head>
<body>
{body}
</body>
</html>
"""
class Handler(SimpleHTTPRequestHandler):
# Quieter logging; the launcher captures httrack's own log anyway.
def log_message(self, fmt, *args):
if os.environ.get("LOCAL_SERVER_VERBOSE"):
super().log_message(fmt, *args)
# --- helpers -----------------------------------------------------------
def request_cookies(self):
"""Parse the Cookie header into {name: decoded-value}.
Mirrors PHP's $_COOKIE: values are url-decoded, matching the encoding
applied when the cookie was set (see set_cookie)."""
jar = {}
raw = self.headers.get("Cookie", "")
for pair in raw.split(";"):
pair = pair.strip()
if "=" in pair:
name, value = pair.split("=", 1)
jar[name.strip()] = unquote(value.strip())
return jar
def set_cookie(self, name, value):
"""Queue a Set-Cookie header, url-encoding the value like PHP's
setcookie() so spaces/quotes/commas stay a single token that httrack
can store and replay verbatim."""
self._set_cookies.append(f"{name}={quote(value)}; Path={COOKIE_PATH}")
def send_html(self, body, status=200, extra_status=None):
encoded = PAGE.format(body=body).encode("utf-8")
self.send_response(status, extra_status)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(encoded)))
for cookie in self._set_cookies:
self.send_header("Set-Cookie", cookie)
self.end_headers()
if self.command != "HEAD":
self.wfile.write(encoded)
def fail_cookie(self, what):
# The old PHPs answered 500 with the reason in the status line.
self.send_html("", status=500, extra_status=f"The {what} is missing or invalid")
# --- dynamic routes ----------------------------------------------------
def route_entrance(self):
self.set_cookie("cat", COOKIES["cat"])
self.set_cookie("cake", COOKIES["cake"])
self.send_html('\tThis is a <a href="second.php">link</a>')
def route_second(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
self.set_cookie("badger", COOKIES["badger"])
self.send_html('\tThis is a <a href="third.php">link</a>')
def route_third(self):
jar = self.request_cookies()
if jar.get("cat") != COOKIES["cat"]:
return self.fail_cookie("cat")
if jar.get("cake") != COOKIES["cake"]:
return self.fail_cookie("cake")
if jar.get("badger") != COOKIES["badger"]:
return self.fail_cookie("badger")
self.send_html("\tThis is a test.")
def route_robots(self):
body = b"User-agent: *\nDisallow:\n"
self.send_response(200)
self.send_header("Content-Type", "text/plain")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
self.wfile.write(body)
# Fake-binary blobs for the image/pdf/typeless cases.
FAKE_PNG = b"\x89PNG\r\n\x1a\n" + b"\x00" * 64
FAKE_PDF = b"%PDF-1.4\n" + b"\x00" * 64
# path -> (body, content_type); content_type None means no header at all.
TYPE_MATRIX = {
"/types/control.php": (b"<html><body>control</body></html>", "text/html"),
"/types/photo.png": (FAKE_PNG, "image/png"),
"/types/doc.pdf": (FAKE_PDF, "application/pdf"),
"/types/notype.png": (FAKE_PNG, None),
"/types/notype.pdf": (FAKE_PDF, None),
"/types/lie.png": (FAKE_PNG, "text/html"),
"/types/report.pdf": (b"<html><body>real page</body></html>", "text/html"),
"/types/page.htm": (b"<html><body>htm page</body></html>", "text/html"),
"/types/script.js": (b"var x = 1;\n", "application/javascript"),
"/types/style.css": (b"body { color: red; }\n", "text/css"),
"/types/data.json": (b'{"k": "v"}\n', "application/json"),
"/types/gen.php": (FAKE_PNG, "image/png"),
}
def route_types_index(self):
body = (
'\t<a href="control.php">control</a>\n'
'\t<img src="photo.png" />\n'
'\t<a href="doc.pdf">doc</a>\n'
'\t<img src="notype.png" />\n'
'\t<a href="notype.pdf">notypepdf</a>\n'
'\t<img src="lie.png" />\n'
'\t<a href="report.pdf">report</a>\n'
'\t<a href="page.htm">htm</a>\n'
'\t<script src="script.js"></script>\n'
'\t<link rel="stylesheet" href="style.css" />\n'
'\t<a href="data.json">json</a>\n'
'\t<img src="gen.php?id=5" />\n'
)
self.send_html(body)
def route_types(self):
path = urlsplit(self.path).path
body, ctype = self.TYPE_MATRIX[path]
self.send_raw(body, ctype)
ROUTES = {
"/cookies/entrance.php": route_entrance,
"/cookies/second.php": route_second,
"/cookies/third.php": route_third,
"/robots.txt": route_robots,
"/types/index.html": route_types_index,
"/types/control.php": route_types,
"/types/photo.png": route_types,
"/types/doc.pdf": route_types,
"/types/notype.png": route_types,
"/types/notype.pdf": route_types,
"/types/lie.png": route_types,
"/types/report.pdf": route_types,
"/types/page.htm": route_types,
"/types/script.js": route_types,
"/types/style.css": route_types,
"/types/data.json": route_types,
"/types/gen.php": route_types,
}
# --- dispatch ----------------------------------------------------------
def dispatch(self):
self._set_cookies = []
path = urlsplit(self.path).path
handler = self.ROUTES.get(path)
if handler is not None:
handler(self)
return True
return False
def do_GET(self):
if not self.dispatch():
super().do_GET()
def do_HEAD(self):
if not self.dispatch():
super().do_HEAD()
def main():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--root", required=True, help="docroot for static files")
parser.add_argument("--bind", default="127.0.0.1", help="bind address")
parser.add_argument("--tls", action="store_true", help="serve HTTPS")
parser.add_argument("--cert", help="TLS certificate (PEM)")
parser.add_argument("--key", help="TLS private key (PEM)")
args = parser.parse_args()
root = os.path.abspath(args.root)
def factory(*a, **kw):
return Handler(*a, directory=root, **kw)
httpd = ThreadingHTTPServer((args.bind, 0), factory)
if args.tls:
import ssl
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile=args.cert, keyfile=args.key)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
# The launcher reads this line to discover the ephemeral port.
print(f"PORT {port}", flush=True)
try:
httpd.serve_forever()
except KeyboardInterrupt:
pass
if __name__ == "__main__":
main()

151
tests/proxy-https-server.py Normal file
View File

@@ -0,0 +1,151 @@
#!/usr/bin/env python3
"""Local CONNECT proxy + self-signed HTTPS origin for the issue #85 test.
Starts a TLS origin server and an HTTP proxy that honours CONNECT, on ephemeral
ports. Every request line the proxy receives (and any Proxy-Authorization) is
appended to the proxy log; every header the origin receives over the tunnel is
appended to the origin log. That lets the test assert both that an https crawl
tunneled through the proxy and that proxy credentials never leaked to the origin.
Proxy modes (argv[3], default "ok"):
ok - honour CONNECT and tunnel to the origin
flood - answer 200 then stream headers forever with no blank line, to exercise
the client's bound on the proxy response (must not hang the crawl)
Usage: proxy-https-server.py <cert.pem> <logdir> [mode]
Prints "ORIGIN <port>", "PROXY <port>", then "ready" (one per line) on stdout.
"""
import http.server
import os
import socket
import socketserver
import ssl
import sys
import threading
ORIGIN_BODY = b"<html><body>ORIGIN-PAGE-85</body></html>"
PROXY_LOG = "proxy.log"
ORIGIN_LOG = "origin-headers.log"
def make_origin(logdir):
class Origin(http.server.BaseHTTPRequestHandler):
def do_GET(self):
with open(os.path.join(logdir, ORIGIN_LOG), "a") as handle:
for key in self.headers.keys():
handle.write(key + "\n")
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.send_header("Content-Length", str(len(ORIGIN_BODY)))
self.end_headers()
self.wfile.write(ORIGIN_BODY)
def log_message(self, *args):
pass
return Origin
def start_origin(certfile, logdir):
httpd = socketserver.TCPServer(("127.0.0.1", 0), make_origin(logdir))
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile)
httpd.socket = ctx.wrap_socket(httpd.socket, server_side=True)
port = httpd.socket.getsockname()[1]
threading.Thread(target=httpd.serve_forever, daemon=True).start()
return port
def pipe(src, dst):
try:
while True:
data = src.recv(65536)
if not data:
break
dst.sendall(data)
except OSError:
pass
finally:
for sock in (src, dst):
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
def handle_client(conn, logdir, mode):
rfile = conn.makefile("rb")
request_line = rfile.readline().decode("latin-1").strip()
auth = None
while True:
line = rfile.readline().decode("latin-1")
if line in ("\r\n", "\n", ""):
break
key, _, value = line.partition(":")
if key.strip().lower() == "proxy-authorization":
auth = value.strip()
with open(os.path.join(logdir, PROXY_LOG), "a") as handle:
handle.write(request_line + "\n")
if auth is not None:
handle.write("AUTH " + auth + "\n")
parts = request_line.split()
if not (len(parts) >= 2 and parts[0] == "CONNECT"):
conn.sendall(b"HTTP/1.0 501 Not Implemented\r\n\r\n")
conn.close()
return
if mode == "flood":
# 200, then an endless header stream with no terminating blank line: the
# client must bound this and give up, not hang.
try:
conn.sendall(b"HTTP/1.0 200 Connection established\r\n")
while True:
conn.sendall(b"X-Pad: 0123456789\r\n")
except OSError:
pass
conn.close()
return
host, _, port = parts[1].partition(":")
try:
upstream = socket.create_connection((host, int(port or 443)))
except OSError:
conn.sendall(b"HTTP/1.0 502 Bad Gateway\r\n\r\n")
conn.close()
return
conn.sendall(b"HTTP/1.0 200 Connection established\r\n\r\n")
threading.Thread(target=pipe, args=(conn, upstream), daemon=True).start()
pipe(upstream, conn)
def start_proxy(logdir, mode):
srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind(("127.0.0.1", 0))
srv.listen(16)
port = srv.getsockname()[1]
def serve():
while True:
conn, _ = srv.accept()
threading.Thread(
target=handle_client, args=(conn, logdir, mode), daemon=True
).start()
threading.Thread(target=serve, daemon=True).start()
return port
def main():
certfile, logdir = sys.argv[1], sys.argv[2]
mode = sys.argv[3] if len(sys.argv) > 3 else "ok"
for name in (PROXY_LOG, ORIGIN_LOG):
open(os.path.join(logdir, name), "w").close()
origin_port = start_origin(certfile, logdir)
proxy_port = start_proxy(logdir, mode)
print("ORIGIN %d" % origin_port, flush=True)
print("PROXY %d" % proxy_port, flush=True)
print("ready", flush=True)
threading.Event().wait()
if __name__ == "__main__":
main()

View File

@@ -2,19 +2,19 @@
# #
error=0 error=0
for i in *.test ; do for i in *.test; do
if bash $i ; then if bash "$i"; then
echo "$i: passed" >&2 echo "$i: passed" >&2
else else
echo "$i: ERROR" >&2 echo "$i: ERROR" >&2
error=$[${error}+1] error=$((error + 1))
fi fi
done done
if test "$error" -eq 0; then if test "$error" -eq 0; then
echo "all tests passed" >&2 echo "all tests passed" >&2
else else
echo "${error} test(s) failed" >&2 echo "${error} test(s) failed" >&2
fi fi
exit $error exit $error

View File

@@ -0,0 +1,18 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Sample test</title>
</head>
<body>
This is a <a href="link.html?v=1">link</a>
This is a <a href='link.html?v=2'>link</a>
This is a <a href="./link.html?v=3">link</a>
This is a <a href=link.html?v=4>link</a>
</body>

View File

@@ -0,0 +1,3 @@
This is a link.
Go back to <a href="basic.html">home</a>.

21
tests/server.crt Normal file
View File

@@ -0,0 +1,21 @@
-----BEGIN CERTIFICATE-----
MIIDbzCCAlegAwIBAgIUdWkDDomnY3WW95UqJ+UOASuR/i0wDQYJKoZIhvcNAQEL
BQAwODESMBAGA1UEAwwJMTI3LjAuMC4xMSIwIAYDVQQKDBlIVFRyYWNrIGxvY2Fs
IHRlc3Qgc2VydmVyMCAXDTI2MDYxNTE0NDQxMFoYDzIwNTYwNjA3MTQ0NDEwWjA4
MRIwEAYDVQQDDAkxMjcuMC4wLjExIjAgBgNVBAoMGUhUVHJhY2sgbG9jYWwgdGVz
dCBzZXJ2ZXIwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDx78mogNhT
noWwRa51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf
3rwj79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJC
SGxIin9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4
ZbPgwep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaE
nQrAlTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJx
rjVEPfA8QSbtAgMBAAGjbzBtMB0GA1UdDgQWBBTHE0KKW8REV4HxajzVsIBxz3iL
9zAfBgNVHSMEGDAWgBTHE0KKW8REV4HxajzVsIBxz3iL9zAPBgNVHRMBAf8EBTAD
AQH/MBoGA1UdEQQTMBGHBH8AAAGCCWxvY2FsaG9zdDANBgkqhkiG9w0BAQsFAAOC
AQEAYlTEftrwGJBXuPmtxhmtw2HO/VTC4TGnq67hH5H+ptwgZJuuxCQ5KW6flTyp
FTyMhha33WD4EBL3wqqJsWr9Y4BXqi4G0lRqXBcC1oIUa2VYIDMER7kaY1qTSqE8
ARpwdB2BhvngAzDLc+4Jt4jQMRGr8fHAwxpDBoIZ1knbyzYNP73Bajse6/8YtxUu
nB2BsldjZnLvyHvRxUpWp92OyQih4jYSrlN6olDFlKDg7++kMhkHtJQW9a1t54VN
0ZXrB1ZRuHUUvGBq26x71riTWor7HNOSQaGeCMQjZNQkh5tfshNygUGSZVXTEwhG
xSrOL7NqBt2+EkVwf7LjGzjmBw==
-----END CERTIFICATE-----

28
tests/server.key Normal file
View File

@@ -0,0 +1,28 @@
-----BEGIN PRIVATE KEY-----
MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDx78mogNhTnoWw
Ra51NeGtapQ1PfTYLlIMUzuloFXOsR1/ozRkFucqHNftF22wf0gg4VQJSBSf3rwj
79vsnt3nyaD03bTAafpHXkd+IJxQowiG8TfOJF0R/Qg9g7DCE66R9agQpMJCSGxI
in9p/4ld4Hn6869d4hNq4fHxNf/qkj2cnf8DYxrldz2FGsi6yMed4tzz2Am4ZbPg
wep+fy843ZdYrVIms9vJluNa9E+6Vpw9FwdjzQ/IBBMLvGaC2pDkc95YelaEnQrA
lTO/0l5vjc8XuTQFlo3DbUg+WEld/pxvCqsd/q1mqjL0WbxtXl2zCwGzAoJxrjVE
PfA8QSbtAgMBAAECggEACgNK4klq1T3IpKdNoBY5yoE7CbUQZBNkBpSPRxHgBezj
SVFfgrZGnOySrIJSt4JHtuynG2Hl+0ku74HRep/ck+eOsh5W3mZvGvMLnGxhwR3u
Or99osTIgU0VQTkpC0SLQ16FCnih0uJycNIikdLR7uuya1tt1OyIBzK7XlNGIywT
p85zJc7/6TfTC9eM7lqh7JGR7KplBxSvgZL1pUr7y4rNpKms6uzOvPND79CcKnbU
BBA9Tu4qdOkoOljsZKkvh3pihxyG9X6d8QTZ/uX3pkvliwSFBc+Sz9EootA3/4r5
gVWpQ2t/AY7fY4hqzLIX/HivVaPj3cWk1G+SHm0XNQKBgQD5I9rijqFvV/p6FmUl
FbnjJFFHHgZLivlGxAC5vOyJNQQaqdeDzg7yMotNmQTggVGjT6sjdosQb3n+ctuk
EhQnZSU5VkNKv1+PTR35WrRkaECCaqz3Pv79pV9GVcX3it7UuYjNiOeSPqINWe+X
49JwnJFz+qQ1BchAwOis4zkENwKBgQD4mShDaYLOO97VpgZj4cGxHHWyEK9CRQvp
I7HxRmfaWS3JHwb88lOmALEU6pAj5cYJPAznv8BnUWcVHalZbkQ1JWYtUJRqj6OI
Ym7rw/nm4Ay5ijbdEism173dSk3IjOe+PdAlxzsOuVzYdBTqElmeQWtBzhY9aHvX
r+A02C2j+wKBgHHDo6Gsi57yR5gUPd9vSlCkNtEIrss0DJv5yHMIB+KnaNZcE+NF
5qFF30Jxyz5RDtxJ9tXcvaeln8lG3XDQKI/MqfDCqTuqo5ImHrfMaW8oA70JxS2p
gHqGVzkg1aMxsIrmpcdk6olnPExocvWivGdbtzeEjhMALu8Sp6y6nUCFAoGBAK5h
KLgYw/OMVaQCIMthaa+l6f0s7PMMYe1453H6VBD6qz4/8HPwO7LfG1gzrUYxADgs
ElVh0UHn/On383nS+i9Ze5Hfyyvwc+LQQURKJPrJQMPJavCptPE7NmiKnYNHK6vr
yh0l4oxShAklbCJBGvICq4zuVfVfXDeQnDIVTfaPAoGBAMCrZqYdOUhUu+aUqxZq
qO/TTQxrxftU63jGUg+o042TdgI4KWLn07wvHJ8/E2OqF35eXenvcuKbNLI1l72J
4cp+3cUv8iAXThTRYEztr5CS/wta4o4CNN8zfjn5dV9AI4Hmt4V7EaGWpBcViGbj
n0Mhag+dO8DHuenqi1yfMrAt
-----END PRIVATE KEY-----

152
tools/mk-sbuild-chroot.sh Executable file
View File

@@ -0,0 +1,152 @@
#!/usr/bin/env bash
#
# Bootstrap an sbuild chroot for the clean-room build gate (mkdeb.sh --sbuild).
#
# Uses the rootless unshare backend: no root, no schroot daemon. It builds a
# minimal buildd chroot tarball into ~/.cache/sbuild/<dist>-<arch>.tar.zst, where
# sbuild --dist=<dist> finds it automatically in unshare mode.
#
# Usage:
# tools/mk-sbuild-chroot.sh [options]
#
# Options:
# -d, --dist DIST suite to bootstrap (default: unstable)
# -a, --arch ARCH architecture (default: dpkg --print-architecture)
# -m, --mirror URL apt mirror (default: http://deb.debian.org/debian)
# --components LIST comma-separated components (default: main)
# -f, --force rebuild even if the tarball already exists
# --write-sbuildrc add "$chroot_mode = 'unshare';" to ~/.sbuildrc if absent
# -h, --help show this help
#
# One-time setup; refresh later with sbuild-update or by rerunning with --force.
# Requires mmdebstrap and the uidmap tools (newuidmap) for the unshare backend.
set -euo pipefail
readonly PROGNAME=${0##*/}
die() {
printf '%s: error: %s\n' "$PROGNAME" "$*" >&2
exit 1
}
info() {
printf '==> %s\n' "$*" >&2
}
usage() {
sed -n '2,/^set -euo/{/^set -euo/!p}' "$0" | sed 's/^# \{0,1\}//'
}
need() {
local tool
for tool in "$@"; do
command -v "$tool" >/dev/null 2>&1 || die "required tool not found: $tool"
done
}
main() {
local dist=unstable
local arch=""
local mirror=http://deb.debian.org/debian
local components=main
local force=0
local write_sbuildrc=0
while [[ $# -gt 0 ]]; do
case $1 in
-d | --dist)
[[ $# -ge 2 ]] || die "missing argument for $1"
dist=$2
shift 2
;;
-a | --arch)
[[ $# -ge 2 ]] || die "missing argument for $1"
arch=$2
shift 2
;;
-m | --mirror)
[[ $# -ge 2 ]] || die "missing argument for $1"
mirror=$2
shift 2
;;
--components)
[[ $# -ge 2 ]] || die "missing argument for $1"
components=$2
shift 2
;;
-f | --force)
force=1
shift
;;
--write-sbuildrc)
write_sbuildrc=1
shift
;;
-h | --help)
usage
exit 0
;;
*)
die "unknown option: $1 (try --help)"
;;
esac
done
need mmdebstrap dpkg
# Unshare needs the setuid uid/gid mappers; mmdebstrap fails cryptically without.
command -v newuidmap >/dev/null 2>&1 ||
die "newuidmap not found; install the uidmap package for the unshare backend"
# Unshare maps a whole UID range, not just the caller's: the base install
# creates system users, and without an /etc/subuid+subgid range the install
# crashes (dpkg SIGSEGV) instead of erroring cleanly. Root uses mode=root and
# needs no range.
if [[ $(id -u) -ne 0 ]]; then
local me
me=$(id -un)
if ! grep -qs "^$me:" /etc/subuid || ! grep -qs "^$me:" /etc/subgid; then
# Suggest a range starting past every allocation in either file.
local start
start=$(awk -F: '{e = $2 + $3; if (e > m) m = e} END {print (m ? m : 100000)}' \
/etc/subuid /etc/subgid 2>/dev/null)
die "no /etc/subuid+subgid range for $me; the unshare backend needs one:
sudo usermod --add-subuids $start-$((start + 65535)) --add-subgids $start-$((start + 65535)) $me"
fi
fi
: "${arch:=$(dpkg --print-architecture)}"
local cache=$HOME/.cache/sbuild
local tarball=$cache/${dist}-${arch}.tar.zst
if [[ -e $tarball && $force -eq 0 ]]; then
info "chroot already exists: $tarball (use --force to rebuild)"
else
info "bootstrapping $dist/$arch chroot into $tarball"
mkdir -p "$cache"
mmdebstrap --variant=buildd --arch="$arch" --components="$components" \
"$dist" "$tarball" "$mirror"
info "chroot ready: $tarball"
fi
local rc=$HOME/.sbuildrc
local mode_line="\$chroot_mode = 'unshare';"
# shellcheck disable=SC2016 # $chroot_mode is literal regex text, not a shell var.
if grep -qsE '^[[:space:]]*\$chroot_mode[[:space:]]*=.*unshare' "$rc"; then
: # already configured (active, non-commented line)
elif [[ $write_sbuildrc -eq 1 ]]; then
info "enabling the unshare backend in $rc"
printf '%s\n' "$mode_line" >>"$rc"
else
cat >&2 <<EOF
==> To use this chroot without passing --chroot-mode each time, add to $rc:
$mode_line
(or rerun with --write-sbuildrc). Then verify with:
sbuild --dist=$dist path/to/package.dsc
and build the release gate with:
tools/mkdeb.sh --source-only --sbuild
EOF
fi
}
main "$@"

View File

@@ -20,11 +20,27 @@
# Options: # Options:
# -k, --key KEYID GPG key for signing (default: $DEBSIGN_KEYID) # -k, --key KEYID GPG key for signing (default: $DEBSIGN_KEYID)
# -o, --outdir DIR output directory (default: <repo>/dist) # -o, --outdir DIR output directory (default: <repo>/dist)
# --orig FILE reuse this upstream orig tarball instead of
# regenerating it (required for a Debian revision
# >= 2, whose orig is frozen in the archive)
# -s, --source-only build only the source package # -s, --source-only build only the source package
# -u, --unsigned do not sign anything (implies no release sigs) # -u, --unsigned do not sign anything (implies no release sigs)
# --no-release-artifacts skip the orig tarball .asc/.md5/.sha1 # --no-release-artifacts skip the orig tarball .asc/.md5/.sha1
# --sbuild additionally build the .dsc in a clean sbuild
# chroot as a from-scratch verification gate
# -h, --help show this help # -h, --help show this help
# #
# --sbuild reproduces the buildd environment: it builds the source package in a
# minimal chroot holding only the declared Build-Depends, so an FTBFS or a
# missing dependency fails here instead of on the archive's buildds (which, with
# a source-only upload, are otherwise the first clean build). It needs an sbuild
# chroot for the changelog's distribution; create one once with the companion
# tools/mk-sbuild-chroot.sh (rootless unshare backend).
#
# The Debian revision in debian/changelog decides the orig: revision 1 builds a
# fresh upstream tarball; revision >= 2 must reuse the orig frozen at revision 1
# (the .dsc references it by checksum), so pass it with --orig.
#
# SOURCE_DATE_EPOCH is honored for reproducible output. # SOURCE_DATE_EPOCH is honored for reproducible output.
set -euo pipefail set -euo pipefail
@@ -57,9 +73,11 @@ need() {
main() { main() {
local key=${DEBSIGN_KEYID:-} local key=${DEBSIGN_KEYID:-}
local outdir="" local outdir=""
local orig_in=""
local source_only=0 local source_only=0
local unsigned=0 local unsigned=0
local release_artifacts=1 local release_artifacts=1
local sbuild=0
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
@@ -73,6 +91,11 @@ main() {
outdir=$2 outdir=$2
shift 2 shift 2
;; ;;
--orig)
[[ $# -ge 2 ]] || die "missing argument for $1"
orig_in=$2
shift 2
;;
-s | --source-only) -s | --source-only)
source_only=1 source_only=1
shift shift
@@ -85,6 +108,10 @@ main() {
release_artifacts=0 release_artifacts=0
shift shift
;; ;;
--sbuild)
sbuild=1
shift
;;
-h | --help) -h | --help)
usage usage
exit 0 exit 0
@@ -95,7 +122,8 @@ main() {
esac esac
done done
need git autoreconf debuild dcmd need git autoreconf debuild dcmd dpkg-parsechangelog
[[ $sbuild -eq 1 ]] && need sbuild
if [[ $unsigned -eq 0 ]]; then if [[ $unsigned -eq 0 ]]; then
need gpg need gpg
[[ -n $key ]] || die "no signing key (pass --key or set DEBSIGN_KEYID, or use --unsigned)" [[ -n $key ]] || die "no signing key (pass --key or set DEBSIGN_KEYID, or use --unsigned)"
@@ -107,6 +135,11 @@ main() {
mkdir -p "$outdir" mkdir -p "$outdir"
outdir=$(cd "$outdir" && pwd) outdir=$(cd "$outdir" && pwd)
if [[ -n $orig_in ]]; then
[[ -r $orig_in ]] || die "--orig file not readable: $orig_in"
orig_in=$(cd "$(dirname "$orig_in")" && pwd)/$(basename "$orig_in")
fi
scratch=$(mktemp -d "${TMPDIR:-/tmp}/httrack-mkdeb.XXXXXX") scratch=$(mktemp -d "${TMPDIR:-/tmp}/httrack-mkdeb.XXXXXX")
trap 'rm -rf -- "$scratch"' EXIT trap 'rm -rf -- "$scratch"' EXIT
@@ -118,45 +151,65 @@ main() {
git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD | git -C "$repo/src/coucal" archive --format=tar --prefix=src/coucal/ HEAD |
tar -x -C "$export_dir" tar -x -C "$export_dir"
# Refresh build system and man page, then build the tarball. We build here # Upstream version and Debian revision drive the orig: revision 1 builds a
# only because regen-man needs the compiled binaries; the test suite is not # fresh tarball, revision >= 2 reuses the one frozen at -1 (the .dsc pins it
# run in this pass. debuild (below) runs the full suite once, with the online # by checksum, so a regenerated orig with new mtimes would be rejected).
# tests enabled, so a check here would just be a slower, offline-only repeat. local fullver ver rev
info "regenerating build system and man page" fullver=$(cd "$export_dir" && dpkg-parsechangelog -S Version)
( ver=${fullver%-*}
cd "$export_dir" rev=${fullver##*-}
autoreconf -fi local orig=httrack_${ver}.orig.tar.gz
./configure --quiet info "version $ver (Debian revision $rev)"
make -s -j"$(nproc)"
make -s -C man regen-man
# Build the tarball from a clean tree so no object files leak into it.
make -s clean
make -s dist
)
local tarball ver # A signed build is upload-bound, so a revision >= 2 must reuse the frozen
local -a tarballs # orig (--orig); an unsigned build is a throwaway (CI, local) and may
shopt -s nullglob # regenerate it, since it can never reach the archive.
tarballs=("$export_dir"/httrack-*.tar.gz) if [[ -z $orig_in && $rev != 1 && $unsigned -eq 0 ]]; then
shopt -u nullglob die "Debian revision $rev needs --orig FILE (the orig is frozen from revision 1)"
[[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball" fi
tarball=${tarballs[0]##*/}
ver=${tarball#httrack-} if [[ -n $orig_in ]]; then
ver=${ver%.tar.gz} info "reusing upstream tarball $orig_in"
info "version $ver" cp -- "$orig_in" "$scratch/$orig"
else
# Refresh build system and man page, then build the tarball. We build
# here only because regen-man needs the compiled binaries; the test
# suite is not run in this pass. debuild (below) runs the full suite
# once, online tests enabled, so a check here would just repeat it.
info "regenerating build system and man page"
(
cd "$export_dir"
autoreconf -fi
./configure --quiet
make -s -j"$(nproc)"
make -s -C man regen-man
# Build the tarball from a clean tree so no object files leak in.
make -s clean
make -s dist
)
local -a tarballs
shopt -s nullglob
tarballs=("$export_dir"/httrack-*.tar.gz)
shopt -u nullglob
[[ ${#tarballs[@]} -ge 1 ]] || die "make dist produced no tarball"
local tarball=${tarballs[0]##*/}
[[ $tarball == "httrack-$ver.tar.gz" ]] ||
die "changelog version $ver disagrees with built tarball $tarball (configure.ac mismatch?)"
cp -- "$export_dir/$tarball" "$scratch/$orig"
fi
# 3.0 (quilt): orig tarball is upstream-only; debian/ is overlaid on top. # 3.0 (quilt): orig tarball is upstream-only; debian/ is overlaid on top.
local orig=httrack_${ver}.orig.tar.gz
cp -- "$export_dir/$tarball" "$scratch/$orig"
( (
cd "$scratch" cd "$scratch"
tar -xf "$orig" tar -xf "$orig"
[[ -d httrack-$ver ]] || die "orig tarball does not unpack to httrack-$ver/"
cp -a "$export_dir/debian" "httrack-$ver/debian" cp -a "$export_dir/debian" "httrack-$ver/debian"
) )
# Build (debuild also runs lintian and signs). --fail-on aborts on a lintian # Build and sign. debuild runs lintian too but does NOT propagate its exit
# error or warning, so neither a release nor CI produces an unclean package. # status, so a broken package would pass unnoticed; disable it here and run
local -a debuild_opts=(--lintian-opts -I -i "--fail-on=error,warning") # lintian ourselves below as the real gate.
local -a debuild_opts=(--no-lintian)
local -a build_opts=() local -a build_opts=()
[[ $source_only -eq 1 ]] && build_opts+=(-S) [[ $source_only -eq 1 ]] && build_opts+=(-S)
if [[ $unsigned -eq 1 ]]; then if [[ $unsigned -eq 1 ]]; then
@@ -167,7 +220,8 @@ main() {
info "building packages with debuild" info "building packages with debuild"
( (
cd "$scratch/httrack-$ver" cd "$scratch/httrack-$ver"
debuild "${build_opts[@]}" "${debuild_opts[@]}" # debuild options (--no-lintian) must precede the dpkg-buildpackage ones
debuild "${debuild_opts[@]}" "${build_opts[@]}"
) )
# Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo). # Collect every file the .changes references (orig, dsc, debs, ddebs, buildinfo).
@@ -177,11 +231,49 @@ main() {
changes=("$scratch"/*.changes) changes=("$scratch"/*.changes)
shopt -u nullglob shopt -u nullglob
[[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file" [[ ${#changes[@]} -ge 1 ]] || die "debuild produced no .changes file"
# The real lintian gate (debuild only reports, it does not fail on tags).
# --profile debian: CI runners are Ubuntu, whose vendor data would wrongly
# reject the Debian "unstable" distribution. newer-standards-version only
# means the local lintian is older than the buildds', not a package
# defect, so suppress it. set -e turns any error/warning tag into a failure.
info "running lintian gate (--fail-on=error,warning)"
lintian --profile debian -I -i --fail-on=error,warning \
--suppress-tags newer-standards-version "${changes[@]}"
dcmd cp -- "${changes[@]}" "$outdir/" dcmd cp -- "${changes[@]}" "$outdir/"
# Clean-room build gate: rebuild the source package in a minimal chroot that
# holds only the declared Build-Depends, the same way the buildds will. An
# undeclared dependency or any FTBFS aborts the release here instead of
# surfacing after a source-only upload. Logs and clean-built debs land in
# $outdir/sbuild for inspection.
if [[ $sbuild -eq 1 ]]; then
local -a dscs
shopt -s nullglob
dscs=("$scratch"/*.dsc)
shopt -u nullglob
[[ ${#dscs[@]} -ge 1 ]] || die "no .dsc to sbuild"
local dist
dist=$(cd "$scratch/httrack-$ver" && dpkg-parsechangelog -S Distribution)
[[ $dist == UNRELEASED ]] && dist=unstable
info "clean-room build with sbuild (dist $dist)"
local sbdir=$outdir/sbuild
rm -rf -- "$sbdir"
mkdir -p "$sbdir"
(cd "$sbdir" && sbuild --dist="$dist" -- "${dscs[0]}")
info "sbuild clean-room build passed; logs in $sbdir"
fi
# Release artifacts for the upstream tarball (detached sig + checksums). # Release artifacts for the upstream tarball (detached sig + checksums).
# A Debian revision >= 2 .changes omits the orig (it is already in the
# archive), so dcmd above won't have copied it; place it from the build tree
# so the website artifacts are produced regardless of the revision.
if [[ $release_artifacts -eq 1 && $unsigned -eq 0 ]]; then if [[ $release_artifacts -eq 1 && $unsigned -eq 0 ]]; then
info "signing upstream tarball" info "signing upstream tarball"
cp -- "$scratch/$orig" "$outdir/$orig"
( (
cd "$outdir" cd "$outdir"
gpg --armor --detach-sign --yes -u "$key" -- "$orig" gpg --armor --detach-sign --yes -u "$key" -- "$orig"