Compare commits

..

18 Commits

Author SHA1 Message Date
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
Xavier Roche
02c7f4ebf6 htsparse: don't treat XHR.open's method argument as a URL (#218)
The JavaScript URL detector matched `.open(` for window.open("url",...)
and captured the first argument as a link. XMLHttpRequest.open(method,
url) puts the HTTP method first, so `xhr.open("GET", "ajax_info.txt")`
turned "GET" into a bogus link, rewritten to "GET.html" on a live server.

Reject a first argument that is exactly an HTTP method, mirroring the
existing ensure_not_mime guard. window.open(url) is unaffected; the real
XHR url (the second argument) is still picked up by the dirty parser.

Closes #218

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 20:27:04 +02:00
Xavier Roche
9070b44a70 Merge pull request #398 from xroche/fix/html-underflow-396
htsparse: fix buffer underflow reading *(html-1) at offset 0 (#396)
2026-06-19 19:55:40 +02:00
Xavier Roche
799c045061 htsparse: don't read *(html-1) before the parse buffer (#396)
The link detector's word-boundary guards dereference *(html-1) to check
the byte preceding a matched token. When the token sits at the very start
of the parse buffer (html == r->adr), that reads one byte before the
allocation: a heap-buffer-overflow under ASan, silent on a normal build.
A stylesheet beginning with a url() token is enough to hit it.

Route the three reachable guards (url(), location=, the makeindex /title
check) through html_prevc(), which returns a space sentinel at the buffer
start. Space is the right value for these tests: a token at offset 0 is at
a word boundary, so it stays a valid match. The other *(html-1) sites only
run after html has advanced past an opening tag or quote.

Covers it with an offset-0 url() fixture in 01_engine-parse.test; without
the fix it aborts at htsparse.c:1386 under the CI sanitizer job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:44:25 +02:00
Xavier Roche
fb1ee3bf2e Merge pull request #397 from xroche/fix/css-import-94
CSS @import: capture URLs that carry a media/supports/layer condition (#94)
2026-06-19 19:30:21 +02:00
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
Xavier Roche
0145ec37a3 htsparse: don't crawl xmlns namespace declarations (#191)
The "dirty parsing" heuristic accepts any tag attribute whose value looks
like a URL unless the attribute is on the no-detect list. xmlns and
xmlns:prefix declarations carry namespace URIs (xmlns:og="http://ogp.me/ns#",
etc.) that are not resources, so httrack queued and fetched them, stalling
the crawl on unrelated spec URLs. Reject xmlns/xmlns:prefix where the
no-detect list is already consulted.

01_engine-parse.test grows a fixture with each form (default and prefixed) as
the last attribute of its element, since the heuristic only inspects an
attribute whose value is immediately followed by '>'; the targets are local
file:// gifs so a regression actually downloads them (verified: reverting the
guard fetches all three).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:24:55 +02:00
Xavier Roche
a80fab38ba Merge pull request #394 from xroche/fix/proxy-https-connect-85
Tunnel https through the proxy via CONNECT (#85)
2026-06-19 18:03:31 +02:00
27 changed files with 920 additions and 506 deletions

View File

@@ -320,6 +320,21 @@ jobs:
lint: lint:
name: lint (shellcheck, shfmt) name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04 runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps: steps:
- uses: actions/checkout@v6 - uses: actions/checkout@v6
@@ -332,12 +347,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck - name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh run: shellcheck $SHELL_SCRIPTS
- name: shfmt - name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format # Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip, # (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -1,8 +1,7 @@
#!/bin/sh #!/bin/sh
# Simple indexing test using HTTrack # Simple indexing test using HTTrack
# A "real" script/program would use advanced search, and # A "real" script/program would use advanced search, and
# use dichotomy to find the word in the index.txt file # use dichotomy to find the word in the index.txt file
# This script is really basic and NOT optimized, and # This script is really basic and NOT optimized, and
# should not be used for professional purpose :) # should not be used for professional purpose :)
@@ -11,50 +10,49 @@ TESTSITE="http://localhost/"
# Create an index if necessary # Create an index if necessary
if ! test -f "index.txt"; then if ! test -f "index.txt"; then
echo "Building the index .." echo "Building the index .."
rm -rf test rm -rf test
httrack --display "$TESTSITE" -%I -O test httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./ mv test/index.txt ./
fi fi
# Convert crlf to lf # Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .." echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt tr -d '\r' <index.txt.old >index.txt
fi fi
keyword=- keyword=-
while test -n "$keyword"; do while test -n "$keyword"; do
printf "Enter a keyword: " printf "Enter a keyword: "
read keyword read -r keyword
if test -n "$keyword"; then if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`" FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches # Multiple matches
printf "Found multiple keywords: " printf "Found multiple keywords: "
echo "$FOUNDK"|cut -f2 -d':'|tr '\n' ' ' echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
echo "" echo ""
echo "Use keyword$ to find only one" echo "Use keyword$ to find only one"
else else
# One match # One match
N=`echo "$FOUNDK"|cut -f1 -d':'` N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=`tail +$N index.txt|grep -nE "\("|head -n 1` PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM"|grep "ignored">/dev/null; then if ! echo "$PM" | grep "ignored" >/dev/null; then
M=`echo $PM|cut -f1 -d':'` M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:" echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' ' tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else else
echo "keyword ignored (too many hits)" echo "keyword ignored (too many hits)"
fi fi
fi fi
else else
echo "not found" echo "not found"
fi fi
fi fi
done done

View File

@@ -3,57 +3,59 @@
# Change this to download files # Change this to download files
if false; then if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi fi
# Produce code # Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n" printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT ; do for i in *.TXT; do
echo "processing $i" >&2 echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \ grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
( (
unset arr unset arr
while read LINE ; do while read -r LINE; do
from=$[$(echo $LINE | cut -f1 -d' ')] from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then if ! test -n "$from"; then
echo "error with $i" >&2 echo "error with $i" >&2
exit 1 exit 1
elif test $from -ge 256; then elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2 echo "out-of-range ($LINE) with $i" >&2
exit 1 exit 1
fi fi
to=$(echo $LINE | cut -f2 -d' ') to=$(echo "$LINE" | cut -f2 -d' ')
arr[$from]=$to arr[from]=$to
done done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n " name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
i=0 printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
while test "$i" -lt 256; do idx=0
if test "$i" -gt 0; then while test "$idx" -lt 256; do
printf ", " if test "$idx" -gt 0; then
if test $[${i}%8] -eq 0; then printf ", "
printf "\n " if test $((idx % 8)) -eq 0; then
fi printf "\n "
fi fi
value=${arr[$i]:-0} fi
printf "0x%04x" $value value=${arr[$idx]:-0}
i=$[${i}+1] printf "0x%04x" "$value"
done idx=$((idx + 1))
printf " };\n\n" done
) printf " };\n\n"
echo "processed $i" >&2 )
echo "processed $i" >&2
done done
# Indexes # Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n" printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT ; do for i in *.TXT; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/') # shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n" name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done done
printf " { NULL, NULL }\n};\n" printf " { NULL, NULL }\n};\n"

View File

@@ -2787,6 +2787,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; return 0;
} }
break; break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs case '2': // mimedefs
if (na + 1 >= argc) { if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL"); HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");

View File

@@ -33,43 +33,43 @@ EOF
else else
GET "${url}" GET "${url}"
fi fi
) \ ) |
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \ grep -E '^<!ENTITY [a-zA-Z0-9_]' |
| sed \ sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \ -e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \ -e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\ -e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
| ( \ (
read A read -r A
while test -n "$A"; do while test -n "$A"; do
ent="${A%% *}" ent="${A%% *}"
code=$(echo "$A"|cut -f2 -d' ') code=$(echo "$A" | cut -f2 -d' ')
# compute hash # compute hash
hash=0 hash=0
i=0 i=0
a=1664525 a=1664525
c=1013904223 c=1013904223
m="$[1 << 32]" m="$((1 << 32))"
while test "$i" -lt ${#ent}; do while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}"|hexdump -v -e '/1 "%d"')" d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]" hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$[${i}+1] i=$((i + 1))
done done
echo -e " /* $A */" echo -e " /* $A */"
echo -e " case ${hash}u:" echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {" echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};" echo -e " return ${code};"
echo -e " }" echo -e " }"
echo -e " break;" echo -e " break;"
# next # next
read A read -r A
done done
) )
cat <<EOF cat <<EOF
} }
/* unknown */ /* unknown */
return -1; return -1;
} }
EOF EOF
) > ${dest} ) >${dest}

View File

@@ -2605,6 +2605,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++) for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\') if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/'; adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
} }
// no hostname // no hostname

View File

@@ -296,6 +296,48 @@ static const char *html_inline_safe(const char *src, char *dst, size_t size) {
return dst; return dst;
} }
/* Byte before html, or a space sentinel at the buffer start where html[-1]
would underflow; space reads as the word boundary the guards want there. */
static HTS_INLINE char html_prevc(const char *html, const char *start) {
return html > start ? html[-1] : ' ';
}
/* True if [s, s+len) is exactly an HTTP method token (XHR.open's first
argument is a method, not a URL: #218). Case-insensitive. */
static int is_http_method(const char *s, size_t len) {
static const char *const methods[] = {"GET", "POST", "PUT",
"DELETE", "HEAD", "OPTIONS",
"PATCH", "TRACE", NULL};
int i;
for (i = 0; methods[i] != NULL; i++) {
if (strlen(methods[i]) == len && strfield(s, methods[i]) == (int) len)
return 1;
}
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */ /* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) { int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE]; char catbuff[CATBUFF_SIZE];
@@ -556,7 +598,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->getmode & HTS_GETMODE_HTML) { if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title"); p = strfield(html, "title");
if (p) { if (p) {
if (*(html - 1) == '/') if (html_prevc(html, r->adr) == '/')
p = 0; // /title p = 0; // /title
} else { } else {
if (strfield(html, "/html")) if (strfield(html, "/html"))
@@ -1341,6 +1383,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0; int can_avoid_quotes = 0;
char quotes_replacement = '\0'; char quotes_replacement = '\0';
int ensure_not_mime = 0; int ensure_not_mime = 0;
// .open(method,url): reject an HTTP-method first arg (#218)
int ensure_not_method = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag) if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'" expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1357,9 +1404,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (!nc) if (!nc)
nc = strfield(html, ":location"); // javascript:location="doc" nc = strfield(html, ":location"); // javascript:location="doc"
if (!nc) { // location="doc" if (!nc) { // location="doc"
if ((nc = strfield(html, "location")) if ((nc = strfield(html, "location")) &&
&& !isspace(*(html - 1)) !isspace(html_prevc(html, r->adr)))
)
nc = 0; nc = 0;
} }
if (!nc) if (!nc)
@@ -1369,6 +1415,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = "),"; // fin: virgule ou parenthèse expected_end = "),"; // fin: virgule ou parenthèse
ensure_not_mime = 1; //* ensure the url is not a mime type */ ensure_not_mime = 1; //* ensure the url is not a mime type */
ensure_not_method = 1; // xhr.open: don't grab method
} }
if (!nc) if (!nc)
if ((nc = strfield(html, ".replace"))) { // window.replace("url") if ((nc = strfield(html, ".replace"))) { // window.replace("url")
@@ -1380,7 +1427,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
} }
if (!nc && (nc = strfield(html, "url")) && (!isalnum(*(html - 1))) && *(html - 1) != '_') { // url(url) if (!nc && (nc = strfield(html, "url")) &&
(!isalnum(html_prevc(html, r->adr))) &&
html_prevc(html, r->adr) != '_') { // url(url)
expected = '('; // parenthèse expected = '('; // parenthèse
expected_end = ")"; // fin: parenthèse expected_end = ")"; // fin: parenthèse
can_avoid_quotes = 1; can_avoid_quotes = 1;
@@ -1390,6 +1439,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url" if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) { if (is_space(*(html + nc))) {
expected = 0; // no char expected expected = 0; // no char expected
is_import = 1;
} else } else
nc = 0; nc = 0;
} }
@@ -1407,6 +1457,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) { if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c; const char *b, *c;
int ndelim = 1; int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\'')) if ((*a == 34) || (*a == '\''))
a++; a++;
@@ -1421,12 +1472,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++; b++;
} }
c = b--; c = b--;
c += ndelim; // no closing delimiter here (truncated input):
while(*c == ' ') // Don't scan past the buffer NUL or capture it.
c++; if (*c != '\0') {
if ((strchr(expected_end, *c)) || (*c == '\n') c += ndelim;
|| (*c == '\r')) { while (*c == ' ')
c -= (ndelim + 1); c++;
valid_url =
(strchr(expected_end, *c)) || (*c == '\n') ||
(*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) { if ((int) (c - a + 1)) {
if (ensure_not_mime) { if (ensure_not_mime) {
int i = 0; int i = 0;
@@ -1442,6 +1501,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
i++; i++;
} }
} }
// XHR.open's "GET" etc. is a method, not a URL
if (a != NULL && ensure_not_method &&
is_http_method(a, (size_t) (c - a + 1))) {
a = NULL;
}
// Check for bogus links (Vasiliy) // Check for bogus links (Vasiliy)
if (a != NULL) { if (a != NULL) {
const size_t size = c - a + 1; const size_t size = c - a + 1;
@@ -1485,7 +1549,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
} }
} }
} }
} }
} }
} }
@@ -1692,6 +1755,24 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
hts_nodetect[i - hts_nodetect[i -
1]); 1]);
} }
// xmlns / xmlns:prefix declare
// XML namespaces, not resources
// (#191)
else {
const int xl = strfield(
intag_startattr, "xmlns");
const char xc =
intag_startattr[xl];
if (xl &&
(xc == ':' || xc == '=' ||
is_space(xc))) {
url_ok = 0;
hts_log_print(
opt, LOG_DEBUG,
"dirty parsing: xmlns "
"namespace avoided");
}
}
} }
} }
@@ -2967,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */ /* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo)); inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf) //if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx // escape_uri(tempo); // escape with %xx
//else { //else {

View File

@@ -4,131 +4,140 @@
# Initializes the htsserver GUI frontend and launch the default browser # Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE= BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition" SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set # sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}" SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi fi
# Patch for Darwin/Mac by Ross Williams # Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find # Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for # the default browser. The -W flag causes it to wait for
# the browser to exit # the browser to exit
BROWSEREXE="/usr/bin/open -W" BROWSEREXE="/usr/bin/open -W"
fi fi
BINWD=`dirname "$0"` BINWD=$(dirname "$0")
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin" SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "` IFS=':' read -ra pathdirs <<<"$PATH"
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share" for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
### ###
# And now some famous cuisine # And now some famous cuisine
function log { function log {
echo "$0($$): $@" >&2 echo "$0($$): $*" >&2
return 0 return 0
} }
function launch_browser { function launch_browser {
log "Launching $1" log "Launching $1"
browser=$1 browser=$1
url=$2 url=$2
log "Spawning browser.." log "Spawning browser.."
${browser} "${url}" ${browser} "${url}"
# note: browser can hiddenly use the -remote feature of # note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately # mozilla and therefore return immediately
log "Browser (or helper) exited" log "Browser (or helper) exited"
} }
# First ensure that we can launch the server # First ensure that we can launch the server
BINPATH= BINPATH=
for i in ${SRCHPATH}; do for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i} ! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done done
for i in ${SRCHDISTPATH}; do for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack" ! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1 test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1 test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1 test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1 test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1 test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1 test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale # Locale
HTSLANG="${LC_MESSAGES}" HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}" ! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}" ! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`" HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'` LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1 ! test -n "${LANGN}" && LANGN=1
# Find the browser # Find the browser
# note: not all systems have sensible-browser or www-browser alternative # note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found # thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do for i in "${SRCHBROWSEREXE[@]}"; do
for j in ${SRCHPATH}; do for j in "${SRCHPATH[@]}"; do
if test -x ${j}/${i}; then if test -x "${j}/${i}"; then
BROWSEREXE=${j}/${i} BROWSEREXE="${j}/${i}"
fi fi
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" && break test -n "$BROWSEREXE" && break
done done
test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1 test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command # "browse" command
if test "$1" = "browse"; then if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='` INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html" INDEXF="${INDEXF}/index.html"
else else
INDEXF="" INDEXF=""
fi fi
fi fi
if ! test -n "$INDEXF"; then if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html" INDEXF="${HOME}/websites/index.html"
fi fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}" launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $? exit $?
fi fi
# Create a temporary filename # Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1 TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server # Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}& (
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL # Find the generated SRVURL
SRVURL= SRVURL=
MAXCOUNT=60 MAXCOUNT=60
while ! test -n "$SRVURL"; do while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1] MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1 test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.." test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=` SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1 test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1 test -n "$SRVURL" || sleep 1
done done
# Cleanup function # Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup { function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.." test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself # Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=` test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 ${SRVPID} test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE} test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done" test -n "$1" && log "..Done"
return 0 return 0
} }
# Cleanup in case of emergency # Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser # Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}" launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks! # That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup cleanup
exit 0 exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8). # charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8. # -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() { conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1 test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
} }
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9. # the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset). # HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output). # -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() { ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1 test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
} }
# crash probe: malformed input must exit cleanly, not abort. # crash probe: malformed input must exit cleanly, not abort.
runs() { runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1 httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
} }
# named entities # named entities

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...". # -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() { match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
} }
nomatch() { nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1 test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
} }
# bare star matches everything # bare star matches everything
@@ -67,7 +67,7 @@ nomatch '*[\[]' 'a'
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed # filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy) # by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change. # behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']' match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']' match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x' nomatch '*[\[\]]' '[]x'

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'". # -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() { mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
} }
unknown() { unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1 test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
} }
mime '/a/b.html' 'text/html' mime '/a/b.html' 'text/html'

View File

@@ -154,4 +154,173 @@ grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
grep -q 'title="file://' "$saved2" || grep -q 'title="file://' "$saved2" ||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1 ! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
# xmlns / xmlns:prefix decls must not be crawled (#191). Local file:// targets so a
# regression downloads them; each is the LAST attr (heuristic only scans a value before '>').
site3="$tmp/xmlns"
mkdir -p "$site3"
for f in ns og rdfs real; do gif "$site3/$f.gif"; done
cat >"$site3/index.html" <<EOF
<html xmlns="file://$site3/ns.gif"><body>
<svg xmlns:og="file://$site3/og.gif"></svg>
<div class="c" xmlns:rdfs="file://$site3/rdfs.gif"></div>
<a href="file://$site3/real.gif">real link</a>
</body></html>
EOF
out3="$tmp/xmlns-out"
crawl "$site3/index.html" "$out3"
# the real link is still captured
found "real.gif" "$out3"
# namespace-declaration targets must not be fetched (default + prefixed forms)
notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
# Offset-0 underflow (#396): a token at the buffer start makes the detector's
# word-boundary guard read *(html-1) one byte early (aborts under ASan). The
# url() target is still captured; here it just must not underflow.
site6="$tmp/parse-off0"
mkdir -p "$site6"
printf 'body{}\n' >"$site6/off0.css"
printf 'url(off0.css)\n' >"$site6/main.css"
out6="$tmp/parse-off0-out"
crawl "$site6/main.css" "$out6"
found "off0.css" "$out6"
# XMLHttpRequest.open(method, url) (#218): the first argument is an HTTP method,
# not a URL. Without the fix "GET" is captured as a link and fetched (the offline
# fixture saves a bare file named GET; a live server mangles it to GET.html).
# window.open(url) detection must be unaffected.
site7="$tmp/xhropen"
mkdir -p "$site7"
gif "$site7/winopen.gif"
cat >"$site7/index.html" <<EOF
<html><body><script>
var x = new XMLHttpRequest();
x.open("GET", "ajax_info.txt");
var y = new XMLHttpRequest();
y.open("Post", "submit.cgi");
window.open("file://$site7/winopen.gif");
</script></body></html>
EOF
out7="$tmp/xhropen-out"
crawl "$site7/index.html" "$out7"
# negative control: without the fix a file named exactly GET is downloaded
notfound "GET" "$out7"
# methods are matched case-insensitively (XHR spec normalizes them): a mixed-case
# method is rejected too, so a file named Post must not appear either
notfound "Post" "$out7"
# regression guard: window.open(url) is still detected, so its absolute URL is
# rewritten to a local link. The rewrite only happens if the parser saw it, so
# these two assertions fail if .open detection broke (not a trivial --near save).
saved7=$(savedhtml "$out7")
test -n "$saved7" || ! echo "FAIL: saved xhr page not found" || exit 1
grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! echo "FAIL #218: window.open(url) no longer detected/rewritten" || exit 1
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0 exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments. # path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() { simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1 test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
} }
simp './foo/bar/' 'foo/bar/' simp './foo/bar/' 'foo/bar/'
@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986 # empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b' simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it # the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"overflow while copying"*) ;; echo "over-capacity write was NOT caught" >&2
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"overflow while copying"*) ;;
*)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer # Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert. # aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;; *"strsafe: NOT aborted"*)
*"htsbuff append overflow"*) ;; echo "htsbuff over-capacity write was NOT caught" >&2
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;; exit 1
;;
*"htsbuff append overflow"*) ;;
*)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac esac

View File

@@ -3,6 +3,6 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/cookies/third.html \ --found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \ --found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \ --found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php httrack http://ut.httrack.com/cookies/entrance.php

View File

@@ -3,21 +3,21 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 5 \ --errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \ --found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \ --found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \ --found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \ --found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \ httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0 '+*.ut.httrack.com/*' --robots=0
# unicode tests (bogus links) # unicode tests (bogus links)
bash crawl-test.sh \ bash crawl-test.sh \
--errors 0 --files 1 \ --errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \ --found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \ httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0 '-*' --robots=0

View File

@@ -3,67 +3,67 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests # unicode tests
bash crawl-test.sh \ bash crawl-test.sh \
--errors 1 --files 10 \ --errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \ --found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \ --found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \ --found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \ --found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \ --found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html httrack http://ut.httrack.com/unicode-links/utf8.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 7 \ --errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \ --found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \ --found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \ --found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \ --found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html httrack http://ut.httrack.com/unicode-links/default.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 2 --files 9 \ --errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \ --found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \ --found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \ --found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \ --found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html httrack http://ut.httrack.com/unicode-links/iso88591.html
bash crawl-test.sh \ bash crawl-test.sh \
--errors 4 --files 9 \ --errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \ --found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \ --found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \ --found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \ --found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \ --found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \ --found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \ --found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \ --found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \ --found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \ --found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \ --found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \ --found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \ --found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html httrack http://ut.httrack.com/unicode-links/gb18030.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=42&can=1 # http://code.google.com/p/httrack/issues/detail?id=42&can=1
# we expect 2 errors only because other links are too longs (to be modified if suitable) # we expect 2 errors only because other links are too longs (to be modified if suitable)
bash crawl-test.sh --errors 2 --files 1 \ bash crawl-test.sh --errors 2 --files 1 \
--found ut.httrack.com/overflow/longquerywithaccents.html \ --found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php httrack http://ut.httrack.com/overflow/longquerywithaccents.php

View File

@@ -3,45 +3,45 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=4&can=1 # http://code.google.com/p/httrack/issues/detail?id=4&can=1
bash crawl-test.sh --errors 0 --files 4 \ bash crawl-test.sh --errors 0 --files 4 \
--found ut.httrack.com/parsing/back5e1f.gif \ --found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \ --found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \ --found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \ --found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html httrack http://ut.httrack.com/parsing/events.html
# http://code.google.com/p/httrack/issues/detail?id=2&can=1 # http://code.google.com/p/httrack/issues/detail?id=2&can=1
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/background-image.css \ --found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \ --found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html httrack http://ut.httrack.com/parsing/background-image.html
# javascript parsing # javascript parsing
bash crawl-test.sh --errors 0 --files 3 \ bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/back.gif \ --found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \ --found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \ --found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html httrack http://ut.httrack.com/parsing/javascript.html
# handling of + before query string # handling of + before query string
bash crawl-test.sh --errors 0 --files 6 \ bash crawl-test.sh --errors 0 --files 6 \
--found ut.httrack.com/parsing/escaping.html \ --found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \ --found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \ --found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \ --found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \ --found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \ --found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html httrack http://ut.httrack.com/parsing/escaping.html
# handling of # encoded in filename # handling of # encoded in filename
# see http://code.google.com/p/httrack/issues/detail?id=25 # see http://code.google.com/p/httrack/issues/detail?id=25
bash crawl-test.sh --errors 2 --files 4 \ bash crawl-test.sh --errors 2 --files 4 \
--found "ut.httrack.com/parsing/escaping2.html" \ --found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \ --found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \ --found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \ --found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html httrack http://ut.httrack.com/parsing/escaping2.html

View File

@@ -3,11 +3,11 @@
set -euo pipefail set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77 bash check-network.sh || ! echo "skipping online unit tests" || exit 77
if test "${HTTPS_SUPPORT:-}" == "no"; then if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping" echo "no https support compiled, skipping"
exit 77 exit 77
fi fi
bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html

View File

@@ -35,6 +35,7 @@ TESTS = \
01_engine-mime.test \ 01_engine-mime.test \
01_engine-parse.test \ 01_engine-parse.test \
01_engine-rcfile.test \ 01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \ 01_engine-simplify.test \
01_engine-strsafe.test \ 01_engine-strsafe.test \
02_manpage-regen.test \ 02_manpage-regen.test \

View File

@@ -6,39 +6,39 @@
# do not enable online tests (./configure --disable-online-unit-tests) # do not enable online tests (./configure --disable-online-unit-tests)
if test "$ONLINE_UNIT_TESTS" == "no"; then if test "$ONLINE_UNIT_TESTS" == "no"; then
echo "online tests are disabled" >&2 echo "online tests are disabled" >&2
exit 1 exit 1
# enable online tests (--enable-online-unit-tests) # enable online tests (--enable-online-unit-tests)
elif test "$ONLINE_UNIT_TESTS" == "yes"; then elif test "$ONLINE_UNIT_TESTS" == "yes"; then
exit 0 exit 0
# check if online tests are reachable # check if online tests are reachable
else else
# test url # test url
url=http://ut.httrack.com/enabled url=http://ut.httrack.com/enabled
# cache file name # cache file name
cache=check-network_sh.cache cache=check-network_sh.cache
# cached result ? # cached result ?
if test -f $cache ; then if test -f $cache; then
if grep -q "ok" $cache ; then if grep -q "ok" $cache; then
exit 0 exit 0
else else
echo "online tests are disabled (cached)" >&2 echo "online tests are disabled (cached)" >&2
exit 1 exit 1
fi fi
# fetch single file # fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null ; then elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null; then
echo "ok" > $cache echo "ok" >$cache
exit 0 exit 0
else else
echo "error" > $cache echo "error" >$cache
echo "online tests are disabled (auto)" >&2 echo "online tests are disabled (auto)" >&2
exit 1 exit 1
fi fi
fi fi

View File

@@ -2,185 +2,184 @@
# #
function warning { function warning {
echo "** $*" >&2 echo "** $*" >&2
return 0 return 0
} }
function die { function die {
warning "$*" warning "$*"
exit 1 exit 1
} }
function debug { function debug {
if test -n "$verbose"; then if test -n "$verbose"; then
echo "$*" >&2 echo "$*" >&2
fi fi
} }
function info { function info {
printf "[$*] ..\t" >&2 printf '[%s] ..\t' "$*" >&2
} }
function result { function result {
echo "$*" >&2 echo "$*" >&2
} }
function cleanup { function cleanup {
debug "cleaning function called" debug "cleaning function called"
if test -n "$tmpdir"; then if test -n "$tmpdir"; then
if test -d "$tmpdir"; then if test -d "$tmpdir"; then
if test -z "$nopurge"; then if test -z "$nopurge"; then
debug "cleaning up $tmpdir" debug "cleaning up $tmpdir"
rm -rf "$tmpdir" rm -rf "$tmpdir"
fi fi
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
} }
function usage { function usage {
cat << EOF cat <<EOF
usage: $0 usage: $0
EOF EOF
} }
function assert_equals { function assert_equals {
info "$1" info "$1"
if test ! "$2" == "$3"; then if test ! "$2" == "$3"; then
result "expected '$2', got '$3'" result "expected '$2', got '$3'"
exit 1 exit 1
else else
result "OK ($2)" result "OK ($2)"
fi fi
} }
function start-crawl { function start-crawl {
# parse args # parse args
pos=1 pos=1
while test "$#" -ge "$pos" ; do while test "$#" -ge "$pos"; do
case "${!pos}" in case "${!pos}" in
--debug) --debug)
verbose=1 verbose=1
;; ;;
--no-purge|--summary|--print-files) --no-purge | --summary | --print-files) ;;
;; --errors | --files | --found | --not-found | --directory)
--errors|--files|--found|--not-found|--directory) pos=$((pos + 1))
pos=$[${pos}+1] test "$#" -ge "$pos" || warning "missing argument" || return 1
test "$#" -ge "$pos" || warning "missing argument" || return 1 ;;
;; httrack)
httrack) pos=$((pos + 1))
pos=$[${pos}+1] break
break; ;;
;; *)
*) warning "unrecognized option ${!pos}"
warning "unrecognized option ${!pos}" return 1
return 1 ;;
;; esac
esac pos=$((pos + 1))
pos=$[${pos}+1] done
done debug "remaining args: ${*:pos}"
debug "remaining args: ${@:${pos}}"
# ut/ won't exceed 2 minutes # ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5" moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ? # proxy environment ?
if test -n "$http_proxy"; then if test -n "${http_proxy:-}"; then
moreargs="$moreargs --proxy $http_proxy" moreargs+=(--proxy "$http_proxy")
fi fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1 test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl" tmp="${tmpdir}/crawl"
rm -rf "$tmp"
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}}
info "running httrack ${@:${pos}}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break;
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp" rm -rf "$tmp"
else mkdir "$tmp" || ! warning "could not create $tmp" || return 1
tmpdir=
fi which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp"
else
tmpdir=
fi
} }
# check args # check args
@@ -195,7 +194,7 @@ tmpdir=
crawlpid= crawlpid=
nopurge= nopurge=
verbose= verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25 trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory # working directory
tmpdir="${tmptopdir}/httrack_ut.$$" tmpdir="${tmptopdir}/httrack_ut.$$"

View File

@@ -2,19 +2,19 @@
# #
error=0 error=0
for i in *.test ; do for i in *.test; do
if bash $i ; then if bash "$i"; then
echo "$i: passed" >&2 echo "$i: passed" >&2
else else
echo "$i: ERROR" >&2 echo "$i: ERROR" >&2
error=$[${error}+1] error=$((error + 1))
fi fi
done done
if test "$error" -eq 0; then if test "$error" -eq 0; then
echo "all tests passed" >&2 echo "all tests passed" >&2
else else
echo "${error} test(s) failed" >&2 echo "${error} test(s) failed" >&2
fi fi
exit $error exit $error