Compare commits

..

9 Commits

Author SHA1 Message Date
Xavier Roche
fe7041ddbf Address review: keep empty-PATH parity, fold the CI script list
Review of the array refactor flagged one behaviour divergence: splitting
PATH with `IFS=: read -ra` keeps empty fields (from doubled or leading
colons) as "" elements, where the old `echo $PATH | tr : ' '` word-split
dropped them, so the search loop would probe /htsserver. Skip the empty
fields to restore exact parity.

Also reflow the CI SHELL_SCRIPTS list as a folded block scalar, one
entry per line and sorted, so it reads cleanly; the folded value is the
same space-separated string.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 12:39:31 +02:00
Xavier Roche
f5543df1af ci: lint every shell script with shellcheck and shfmt
The lint job only covered a handful of scripts; bootstrap, build.sh, the
generators, webhttrack, the CGI search helper and the crawl/run-all test
harnesses went unchecked, and shfmt ran on three files. Now both linters
run over the whole tracked shell tree, listed once in a job-level env var
so the two steps stay in sync.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:37:09 +02:00
Xavier Roche
fee30aa95d Make every shell script shellcheck-clean
Fix the shellcheck findings the shfmt pass left behind, all proven
behaviour-preserving:

- Quote single-value expansions, drop the redundant ${} in arithmetic,
  add read -r, and use printf '%s' instead of variables in format
  strings, across the generators, crawl-test.sh, run-all-tests.sh and
  search.sh.
- crawl-test.sh / webhttrack: turn the deliberately word-split search
  lists into bash arrays (space-safe, no scattered disables) and replace
  the numeric trap signal lists with names, dropping the un-trappable
  KILL/STOP that bash silently ignored anyway.
- search.sh: drop the bogus \" escapes that made grep search for a
  literal-quoted pattern.

The generators are exercised by hand and ship their committed output
(htscodepages.h, htsentities.h); a differential run on synthetic input
confirms byte-identical output before and after. crawl-test.sh and
webhttrack were run end to end against a local server / a faked install,
the latter also proving the array search now survives spaces in paths.
SC2153/SC2120 false positives carry a scoped disable with a reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:35:55 +02:00
Xavier Roche
f9f4700ee1 Reformat every shell script with shfmt -i 4
Mechanical pass: run shfmt -i 4 over the whole tracked shell tree (the
test harness .test files, the regen generators, webhttrack, the CGI
search helper, and the build/dist scripts) so they share one style.
shfmt also normalised backticks to $(...) and $[..] to $((..)).

No behaviour change: arithmetic is preserved exactly, non-ASCII bytes
are untouched, and the full make check suite still passes. The tab
indented .test files become 4-space indented, hence the wide diff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:24:01 +02:00
Xavier Roche
f030fa21e3 Merge pull request #401 from xroche/fix/relative-path-dotdot-137-162
Test the relative-link engine; collapse ../ in file:// URLs
2026-06-20 11:15:53 +02:00
Xavier Roche
bdd1c1bc2c Test the relative-link engine; collapse ../ in file:// URLs
The ../-handling tickets #137 (embedded ../ in a URL) and #162 (cross-host
"too many ../") do not reproduce on master or the released 3.49.x: the engine
has resolved embedded, cross-host, out-of-scope and above-root ../ correctly
since the 2012 import, and the released binary behaves identically. #137's
actual breakage was a JS-generated iframe URL (httrack can't rewrite
dynamically-built links); #162 is a long-gone Windows path quirk.

The area was nearly untested, though, despite feeding both link rewriting and
crawl-scope decisions: two trivial lienrelatif asserts, none for
ident_url_relatif. Add a wide regression net via two hidden debug probes
(-#l lienrelatif, -#i ident_url_relatif, mirroring -#1 fil_simplifie) driving
tens of cases in tests/01_engine-relative.test (embedded/cross-host/sibling/
ancestor/above-root ../, query stripping, scheme handling), plus the missing
fil_simplifie edge cases (absolute paths, root clamp, query freeze) in
01_engine-simplify.test. Expected values are computed by hand, not echoed.

While covering it, fixed one real gap: the file:// branch of
ident_url_absolute skipped the fil_simplifie its http sibling runs, so file://
URLs kept their ../ in adrfil->fil while the save path was already collapsed
(htsname.c:1343). Collapsing it matches the other schemes, contains traversal
at the file:// root, and dedups a/../b against b.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 11:14:28 +02:00
Xavier Roche
56665a268f Merge pull request #400 from xroche/fix/css-url-paren-163
Encode parens in rewritten CSS url() so the value isn't truncated (#163)
2026-06-20 10:02:32 +02:00
Xavier Roche
2e948b9acd htsparse: percent-encode parens in rewritten CSS url() (#163)
A source url(...) whose target encodes '(' ')' as %28/%29 was rewritten
with literal parens, because they are RFC2396 "mark" characters that the
URI escaper (escape_uri_utf, mode 30) leaves alone. In an unquoted CSS
url(...) the literal ')' closes the token early, so the browser mis-parses
the value and drops the background image.

Re-escape '(' and ')' back to %28/%29 when emitting the link, gated on the
url() context (ending_p == ')'). The UA decodes them to the saved-on-disk
name, so the reference still resolves. Quoted url("...") and ordinary HTML
attributes keep their parens, matching prior behavior.

Test in 01_engine-parse.test crawls a CSS fixture whose url() references a
%20%28...%29 name and asserts the rewrite keeps the parens encoded;
negative control confirmed (literal-paren output fails it).

Closes #163

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-20 10:01:17 +02:00
Xavier Roche
cae11499f1 Merge pull request #399 from xroche/fix/js-string-falsepos-218
htsparse: don't treat XHR.open's method argument as a URL (#218)
2026-06-19 20:36:26 +02:00
27 changed files with 738 additions and 494 deletions

View File

@@ -320,6 +320,21 @@ jobs:
lint:
name: lint (shellcheck, shfmt)
runs-on: ubuntu-24.04
# Every tracked shell script; the globs expand at run time. Kept here so the
# shellcheck and shfmt steps below cannot drift apart.
env:
SHELL_SCRIPTS: >-
.githooks/pre-commit
bootstrap
build.sh
html/div/search.sh
man/makeman.sh
src/htsbasiccharsets.sh
src/htsentities.sh
src/webhttrack
tests/*.sh
tests/*.test
tools/mkdeb.sh
steps:
- uses: actions/checkout@v6
@@ -332,12 +347,11 @@ jobs:
sudo apt-get install -y --no-install-recommends shellcheck shfmt
shfmt --version
# Lint the scripts we maintain; the legacy scripts are a separate cleanup.
- name: shellcheck
run: shellcheck man/makeman.sh tools/mkdeb.sh .githooks/pre-commit tests/*.test tests/check-network.sh
run: shellcheck $SHELL_SCRIPTS
- name: shfmt
run: shfmt -d -i 4 man/makeman.sh tools/mkdeb.sh .githooks/pre-commit
run: shfmt -d -i 4 $SHELL_SCRIPTS
# Check clang-format on CHANGED LINES ONLY. The engine predates clang-format
# (it was shaped by an old Visual Studio formatter) and does not round-trip,

View File

@@ -1,8 +1,7 @@
#!/bin/sh
# Simple indexing test using HTTrack
# A "real" script/program would use advanced search, and
# A "real" script/program would use advanced search, and
# use dichotomy to find the word in the index.txt file
# This script is really basic and NOT optimized, and
# should not be used for professional purpose :)
@@ -11,50 +10,49 @@ TESTSITE="http://localhost/"
# Create an index if necessary
if ! test -f "index.txt"; then
echo "Building the index .."
rm -rf test
httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./
echo "Building the index .."
rm -rf test
httrack --display "$TESTSITE" -%I -O test
mv test/index.txt ./
fi
# Convert crlf to lf
if test "`head index.txt -n 1 | tr '\r' '#' | grep -c '#'`" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old
cat index.txt.old|tr -d '\r' > index.txt
if test "$(head index.txt -n 1 | tr '\r' '#' | grep -c '#')" = "1"; then
echo "Converting index to Unix LF style (not CR/LF) .."
mv -f index.txt index.txt.old
tr -d '\r' <index.txt.old >index.txt
fi
keyword=-
while test -n "$keyword"; do
printf "Enter a keyword: "
read keyword
printf "Enter a keyword: "
read -r keyword
if test -n "$keyword"; then
FOUNDK="`grep -niE \"^$keyword\" index.txt`"
if test -n "$keyword"; then
FOUNDK="$(grep -niE "^$keyword" index.txt)"
if test -n "$FOUNDK"; then
if ! test `echo "$FOUNDK"|wc -l` = "1"; then
# Multiple matches
printf "Found multiple keywords: "
echo "$FOUNDK"|cut -f2 -d':'|tr '\n' ' '
echo ""
echo "Use keyword$ to find only one"
else
# One match
N=`echo "$FOUNDK"|cut -f1 -d':'`
PM=`tail +$N index.txt|grep -nE "\("|head -n 1`
if ! echo "$PM"|grep "ignored">/dev/null; then
M=`echo $PM|cut -f1 -d':'`
echo "Found in:"
cat index.txt | tail "+$N" | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else
echo "keyword ignored (too many hits)"
fi
fi
else
echo "not found"
fi
if test -n "$FOUNDK"; then
if ! test "$(echo "$FOUNDK" | wc -l)" = "1"; then
# Multiple matches
printf "Found multiple keywords: "
echo "$FOUNDK" | cut -f2 -d':' | tr '\n' ' '
echo ""
echo "Use keyword$ to find only one"
else
# One match
N=$(echo "$FOUNDK" | cut -f1 -d':')
PM=$(tail "+$N" index.txt | grep -nE "\(" | head -n 1)
if ! echo "$PM" | grep "ignored" >/dev/null; then
M=$(echo "$PM" | cut -f1 -d':')
echo "Found in:"
tail "+$N" index.txt | head -n "$M" | grep -E "[0-9]* " | cut -f2 -d' '
else
echo "keyword ignored (too many hits)"
fi
fi
else
echo "not found"
fi
fi
fi
done

View File

@@ -3,57 +3,59 @@
# Change this to download files
if false; then
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/CP*.TXT" | lftp
echo "mget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8*.TXT" | lftp
rm -f CP932.TXT CP936.TXT CP949.TXT CP950.TXT
fi
# Produce code
printf "/** GENERATED FILE ($0), DO NOT EDIT **/\n\n"
for i in *.TXT ; do
echo "processing $i" >&2
grep -vE "^(#|$)" $i | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' | \
(
unset arr
while read LINE ; do
from=$[$(echo $LINE | cut -f1 -d' ')]
if ! test -n "$from"; then
echo "error with $i" >&2
exit 1
elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2
exit 1
fi
to=$(echo $LINE | cut -f2 -d' ')
arr[$from]=$to
done
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf "/* Table for $i */\nstatic const hts_UCS4 table_${name}[256] = {\n "
i=0
while test "$i" -lt 256; do
if test "$i" -gt 0; then
printf ", "
if test $[${i}%8] -eq 0; then
printf "\n "
fi
fi
value=${arr[$i]:-0}
printf "0x%04x" $value
i=$[${i}+1]
done
printf " };\n\n"
)
echo "processed $i" >&2
printf '/** GENERATED FILE (%s), DO NOT EDIT **/\n\n' "$0"
for i in *.TXT; do
echo "processing $i" >&2
grep -vE "^(#|$)" "$i" | grep -E "^0x" | sed -e 's/[[:space:]]/ /g' | cut -f1,2 -d' ' |
(
unset arr
while read -r LINE; do
from=$(($(echo "$LINE" | cut -f1 -d' ')))
if ! test -n "$from"; then
echo "error with $i" >&2
exit 1
elif test $from -ge 256; then
echo "out-of-range ($LINE) with $i" >&2
exit 1
fi
to=$(echo "$LINE" | cut -f2 -d' ')
arr[from]=$to
done
# shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf '/* Table for %s */\nstatic const hts_UCS4 table_%s[256] = {\n ' "$i" "$name"
idx=0
while test "$idx" -lt 256; do
if test "$idx" -gt 0; then
printf ", "
if test $((idx % 8)) -eq 0; then
printf "\n "
fi
fi
value=${arr[$idx]:-0}
printf "0x%04x" "$value"
idx=$((idx + 1))
done
printf " };\n\n"
)
echo "processed $i" >&2
done
# Indexes
printf "static const struct {\n const char *name;\n const hts_UCS4 *table;\n} table_mappings[] = {\n"
for i in *.TXT ; do
name=$(echo $i | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf " { \"$(echo $name | tr -d '_')\", table_${name} },\n"
for i in *.TXT; do
# shellcheck disable=SC2018,SC2019 # charset filenames are ASCII; keep C-locale A-Z/a-z
name=$(echo "$i" | tr 'A-Z' 'a-z' | tr '-' '_' | sed -e 's/\.txt//' -e 's/8859/iso_8859/')
printf ' { "%s", table_%s },\n' "$(echo "$name" | tr -d '_')" "$name"
done
printf " { NULL, NULL }\n};\n"

View File

@@ -2787,6 +2787,47 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0;
}
break;
case 'l': /* lienrelatif: relative link from curr_fil to link */
if (na + 2 >= argc) {
HTS_PANIC_PRINTF(
"Option #l needs a link and a current-file path");
printf(
"Example: '-#l' 'host/dir/img.gif' 'host/dir/p.html'\n");
htsmain_free();
return -1;
} else {
char s[HTS_URLMAXSIZE * 2];
if (lienrelatif(s, sizeof(s), argv[na + 1], argv[na + 2]) ==
0)
printf("relative=%s\n", s);
else
printf("relative=<ERROR>\n");
htsmain_free();
return 0;
}
break;
case 'i': /* ident_url_relatif: resolve a link -> adr/fil */
if (na + 3 >= argc) {
HTS_PANIC_PRINTF(
"Option #i needs a link, an origin address and file");
printf("Example: '-#i' '../img.gif' 'www.foo.com' "
"'/d/p.html'\n");
htsmain_free();
return -1;
} else {
lien_adrfil af;
const int r = ident_url_relatif(argv[na + 1], argv[na + 2],
argv[na + 3], &af);
if (r == 0)
printf("adr=%s fil=%s\n", af.adr, af.fil);
else
printf("error=%d\n", r);
htsmain_free();
return 0;
}
break;
case '2': // mimedefs
if (na + 1 >= argc) {
HTS_PANIC_PRINTF("Option #2 needs to be followed by an URL");

View File

@@ -33,43 +33,43 @@ EOF
else
GET "${url}"
fi
) \
| grep -E '^<!ENTITY [a-zA-Z0-9_]' \
| sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/'\
| ( \
read A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A"|cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$[1 << 32]"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}"|hexdump -v -e '/1 "%d"')"
hash="$[((${hash}*${a})%(${m})+${d}+${c})%(${m})]"
i=$[${i}+1]
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
) |
grep -E '^<!ENTITY [a-zA-Z0-9_]' |
sed \
-e 's/<!ENTITY //' -e "s/[[:space:]][[:space:]]*/ /g" \
-e 's/-->$//' \
-e 's/\([^ ]*\) CDATA "&#\([^\"]*\);" -- \(.*\)/\1 \2 \3/' |
(
read -r A
while test -n "$A"; do
ent="${A%% *}"
code=$(echo "$A" | cut -f2 -d' ')
# compute hash
hash=0
i=0
a=1664525
c=1013904223
m="$((1 << 32))"
while test "$i" -lt ${#ent}; do
d="$(echo -n "${ent:${i}:1}" | hexdump -v -e '/1 "%d"')"
hash="$((((hash * a) % (m) + d + c) % (m)))"
i=$((i + 1))
done
echo -e " /* $A */"
echo -e " case ${hash}u:"
echo -e " if (len == ${#ent} /* && strncmp(ent, \"${ent}\") == 0 */) {"
echo -e " return ${code};"
echo -e " }"
echo -e " break;"
# next
read A
done
)
# next
read -r A
done
)
cat <<EOF
}
/* unknown */
return -1;
}
EOF
) > ${dest}
) >${dest}

View File

@@ -2605,6 +2605,8 @@ int ident_url_absolute(const char *url, lien_adrfil *adrfil) {
for(i = 0; adrfil->fil[i] != '\0'; i++)
if (adrfil->fil[i] == '\\')
adrfil->fil[i] = '/';
// collapse ../ like the http branch above (path-traversal safety)
fil_simplifie(adrfil->fil);
}
// no hostname

View File

@@ -317,6 +317,27 @@ static int is_http_method(const char *s, size_t len) {
return 0;
}
/* Percent-encode '(' and ')' in a link emitted into an unquoted url(...) (CSS
or JS): a literal ')' closes the token early and the UA mis-parses the value
(#163). The UA decodes %28/%29 back to the saved-on-disk name. */
static void escape_url_parens(char *const s, const size_t size) {
char BIGSTK buff[HTS_URLMAXSIZE * 2];
size_t i, j;
for (i = 0, j = 0; s[i] != '\0' && j + 3 < size && j + 3 < sizeof(buff);
i++) {
if (s[i] == '(' || s[i] == ')') {
buff[j++] = '%';
buff[j++] = '2';
buff[j++] = s[i] == '(' ? '8' : '9';
} else {
buff[j++] = s[i];
}
}
buff[j] = '\0';
strlcpybuff(s, buff, size);
}
/* Main parser */
int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
char catbuff[CATBUFF_SIZE];
@@ -3027,6 +3048,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
/* Never escape high-chars (we don't know the encoding!!) */
inplace_escape_uri_utf(tempo, sizeof(tempo));
// unquoted url() (CSS/JS): keep parens escaped
if (ending_p == ')')
escape_url_parens(tempo, sizeof(tempo));
//if (!no_esc_utf)
// escape_uri(tempo); // escape with %xx
//else {

View File

@@ -4,131 +4,140 @@
# Initializes the htsserver GUI frontend and launch the default browser
BROWSEREXE=
SRCHBROWSEREXE="x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition"
SRCHBROWSEREXE=(x-www-browser www-browser iceape mozilla firefox-developer-edition firefox icecat iceweasel abrowser firebird galeon konqueror midori opera google-chrome chrome chromium chromium-browser netscape firefox-developer-edition)
# shellcheck disable=SC2153 # BROWSER is the standard freedesktop env var, not a typo
if test -n "${BROWSER}"; then
# sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE="xdg-open sensible-browser ${SRCHBROWSEREXE}"
# sensible-browser will f up if BROWSER is not set
SRCHBROWSEREXE=(xdg-open sensible-browser "${SRCHBROWSEREXE[@]}")
fi
# Patch for Darwin/Mac by Ross Williams
if test "`uname -s`" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for
# the browser to exit
BROWSEREXE="/usr/bin/open -W"
if test "$(uname -s)" == "Darwin"; then
# Darwin/Mac OS X uses a system 'open' command to find
# the default browser. The -W flag causes it to wait for
# the browser to exit
BROWSEREXE="/usr/bin/open -W"
fi
BINWD=`dirname "$0"`
SRCHPATH="$BINWD /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin ${HOME}/usr/bin ${HOME}/bin"
SRCHPATH="$SRCHPATH "`echo $PATH | tr ":" " "`
SRCHDISTPATH="$BINWD/../share $BINWD/.. /usr/share /usr/local /usr /local /usr/local/share ${HOME}/usr ${HOME}/usr/share /opt/local/share /sw ${HOME}/usr/local ${HOME}/usr/share"
BINWD=$(dirname "$0")
SRCHPATH=("$BINWD" /usr/local/bin /usr/share/bin /usr/bin /usr/lib/httrack /usr/local/lib/httrack /usr/local/share/httrack /opt/local/bin /sw/bin "${HOME}/usr/bin" "${HOME}/bin")
IFS=':' read -ra pathdirs <<<"$PATH"
for d in "${pathdirs[@]}"; do
# drop empty PATH fields, matching the old echo|tr word-split
test -n "$d" && SRCHPATH+=("$d")
done
SRCHDISTPATH=("$BINWD/../share" "$BINWD/.." /usr/share /usr/local /usr /local /usr/local/share "${HOME}/usr" "${HOME}/usr/share" /opt/local/share /sw "${HOME}/usr/local" "${HOME}/usr/share")
###
# And now some famous cuisine
function log {
echo "$0($$): $@" >&2
return 0
echo "$0($$): $*" >&2
return 0
}
function launch_browser {
log "Launching $1"
browser=$1
url=$2
log "Spawning browser.."
${browser} "${url}"
# note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately
log "Browser (or helper) exited"
log "Launching $1"
browser=$1
url=$2
log "Spawning browser.."
${browser} "${url}"
# note: browser can hiddenly use the -remote feature of
# mozilla and therefore return immediately
log "Browser (or helper) exited"
}
# First ensure that we can launch the server
BINPATH=
for i in ${SRCHPATH}; do
! test -n "${BINPATH}" && test -x ${i}/htsserver && BINPATH=${i}
for i in "${SRCHPATH[@]}"; do
! test -n "${BINPATH}" && test -x "${i}/htsserver" && BINPATH="${i}"
done
for i in ${SRCHDISTPATH}; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
for i in "${SRCHDISTPATH[@]}"; do
! test -n "${DISTPATH}" && test -f "${i}/httrack/lang.def" && DISTPATH="${i}/httrack"
done
test -n "${BINPATH}" || ! log "Could not find htsserver" || exit 1
test -n "${DISTPATH}" || ! log "Could not find httrack directory" || exit 1
test -f ${DISTPATH}/lang.def || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f ${DISTPATH}/lang.indexes || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d ${DISTPATH}/lang || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d ${DISTPATH}/html || ! log "Could not find ${DISTPATH}/html" || exit 1
test -f "${DISTPATH}/lang.def" || ! log "Could not find ${DISTPATH}/lang.def" || exit 1
test -f "${DISTPATH}/lang.indexes" || ! log "Could not find ${DISTPATH}/lang.indexes" || exit 1
test -d "${DISTPATH}/lang" || ! log "Could not find ${DISTPATH}/lang" || exit 1
test -d "${DISTPATH}/html" || ! log "Could not find ${DISTPATH}/html" || exit 1
# Locale
HTSLANG="${LC_MESSAGES}"
! test -n "${HTSLANG}" && HTSLANG="${LC_ALL}"
! test -n "${HTSLANG}" && HTSLANG="${LANG}"
HTSLANG="`echo $LANG | cut -f1 -d'.' | cut -f1 -d'_'`"
LANGN=`grep -E "^${HTSLANG}:" ${DISTPATH}/lang.indexes | cut -f2 -d':'`
HTSLANG="$(echo "$LANG" | cut -f1 -d'.' | cut -f1 -d'_')"
LANGN=$(grep -E "^${HTSLANG}:" "${DISTPATH}/lang.indexes" | cut -f2 -d':')
! test -n "${LANGN}" && LANGN=1
# Find the browser
# note: not all systems have sensible-browser or www-browser alternative
# thefeore, we have to find a bit more if sensible-browser could not be found
for i in ${SRCHBROWSEREXE}; do
for j in ${SRCHPATH}; do
if test -x ${j}/${i}; then
BROWSEREXE=${j}/${i}
fi
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" && break
for i in "${SRCHBROWSEREXE[@]}"; do
for j in "${SRCHPATH[@]}"; do
if test -x "${j}/${i}"; then
BROWSEREXE="${j}/${i}"
fi
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" && break
done
test -n "$BROWSEREXE" || ! log "Could not find any suitable browser" || exit 1
# "browse" command
if test "$1" = "browse"; then
if test -f "${HOME}/.httrack.ini"; then
INDEXF=`cat ${HOME}/.httrack.ini | tr '\r' '\n' | grep -E "^path=" | cut -f2- -d'='`
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html"
else
INDEXF=""
fi
fi
if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html"
fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $?
if test -f "${HOME}/.httrack.ini"; then
INDEXF=$(tr '\r' '\n' <"${HOME}/.httrack.ini" | grep -E "^path=" | cut -f2- -d'=')
if test -n "${INDEXF}" -a -d "${INDEXF}" -a -f "${INDEXF}/index.html"; then
INDEXF="${INDEXF}/index.html"
else
INDEXF=""
fi
fi
if ! test -n "$INDEXF"; then
INDEXF="${HOME}/websites/index.html"
fi
launch_browser "${BROWSEREXE}" "file://${INDEXF}"
exit $?
fi
# Create a temporary filename
TMPSRVFILE="$(mktemp ${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX)" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
TMPSRVFILE="$(mktemp "${TMPDIR:-/tmp}/.webhttrack.XXXXXXXX")" || ! log "Could not create the temporary file ${TMPSRVFILE}" || exit 1
# Launch htsserver binary and setup the server
(${BINPATH}/htsserver "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" $@; echo SRVURL=error) > ${TMPSRVFILE}&
(
"${BINPATH}/htsserver" "${DISTPATH}/" --ppid "$$" path "${HOME}/websites" lang "${LANGN}" "$@"
echo SRVURL=error
) >"${TMPSRVFILE}" &
# Find the generated SRVURL
SRVURL=
MAXCOUNT=60
while ! test -n "$SRVURL"; do
MAXCOUNT=$[$MAXCOUNT - 1]
test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=`grep -E URL= ${TMPSRVFILE} | cut -f2- -d=`
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1
MAXCOUNT=$((MAXCOUNT - 1))
test $MAXCOUNT -gt 0 || exit 1
test $MAXCOUNT -lt 50 && echo "waiting for server to reply.."
SRVURL=$(grep -E URL= "${TMPSRVFILE}" | cut -f2- -d=)
test ! "$SRVURL" = "error" || ! log "Could not spawn htsserver" || exit 1
test -n "$SRVURL" || sleep 1
done
# Cleanup function
# shellcheck disable=SC2120 # $1 is an optional "signal caught" marker; bare calls are intentional
function cleanup {
test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f ${TMPSRVFILE} && SRVPID=`grep -E PID= ${TMPSRVFILE} | cut -f2- -d=`
test -n "${SRVPID}" && kill -9 ${SRVPID}
test -f ${TMPSRVFILE} && rm ${TMPSRVFILE}
test -n "$1" && log "..Done"
return 0
test -n "$1" && log "Nasty signal caught, cleaning up.."
# Do not kill if browser exited (chrome bug issue) ; server will die itself
test -n "$1" && test -f "${TMPSRVFILE}" && SRVPID=$(grep -E PID= "${TMPSRVFILE}" | cut -f2- -d=)
test -n "${SRVPID}" && kill -9 "${SRVPID}"
test -f "${TMPSRVFILE}" && rm "${TMPSRVFILE}"
test -n "$1" && log "..Done"
return 0
}
# Cleanup in case of emergency
trap "cleanup now; exit" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap "cleanup now; exit" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# Got SRVURL, launch browser
launch_browser "${BROWSEREXE}" "${SRVURL}"
# That's all, folks!
trap "" 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap "" HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
cleanup
exit 0

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# charset -> UTF-8 conversion (hts_convertStringToUTF8).
# -#3 <charset> <string> prints the string re-decoded from <charset> as UTF-8.
conv() {
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
test "$(httrack -O /dev/null -#3 "$1" "$2")" == "$3" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#3 "$1" "$2" >/dev/null 2>&1 || exit 1
}
# the source bytes below are UTF-8 (this file is UTF-8); "café" is 0x63 61 66 C3 A9.

View File

@@ -6,11 +6,11 @@ set -euo pipefail
# HTML entity unescaping (hts_unescapeEntitiesWithCharset).
# -#6 <string> prints the string with entities decoded (UTF-8 output).
ent() {
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
test "$(httrack -O /dev/null -#6 "$1")" == "$2" || exit 1
}
# crash probe: malformed input must exit cleanly, not abort.
runs() {
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
httrack -O /dev/null -#6 "$1" >/dev/null 2>&1 || exit 1
}
# named entities

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#0 <filter> <string> prints "<string> does match <filter>" or "... does NOT match ...".
match() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does match $1" || exit 1
}
nomatch() {
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
test "$(httrack -O /dev/null -#0 "$1" "$2")" == "$2 does NOT match $1" || exit 1
}
# bare star matches everything
@@ -67,7 +67,7 @@ nomatch '*[\[]' 'a'
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
# by a trailing literal ']'. These assertions document the current (buggy)
# behavior so any future matcher fix is a deliberate, visible change.
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[' # not matched, despite the docs
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
nomatch '*[\[\]]' '[]x'

View File

@@ -7,10 +7,10 @@ set -euo pipefail
# -#2 <path> prints "<path> is '<mime>'" then "and its local type is '.<ext>'".
mime() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is '$2'" || exit 1
}
unknown() {
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
test "$(httrack -O /dev/null -#2 "$1" | head -1)" == "$1 is of an unknown MIME type" || exit 1
}
mime '/a/b.html' 'text/html'

View File

@@ -264,4 +264,63 @@ grep -Fq 'window.open("winopen.gif")' "$saved7" ||
! grep -Fq 'window.open("file://' "$saved7" ||
! echo "FAIL #218: window.open URL left absolute (not rewritten)" || exit 1
# Parens in an unquoted url(...) (#163): the source %28/%29 decode to literal
# '(' ')' in the saved name, but a literal ')' in the rewritten url() closes the
# token early, so they must stay encoded. Negative control: without the fix the
# %281%29 greps fail (parens are RFC2396 "mark" chars the escaper leaves alone).
site8="$tmp/cssparens"
mkdir -p "$site8"
for f in 'img (1).gif' 'a(b)c(1).gif' 'q (4).gif'; do gif "$site8/$f"; done
cat >"$site8/style.css" <<'EOF'
.a { background: url(img%20%281%29.gif); }
.b { background: url(a%28b%29c%281%29.gif); }
.c { background: url("q%20%284%29.gif"); }
EOF
out8="$tmp/cssparens-out"
crawl "$site8/style.css" "$out8"
found "img (1).gif" "$out8"
found "a(b)c(1).gif" "$out8"
found "q (4).gif" "$out8"
css8=$(find "$out8" -type f -path '*/file/*' -name style.css -print -quit)
test -n "$css8" || ! echo "FAIL: saved style.css not found" || exit 1
grep -Fq 'url(img%20%281%29.gif)' "$css8" ||
! echo "FAIL #163: parens in unquoted url() not percent-encoded on rewrite" || exit 1
grep -Fq 'url(a%28b%29c%281%29.gif)' "$css8" ||
! echo "FAIL #163: not every paren in a url() was percent-encoded" || exit 1
grep -Fq 'url("q%20%284%29.gif")' "$css8" ||
! echo "FAIL #163: quoted url() altered or parens left literal on rewrite" || exit 1
# The url() detector is not CSS-specific: <script> and inline style= get the
# same encoding, but ordinary href/src (ending_p is the quote, not ')') keep
# literal parens -- the attribute checks guard the gate against over-firing.
site9="$tmp/urlparens"
mkdir -p "$site9"
for f in 'js (1).gif' 'inl (2).gif' 'asrc (3).gif' 'ahref (4).gif'; do gif "$site9/$f"; done
cat >"$site9/index.html" <<EOF
<html><body>
<script>var bg = "url(js%20%281%29.gif)";</script>
<div style="background-image:url(inl%20%282%29.gif)"></div>
<img src="asrc%20%283%29.gif">
<a href="ahref%20%284%29.gif">link</a>
</body></html>
EOF
out9="$tmp/urlparens-out"
crawl "$site9/index.html" "$out9"
saved9=$(savedhtml "$out9")
test -n "$saved9" || ! echo "FAIL: saved urlparens page not found" || exit 1
# rewrite-only: the JS-string asset is not queued for download
grep -Fq 'url(js%20%281%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in <script> url() not percent-encoded" || exit 1
found "inl (2).gif" "$out9"
grep -Fq 'url(inl%20%282%29.gif)' "$saved9" ||
! echo "FAIL #163: parens in inline style url() not percent-encoded" || exit 1
found "asrc (3).gif" "$out9"
found "ahref (4).gif" "$out9"
grep -Fq 'src="asrc%20(3).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain src attribute were wrongly encoded" || exit 1
grep -Fq 'href="ahref%20(4).gif"' "$saved9" ||
! echo "FAIL #163: parens in a plain href attribute were wrongly encoded" || exit 1
! grep -Eq '(src|href)="[^"]*%28' "$saved9" ||
! echo "FAIL #163: gate over-fired onto a non-url() attribute link" || exit 1
exit 0

68
tests/01_engine-relative.test Executable file
View File

@@ -0,0 +1,68 @@
#!/bin/bash
#
# lienrelatif (build relative path) + ident_url_relatif (resolve a link, collapse
# ./ and ../). Regression net for #137/#162; expected values hand-computed.
set -euo pipefail
# relative path from <curr>'s directory to <link>
rel() {
local got
got=$(httrack -O /dev/null -#l "$1" "$2")
test "$got" == "relative=$3" ||
{
echo "FAIL rel($1, $2): got '$got' want 'relative=$3'"
exit 1
}
}
# resolve <link> against origin <adr>/<fil> -> adr=.. fil=..
ident() {
local got
got=$(httrack -O /dev/null -#i "$1" "$2" "$3")
test "$got" == "$4" ||
{
echo "FAIL ident($1, $2, $3): got '$got' want '$4'"
exit 1
}
}
### lienrelatif
rel 'dir/page.html' 'dir/index.html' 'page.html'
rel 'dir/page.html' 'dir/page.html' 'page.html' # self-link
rel 'a.html' 'dir/index.html' '../a.html'
rel 'x.html' 'a/b/c/index.html' '../../../x.html'
rel 'h/a/x.jpg' 'h/a/sub/page.html' '../x.jpg'
rel 'a/b/c/x.html' 'index.html' 'a/b/c/x.html'
rel 'h/sub/x.jpg' 'h/page.html' 'sub/x.jpg'
rel 'h/dir2/x.jpg' 'h/dir1/page.html' '../dir2/x.jpg' # sibling dir
rel 'h/bc/x.jpg' 'h/b/page.html' '../bc/x.jpg' # b/bc prefix trap
rel 'h/b/x.jpg' 'h/bc/page.html' '../b/x.jpg'
rel 'h2/img/x.jpg' 'h1/p/page.html' '../../h2/img/x.jpg' # cross-host
rel 'img.cdn/photo.jpg' 'www.site/articles/2020/post.html' '../../../img.cdn/photo.jpg'
rel 'h/a/' 'h/a/sub/page.html' '../' # link is ancestor dir
rel 'x.html' 'page.html' 'x.html'
rel 'dir/page.html?x=1' 'dir/index.html?y=2' 'page.html' # ? stripped
### ident_url_relatif
ident 'img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/dir/sub/img.gif'
ident '/img.gif' 'www.foo.com' '/dir/page.html' 'adr=www.foo.com fil=/img.gif'
# embedded ../ collapses (#137)
ident '../img.gif' 'www.foo.com' '/dir/sub/page.html' 'adr=www.foo.com fil=/dir/img.gif'
ident 'sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/articles/2020/logo.png'
ident '../../pix/sub/../logo.png' 'www.foo.com' '/articles/2020/post.html' 'adr=www.foo.com fil=/pix/logo.png'
ident '../../../../x.gif' 'www.foo.com' '/a/b/page.html' 'adr=www.foo.com fil=/x.gif' # above-root clamp
ident '?page=2' 'www.foo.com' '/dir/index.html?old=1' 'adr=www.foo.com fil=/dir/index.html?page=2'
ident 'http://other.com/a/b/../c/index.html' 'www.foo.com' '/p.html' 'adr=other.com fil=/a/c/index.html'
# file:// collapses ../ like the other schemes; traversal contained, // authority kept
ident 'file:///var/data/pix/sub/../logo.png' 'www.foo.com' '/p.html' 'adr=file:// fil=/var/data/pix/logo.png'
ident 'file:///a/b/c/../../d/e.gif' 'www.foo.com' '/p.html' 'adr=file:// fil=/a/d/e.gif'
ident 'file:///a/../../b' 'www.foo.com' '/p.html' 'adr=file:// fil=/b'
ident 'file://srv/share/../x' 'www.foo.com' '/p.html' 'adr=file:// fil=//srv/x'
ident 'mailto:foo@bar.com' 'www.foo.com' '/p.html' 'error=-1' # unsupported scheme
ident 'javascript:void(0)' 'www.foo.com' '/p.html' 'error=-1'
echo "OK"

View File

@@ -5,7 +5,7 @@ set -euo pipefail
# path simplify engine (fil_simplifie): collapses ./ and ../ segments.
simp() {
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
test "$(httrack -O /dev/null -#1 "$1")" == "simplified=$2" || exit 1
}
simp './foo/bar/' 'foo/bar/'
@@ -26,3 +26,17 @@ simp './a/../../b' 'b'
# empty segments ('//') are not dot-segments and are preserved, per RFC 3986
simp 'a//b' 'a//b'
simp 'a//b/../c' 'a//c'
# absolute paths keep the leading '/'; above-root '..' is clamped to it
simp '/a/../b' '/b'
simp '/a/../../b' '/b'
simp '/../x' '/x'
# collapses to nothing -> './' (relative) or '/' (absolute)
simp '..' './'
simp 'a/..' './'
simp '/' '/'
simp 'a/b/..' 'a/' # trailing bare '..'
simp 'a/../b?x=../y' 'b?x=../y' # '?' freezes simplification

View File

@@ -21,9 +21,15 @@ test "$out" == "strsafe: OK" || exit 1
# the bounded macro aborts (non-zero exit), so don't let set -e trip on it
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;;
*"overflow while copying"*) ;;
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;;
*"strsafe: NOT aborted"*)
echo "over-capacity write was NOT caught" >&2
exit 1
;;
*"overflow while copying"*) ;;
*)
echo "expected htssafe overflow abort, got: $err" >&2
exit 1
;;
esac
# Same guarantee for the htsbuff builder. The source is exactly the buffer
@@ -32,7 +38,13 @@ esac
# aborted"). Match the specific htsbuff abort message, not just any assert.
err=$(httrack -#8 overflow-buff "abcd" 2>&1) || true
case "$err" in
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;;
*"htsbuff append overflow"*) ;;
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;;
*"strsafe: NOT aborted"*)
echo "htsbuff over-capacity write was NOT caught" >&2
exit 1
;;
*"htsbuff append overflow"*) ;;
*)
echo "expected htsbuff overflow abort, got: $err" >&2
exit 1
;;
esac

View File

@@ -3,6 +3,6 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 5 httrack http://ut.httrack.com/simple/basic.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php
--found ut.httrack.com/cookies/third.html \
--found ut.httrack.com/cookies/second.html \
--found ut.httrack.com/cookies/entrance.html \
httrack http://ut.httrack.com/cookies/entrance.php

View File

@@ -3,21 +3,21 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests
bash crawl-test.sh \
--errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0
--errors 1 --files 5 \
--found 'café.ut.httrack.com/unicode-links/café3860.html' \
--found 'café.ut.httrack.com/unicode-links/café30f4.html' \
--found 'café.ut.httrack.com/unicode-links/café5e1f.html' \
--found 'café.ut.httrack.com/unicode-links/café7b30.html' \
httrack 'http://ut.httrack.com/unicode-links/idna.html' \
'+*.ut.httrack.com/*' --robots=0
# unicode tests (bogus links)
bash crawl-test.sh \
--errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0
--errors 0 --files 1 \
--found 'ut.httrack.com/unicode-links/idna_bogus.html' \
httrack 'http://ut.httrack.com/unicode-links/idna_bogus.html' \
'-*' --robots=0

View File

@@ -3,67 +3,67 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# unicode tests
bash crawl-test.sh \
--errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html
--errors 1 --files 10 \
--found ut.httrack.com/unicode-links/caf%a91bce.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café463e.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/caféae52.html \
--found ut.httrack.com/unicode-links/caféc009.html \
--found ut.httrack.com/unicode-links/utf8.html \
httrack http://ut.httrack.com/unicode-links/utf8.html
bash crawl-test.sh \
--errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html
--errors 4 --files 7 \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café9fa8.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caf%e939bd.html \
--found ut.httrack.com/unicode-links/caf%e9ae52.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/default.html \
httrack http://ut.httrack.com/unicode-links/default.html
bash crawl-test.sh \
--errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html
--errors 2 --files 9 \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café647f.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/caféaec2.html \
--found ut.httrack.com/unicode-links/caféfad6.html \
--found ut.httrack.com/unicode-links/iso88591.html \
httrack http://ut.httrack.com/unicode-links/iso88591.html
bash crawl-test.sh \
--errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html
--errors 4 --files 9 \
--found ut.httrack.com/unicode-links/caf%a8%a6c72a.html \
--found ut.httrack.com/unicode-links/caf%a9bf59.html \
--found ut.httrack.com/unicode-links/café8007.html \
--found ut.httrack.com/unicode-links/cafébf43.html \
--found ut.httrack.com/unicode-links/cafédcd8.html \
--found ut.httrack.com/unicode-links/café2461.html \
--found ut.httrack.com/unicode-links/caf%a8%a61bce.html \
--found ut.httrack.com/unicode-links/caf%a9ae52.html \
--found ut.httrack.com/unicode-links/café7b30.html \
--found ut.httrack.com/unicode-links/café30f4.html \
--found ut.httrack.com/unicode-links/café5e1f.html \
--found ut.httrack.com/unicode-links/café3860.html \
--found ut.httrack.com/unicode-links/gb18030.html \
httrack http://ut.httrack.com/unicode-links/gb18030.html

View File

@@ -3,10 +3,10 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=42&can=1
# we expect 2 errors only because other links are too longs (to be modified if suitable)
bash crawl-test.sh --errors 2 --files 1 \
--found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php
--found ut.httrack.com/overflow/longquerywithaccents.html \
httrack http://ut.httrack.com/overflow/longquerywithaccents.php

View File

@@ -3,45 +3,45 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
# http://code.google.com/p/httrack/issues/detail?id=4&can=1
bash crawl-test.sh --errors 0 --files 4 \
--found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html
--found ut.httrack.com/parsing/back5e1f.gif \
--found ut.httrack.com/parsing/events.html \
--found ut.httrack.com/parsing/fade230f4.gif \
--found ut.httrack.com/parsing/fade3860.gif \
httrack http://ut.httrack.com/parsing/events.html
# http://code.google.com/p/httrack/issues/detail?id=2&can=1
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html
--found ut.httrack.com/parsing/background-image.css \
--found ut.httrack.com/parsing/background-image.html \
--found ut.httrack.com/parsing/fade.gif \
httrack http://ut.httrack.com/parsing/background-image.html
# javascript parsing
bash crawl-test.sh --errors 0 --files 3 \
--found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html
--found ut.httrack.com/parsing/back.gif \
--found ut.httrack.com/parsing/fade.gif \
--found ut.httrack.com/parsing/javascript.html \
httrack http://ut.httrack.com/parsing/javascript.html
# handling of + before query string
bash crawl-test.sh --errors 0 --files 6 \
--found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html
--found ut.httrack.com/parsing/escaping.html \
--found "ut.httrack.com/parsing/foo bar30f4.html" \
--found "ut.httrack.com/parsing/foo bar5e1f.html" \
--found "ut.httrack.com/parsing/foo+bar+plus3860.html" \
--found "ut.httrack.com/parsing/foo barae52.html" \
--found "ut.httrack.com/parsing/foo bar7b30.html" \
httrack http://ut.httrack.com/parsing/escaping.html
# handling of # encoded in filename
# see http://code.google.com/p/httrack/issues/detail?id=25
bash crawl-test.sh --errors 2 --files 4 \
--found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html
--found "ut.httrack.com/parsing/escaping2.html" \
--found "ut.httrack.com/parsing/++foo++bar++plus++.html" \
--found "ut.httrack.com/parsing/foo#bar#.html" \
--found "ut.httrack.com/parsing/foo bar.html" \
httrack http://ut.httrack.com/parsing/escaping2.html

View File

@@ -3,11 +3,11 @@
set -euo pipefail
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
bash check-network.sh || ! echo "skipping online unit tests" || exit 77
if test "${HTTPS_SUPPORT:-}" == "no"; then
echo "no https support compiled, skipping"
exit 77
echo "no https support compiled, skipping"
exit 77
fi
bash crawl-test.sh --errors 0 --files 5 httrack https://ut.httrack.com/simple/basic.html

View File

@@ -35,6 +35,7 @@ TESTS = \
01_engine-mime.test \
01_engine-parse.test \
01_engine-rcfile.test \
01_engine-relative.test \
01_engine-simplify.test \
01_engine-strsafe.test \
02_manpage-regen.test \

View File

@@ -6,39 +6,39 @@
# do not enable online tests (./configure --disable-online-unit-tests)
if test "$ONLINE_UNIT_TESTS" == "no"; then
echo "online tests are disabled" >&2
exit 1
echo "online tests are disabled" >&2
exit 1
# enable online tests (--enable-online-unit-tests)
elif test "$ONLINE_UNIT_TESTS" == "yes"; then
exit 0
exit 0
# check if online tests are reachable
else
# test url
url=http://ut.httrack.com/enabled
# test url
url=http://ut.httrack.com/enabled
# cache file name
cache=check-network_sh.cache
# cache file name
cache=check-network_sh.cache
# cached result ?
if test -f $cache ; then
if grep -q "ok" $cache ; then
exit 0
else
echo "online tests are disabled (cached)" >&2
exit 1
fi
# cached result ?
if test -f $cache; then
if grep -q "ok" $cache; then
exit 0
else
echo "online tests are disabled (cached)" >&2
exit 1
fi
# fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null ; then
echo "ok" > $cache
exit 0
else
echo "error" > $cache
echo "online tests are disabled (auto)" >&2
exit 1
fi
# fetch single file
elif bash crawl-test.sh --errors 0 --files 1 httrack --timeout=3 --max-time=3 "$url" 2>/dev/null >/dev/null; then
echo "ok" >$cache
exit 0
else
echo "error" >$cache
echo "online tests are disabled (auto)" >&2
exit 1
fi
fi

View File

@@ -2,185 +2,184 @@
#
function warning {
echo "** $*" >&2
return 0
echo "** $*" >&2
return 0
}
function die {
warning "$*"
exit 1
warning "$*"
exit 1
}
function debug {
if test -n "$verbose"; then
echo "$*" >&2
fi
if test -n "$verbose"; then
echo "$*" >&2
fi
}
function info {
printf "[$*] ..\t" >&2
printf '[%s] ..\t' "$*" >&2
}
function result {
echo "$*" >&2
echo "$*" >&2
}
function cleanup {
debug "cleaning function called"
if test -n "$tmpdir"; then
if test -d "$tmpdir"; then
if test -z "$nopurge"; then
debug "cleaning up $tmpdir"
rm -rf "$tmpdir"
fi
debug "cleaning function called"
if test -n "$tmpdir"; then
if test -d "$tmpdir"; then
if test -z "$nopurge"; then
debug "cleaning up $tmpdir"
rm -rf "$tmpdir"
fi
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
fi
if test -n "$crawlpid"; then
debug "killing $crawlpid"
kill -9 "$crawlpid"
crawlpid=
fi
}
function usage {
cat << EOF
cat <<EOF
usage: $0
EOF
}
function assert_equals {
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
else
result "OK ($2)"
fi
info "$1"
if test ! "$2" == "$3"; then
result "expected '$2', got '$3'"
exit 1
else
result "OK ($2)"
fi
}
function start-crawl {
# parse args
pos=1
while test "$#" -ge "$pos" ; do
case "${!pos}" in
--debug)
verbose=1
;;
--no-purge|--summary|--print-files)
;;
--errors|--files|--found|--not-found|--directory)
pos=$[${pos}+1]
test "$#" -ge "$pos" || warning "missing argument" || return 1
;;
httrack)
pos=$[${pos}+1]
break;
;;
*)
warning "unrecognized option ${!pos}"
return 1
;;
esac
pos=$[${pos}+1]
done
debug "remaining args: ${@:${pos}}"
# parse args
pos=1
while test "$#" -ge "$pos"; do
case "${!pos}" in
--debug)
verbose=1
;;
--no-purge | --summary | --print-files) ;;
--errors | --files | --found | --not-found | --directory)
pos=$((pos + 1))
test "$#" -ge "$pos" || warning "missing argument" || return 1
;;
httrack)
pos=$((pos + 1))
break
;;
*)
warning "unrecognized option ${!pos}"
return 1
;;
esac
pos=$((pos + 1))
done
debug "remaining args: ${*:pos}"
# ut/ won't exceed 2 minutes
moreargs="--quiet --max-time=120 --timeout=30 --connection-per-second=5"
# ut/ won't exceed 2 minutes
moreargs=(--quiet --max-time=120 --timeout=30 --connection-per-second=5)
# proxy environment ?
if test -n "$http_proxy"; then
moreargs="$moreargs --proxy $http_proxy"
fi
# proxy environment ?
if test -n "${http_proxy:-}"; then
moreargs+=(--proxy "$http_proxy")
fi
test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl"
rm -rf "$tmp"
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" ${moreargs} ${@:${pos}}
info "running httrack ${@:${pos}}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" ${moreargs} ${@:${pos}} >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1" ; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" \
| sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break;
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
test -n "$tmpdir" || ! warning "no tmpdir" || return 1
tmp="${tmpdir}/crawl"
rm -rf "$tmp"
else
tmpdir=
fi
mkdir "$tmp" || ! warning "could not create $tmp" || return 1
which httrack >/dev/null || ! warning "could not find httrack" || return 1
ver=$(httrack -O /dev/null --version | sed -e 's/HTTrack version //')
test -n "$ver" || ! warning "could not run httrack" || return 1
# start crawl
log="${tmp}/log"
debug starting httrack -O "${tmp}" "${moreargs[@]}" "${@:pos}"
info "running httrack ${*:pos}"
httrack -O "${tmp}" --user-agent="httrack $ver ut ($(uname -omrs))" "${moreargs[@]}" "${@:pos}" >"${log}" 2>&1 &
crawlpid="$!"
debug "started cralwer on pid $crawlpid"
wait "$crawlpid"
result="$?"
crawlpid=
test "$result" -eq 0 || ! result "error code $result" || return 1
result "OK"
grep -iE "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt" >&2
# now audit
while test "$#" -gt 0; do
case "$1" in
--no-purge)
nopurge=1
;;
--summary)
grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt"
;;
--print-files)
find "${tmp}" -mindepth 1 -type f
;;
--errors)
shift
assert_equals "checking errors" "$1" "$(grep -iEc "^[0-9\:]*[[:space:]]Error:" "${tmp}/hts-log.txt")"
;;
--found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--not-found)
shift
info "checking for $1"
if test -f "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--directory)
shift
info "checking for $1"
if test -d "${tmp}/$1"; then
result "OK"
else
result "not found"
exit 1
fi
;;
--files)
shift
nFiles=$(grep -E "^HTTrack Website Copier/[^ ]* mirror complete in " "${tmp}/hts-log.txt" |
sed -e 's/.*[[:space:]]\([^ ]*\)[[:space:]]files written.*/\1/g')
assert_equals "checking files" "$1" "$nFiles"
;;
httrack)
break
;;
esac
shift
done
# cleanup
if test -z "$nopurge"; then
rm -rf "$tmp"
else
tmpdir=
fi
}
# check args
@@ -195,7 +194,7 @@ tmpdir=
crawlpid=
nopurge=
verbose=
trap "cleanup" 0 1 2 3 4 5 6 7 8 9 11 13 14 15 16 19 24 25
trap cleanup EXIT HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV PIPE ALRM TERM STKFLT XCPU XFSZ
# working directory
tmpdir="${tmptopdir}/httrack_ut.$$"

View File

@@ -2,19 +2,19 @@
#
error=0
for i in *.test ; do
if bash $i ; then
echo "$i: passed" >&2
else
echo "$i: ERROR" >&2
error=$[${error}+1]
fi
for i in *.test; do
if bash "$i"; then
echo "$i: passed" >&2
else
echo "$i: ERROR" >&2
error=$((error + 1))
fi
done
if test "$error" -eq 0; then
echo "all tests passed" >&2
echo "all tests passed" >&2
else
echo "${error} test(s) failed" >&2
echo "${error} test(s) failed" >&2
fi
exit $error