Compare commits

..

3 Commits

Author SHA1 Message Date
Xavier Roche
2eac19655b Widen the savename self-test and cover Content-Disposition end to end (#477)
* tests: widen the savename self-test and cover Content-Disposition end to end

-#test=savename grows key=value knobs (cdispo=, statuscode=, status=, adr=,
strip=, urlhack= and its negatives, n83=, type=) plus repeatable
prior=adr|fil|sav rows that register an already-crawled link, so the .test
can pin the still-downloading mime path, redirect delayed naming, dedup and
collision suffixing, 8-3 modes, --strip-query dedup and hostile fils
(traversal, control chars, oversized names) - the regression net for the
upcoming resolve_extension work.

A new cdispo/ endpoint in local-server.py and 32_local-cdispo.test give the
Content-Disposition branches their first end-to-end coverage, including a
traversal filename reduced to its last component.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* review follow-ups: isolate the wire cdispo strip, add over-match negatives

The review found the e2e traversal row masked by two independent layers
(the wire parser and url_savename both strip path components), so a new
-#test=header self-test pins treathead's Content-Disposition parse alone.
Three negative rows keep dedup honest: a kept query key that differs, a
distinct URL under urlhack, and a same-basename-different-directory prior
must all produce a fresh name, not a false match. route_cdispo now reuses
send_raw via an extra_headers argument.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:42:32 +02:00
Xavier Roche
83c231d50e Fold the type-probe's triplicated extension decision into one function (#476)
* htsname: extract resolve_extension() from the type-probe block

The cache, backing-headers and live-probe paths each carried the same
cdispo / wire_patches_ext / give_mimext decision, per the block's own
FIXME ("factorize this unholy mess"); fold the three copies into one
static resolve_extension(). Drop the dead DEFAULT_BIN_EXT arms (the
define has been commented out in htsconfig.h for years) and the ishtest
variable only they read.

-#test=savename gains an optional Content-Disposition argument, and the
engine test now pins the filename-replacement path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* tests: pin cdispo reserved-char sanitization in the savename table

Review follow-up: adds a row where the expected name differs from the
Content-Disposition input, exercising the hostile-cdispo cleaning path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:38:30 +02:00
Xavier Roche
9d29b8329b Phase 0 follow-ups: residual unescape OOB, dead java plugin, FTP test strength (#475)
* htsencoding: bound the raw UTF-8 flush in hts_unescapeUrlSpecial

The completed-sequence flush memcpy ends with a 'continue' that skips the
per-byte NUL-reserve guard, so a raw multi-byte character landing at the
exact end of dest let the trailing NUL write dest[max] (1-byte OOB, found
by the post-#474 review pass; ASan-verified via the extended
-#test=unescape-bounds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* Resurrect the java .class parser on modern Unix builds

The plugin was dead three ways on a current Linux build: hts_plug() was
compiled hidden (-fvisibility=hidden, EXTERNAL_FUNCTION expanded to nothing
on ELF), hts_create_opt() dlopens libhtsjava.so.2 which no longer exists
since the soname moved to .so.3, and JAVA_HEADER.magic is 'unsigned long'
(8 bytes on LP64) under a 10-byte fread, so major/count came from
uninitialized bytes and the 0xCAFEBABE check never matched.

EXTERNAL_FUNCTION now forces default visibility on ELF, the dlopen name is
derived from VERSION_INFO at configure time, and the header fields are
fixed-width. 31_local-javaclass.test crawls a generated .class and asserts
a resource named only in its constant pool is fetched; it fails if any of
the three regresses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* htsftp: assert nonzero buffer sizes, harden the userpass self-test

ftp_split_userpass underflows its size-1 math on a zero size; assert the
precondition now that the function is public in htsftp.h. The self-test
gains a tight-size run with guard bytes and exact-content checks, which the
256-byte buffers alone could not fail on an off-by-one.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* configure: use the dylib plugin name on Darwin

libtool names the module libhtsjava.N.dylib there, so the .so.N form can
never load; caught by 31_local-javaclass.test on the macOS CI job (the old
hardcoded .so.2 was just as dead, silently).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

* tests: cover the %xx-encoded UTF-8 flush path in unescape-bounds

The raw-byte cases never take the utfBufferJ = lastJ rollback branch, so a
wrong flush offset there would have passed (review finding).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

---------

Signed-off-by: Xavier Roche <roche@httrack.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 23:04:02 +02:00
7 changed files with 300 additions and 86 deletions

View File

@@ -167,8 +167,24 @@ static int wire_patches_ext(httrackp *opt, const char *wiremime,
return 1;
}
// forme le nom du fichier à sauver (save) à partir de fil et adr
// système intelligent, qui renomme en cas de besoin (exemple: deux INDEX.HTML et index.html)
/* Wire-metadata name change: a Content-Disposition filename wins (returns 2),
else the declared type's ext when wire_patches_ext() allows (returns 1),
else 0. ext receives the new extension or replacement filename. */
static int resolve_extension(httrackp *opt, const char *cdispo,
const char *contenttype, const char *fil,
char *ext, size_t ext_size) {
if (strnotempty(cdispo)) {
strlcpybuff(ext, cdispo, ext_size);
return 2;
}
if (wire_patches_ext(opt, contenttype, fil) &&
give_mimext(ext, ext_size, contenttype))
return 1;
return 0;
}
// Build the local save name (save) from adr/fil; renames on collision
// (e.g. INDEX.HTML vs index.html).
int url_savename(lien_adrfilsave *const afs,
lien_adrfil *const former,
const char *referer_adr, const char *referer_fil,
@@ -405,45 +421,23 @@ int url_savename(lien_adrfilsave *const afs,
// si option check_type activée
if (is_html < 0 && opt->check_type && !ext_chg) {
int ishtest = 0;
if (protocol != PROTOCOL_FILE
&& protocol != PROTOCOL_FTP
) {
// tester type avec requète HEAD si on ne connait pas le type du fichier
if (!((opt->check_type == 1) && (fil[strlen(fil) - 1] == '/'))) // slash doit être html?
if (opt->savename_delayed == HTS_SAVENAME_DELAYED_HARD ||
(ishtest = ishtml(opt, fil)) <
0) { // unsure whether it's html or a file
ishtml(opt, fil) < 0) { // unsure whether it's html or a file
// lire dans le cache
htsblk r = cache_read_including_broken(opt, cache, adr, fil); // test uniquement
if (r.statuscode != -1) { // pas d'erreur de lecture cache
char s[32];
s[0] = '\0';
if (r.statuscode != -1) { // cache entry read OK
hts_log_print(opt, LOG_DEBUG, "Testing link type (from cache) %s%s",
adr_complete, fil_complete);
if (!HTTP_IS_REDIRECT(r.statuscode)) {
if (strnotempty(r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, r.cdispo);
} else if (wire_patches_ext(opt, r.contenttype, fil)) {
if (give_mimext(s, sizeof(s),
r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
ext_chg = resolve_extension(opt, r.cdispo, r.contenttype, fil,
ext, sizeof(ext));
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtest == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
//
} else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_HARD &&
is_userknowntype(opt, fil)) { /* PATCH BY BRIAN SCHRÖDER.
Lookup mimetype not only by extension,
@@ -467,22 +461,11 @@ int url_savename(lien_adrfilsave *const afs,
// fail later
else if (opt->savename_delayed != HTS_SAVENAME_DELAYED_NONE &&
!opt->state.stop) {
// Check if the file is ready in backing. We basically take the same logic as later.
// FIXME: we should cleanup and factorize this unholy mess
// Check if the file is ready in backing.
if (headers != NULL && headers->status >= 0 && !is_redirect) {
if (strnotempty(headers->r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, headers->r.cdispo);
} else if (wire_patches_ext(opt, headers->r.contenttype,
headers->url_fil)) {
char s[16];
if (give_mimext(
s, sizeof(s),
headers->r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
ext_chg = resolve_extension(opt, headers->r.cdispo,
headers->r.contenttype,
headers->url_fil, ext, sizeof(ext));
}
else if (mime_type != NULL) {
ext[0] = '\0';
@@ -500,13 +483,6 @@ int url_savename(lien_adrfilsave *const afs,
if (!may_unknown2(opt, mime_type, fil)) {
ext_chg = 1;
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtml(opt, fil) == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
} else {
ext_chg = 0;
}
@@ -696,30 +672,10 @@ int url_savename(lien_adrfilsave *const afs,
// libérer emplacement backing
}
{ // pas d'erreur, changer type?
char s[16];
s[0] = '\0';
if (strnotempty(back[b].r.cdispo)) { /* filename given */
ext_chg = 2; /* change filename */
strcpybuff(ext, back[b].r.cdispo);
} else if (wire_patches_ext(opt, back[b].r.contenttype,
back[b].url_fil)) {
if (give_mimext(
s, sizeof(s),
back[b].r.contenttype)) { // recognized extension
ext_chg = 1;
strcpybuff(ext, s);
}
}
#ifdef DEFAULT_BIN_EXT
// no extension and potentially bogus
else if (ishtest == -2) {
ext_chg = 1;
strcpybuff(ext, DEFAULT_BIN_EXT + 1);
}
#endif
}
// no error: change the type?
ext_chg = resolve_extension(
opt, back[b].r.cdispo, back[b].r.contenttype,
back[b].url_fil, ext, sizeof(ext));
}
// FIN Si non déplacé, forcer type?

View File

@@ -1093,31 +1093,113 @@ static int st_resolve(httrackp *opt, int argc, char **argv) {
return 0;
}
/* Extra args are key=value: adr= cdispo= statuscode= status= strip= urlhack=
no-www= no-slash= no-query= n83= type=, plus repeatable prior=adr|fil|sav
registering an already-crawled link (dedup/collision paths). */
/* Parse raw response-header lines and print the naming-relevant fields. */
static int st_header(httrackp *opt, int argc, char **argv) {
htsblk r;
int i;
(void) opt;
if (argc < 1) {
fprintf(stderr, "header: needs at least one raw header line\n");
return 1;
}
memset(&r, 0, sizeof(r));
for (i = 0; i < argc; i++) {
char BIGSTK line[HTS_URLMAXSIZE * 2];
strcpybuff(line, argv[i]);
treathead(NULL, "www.example.com", "/", &r, line);
}
printf("contenttype=%s cdispo=%s\n", r.contenttype, r.cdispo);
return 0;
}
static int st_savename(httrackp *opt, int argc, char **argv) {
lien_adrfilsave afs;
cache_back cache;
struct_back *sback;
hash_struct hash;
lien_back headers;
const char *adr = "www.example.com";
const char *cdispo = NULL;
int statuscode = HTTP_OK, status = 0;
int i;
if (argc < 2) {
fprintf(stderr, "savename: needs a fil and a content-type\n");
return 1;
}
/* knobs first: hash_init and the prior links depend on them */
for (i = 2; i < argc; i++) {
const char *const a = argv[i];
if (strncmp(a, "adr=", 4) == 0)
adr = a + 4;
else if (strncmp(a, "cdispo=", 7) == 0)
cdispo = a + 7;
else if (strncmp(a, "statuscode=", 11) == 0)
statuscode = atoi(a + 11);
else if (strncmp(a, "status=", 7) == 0)
status = atoi(a + 7);
else if (strncmp(a, "strip=", 6) == 0)
StringCopy(opt->strip_query, a + 6);
else if (strncmp(a, "urlhack=", 8) == 0)
opt->urlhack = atoi(a + 8) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-www=", 7) == 0)
opt->no_www_dedup = atoi(a + 7) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-slash=", 9) == 0)
opt->no_slash_dedup = atoi(a + 9) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "no-query=", 9) == 0)
opt->no_query_dedup = atoi(a + 9) ? HTS_TRUE : HTS_FALSE;
else if (strncmp(a, "n83=", 4) == 0)
opt->savename_83 = atoi(a + 4);
else if (strncmp(a, "type=", 5) == 0)
opt->savename_type = atoi(a + 5);
else if (strncmp(a, "prior=", 6) != 0) {
fprintf(stderr, "savename: unknown arg '%s'\n", a);
return 1;
}
}
memset(&afs, 0, sizeof(afs));
strcpybuff(afs.af.adr, "www.example.com");
strcpybuff(afs.af.adr, adr);
strcpybuff(afs.af.fil, argv[0]);
memset(&cache, 0, sizeof(cache));
cache.hashtable = (void *) coucal_new(0);
sback = back_new(opt, opt->maxsoc * 32 + 1024);
/* same wiring as hts_mirror (htscore.c) */
hash_init(opt, &hash, opt->urlhack);
hash.liens = (const lien_url *const *const *) &opt->liens;
opt->hash = &hash;
hts_record_init(opt);
for (i = 2; i < argc; i++) {
if (strncmp(argv[i], "prior=", 6) == 0) {
char *dup = strdupt(argv[i] + 6);
char *const p1 = strchr(dup, '|');
char *const p2 = p1 != NULL ? strchr(p1 + 1, '|') : NULL;
if (p2 == NULL) {
fprintf(stderr, "savename: prior needs adr|fil|sav\n");
return 1;
}
*p1 = *p2 = '\0';
if (!hts_record_link(opt, dup, p1 + 1, p2 + 1, "", "", NULL))
return 1;
freet(dup);
}
}
memset(&headers, 0, sizeof(headers));
headers.status = 0;
headers.r.statuscode = HTTP_OK;
headers.status = status;
headers.r.statuscode = statuscode;
strcpybuff(headers.r.contenttype, argv[1]);
if (cdispo != NULL)
strcpybuff(headers.r.cdispo, cdispo);
strcpybuff(headers.url_fil, argv[0]);
url_savename(&afs, NULL, NULL, NULL, opt, sback, &cache, &hash, 0, 0,
@@ -1924,8 +2006,10 @@ static const struct selftest_entry {
st_relative},
{"resolve", "<link> <adr> <fil>", "resolve a link against an origin",
st_resolve},
{"savename", "<fil> <content-type>", "local save-name for a URL",
st_savename},
{"header", "<raw-header-line> ...", "response header-line parsing",
st_header},
{"savename", "<fil> <content-type> [key=value ...]",
"local save-name for a URL", st_savename},
{"cache", "<dir>", "cache read/write round-trip self-test", st_cache},
{"cache-golden", "<dir> [regen]", "frozen cache-format read self-test",
st_cache_golden},

View File

@@ -0,0 +1,29 @@
#!/bin/bash
#
set -euo pipefail
# Response header-line parsing (treathead via -#test=header <raw-line> ...).
# Isolates the wire layer from url_savename, which strips traversal on its own.
hdr() {
local want="$1"
shift
out="$(httrack -O /dev/null -#test=header "$@" | grep '^contenttype=')"
test "$out" == "$want" || {
echo "FAIL: $* -> '$out' (want '$want')"
exit 1
}
}
hdr 'contenttype=application/pdf cdispo=' 'Content-Type: application/pdf'
# filename= is honored quoted or bare.
hdr 'contenttype= cdispo=report.pdf' \
'Content-Disposition: attachment; filename="report.pdf"'
hdr 'contenttype= cdispo=report.pdf' \
'Content-Disposition: attachment; filename=report.pdf'
# Path components in the filename are dropped on the wire (RFC 2616).
hdr 'contenttype= cdispo=evil.pdf' \
'Content-Disposition: attachment; filename="../../evil.pdf"'

View File

@@ -3,13 +3,30 @@
set -euo pipefail
# Local save-name extension resolution (url_savename via -#test=savename <fil> <content-type>).
# Asserts on the basename of "savename: <path>".
# Local save-name resolution (url_savename via -#test=savename <fil> <content-type> [key=value ...]).
# name() asserts on the basename, full() on the whole path; prior= registers an
# already-crawled link whose sav is rooted under the -O path (/dev/null here).
run() {
httrack -O /dev/null -#test=savename "$@" | sed -n 's/^savename: //p'
}
name() {
out="$(httrack -O /dev/null -#test=savename "$1" "$2" | sed -n 's/^savename: //p')"
test "${out##*/}" == "$3" || {
echo "FAIL: '$1' '$2' -> '$out' (want '$3')"
local fil="$1" ctype="$2" want="$3"
shift 3
out="$(run "$fil" "$ctype" "$@")"
test "${out##*/}" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
full() {
local fil="$1" ctype="$2" want="$3"
shift 3
out="$(run "$fil" "$ctype" "$@")"
test "$out" == "$want" || {
echo "FAIL: '$fil' '$ctype' $* -> '$out' (want '$want')"
exit 1
}
}
@@ -39,3 +56,86 @@ name '/types/data.json' 'application/json' 'data.json'
# Agreeing type must not rewrite the extension's casing (no strip-and-reappend).
name '/x.JPG' 'image/jpeg' 'x.JPG'
# A Content-Disposition filename replaces the URL name outright.
name '/x.php' 'application/pdf' 'report.pdf' cdispo=report.pdf
name '/download' 'text/html' 'setup.exe' cdispo=setup.exe
# Reserved characters in a hostile Content-Disposition name are sanitized.
name '/x.php' 'application/pdf' 'set_up.exe' 'cdispo=set:up.exe'
# The md5-of-query suffix lands inside a Content-Disposition name too.
name '/x.php?id=1' 'application/pdf' 'report681a.pdf' cdispo=report.pdf
# Still-downloading path (status=-1): mime drives the ext, cdispo is ignored
# there (the deliberately unfolded 4th resolve_extension variant).
name '/x.pdf' 'text/html' 'x.html' status=-1
name '/x.html' 'text/html' 'x.html' status=-1
name '/x.php' 'application/pdf' 'x.pdf' status=-1 cdispo=report.pdf
# A redirect answer resolves nothing: delayed placeholder name.
name '/x.php' 'text/html' 'x.0.delayed' statuscode=301
# Root and query-only URLs get index + the md5-of-query suffix.
name '/' 'text/html' 'index.html'
name '/?a=1' 'text/html' 'index3872.html'
# Same URL crawled before: reuse its sav verbatim (case preserved).
full '/X.PHP' 'text/html' 'www.example.com/CASE.HTML' \
'prior=www.example.com|/X.PHP|www.example.com/CASE.HTML'
# Another URL owns the name: collision suffix -2, then -3, case-insensitively.
name '/x.php' 'text/html' 'x-2.html' \
'prior=www.example.com|/other.html|/dev/null/www.example.com/x.html'
name '/x.php' 'text/html' 'x-3.html' \
'prior=www.example.com|/o1.html|/dev/null/www.example.com/x.html' \
'prior=www.example.com|/o2.html|/dev/null/www.example.com/x-2.html'
name '/INDEX.HTML' 'text/html' 'INDEX-2.HTML' \
'prior=www.example.com|/index.html|/dev/null/www.example.com/index.html'
# Same basename in another directory is NOT a collision.
name '/x.php' 'text/html' 'x.html' \
'prior=www.example.com|/sub/x.html|/dev/null/www.example.com/sub/x.html'
# 8-3 modes: DOS truncates every component to 8+3, ISO9660 level 2 to 31.
full '/directory-long/verylongfilename.html' 'text/html' \
'/dev/null/EXAMPLE/DIRECTOR/VERYLONG.HTM' n83=1
full '/directory-long/verylongfilename.html' 'text/html' \
'/dev/null/EXAMPLE_C/DIRECTORY_LONG/VERYLONGFILENAME.HTM' n83=2
name '/verylongfilename.php' 'text/html' 'VERYLO-2.HTM' n83=1 \
'prior=www.example.com|/other.html|/dev/null/EXAMPLE/VERYLONG.HTM'
# urlhack dedup (#271): // collapse and www-strip map to the prior link's sav;
# the per-feature negatives opt out and take a fresh name.
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/PRIOR.html' \
'prior=www.example.com|/a/b.php|/dev/null/www.example.com/a/PRIOR.html'
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/b.html' no-slash=1 \
'prior=www.example.com|/a/b.php|/dev/null/www.example.com/a/PRIOR.html'
full '/w.php' 'text/html' '/dev/null/www.example.com/W-PRIOR.html' adr=example.com \
'prior=www.example.com|/w.php|/dev/null/www.example.com/W-PRIOR.html'
full '/w.php' 'text/html' '/dev/null/example.com/w.html' adr=example.com no-www=1 \
'prior=www.example.com|/w.php|/dev/null/www.example.com/W-PRIOR.html'
# Distinct URLs must stay distinct under urlhack (no over-normalization).
full '/a//b.php' 'text/html' '/dev/null/www.example.com/a/b.html' \
'prior=www.example.com|/a/c.php|/dev/null/www.example.com/a/C-PRIOR.html'
# --strip-query (#112): stripped key dedups onto the prior sav; without the
# option the same URLs stay distinct.
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/PAGE-PRIOR.html' \
strip=sid 'prior=www.example.com|/page.php?id=3|/dev/null/www.example.com/PAGE-PRIOR.html'
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/page475b.html' \
'prior=www.example.com|/page.php?id=3|/dev/null/www.example.com/PAGE-PRIOR.html'
# A kept key that differs must still block the dedup (no over-stripping).
full '/page.php?id=3&sid=42' 'text/html' '/dev/null/www.example.com/page475b.html' \
strip=sid 'prior=www.example.com|/page.php?id=4|/dev/null/www.example.com/PAGE-PRIOR.html'
# Hostile fils stay rooted under the mirror: ../ (raw or %2e-encoded) drops out,
# control characters become spaces, oversized names cap at 210 chars (the cap
# can chop the extension off entirely).
full '/../../etc/passwd' 'text/html' '/dev/null/www.example.com///etc/passwd.html'
full '/%2e%2e/%2e%2e/etc/passwd' 'text/html' '/dev/null/www.example.com///etc/passwd.html'
full '/x.php' 'application/pdf' '/dev/null/www.example.com///evil.exe' 'cdispo=../../evil.exe'
name $'/evil\rname\t.php' 'text/html' 'evil name .html'
name "/$(printf 'a%.0s' {1..300}).php" 'text/html' "$(printf 'a%.0s' {1..210})"

View File

@@ -0,0 +1,17 @@
#!/bin/bash
#
# Content-Disposition names the saved file: the attachment filename replaces
# the URL-derived name, and a traversal filename is reduced to its last
# component, inside the mirror.
set -euo pipefail
: "${top_srcdir:=..}"
bash "$top_srcdir/tests/local-crawl.sh" --errors 0 \
--found 'cdispo/report.pdf' \
--file-matches 'cdispo/report.pdf' '%PDF' \
--not-found 'cdispo/fetch.pdf' \
--found 'cdispo/evil.pdf' \
--not-found 'evil.pdf' \
httrack 'BASEURL/cdispo/index.html'

View File

@@ -38,6 +38,7 @@ TESTS = \
01_engine-ftp-line.test \
01_engine-ftp-userpass.test \
01_engine-hashtable.test \
01_engine-header.test \
01_engine-idna.test \
01_engine-escape-room.test \
01_engine-inplace-escape.test \
@@ -91,6 +92,7 @@ TESTS = \
28_local-pause.test \
29_local-redirect-fragment.test \
30_local-fragment-link.test \
31_local-javaclass.test
31_local-javaclass.test \
32_local-cdispo.test
CLEANFILES = check-network_sh.cache

View File

@@ -134,12 +134,14 @@ class Handler(SimpleHTTPRequestHandler):
# --- type/extension matrix (issue #267 family) -------------------------
def send_raw(self, body, content_type):
def send_raw(self, body, content_type, extra_headers=()):
"""Send a raw body with an explicit Content-Type, or none at all when
content_type is None (to observe httrack's typeless-file naming)."""
self.send_response(200)
if content_type is not None:
self.send_header("Content-Type", content_type)
for name, value in extra_headers:
self.send_header(name, value)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
if self.command != "HEAD":
@@ -354,6 +356,27 @@ class Handler(SimpleHTTPRequestHandler):
if self.command != "HEAD":
self.wfile.write(body)
# Content-Disposition naming: the attachment filename replaces the
# URL-derived name; path components in it are stripped (RFC 2616).
CDISPO_NAMES = {
"/cdispo/fetch.php": "report.pdf",
"/cdispo/evil.php": "../../evil.pdf",
}
def route_cdispo_index(self):
self.send_html(
'\t<a href="fetch.php">report</a>\n' '\t<a href="evil.php">evil</a>\n'
)
def route_cdispo(self):
filename = self.CDISPO_NAMES[urlsplit(self.path).path]
cdispo = 'attachment; filename="%s"' % filename
self.send_raw(
self.FAKE_PDF,
"application/pdf",
extra_headers=[("Content-Disposition", cdispo)],
)
# 302 whose Location carries a #fragment (#204): the fragment is a UA anchor
# that must be dropped before the target is fetched. A leaked '#' reaches the
# strict-server guard below and 400s.
@@ -406,6 +429,9 @@ class Handler(SimpleHTTPRequestHandler):
"/mimex/index.html": route_mimex_index,
"/mimex/blob.pdf": route_mimex_blob,
"/mimex/real.html": route_mimex_real,
"/cdispo/index.html": route_cdispo_index,
"/cdispo/fetch.php": route_cdispo,
"/cdispo/evil.php": route_cdispo,
"/redir/index.html": route_redir_index,
"/redir/go.php": route_redir_go,
"/redir/target.html": route_redir_target,