mirror of
https://github.com/xroche/httrack.git
synced 2026-06-14 22:33:54 +03:00
Compare commits
15 Commits
docs/rfc26
...
cleanup/ht
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c4ef18f5a5 | ||
|
|
d76dad47f7 | ||
|
|
9c6ff54040 | ||
|
|
4a057514b9 | ||
|
|
055e17b057 | ||
|
|
d7bb97d697 | ||
|
|
d741188980 | ||
|
|
ca810ef7e3 | ||
|
|
1bf90ce294 | ||
|
|
583817dcd4 | ||
|
|
5351e96d71 | ||
|
|
a0bf50f6b1 | ||
|
|
794404bba2 | ||
|
|
82d08aaeaf | ||
|
|
459f06e758 |
@@ -23,7 +23,7 @@ http://www.httrack.com/
|
||||
|
||||
## Compile trunk release
|
||||
```sh
|
||||
git clone https://github.com/xroche/httrack.git --recurse
|
||||
git clone https://github.com/xroche/httrack.git --recurse-submodules
|
||||
cd httrack
|
||||
./configure --prefix=$HOME/usr && make -j8 && make install
|
||||
```
|
||||
|
||||
@@ -181,17 +181,17 @@ used for some time.
|
||||
|
||||
<p align=justify> The rest of this manual is dedicated to detailing what
|
||||
you find in the help message and providing examples - lots and lots of
|
||||
examples... Here is what you get (page by page - use <enter> to move to
|
||||
examples... Here is what you get (page by page - use <enter> to move to
|
||||
the next page in the real program) if you type 'httrack --help':
|
||||
|
||||
<pre>
|
||||
>httrack --help
|
||||
HTTrack version 3.03BETAo4 (compiled Jul 1 2001)
|
||||
usage: ./httrack <URLs [-option] [+<FILTERs>] [-<FILTERs>]
|
||||
usage: ./httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>]
|
||||
with options listed below: (* is the default value)
|
||||
|
||||
General options:
|
||||
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
|
||||
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>)
|
||||
%O top path if no path defined (-O path_mirror[,path_cache_and_logfiles])
|
||||
|
||||
Action options:
|
||||
@@ -202,7 +202,7 @@ Action options:
|
||||
Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)
|
||||
|
||||
Proxy options:
|
||||
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
|
||||
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
|
||||
%f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])
|
||||
|
||||
Limits options:
|
||||
@@ -227,7 +227,7 @@ Links options:
|
||||
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
|
||||
n get non-html files 'near' an html file (ex: an image located outside) (--near)
|
||||
t test all URLs (even forbidden ones) (--test)
|
||||
%L <file add all URL located in this text file (one URL per line) (--list <param>)
|
||||
%L <file> add all URL located in this text file (one URL per line) (--list <param>)
|
||||
|
||||
Build options:
|
||||
NN structure type (0 *original structure, 1+: see below) (--structure[=N])
|
||||
@@ -248,12 +248,12 @@ Spider options:
|
||||
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
|
||||
%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
|
||||
%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
|
||||
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
|
||||
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
|
||||
|
||||
Browser ID:
|
||||
F user-agent field (-F "user-agent name") (--user-agent <param>)
|
||||
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
|
||||
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
|
||||
F user-agent field (-F "user-agent name") (--user-agent <param>)
|
||||
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
|
||||
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
|
||||
|
||||
Log, index, cache
|
||||
C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
|
||||
@@ -303,8 +303,8 @@ Guru options: (do NOT use)
|
||||
#! Execute a shell command (-#! "echo hello")
|
||||
|
||||
Command-line specific options:
|
||||
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
|
||||
%U run the engine with another id when called as root (-%U smith) (--user <param>)
|
||||
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
|
||||
%U run the engine with another id when called as root (-%U smith) (--user <param>)
|
||||
|
||||
Details: Option N
|
||||
N0 Site-structure (default)
|
||||
@@ -340,14 +340,14 @@ Details: User-defined option N
|
||||
%[param] param variable in query string
|
||||
|
||||
Shortcuts:
|
||||
--mirror <URLs *make a mirror of site(s) (default)
|
||||
--get <URLs get the files indicated, do not seek other URLs (-qg)
|
||||
--list <text file add all URL located in this text file (-%L)
|
||||
--mirrorlinks <URLs mirror all links in 1st level pages (-Y)
|
||||
--testlinks <URLs test links in pages (-r1p0C0I0t)
|
||||
--spider <URLs spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
|
||||
--testsite <URLs identical to --spider
|
||||
--skeleton <URLs make a mirror, but gets only html files (-p1)
|
||||
--mirror <URLs> *make a mirror of site(s) (default)
|
||||
--get <URLs> get the files indicated, do not seek other URLs (-qg)
|
||||
--list <text file> add all URL located in this text file (-%L)
|
||||
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
|
||||
--testlinks <URLs> test links in pages (-r1p0C0I0t)
|
||||
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
|
||||
--testsite <URLs> identical to --spider
|
||||
--skeleton <URLs> make a mirror, but gets only html files (-p1)
|
||||
--update update a mirror, without confirmation (-iC2)
|
||||
--continue continue a mirror, without confirmation (-iC1)
|
||||
|
||||
@@ -387,13 +387,13 @@ with examples... I will be here a while...
|
||||
<hr>
|
||||
<h2> Syntax </h2>
|
||||
|
||||
<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
|
||||
<pre><b><i>httrack <URLs> [-option] [+<FILTERs>] [-<FILTERs>] </i></b></pre>
|
||||
|
||||
<p align=justify> The syntax of httrack is quite simple. You specify
|
||||
the URLs you wish to start the process from (<URLS>), any options you
|
||||
the URLs you wish to start the process from (<URLS>), any options you
|
||||
might want to add ([-option], any filters specifying places you should
|
||||
([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
|
||||
line by pressing <enter>. Httrack then goes off and does your bidding.
|
||||
([+<FILTERs>]) and should not ([-<FILTERs>]) go, and end the command
|
||||
line by pressing <enter>. Httrack then goes off and does your bidding.
|
||||
For example:
|
||||
|
||||
<pre><b><i>
|
||||
@@ -425,7 +425,7 @@ site. Specifically, the defauls are:
|
||||
pN priority mode: (* p3) *3 save all files
|
||||
D *can only go down into subdirs
|
||||
a *stay on the same address
|
||||
--mirror <URLs> *make a mirror of site(s) (default)
|
||||
--mirror <URLs> *make a mirror of site(s) (default)
|
||||
</pre>
|
||||
|
||||
<p align=justify> Here's what all of that means:
|
||||
@@ -542,7 +542,7 @@ subdirectories of the starting directory to be investigated.
|
||||
search started are to be collected. Other sites they point to are not
|
||||
to be imaged.
|
||||
|
||||
<pre><b><i> --mirror <URLs> *make a mirror of site(s) (default) </i></b></pre>
|
||||
<pre><b><i> --mirror <URLs> *make a mirror of site(s) (default) </i></b></pre>
|
||||
|
||||
<p align=justify> This indicates that the program should try to make a
|
||||
copy of the site as well as it can.
|
||||
@@ -921,7 +921,7 @@ Links options:
|
||||
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use)
|
||||
n get non-html files 'near' an html file (ex: an image located outside)
|
||||
t test all URLs (even forbidden ones)
|
||||
%L <file> add all URL located in this text file (one URL per line)
|
||||
%L <file> add all URL located in this text file (one URL per line)
|
||||
</i></b></pre>
|
||||
|
||||
<p align=justify> The links options allow you to control what links are
|
||||
@@ -1183,7 +1183,7 @@ Spider options:
|
||||
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies)
|
||||
%B tolerant requests (accept bogus responses on some servers, but not standard!)
|
||||
%s update hacks: various hacks to limit re-transfers when updating
|
||||
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
|
||||
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3=text/html) (--assume <param>)
|
||||
</i></b></pre>
|
||||
|
||||
<p align=justify> By default, cookies are universally accepted and
|
||||
@@ -1387,7 +1387,7 @@ web servers leave footprints in the browser.
|
||||
Browser ID:
|
||||
F user-agent field (-F "user-agent name")
|
||||
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]"
|
||||
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
|
||||
%l preferred language (-%l "fr, en, jp, *" (--language <param>)
|
||||
</i></b></pre>
|
||||
|
||||
<p align=justify> The user-agent field is used by browsers to determine
|
||||
@@ -1799,7 +1799,7 @@ based authentication)
|
||||
|
||||
<pre><b><i>
|
||||
Command-line specific options:
|
||||
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
|
||||
V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
|
||||
</i></b></pre>
|
||||
|
||||
<p align=justify> This option is very nice for a wide array of actions
|
||||
@@ -1811,7 +1811,7 @@ httrack http://www.shoesizes.com/bob/ -O /tmp/shoesizes -V "/bin/echo \$0"
|
||||
</i></b></pre>
|
||||
|
||||
<pre>
|
||||
%U run the engine with another id when called as root (-%U smith) (--user <param>)
|
||||
%U run the engine with another id when called as root (-%U smith) (--user <param>)
|
||||
</pre>
|
||||
|
||||
<p align=justify> Change the UID of the owner when running as r00t
|
||||
@@ -1856,14 +1856,14 @@ of other options that are commonly used.
|
||||
|
||||
<pre><b><i>
|
||||
Shortcuts:
|
||||
--mirror <URLs> *make a mirror of site(s) (default)
|
||||
--get <URLs> get the files indicated, do not seek other URLs (-qg)
|
||||
--list <text file> add all URL located in this text file (-%L)
|
||||
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
|
||||
--testlinks <URLs> test links in pages (-r1p0C0I0t)
|
||||
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
|
||||
--testsite <URLs> identical to --spider
|
||||
--skeleton <URLs> make a mirror, but gets only html files (-p1)
|
||||
--mirror <URLs> *make a mirror of site(s) (default)
|
||||
--get <URLs> get the files indicated, do not seek other URLs (-qg)
|
||||
--list <text file> add all URL located in this text file (-%L)
|
||||
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
|
||||
--testlinks <URLs> test links in pages (-r1p0C0I0t)
|
||||
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
|
||||
--testsite <URLs> identical to --spider
|
||||
--skeleton <URLs> make a mirror, but gets only html files (-p1)
|
||||
--update update a mirror, without confirmation (-iC2)
|
||||
--continue continue a mirror, without confirmation (-iC1)
|
||||
--catchurl create a temporary proxy to capture an URL or a form post URL
|
||||
@@ -2019,15 +2019,15 @@ are in reverse priority order. Here's an example:
|
||||
<td>no characters must be present after</a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> <b> <filter>*[< NN]</b></td>
|
||||
<td> <b> <filter>*[< NN]</b></td>
|
||||
<td> size less than NN Kbytes</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> <b> <filter>*[> PP]</b></td>
|
||||
<td> <b> <filter>*[> PP]</b></td>
|
||||
<td> size more than PP Kbytes</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> <b> <filter>*[< NN > PP]</b></td>
|
||||
<td> <b> <filter>*[< NN > PP]</b></td>
|
||||
<td> size less than NN Kbytes and more than PP Kbytes</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
@@ -7,7 +7,7 @@ uk
|
||||
LANGUAGE_AUTHOR
|
||||
Andrij Shevchuk (http://programy.com.ua, http://vic-info.com.ua) \r\n
|
||||
LANGUAGE_CHARSET
|
||||
ISO-8859-5
|
||||
windows-1251
|
||||
LANGUAGE_WINDOWSID
|
||||
Ukrainian
|
||||
OK
|
||||
|
||||
@@ -271,8 +271,11 @@ int optalias_check(int argc, const char *const *argv, int n_arg,
|
||||
*return_argc = 1;
|
||||
if (argv[n_arg][0] == '-')
|
||||
if (argv[n_arg][1] == '-') {
|
||||
char command[1000];
|
||||
char param[1000];
|
||||
/* sized to HTS_CDLMAXSIZE: a long-form option value (--user-agent,
|
||||
--headers, ...) is copied into param, and the value is bounded by the
|
||||
general per-argument check in htscoremain.c (HTS_CDLMAXSIZE) */
|
||||
char command[HTS_CDLMAXSIZE];
|
||||
char param[HTS_CDLMAXSIZE];
|
||||
char addcommand[256];
|
||||
|
||||
/* */
|
||||
|
||||
@@ -201,8 +201,8 @@ HTSEXT_API int catch_url(T_SOC soc, char *url, char *method, char *data) {
|
||||
while(strnotempty(line)) {
|
||||
socinput(soc, line, 1000);
|
||||
treathead(NULL, NULL, NULL, &blkretour, line); // traiter
|
||||
strcatbuff(data, line);
|
||||
strcatbuff(data, "\r\n");
|
||||
strlcatbuff(data, line, CATCH_URL_DATA_SIZE);
|
||||
strlcatbuff(data, "\r\n", CATCH_URL_DATA_SIZE);
|
||||
}
|
||||
// CR/LF final de l'en tête inutile car déja placé via la ligne vide juste au dessus
|
||||
//strcatbuff(data,"\r\n");
|
||||
|
||||
@@ -40,6 +40,9 @@ Please visit our Website: http://www.httrack.com
|
||||
/* Library internal definictions */
|
||||
#ifdef HTS_INTERNAL_BYTECODE
|
||||
|
||||
// Capacity contract for the catch_url() 'data' buffer (32 Kb).
|
||||
#define CATCH_URL_DATA_SIZE 32768
|
||||
|
||||
// Fonctions
|
||||
void socinput(T_SOC soc, char *s, int max);
|
||||
|
||||
|
||||
@@ -140,6 +140,93 @@ static void basic_selftests(void) {
|
||||
md5selftest();
|
||||
}
|
||||
|
||||
/* Self-tests for the htssafe.h bounded string ops (driven by httrack -#8).
|
||||
Returns 0 if every bounded operation behaved correctly, 1 otherwise.
|
||||
The abort-on-overflow guarantee is checked separately by the -#8 "overflow"
|
||||
sub-mode (it aborts the process by design). */
|
||||
static int string_safety_selftests(void) {
|
||||
char buf[8];
|
||||
|
||||
/* strcpybuff into a sized array: exact copy */
|
||||
strcpybuff(buf, "abc");
|
||||
if (strcmp(buf, "abc") != 0)
|
||||
return 1;
|
||||
|
||||
/* strcatbuff append within capacity */
|
||||
strcatbuff(buf, "de");
|
||||
if (strcmp(buf, "abcde") != 0)
|
||||
return 1;
|
||||
|
||||
/* strncatbuff appends at most N source chars */
|
||||
strcpybuff(buf, "ab");
|
||||
strncatbuff(buf, "cdef", 2);
|
||||
if (strcmp(buf, "abcd") != 0)
|
||||
return 1;
|
||||
|
||||
/* strlcpybuff: explicit-capacity copy into a pointer destination, the form
|
||||
the migration moves toward */
|
||||
{
|
||||
char storage[8];
|
||||
char *const p = storage;
|
||||
|
||||
strlcpybuff(p, "hello", sizeof(storage));
|
||||
if (strcmp(p, "hello") != 0)
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* strcpybuff into a pointer destination: routes through the unchecked
|
||||
strcpybuff_ptr_ fallback (the path the -#8 warning flags). The warning is
|
||||
intentional here; we only verify the fallback still copies correctly. */
|
||||
#if defined(__GNUC__)
|
||||
#pragma GCC diagnostic push
|
||||
#pragma GCC diagnostic ignored "-Wattribute-warning"
|
||||
#endif
|
||||
{
|
||||
char storage[8];
|
||||
char *const p = storage;
|
||||
|
||||
strcpybuff(p, "ptr");
|
||||
if (strcmp(p, "ptr") != 0)
|
||||
return 1;
|
||||
}
|
||||
#if defined(__GNUC__)
|
||||
#pragma GCC diagnostic pop
|
||||
#endif
|
||||
|
||||
/* htsbuff: bounded builder over a fixed array (append, truncating append,
|
||||
reset, and length tracking) */
|
||||
{
|
||||
char dst[8];
|
||||
htsbuff b = htsbuff_array(dst);
|
||||
|
||||
htsbuff_cat(&b, "ab");
|
||||
htsbuff_cat(&b, "cd");
|
||||
if (strcmp(htsbuff_str(&b), "abcd") != 0 || b.len != 4)
|
||||
return 1;
|
||||
|
||||
htsbuff_catn(&b, "efghij", 2); /* append at most 2 */
|
||||
if (strcmp(htsbuff_str(&b), "abcdef") != 0)
|
||||
return 1;
|
||||
|
||||
htsbuff_cpy(&b, "xyz"); /* reset */
|
||||
if (strcmp(htsbuff_str(&b), "xyz") != 0 || b.len != 3)
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* boundary: filling to exactly cap-1 must succeed (one more aborts, which the
|
||||
-#8 overflow-buff mode checks) */
|
||||
{
|
||||
char d2[4];
|
||||
htsbuff c = htsbuff_array(d2);
|
||||
|
||||
htsbuff_cat(&c, "abc");
|
||||
if (strcmp(htsbuff_str(&c), "abc") != 0 || c.len != 3)
|
||||
return 1;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int hts_main_internal(int argc, char **argv, httrackp * opt);
|
||||
|
||||
// Main, récupère les paramètres et appelle le robot
|
||||
@@ -1787,10 +1874,6 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
|
||||
HTS_PANIC_PRINTF("Empty string given");
|
||||
htsmain_free();
|
||||
return -1;
|
||||
} else if (strlen(argv[na]) >= 256) {
|
||||
HTS_PANIC_PRINTF("Header line string too long");
|
||||
htsmain_free();
|
||||
return -1;
|
||||
}
|
||||
StringCat(opt->headers, argv[na]);
|
||||
StringCat(opt->headers, "\r\n"); /* separator */
|
||||
@@ -2441,6 +2524,35 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
|
||||
htsmain_free();
|
||||
return 0;
|
||||
break;
|
||||
case '8': /* string-safety selftest: httrack -#8 [overflow <bigstr>] */
|
||||
if (na + 1 < argc
|
||||
&& strncmp(argv[na + 1], "overflow", 8) == 0) {
|
||||
/* Deliberately exceed a sized buffer: the bounded op must
|
||||
abort. The source comes from argv so its length is opaque
|
||||
to the compiler (no static -Wstringop-overflow, genuine
|
||||
runtime check). "overflow-buff" exercises htsbuff. */
|
||||
char small[4];
|
||||
const char *const src =
|
||||
(na + 2 < argc) ? argv[na + 2] : "overflowing";
|
||||
|
||||
if (strcmp(argv[na + 1], "overflow-buff") == 0) {
|
||||
htsbuff b = htsbuff_array(small);
|
||||
|
||||
htsbuff_cat(&b, src);
|
||||
} else {
|
||||
strcpybuff(small, src);
|
||||
}
|
||||
printf("strsafe: NOT aborted\n"); /* must be unreachable */
|
||||
htsmain_free();
|
||||
return 1;
|
||||
} else {
|
||||
const int err = string_safety_selftests();
|
||||
|
||||
printf("strsafe: %s\n", err ? "FAIL" : "OK");
|
||||
htsmain_free();
|
||||
return err;
|
||||
}
|
||||
break;
|
||||
case '7': // hashtable selftest: httrack -#7 nb_entries
|
||||
basic_selftests();
|
||||
if (++na < argc) {
|
||||
@@ -2691,11 +2803,6 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
|
||||
return -1;
|
||||
} else {
|
||||
na++;
|
||||
if (strlen(argv[na]) >= 126) {
|
||||
HTS_PANIC_PRINTF("User-agent length too long");
|
||||
htsmain_free();
|
||||
return -1;
|
||||
}
|
||||
StringCopy(opt->user_agent, argv[na]);
|
||||
if (StringNotEmpty(opt->user_agent))
|
||||
opt->user_agent_send = 1;
|
||||
|
||||
@@ -409,7 +409,7 @@ void help_catchurl(const char *dest_path) {
|
||||
if (soc != INVALID_SOCKET) {
|
||||
char BIGSTK url[HTS_URLMAXSIZE * 2];
|
||||
char method[32];
|
||||
char BIGSTK data[32768];
|
||||
char BIGSTK data[CATCH_URL_DATA_SIZE];
|
||||
|
||||
url[0] = method[0] = data[0] = '\0';
|
||||
//
|
||||
|
||||
@@ -121,6 +121,7 @@ const char *hts_detect[] = {
|
||||
"lowsrc",
|
||||
"profile", // element META
|
||||
"src",
|
||||
"srcset", // HTML5 responsive images (<img>, <source>)
|
||||
"swurl",
|
||||
"url",
|
||||
"usemap",
|
||||
@@ -877,7 +878,7 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
|
||||
const char *xsend, const char *adr, const char *fil,
|
||||
const char *referer_adr, const char *referer_fil,
|
||||
htsblk * retour) {
|
||||
char BIGSTK buffer_head_request[8192];
|
||||
char BIGSTK buffer_head_request[16384];
|
||||
buff_struct bstr = { buffer_head_request, sizeof(buffer_head_request), 0 };
|
||||
|
||||
//int use_11=0; // HTTP 1.1 utilisé
|
||||
|
||||
@@ -532,6 +532,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
int valid_p = 0; // force to take p even if == 0
|
||||
int ending_p = '\0'; // ending quote?
|
||||
int archivetag_p = 0; // avoid multiple-archives with commas
|
||||
int srcset_p = 0; // srcset="url1 480w, url2 2x": list of URLs
|
||||
int unquoted_script = 0;
|
||||
INSCRIPT inscript_state_pos_prev = inscript_state_pos;
|
||||
|
||||
@@ -1050,6 +1051,12 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
if (strcmp(hts_detect[i], "archive") == 0) {
|
||||
archivetag_p = 1;
|
||||
}
|
||||
/* srcset: a comma-list of candidate URLs, each split
|
||||
out and rewritten below (#235, #236) */
|
||||
else if (strcmp(hts_detect[i], "srcset") == 0
|
||||
|| strcmp(hts_detect[i], "data-srcset") == 0) {
|
||||
srcset_p = 1;
|
||||
}
|
||||
}
|
||||
i++;
|
||||
}
|
||||
@@ -1815,6 +1822,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
html++; // sauter # pour usemap etc
|
||||
}
|
||||
}
|
||||
srcset_next:
|
||||
/* srcset: skip leading whitespace/commas before each candidate;
|
||||
the skipped bytes flush verbatim below */
|
||||
if (srcset_p) {
|
||||
while(html < r->adr + r->size
|
||||
&& (is_realspace(*html) || *html == ','))
|
||||
INCREMENT_CURRENT_ADR(1);
|
||||
}
|
||||
eadr = html;
|
||||
|
||||
// ne pas flusher après code si on doit écrire le codebase avant!
|
||||
@@ -1844,6 +1859,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
if ((*eadr == quote && (!quoteinscript || *(eadr - 1) == '\\')) // end quote
|
||||
|| (noquote && (*eadr == '\"' || *eadr == '\'')) // end at any quote
|
||||
|| (!noquote && quote == '\0' && is_realspace(*eadr)) // unquoted href
|
||||
|| srcset_p // whitespace ends a srcset candidate URL
|
||||
) // si pas d'attente de quote spéciale ou si quote atteinte
|
||||
ok = 0;
|
||||
} else if (ending_p && (*eadr == ending_p))
|
||||
@@ -1872,6 +1888,16 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
break; // \" ou \' point d'arrêt
|
||||
case '?': /*quote_adr=adr; */
|
||||
break; // noter position query
|
||||
case ',':
|
||||
if (srcset_p) {
|
||||
/* split only on a trailing comma; one inside the URL
|
||||
(data: URI, CDN path) is kept, per the WHATWG algo */
|
||||
const char *const n = eadr + 1;
|
||||
|
||||
if (n >= r->adr + r->size || is_space(*n) || *n == ',')
|
||||
ok = 0;
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
//}
|
||||
@@ -3250,6 +3276,28 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
}
|
||||
// adr=eadr-1; // ** sauter
|
||||
|
||||
/* srcset candidate loop: skip the descriptor and comma, then
|
||||
re-enter the capture for the next URL. Backward goto, not a loop:
|
||||
the per-candidate body is this whole block. */
|
||||
if (srcset_p && ok == 0) {
|
||||
const char *const endp = r->adr + r->size;
|
||||
const char *q = html;
|
||||
while(q < endp && *q != '\0' && *q != ',' && *q != quote
|
||||
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32)
|
||||
q++; // skip the descriptor
|
||||
if (q < endp && *q == ',') {
|
||||
q++;
|
||||
while(q < endp && (is_realspace(*q) || *q == ','))
|
||||
q++; // skip whitespace and empty candidates
|
||||
if (q < endp && *q != '\0' && *q != ',' && *q != quote
|
||||
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
|
||||
INCREMENT_CURRENT_ADR(q - html); // keep the automate in sync
|
||||
ok = 1;
|
||||
goto srcset_next;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* We skipped bytes and skip the " : reset state */
|
||||
/*if (inscript) {
|
||||
inscript_state_pos = INSCRIPT_START;
|
||||
|
||||
145
src/htssafe.h
145
src/htssafe.h
@@ -123,41 +123,111 @@ static HTS_UNUSED void htssafe_compile_time_check_(void) {
|
||||
(void) check_pointer;
|
||||
}
|
||||
|
||||
/*
|
||||
* Pointer-destination diagnostics for the buff() macros (GCC/Clang, C only).
|
||||
*
|
||||
* strcpybuff()/strcatbuff()/strncatbuff() bounds-check only when the
|
||||
* destination is a sized char[] array (HTS_IS_CHAR_BUFFER). For a bare char*
|
||||
* the capacity is unknown, so the macro silently falls back to plain
|
||||
* strcpy()/strcat()/strncat() while still looking like a checked call.
|
||||
*
|
||||
* These stubs route that pointer case through __builtin_choose_expr() so the
|
||||
* 'warning' attribute fires only at pointer-destination sites; array sites use
|
||||
* the bounded *_safe_ helpers and stay quiet. The warning names the
|
||||
* explicit-size replacement (strlcpybuff/strlcatbuff). Diagnostic only: no
|
||||
* runtime or ABI change, built only on GCC/Clang in C mode. Other compilers
|
||||
* (MSVC, ...) keep the previous behavior via the #else branches.
|
||||
*/
|
||||
#if (defined(__GNUC__) && !defined(__cplusplus))
|
||||
#if defined(__has_attribute)
|
||||
#if __has_attribute(warning)
|
||||
#define HTS_BUFF_PTR_ATTR(msg) __attribute__((unused, noinline, warning(msg)))
|
||||
#endif
|
||||
#endif
|
||||
#ifndef HTS_BUFF_PTR_ATTR
|
||||
/* 'warning' attribute unavailable: keep noinline so the migration can still
|
||||
grep for these symbols, but no compile-time diagnostic is emitted. */
|
||||
#define HTS_BUFF_PTR_ATTR(msg) __attribute__((unused, noinline))
|
||||
#endif
|
||||
|
||||
HTS_BUFF_PTR_ATTR("strcpybuff() destination is a pointer (capacity unknown): "
|
||||
"NOT bounds-checked; use strlcpybuff(dst, src, size)")
|
||||
static char *strcpybuff_ptr_(char *dest, const char *src) {
|
||||
return strcpy(dest, src);
|
||||
}
|
||||
|
||||
HTS_BUFF_PTR_ATTR("strcatbuff() destination is a pointer (capacity unknown): "
|
||||
"NOT bounds-checked; use strlcatbuff(dst, src, size)")
|
||||
static char *strcatbuff_ptr_(char *dest, const char *src) {
|
||||
return strcat(dest, src);
|
||||
}
|
||||
|
||||
HTS_BUFF_PTR_ATTR("strncatbuff() destination is a pointer (capacity unknown): "
|
||||
"NOT bounds-checked; use strlcatbuff(dst, src, size)")
|
||||
static char *strncatbuff_ptr_(char *dest, const char *src, size_t n) {
|
||||
return strncat(dest, src, n);
|
||||
}
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Append at most N characters from "B" to "A".
|
||||
* If "A" is a char[] variable whose size is not sizeof(char*), then the size
|
||||
* is assumed to be the capacity of this array.
|
||||
*/
|
||||
#if (defined(__GNUC__) && !defined(__cplusplus))
|
||||
#define strncatbuff(A, B, N) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \
|
||||
strncat_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
|
||||
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \
|
||||
strncatbuff_ptr_((A), (B), (N)) )
|
||||
#else
|
||||
#define strncatbuff(A, B, N) \
|
||||
( HTS_IS_NOT_CHAR_BUFFER(A) \
|
||||
? strncat(A, B, N) \
|
||||
: strncat_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), N, \
|
||||
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) )
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Append characters of "B" to "A".
|
||||
* If "A" is a char[] variable whose size is not sizeof(char*), then the size
|
||||
* is assumed to be the capacity of this array.
|
||||
*/
|
||||
#if (defined(__GNUC__) && !defined(__cplusplus))
|
||||
#define strcatbuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \
|
||||
strncat_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \
|
||||
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__), \
|
||||
strcatbuff_ptr_((A), (B)) )
|
||||
#else
|
||||
#define strcatbuff(A, B) \
|
||||
( HTS_IS_NOT_CHAR_BUFFER(A) \
|
||||
? strcat(A, B) \
|
||||
: strncat_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), (size_t) -1, \
|
||||
"overflow while appending '" #B "' to '"#A"'", __FILE__, __LINE__) )
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Copy characters from "B" to "A".
|
||||
* If "A" is a char[] variable whose size is not sizeof(char*), then the size
|
||||
* is assumed to be the capacity of this array.
|
||||
*/
|
||||
#if (defined(__GNUC__) && !defined(__cplusplus))
|
||||
#define strcpybuff(A, B) __builtin_choose_expr( HTS_IS_CHAR_BUFFER(A), \
|
||||
strcpy_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
|
||||
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__), \
|
||||
strcpybuff_ptr_((A), (B)) )
|
||||
#else
|
||||
#define strcpybuff(A, B) \
|
||||
( HTS_IS_NOT_CHAR_BUFFER(A) \
|
||||
? strcpy(A, B) \
|
||||
: strcpy_safe_(A, sizeof(A), B, \
|
||||
HTS_IS_NOT_CHAR_BUFFER(B) ? (size_t) -1 : sizeof(B), \
|
||||
"overflow while copying '" #B "' to '"#A"'", __FILE__, __LINE__) )
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Append characters of "B" to "A", "A" having a maximum capacity of "S".
|
||||
@@ -217,6 +287,81 @@ static HTS_INLINE HTS_UNUSED char* strcpy_safe_(char *const dest, const size_t s
|
||||
return strncat_safe_(dest, sizeof_dest, source, sizeof_source, (size_t) -1, exp, file, line);
|
||||
}
|
||||
|
||||
/**
|
||||
* htsbuff: a non-owning bounded string builder over a fixed buffer.
|
||||
*
|
||||
* Companion to the strcpybuff()/strcatbuff() macros for the common case of a
|
||||
* cursor walking a buffer of known capacity (building a name into a fixed
|
||||
* array, assembling a status line, etc.). It tracks the write position, bounds
|
||||
* every write against the real capacity, and aborts on overflow (same contract
|
||||
* as the *_safe_ helpers), so the error-prone manual "p += strlen(p)" dance
|
||||
* goes away.
|
||||
*
|
||||
* Build one from an in-scope array with htsbuff_array() (capacity via sizeof,
|
||||
* so pass an array, not a pointer), or from a pointer of known capacity with
|
||||
* htsbuff_ptr(). The buffer is kept NUL-terminated; htsbuff_str() returns it.
|
||||
*/
|
||||
typedef struct {
|
||||
char *buf; /* backing buffer (kept NUL-terminated) */
|
||||
size_t cap; /* total capacity of buf, including the NUL */
|
||||
size_t len; /* current length, excluding the NUL */
|
||||
} htsbuff;
|
||||
|
||||
static HTS_INLINE HTS_UNUSED htsbuff htsbuff_ptr_(char *buf, size_t cap) {
|
||||
htsbuff b;
|
||||
b.buf = buf;
|
||||
b.cap = cap;
|
||||
b.len = 0;
|
||||
assertf(cap != 0);
|
||||
buf[0] = '\0';
|
||||
return b;
|
||||
}
|
||||
|
||||
/**
|
||||
* Builder over the in-scope array ARR (capacity = sizeof(ARR)).
|
||||
* On GCC/Clang this rejects a non-array (e.g. a char* pointer), whose sizeof
|
||||
* would be the pointer size and silently wrong; use htsbuff_ptr() for pointers.
|
||||
* On other compilers there is no such guard, so pass only true arrays there.
|
||||
*/
|
||||
#if (defined(__GNUC__) && !defined(__cplusplus))
|
||||
/* 0 for an array, a -1 array-size compile error for a pointer. */
|
||||
#define htsbuff_must_be_array_(A) \
|
||||
(sizeof(char[1 - 2 * !!__builtin_types_compatible_p(typeof(A), typeof(&(A)[0]))]) - 1)
|
||||
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR) + htsbuff_must_be_array_(ARR))
|
||||
#else
|
||||
#define htsbuff_array(ARR) htsbuff_ptr_((ARR), sizeof(ARR))
|
||||
#endif
|
||||
/** Builder over pointer P of known capacity N (N includes the NUL). */
|
||||
#define htsbuff_ptr(P, N) htsbuff_ptr_((P), (N))
|
||||
|
||||
/** Append at most n characters of s (stopping at its NUL). Aborts on overflow. */
|
||||
static HTS_INLINE HTS_UNUSED void htsbuff_catn(htsbuff *b, const char *s, size_t n) {
|
||||
const size_t add = strnlen(s, n);
|
||||
/* Overflow-safe: keep the (potentially huge) 'add' alone on one side. The
|
||||
maintained invariant len < cap makes 'cap - len' >= 1 (no underflow), so
|
||||
'add < cap - len' cannot wrap the way 'len + add < cap' could. */
|
||||
assertf__(add < b->cap - b->len, "htsbuff append overflow", __FILE__, __LINE__);
|
||||
memcpy(b->buf + b->len, s, add);
|
||||
b->len += add;
|
||||
b->buf[b->len] = '\0';
|
||||
}
|
||||
|
||||
/** Append s. Aborts on overflow. */
|
||||
static HTS_INLINE HTS_UNUSED void htsbuff_cat(htsbuff *b, const char *s) {
|
||||
htsbuff_catn(b, s, (size_t) -1);
|
||||
}
|
||||
|
||||
/** Reset content to s. Aborts on overflow. */
|
||||
static HTS_INLINE HTS_UNUSED void htsbuff_cpy(htsbuff *b, const char *s) {
|
||||
b->len = 0;
|
||||
htsbuff_catn(b, s, (size_t) -1);
|
||||
}
|
||||
|
||||
/** Current NUL-terminated content. */
|
||||
static HTS_INLINE HTS_UNUSED const char *htsbuff_str(const htsbuff *b) {
|
||||
return b->buf;
|
||||
}
|
||||
|
||||
#define malloct(A) malloc(A)
|
||||
#define calloct(A,B) calloc((A), (B))
|
||||
#define freet(A) do { if ((A) != NULL) { free(A); (A) = NULL; } } while(0)
|
||||
|
||||
71
tests/01_engine-cmdline.test
Executable file
71
tests/01_engine-cmdline.test
Executable file
@@ -0,0 +1,71 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
|
||||
# Offline command-line option tests (no network). The -F user-agent and -%X
|
||||
# raw-header values used to be rejected past 126 / 256 bytes (#152); they are
|
||||
# now bounded only by the general per-argument cap (HTS_CDLMAXSIZE). A value up
|
||||
# to that cap is accepted on both the short (-F, -%X) and long (--user-agent,
|
||||
# --headers) forms, and an over-cap value is refused cleanly rather than
|
||||
# overrunning a fixed scratch buffer.
|
||||
|
||||
set -u
|
||||
|
||||
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_cmdline.XXXXXX") || exit 1
|
||||
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
|
||||
|
||||
echo '<html><body>hello</body></html>' >"$tmp/index.html"
|
||||
|
||||
# a string of N repeated 'A' characters
|
||||
nchars() {
|
||||
printf 'A%.0s' $(seq 1 "$1")
|
||||
}
|
||||
|
||||
# crawl the local fixture with the given extra args; leaves the exit status in RC
|
||||
run() {
|
||||
local out="$1"
|
||||
shift
|
||||
rm -rf "$out"
|
||||
mkdir -p "$out"
|
||||
httrack "file://$tmp/index.html" -O "$out" --quiet -n "$@" >"$out/.log" 2>&1
|
||||
RC=$?
|
||||
}
|
||||
|
||||
# assert the value was accepted: clean exit and the fixture was mirrored
|
||||
accepted() {
|
||||
{ test "$RC" -eq 0 && test -n "$(find "$1" -type f -path '*/index.html' -print -quit)"; } ||
|
||||
! echo "FAIL: $2 (exit $RC)" || exit 1
|
||||
}
|
||||
|
||||
# assert the value was refused cleanly: a normal error exit, never a crash
|
||||
# (a SIGABRT from an overflowed scratch buffer would surface as exit 134)
|
||||
refused() {
|
||||
{ test "$RC" -ne 0 && test "$RC" -ne 134; } ||
|
||||
! echo "FAIL: $1 (exit $RC)" || exit 1
|
||||
}
|
||||
|
||||
# a value past the old 126/256 caps but within the cap is accepted, on both the
|
||||
# short and long form of each option
|
||||
long=$(nchars 900)
|
||||
run "$tmp/ua-s" -F "$long"
|
||||
accepted "$tmp/ua-s" "#152: long -F user-agent rejected or crashed"
|
||||
run "$tmp/ua-l" --user-agent "$long"
|
||||
accepted "$tmp/ua-l" "#152: long --user-agent rejected or crashed"
|
||||
run "$tmp/hd-s" "-%X" "X-A: $long"
|
||||
accepted "$tmp/hd-s" "#152: long -%X header rejected or crashed"
|
||||
run "$tmp/hd-l" --headers "X-B: $long"
|
||||
accepted "$tmp/hd-l" "#152: long --headers rejected or crashed"
|
||||
|
||||
# a value just under the cap (>1000) must not overflow the long-form alias
|
||||
# scratch buffer (the param[] copy in optalias_check)
|
||||
run "$tmp/ua-n" --user-agent "$(nchars 1010)"
|
||||
accepted "$tmp/ua-n" "#152: near-cap --user-agent overflowed the param[] buffer"
|
||||
|
||||
# a value over the cap is refused cleanly (graceful error, not a SIGABRT), on
|
||||
# both forms
|
||||
over=$(nchars 1100)
|
||||
run "$tmp/ov-s" -F "$over"
|
||||
refused "#152: over-cap -F not refused cleanly"
|
||||
run "$tmp/ov-l" --user-agent "$over"
|
||||
refused "#152: over-cap --user-agent not refused cleanly"
|
||||
|
||||
exit 0
|
||||
@@ -47,3 +47,25 @@ match '*foo*bar' 'foozbar'
|
||||
|
||||
# '?' is the query-string marker, not a single-char wildcard
|
||||
nomatch 'a?c' 'abc'
|
||||
|
||||
# backslash escapes a metacharacter inside a class so it is matched literally.
|
||||
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
|
||||
# both X and '\'. These assertions pin that behavior.
|
||||
match '*[\*]' '*'
|
||||
match '*[\*]' "\\"
|
||||
nomatch '*[\*]' 'a'
|
||||
match '*[\\]' "\\"
|
||||
nomatch '*[\\]' 'a'
|
||||
match '*[\[]' '['
|
||||
match '*[\[]' "\\"
|
||||
nomatch '*[\[]' 'a'
|
||||
|
||||
# A literal ']' cannot be a class member: the class parser stops at the first
|
||||
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
|
||||
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
|
||||
# by a trailing literal ']'. These assertions document the current (buggy)
|
||||
# behavior so any future matcher fix is a deliberate, visible change.
|
||||
nomatch '*[\[\]]' '[' # not matched, despite the docs
|
||||
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
|
||||
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
|
||||
nomatch '*[\[\]]' '[]x'
|
||||
|
||||
155
tests/01_engine-parse.test
Executable file
155
tests/01_engine-parse.test
Executable file
@@ -0,0 +1,155 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
|
||||
# Offline HTML parser tests: each section crawls a file:// fixture (no network)
|
||||
# and checks which assets the parser captured and how it rewrote the links.
|
||||
|
||||
set -u
|
||||
|
||||
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_parse.XXXXXX") || exit 1
|
||||
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
|
||||
|
||||
# a minimal valid 1x1 GIF, reused for every referenced asset
|
||||
gif() {
|
||||
printf 'GIF89a\1\0\1\0\200\0\0\0\0\0\377\377\377!\371\4\1\0\0\0\0,\0\0\0\0\1\0\1\0\0\2\2D\1\0;' >"$1"
|
||||
}
|
||||
|
||||
# crawl <fixture-html> into <out> with link rewriting on, no extra fetching
|
||||
crawl() {
|
||||
local html="$1" out="$2"
|
||||
rm -rf "$out"
|
||||
mkdir -p "$out"
|
||||
httrack "file://$html" -O "$out" --quiet --near -n >"$out/.log" 2>&1
|
||||
}
|
||||
|
||||
# assert a file with the given basename was saved somewhere under <out>
|
||||
found() {
|
||||
test -n "$(find "$2" -type f -name "$1" -print -quit)" ||
|
||||
! echo "FAIL: expected '$1' to be downloaded under $2" || exit 1
|
||||
}
|
||||
|
||||
# assert NO file with the given basename was saved (e.g. a descriptor token must
|
||||
# not be mistaken for a URL)
|
||||
notfound() {
|
||||
test -z "$(find "$2" -type f -name "$1" -print -quit)" ||
|
||||
! echo "FAIL: '$1' should not have been downloaded under $2" || exit 1
|
||||
}
|
||||
|
||||
# the mirrored fixture page (under "file/"), not HTTrack's own landing index
|
||||
savedhtml() {
|
||||
find "$1" -type f -path '*/file/*' -name index.html -print -quit
|
||||
}
|
||||
|
||||
# srcset on <img> and <source> (#235, #236): every candidate captured and
|
||||
# rewritten, descriptors preserved, following attributes left intact.
|
||||
site="$tmp/srcset"
|
||||
mkdir -p "$site"
|
||||
for f in a b c d e f g h i j v dz; do gif "$site/$f.gif"; done
|
||||
# unquoted heredoc: $site expands in the absolute-URL candidate
|
||||
cat >"$site/index.html" <<EOF
|
||||
<html><body>
|
||||
<img src="a.gif" srcset="b.gif 480w, c.gif 800w">
|
||||
<picture><source srcset="d.gif 1x, c.gif 2x"><img src="a.gif"></picture>
|
||||
<img srcset="e.gif, f.gif">
|
||||
<img srcset="g.gif 2x" alt="trailing attr after srcset">
|
||||
<img srcset=" h.gif 2x , i.gif ">
|
||||
<video><source src="v.gif"></video>
|
||||
<img srcset="file://$site/j.gif 2x">
|
||||
<img srcset="data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x, dz.gif 2x">
|
||||
<img srcset="">
|
||||
<a href="a.gif">plain link still works</a>
|
||||
</body></html>
|
||||
EOF
|
||||
out="$tmp/srcset-out"
|
||||
crawl "$site/index.html" "$out"
|
||||
|
||||
# every candidate downloads, incl. unique tails (catches first-only parsing),
|
||||
# whitespace-padded (h,i), <source src> (v), absolute (j), post-data: URI (dz)
|
||||
for f in a b c d e f g h i j v dz; do found "$f.gif" "$out"; done
|
||||
|
||||
# the width/density descriptors are not URLs and must not be fetched
|
||||
notfound "480w" "$out"
|
||||
notfound "800w" "$out"
|
||||
notfound "2x" "$out"
|
||||
|
||||
saved=$(savedhtml "$out")
|
||||
test -n "$saved" || ! echo "FAIL: saved index.html not found" || exit 1
|
||||
|
||||
# descriptors must survive the rewrite (no "b.gif 480w" mangled into a path)
|
||||
grep -Eq 'srcset="[^"]*480w[^"]*800w' "$saved" ||
|
||||
! echo "FAIL: srcset width descriptors lost/reordered in rewritten HTML" || exit 1
|
||||
grep -Eq 'srcset="[^"]*1x[^"]*2x' "$saved" ||
|
||||
! echo "FAIL: srcset density descriptors lost/reordered in rewritten HTML" || exit 1
|
||||
# the descriptor-less comma form keeps both candidates and the separator verbatim
|
||||
grep -Eq 'srcset="e\.gif, f\.gif"' "$saved" ||
|
||||
! echo "FAIL: comma-separated srcset without descriptors was altered" || exit 1
|
||||
# an attribute following srcset in the same tag must be left intact
|
||||
grep -q 'alt="trailing attr after srcset"' "$saved" ||
|
||||
! echo "FAIL: srcset swallowed a following attribute" || exit 1
|
||||
|
||||
# a comma inside a URL (data: URI, CDN path) is part of the URL, not a split
|
||||
# point (WHATWG): the data: URI stays verbatim; the next candidate (dz) downloads
|
||||
grep -Fq 'data:image/gif;base64,R0lGODlhAQABAAAAACw= 1x' "$saved" ||
|
||||
! echo "FAIL: a comma inside a data: URI srcset candidate was mis-split" || exit 1
|
||||
|
||||
# real rewrite, not passthrough: the absolute file:// candidate becomes local
|
||||
# (a flat fixture can't show this; the footer comment's file:// is not in srcset)
|
||||
grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
|
||||
! echo "FAIL: absolute file:// srcset URL was not rewritten to a local link" || exit 1
|
||||
! grep -Eq 'srcset="[^"]*file://' "$saved" ||
|
||||
! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
|
||||
|
||||
# xlink:href (#298) and CSS background-image (#237): detected and rewritten to
|
||||
# local. background-image is covered in both an external <style> block and an
|
||||
# inline style attribute, with the URL unquoted, double-quoted and single-quoted
|
||||
# (the quote style is preserved on rewrite). No-detect attributes (title, alt,
|
||||
# ...) are left untouched. Asserted by rewrite (deterministic), not download.
|
||||
# data-* (#201/#203) is omitted: its detection is currently nondeterministic and
|
||||
# can't be locked yet.
|
||||
site2="$tmp/attrs"
|
||||
mkdir -p "$site2"
|
||||
for f in xl ibg ibgs cex cexd cexs tt; do gif "$site2/$f.gif"; done
|
||||
cat >"$site2/index.html" <<EOF
|
||||
<html><head><style>
|
||||
.a { background-image: url(file://$site2/cex.gif); }
|
||||
.b { background-image: url("file://$site2/cexd.gif"); }
|
||||
.c { background-image: url('file://$site2/cexs.gif'); }
|
||||
</style></head><body>
|
||||
<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
|
||||
<div style="background-image:url(file://$site2/ibg.gif)"></div>
|
||||
<div style="background-image:url('file://$site2/ibgs.gif')"></div>
|
||||
<span title="file://$site2/tt.gif">excluded attribute</span>
|
||||
</body></html>
|
||||
EOF
|
||||
out2="$tmp/attrs-out"
|
||||
crawl "$site2/index.html" "$out2"
|
||||
saved2=$(savedhtml "$out2")
|
||||
test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
|
||||
|
||||
# detected attributes: the absolute URL is rewritten to a local link
|
||||
grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
|
||||
! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
|
||||
|
||||
# #237 external <style> block, each quoting form, quote style preserved
|
||||
grep -Eq 'url\(cex\.gif\)' "$saved2" ||
|
||||
! echo "FAIL #237: unquoted background-image in <style> not rewritten" || exit 1
|
||||
grep -Eq 'url\("cexd\.gif"\)' "$saved2" ||
|
||||
! echo "FAIL #237: double-quoted background-image in <style> not rewritten" || exit 1
|
||||
grep -Eq "url\('cexs\.gif'\)" "$saved2" ||
|
||||
! echo "FAIL #237: single-quoted background-image in <style> not rewritten" || exit 1
|
||||
|
||||
# #237 inline style attribute, unquoted and single-quoted url()
|
||||
grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
|
||||
! echo "FAIL #237: inline unquoted background-image not rewritten" || exit 1
|
||||
grep -Eq "style=\"background-image:url\('ibgs\.gif'\)\"" "$saved2" ||
|
||||
! echo "FAIL #237: inline single-quoted background-image not rewritten" || exit 1
|
||||
|
||||
# no file:// URL survived inside any rewritten background-image
|
||||
! grep -Eq 'background-image:[^;"]*file://' "$saved2" ||
|
||||
! echo "FAIL #237: a file:// URL survived inside a rewritten background-image" || exit 1
|
||||
|
||||
# excluded attribute: title is on the no-detect list, so its value is left as-is
|
||||
grep -q 'title="file://' "$saved2" ||
|
||||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
|
||||
|
||||
exit 0
|
||||
34
tests/01_engine-strsafe.test
Executable file
34
tests/01_engine-strsafe.test
Executable file
@@ -0,0 +1,34 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
|
||||
# htssafe.h bounded string operations (driven by 'httrack -#8').
|
||||
|
||||
# Success path: every bounded op (strcpybuff/strcatbuff/strncatbuff/strlcpybuff)
|
||||
# must behave correctly. Like the other -# debug modes, a trailing token is
|
||||
# required (a bare '-#8' falls through to the usage screen).
|
||||
out=$(httrack -#8 run)
|
||||
test $? -eq 0 || exit 1
|
||||
test "$out" == "strsafe: OK" || exit 1
|
||||
|
||||
# Overflow path: an over-capacity write into a sized buffer must be caught by
|
||||
# the bounded macro and abort the process, not be silently truncated/completed.
|
||||
# Assert the htssafe abort signature specifically, so the test cannot pass for
|
||||
# an unrelated reason (e.g. the -#8 mode being gone and falling through to the
|
||||
# usage screen, which also exits non-zero).
|
||||
err=$(httrack -#8 overflow "this string is far too long for the buffer" 2>&1)
|
||||
case "$err" in
|
||||
*"strsafe: NOT aborted"*) echo "over-capacity write was NOT caught" >&2; exit 1 ;;
|
||||
*"overflow while copying"*) ;;
|
||||
*) echo "expected htssafe overflow abort, got: $err" >&2; exit 1 ;;
|
||||
esac
|
||||
|
||||
# Same guarantee for the htsbuff builder. The source is exactly the buffer
|
||||
# capacity (4 bytes into a 4-byte buffer), so this also pins the boundary: a
|
||||
# '<=' off-by-one in the capacity check would let it through (and print "NOT
|
||||
# aborted"). Match the specific htsbuff abort message, not just any assert.
|
||||
err=$(httrack -#8 overflow-buff "abcd" 2>&1)
|
||||
case "$err" in
|
||||
*"strsafe: NOT aborted"*) echo "htsbuff over-capacity write was NOT caught" >&2; exit 1 ;;
|
||||
*"htsbuff append overflow"*) ;;
|
||||
*) echo "expected htsbuff overflow abort, got: $err" >&2; exit 1 ;;
|
||||
esac
|
||||
@@ -9,6 +9,25 @@ TESTS_ENVIRONMENT += HTTPS_SUPPORT=$(HTTPS_SUPPORT)
|
||||
TESTS_ENVIRONMENT += top_srcdir=$(top_srcdir)
|
||||
|
||||
TEST_EXTENSIONS = .test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
TESTS = \
|
||||
00_runnable.test \
|
||||
01_engine-charset.test \
|
||||
01_engine-cmdline.test \
|
||||
01_engine-entities.test \
|
||||
01_engine-filter.test \
|
||||
01_engine-hashtable.test \
|
||||
01_engine-idna.test \
|
||||
01_engine-mime.test \
|
||||
01_engine-parse.test \
|
||||
01_engine-simplify.test \
|
||||
01_engine-strsafe.test \
|
||||
02_manpage-regen.test \
|
||||
10_crawl-simple.test \
|
||||
11_crawl-cookies.test \
|
||||
11_crawl-idna.test \
|
||||
11_crawl-international.test \
|
||||
11_crawl-longurl.test \
|
||||
11_crawl-parsing.test \
|
||||
12_crawl_https.test
|
||||
|
||||
CLEANFILES = check-network_sh.cache
|
||||
|
||||
@@ -472,7 +472,7 @@ TESTS_ENVIRONMENT = PATH=$(top_builddir)/src$(PATH_SEPARATOR)$$PATH \
|
||||
ONLINE_UNIT_TESTS=$(ONLINE_UNIT_TESTS) \
|
||||
HTTPS_SUPPORT=$(HTTPS_SUPPORT) top_srcdir=$(top_srcdir)
|
||||
TEST_EXTENSIONS = .test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-cmdline.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
CLEANFILES = check-network_sh.cache
|
||||
all: all-am
|
||||
|
||||
|
||||
Reference in New Issue
Block a user