Compare commits

..

8 Commits

Author SHA1 Message Date
Xavier Roche
36b4e834b8 htsopt: type the boolean option fields as a named enum
The httrackp option fields that are pure on/off toggles were declared as bare
int. Introduce a two-value enum, hts_boolean { HTS_FALSE, HTS_TRUE }, and use it
as the type of the 38 boolean fields so each one documents its nature at the
declaration. The hts_create_opt() defaults block now reads HTS_TRUE/HTS_FALSE.

An enum is used rather than C bool on purpose: a C enum is int-sized and
represented like int, so the struct layout, every field offset and
sizeof(httrackp) are unchanged (verified: 141648 bytes before and after). The
size_httrackp guard value still holds and there is no soname bump. A bool field
would be one byte and would repack the whole struct.

Scope is httrackp only; fields that look boolean but are not were left as int
(savename_delayed is tri-state, hostcontrol is a bitmask), as was is_update in
the separate lien_back struct. The four CLI sites that sscanf("%d") into a
boolean field now cast to int* to keep the read well-defined.

Value-preserving: built against origin/master and compared per-object
disassembly. 40 of 45 objects are byte-identical; the five that differ
(htscore/htslib/htsname/htsparse/htswizard) differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes. Runtime assignments and tests on these fields are
left as plain 0/1, which compile identically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-18 07:34:36 +02:00
Xavier Roche
bbb423f025 Merge pull request #385 from xroche/feature/api-enum-types
Give the option fields named enum types and flag macros
2026-06-18 07:06:59 +02:00
Xavier Roche
eed46e0b09 htsopt: give the option fields named enum types and flag macros
The per-mirror option fields in the installed htsopt.h carried bare ints whose
values were scattered magic numbers, decoded only by reading the parser. Type
the four single-valued fields as enums (urlmode -> hts_urlmode, cache ->
hts_cachemode, wizard -> hts_wizard, robots -> hts_robots) and name the bitmask
bits as enums too (hts_getmode, hts_seeker, hts_travel_scope, plus
HTS_TRAVEL_SCOPE_MASK / HTS_TRAVEL_TEST_ALL), following the existing
htsparsejava_flags pattern where the flag bits are an enum but the field stays
int. Replace the magic numbers at every use site with the named values.

This is not an ABI break: a C enum is int-sized and represented identically, so
the struct layout, field offsets and sizeof(httrackp) are unchanged and the
size_httrackp guard value still holds. No soname bump.

The substitution is value-preserving and was verified by comparing per-object
disassembly between this branch and origin/master: 98 of 103 objects are
byte-identical, the htscore/htscoremain/htsparse objects have identical opcode
sequences (the only deltas are __LINE__ immediates moved by clang-format
wrapping long lines), and htslib/htswizard differ only in instruction selection
from the int->enum field types, with every hts_create_opt default confirmed
unchanged. make check passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 23:59:38 +02:00
Xavier Roche
fa57f0148f Merge pull request #384 from xroche/cleanup/dead-decls
Drop dead and duplicate function declarations
2026-06-17 22:15:13 +02:00
Xavier Roche
76260d5e6e src: drop dead and duplicate function declarations
Four declarations named functions that have no definition anywhere, so
they were never exported (absent from libhttrack.so) and any caller
would fail to link: htswrap_set_userdef and htswrap_get_userdef (the
live path is the CHAIN_FUNCTION ARGUMENT with CALLBACKARG_USERDEF),
antislash_unescaped, and the internal liens_record. escape_remove_control
was additionally declared twice in httrack-library.h; the documented
declaration stays, the bare duplicate goes.

Header-only cleanup. The exported symbol set is unchanged (verified with
nm -D), so this is not an ABI break and needs no soname bump.

Found while documenting the public API (#382).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 22:11:30 +02:00
Xavier Roche
5d0913dfce Merge pull request #383 from xroche/fix/mtime-local-precision
Fix mtime_local sub-second precision loss on POSIX
2026-06-17 22:06:42 +02:00
Xavier Roche
9b7601a987 htslib: fix mtime_local sub-second precision on POSIX
mtime_local() returns milliseconds since the epoch, but the POSIX
branch divided tv_usec (microseconds) by 1000000 instead of 1000,
dropping the entire millisecond term. The clock only advanced at
whole-second boundaries, so every sub-second delta the callers compute
(request/connect timing, transfer-rate smoothing) read as zero. The
Windows ftime() branch was already correct.

Found while documenting the public API (#382).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-17 22:03:16 +02:00
Xavier Roche
4ec38c4e66 Merge pull request #382 from xroche/docs/api-httrack-library
Document the public C API with contract comments
2026-06-17 21:17:24 +02:00
10 changed files with 349 additions and 271 deletions

View File

@@ -2779,7 +2779,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
if (strcmp(back[i].url_fil, "/robots.txt")) {
if (back[i].r.statuscode == HTTP_OK) { // 'OK'
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_fil)) { // pas HTML
if (opt->getmode & 2) { // on peut ecrire des non html
if (opt->getmode & HTS_GETMODE_NONHTML) {
int fcheck = 0;
int last_errno = 0;
@@ -2852,7 +2852,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
}
}
}
} else { // on coupe tout!
} else { // on coupe tout!
hts_log_print(opt, LOG_DEBUG,
"File cancelled (non HTML): %s%s",
back[i].url_adr, back[i].url_fil);
@@ -3661,7 +3661,7 @@ void back_wait(struct_back * sback, httrackp * opt, cache_back * cache,
#endif
if (sz >= 0) {
if (!is_hypertext_mime(opt, back[i].r.contenttype, back[i].url_sav)) { // pas HTML
if (opt->getmode & 2) { // on peut ecrire des non html **sinon ben euhh sera intercepté plus loin, donc rap sur ce qui va sortir**
if (opt->getmode & HTS_GETMODE_NONHTML) {
filenote(&opt->state.strc, back[i].url_sav, NULL); // noter fichier comme connu
file_notify(opt, back[i].url_adr, back[i].url_fil,
back[i].url_sav, 0, 1,

View File

@@ -370,7 +370,7 @@ int cache_selftests(httrackp *opt, const char *dir) {
StringCopy(opt->path_html, base);
StringCopy(opt->path_html_utf8, base);
}
opt->cache = 1;
opt->cache = HTS_CACHE_PRIORITY;
/* pass 1: create everything in a single write session */
selftest_open_for_write(&cache, opt);
@@ -547,7 +547,7 @@ static void golden_setup(httrackp *opt, const char *dir) {
StringCopy(opt->path_log, base);
StringCopy(opt->path_html, base);
StringCopy(opt->path_html_utf8, base);
opt->cache = 1;
opt->cache = HTS_CACHE_PRIORITY;
}
int cache_golden_selftest(httrackp *opt, const char *dir, int regen) {

View File

@@ -1835,9 +1835,10 @@ int httpmirror(char *url1, httrackp * opt) {
a++; // sauter espace(s)
if (strnotempty(a)) {
#ifdef IGNORE_RESTRICTIVE_ROBOTS
if (strcmp(a, "/") != 0 || opt->robots >= 3)
if (strcmp(a, "/") != 0 ||
opt->robots >= HTS_ROBOTS_ALWAYS_STRICT)
#endif
{ /* ignoring disallow: / */
{ /* ignoring disallow: / */
if ((strlen(buff) + strlen(a) + 8) < sizeof(buff)) {
strcatbuff(buff, a);
strcatbuff(buff, "\n");
@@ -1932,10 +1933,10 @@ int httpmirror(char *url1, httrackp * opt) {
"Warning: store %s without scan: %s", r.contenttype,
savename());
} else {
if ((opt->getmode & 2) != 0) { // ok autorisé
if ((opt->getmode & HTS_GETMODE_NONHTML) != 0) {
hts_log_print(opt, LOG_DEBUG, "Store %s: %s", r.contenttype,
savename());
} else { // lien non autorisé! (ex: cgi-bin en html)
} else { // lien non autorisé! (ex: cgi-bin en html)
hts_log_print(opt, LOG_DEBUG,
"non-html file ignored after upload at %s : %s",
urladr(), urlfil());
@@ -2052,7 +2053,7 @@ int httpmirror(char *url1, httrackp * opt) {
ptr++;
// faut-il sauter le(s) lien(s) suivant(s)? (fichiers images à passer après les html)
if (opt->getmode & 4) { // sauver les non html après
if (opt->getmode & HTS_GETMODE_HTML_FIRST) {
// sauter les fichiers selon la passe
if (!numero_passe) {
while((ptr < opt->lien_tot) ? (heap(ptr)->pass2) : 0)
@@ -3736,10 +3737,10 @@ HTSEXT_API int copy_htsopt(const httrackp * from, httrackp * to) {
// test all: bit 8 de travel
if (from->travel > -1) {
if (from->travel & 256)
to->travel |= 256;
if (from->travel & HTS_TRAVEL_TEST_ALL)
to->travel |= HTS_TRAVEL_TEST_ALL;
else
to->travel &= 255;
to->travel &= HTS_TRAVEL_SCOPE_MASK;
}
return 0;

View File

@@ -369,10 +369,6 @@ char *readfile_or(const char *fil, const char *defaultdata);
void check_rate(TStamp stat_timestart, int maxrate);
#endif
// links
int liens_record(char *adr, char *fil, char *save, char *former_adr,
char *former_fil, char *codebase);
/* Backing (download-slot) scheduler. Operate on the back[] ring (struct_back).
Not thread-safe; call from the single crawl loop. */

View File

@@ -1431,7 +1431,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
StringBuff(opt->path_log), "hts-in_progress.lock"))) { // fichier lock?
//char s[32];
opt->cache = 1; // cache prioritaire
opt->cache = HTS_CACHE_PRIORITY; // cache prioritaire
if (opt->quiet == 0) {
if ((fexist
(fconcat
@@ -1465,7 +1465,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
(fconcat
(OPT_GET_BUFF(opt), OPT_GET_BUFF_SIZE(opt), StringBuff(opt->path_html), "index.html"))) {
//char s[32];
opt->cache = 2; // cache vient après test de validité
opt->cache = HTS_CACHE_TEST_UPDATE;
if (opt->quiet == 0) {
if ((fexist
(fconcat
@@ -1558,25 +1558,25 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
return 0; // déja fait normalement
//
case 'g': // récupérer un (ou plusieurs) fichiers isolés
opt->wizard = 2; // le wizard on peut plus s'en passer..
opt->wizard = HTS_WIZARD_AUTO;
//opt->wizard=0; // pas de wizard
opt->cache = 0; // ni de cache
opt->cache = HTS_CACHE_NONE; // ni de cache
opt->makeindex = 0; // ni d'index
httrack_logmode = 1; // erreurs à l'écran
opt->savename_type = 1003; // mettre dans le répertoire courant
opt->depth = 0; // ne pas explorer la page
opt->accept_cookie = 0; // pas de cookies
opt->robots = 0; // pas de robots
opt->robots = HTS_ROBOTS_NEVER; // pas de robots
break;
case 'w':
opt->wizard = 2; // wizard 'soft' (ne pose pas de questions)
opt->travel = 0;
opt->seeker = 1;
opt->wizard = HTS_WIZARD_AUTO;
opt->travel = HTS_TRAVEL_SAME_ADDRESS;
opt->seeker = HTS_SEEKER_DOWN;
break;
case 'W':
opt->wizard = 1; // Wizard-Help (pose des questions)
opt->travel = 0;
opt->seeker = 1;
opt->wizard = HTS_WIZARD_ASK; // Wizard-Help (pose des questions)
opt->travel = HTS_TRAVEL_SAME_ADDRESS;
opt->seeker = HTS_SEEKER_DOWN;
break;
case 'r': // n'est plus le recurse get bestial mais wizard itou!
if (isdigit((unsigned char) *(com + 1))) {
@@ -1598,19 +1598,23 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
// note: les tests opt->depth sont pour éviter de faire
// un miroir du web (:-O) accidentelement ;-)
case 'a': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 0 + (opt->travel & 256);
opt->travel =
HTS_TRAVEL_SAME_ADDRESS + (opt->travel & HTS_TRAVEL_TEST_ALL);
break;
case 'd': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 1 + (opt->travel & 256);
opt->travel =
HTS_TRAVEL_SAME_DOMAIN + (opt->travel & HTS_TRAVEL_TEST_ALL);
break;
case 'l': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 2 + (opt->travel & 256);
opt->travel =
HTS_TRAVEL_SAME_TLD + (opt->travel & HTS_TRAVEL_TEST_ALL);
break;
case 'e': /*if (opt->depth==9999) opt->depth=3; */
opt->travel = 7 + (opt->travel & 256);
opt->travel =
HTS_TRAVEL_EVERYWHERE + (opt->travel & HTS_TRAVEL_TEST_ALL);
break;
case 't':
opt->travel |= 256;
opt->travel |= HTS_TRAVEL_TEST_ALL;
break;
case 'n':
opt->nearlink = 1;
@@ -1620,16 +1624,16 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
break;
//
case 'U':
opt->seeker = 2;
opt->seeker = HTS_SEEKER_UP;
break;
case 'D':
opt->seeker = 1;
opt->seeker = HTS_SEEKER_DOWN;
break;
case 'S':
opt->seeker = 0;
break;
case 'B':
opt->seeker = 3;
opt->seeker = HTS_SEEKER_DOWN | HTS_SEEKER_UP;
break;
//
case 'Y':
@@ -1659,12 +1663,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
//case 'A': opt->urlmode=1; break;
//case 'R': opt->urlmode=2; break;
case 'K':
opt->urlmode = 0;
opt->urlmode = HTS_URLMODE_ABSOLUTE;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->urlmode);
if (opt->urlmode == 0) { // in fact K0 ==> K2
sscanf(com + 1, "%d", (int *) &opt->urlmode);
if (opt->urlmode == HTS_URLMODE_ABSOLUTE) { // in fact K0 ==> K2
// and K ==> K0
opt->urlmode = 2;
opt->urlmode = HTS_URLMODE_RELATIVE;
}
while(isdigit((unsigned char) *(com + 1)))
com++;
@@ -1779,7 +1783,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
break;
//
case 'b':
sscanf(com + 1, "%d", &opt->accept_cookie);
sscanf(com + 1, "%d", (int *) &opt->accept_cookie);
while(isdigit((unsigned char) *(com + 1)))
com++;
break;
@@ -1831,33 +1835,33 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
break;
case 's':
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->robots);
sscanf(com + 1, "%d", (int *) &opt->robots);
while(isdigit((unsigned char) *(com + 1)))
com++;
} else
opt->robots = 1;
opt->robots = HTS_ROBOTS_SOMETIMES;
#if DEBUG_ROBOTS
printf("robots.txt mode set to %d\n", opt->robots);
#endif
break;
case 'o':
sscanf(com + 1, "%d", &opt->errpage);
sscanf(com + 1, "%d", (int *) &opt->errpage);
while(isdigit((unsigned char) *(com + 1)))
com++;
break;
case 'u':
sscanf(com + 1, "%d", &opt->check_type);
sscanf(com + 1, "%d", (int *) &opt->check_type);
while(isdigit((unsigned char) *(com + 1)))
com++;
break;
//
case 'C':
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->cache);
sscanf(com + 1, "%d", (int *) &opt->cache);
while(isdigit((unsigned char) *(com + 1)))
com++;
} else
opt->cache = 1;
opt->cache = HTS_CACHE_PRIORITY;
break;
case 'k':
opt->all_in_cache = 1;
@@ -1913,7 +1917,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
case 'I':
opt->kindex = 1;
if (isdigit((unsigned char) *(com + 1))) {
sscanf(com + 1, "%d", &opt->kindex);
sscanf(com + 1, "%d", (int *) &opt->kindex);
while(isdigit((unsigned char) *(com + 1)))
com++;
}
@@ -2045,7 +2049,7 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
// preserve: no footer, original links
case 'p':
StringClear(opt->footer);
opt->urlmode = 4;
opt->urlmode = HTS_URLMODE_KEEP_ORIGINAL;
break;
case 'L': // URL list
if ((na + 1 >= argc) || (argv[na + 1][0] == '-')) {
@@ -3610,12 +3614,12 @@ static int hts_main_internal(int argc, char **argv, httrackp * opt) {
printf("Mirror launched on %s by HTTrack Website Copier/"
HTTRACK_VERSION "%s " HTTRACK_AFF_AUTHORS "" LF, t,
hts_get_version_info(opt));
if (opt->wizard == 0) {
if (opt->wizard == HTS_WIZARD_NONE) {
printf
("mirroring %s with %d levels, %d sockets,t=%d,s=%d,logm=%d,lnk=%d,mdg=%d\n",
url, opt->depth, opt->maxsoc, opt->travel, opt->seeker,
httrack_logmode, opt->urlmode, opt->getmode);
} else { // the magic wizard
} else { // the magic wizard
printf("mirroring %s with the wizard help..\n", url);
}
}

View File

@@ -2580,8 +2580,8 @@ HTSEXT_API TStamp mtime_local(void) {
assert(! "gettimeofday");
}
return (TStamp) (((TStamp) tv.tv_sec * (TStamp) 1000)
+ ((TStamp) tv.tv_usec / (TStamp) 1000000));
return (TStamp) (((TStamp) tv.tv_sec * (TStamp) 1000) +
((TStamp) tv.tv_usec / (TStamp) 1000));
#else
struct timeb B;
ftime(&B);
@@ -5434,34 +5434,34 @@ HTSEXT_API httrackp *hts_create_opt(void) {
/* default settings */
opt->wizard = 2; // wizard automatique
opt->quiet = 0; // questions
//
opt->travel = 0; // même adresse
opt->wizard = HTS_WIZARD_AUTO; // wizard automatique
opt->quiet = HTS_FALSE;
//
opt->travel = HTS_TRAVEL_SAME_ADDRESS; // même adresse
opt->depth = 9999; // mirror total par défaut
opt->extdepth = 0; // mais pas à l'extérieur
opt->seeker = 1; // down
opt->urlmode = 2; // relatif par défaut
opt->no_type_change = 0; // change file types
opt->seeker = HTS_SEEKER_DOWN; // down
opt->urlmode = HTS_URLMODE_RELATIVE; // relatif par défaut
opt->no_type_change = HTS_FALSE;
opt->debug = LOG_NOTICE; // small log
opt->getmode = 3; // linear scan
opt->getmode = HTS_GETMODE_HTML | HTS_GETMODE_NONHTML;
opt->maxsite = -1; // taille max site (aucune)
opt->maxfile_nonhtml = -1; // taille max fichier non html
opt->maxfile_html = -1; // idem pour html
opt->maxsoc = 4; // nbre socket max
opt->fragment = -1; // pas de fragmentation
opt->nearlink = 0; // ne pas prendre les liens non-html "adjacents"
opt->makeindex = 1; // faire un index
opt->kindex = 0; // index 'keyword'
opt->delete_old = 1; // effacer anciens fichiers
opt->background_on_suspend = 1; // Background the process if Control Z calls signal suspend.
opt->makestat = 0; // pas de fichier de stats
opt->maketrack = 0; // ni de tracking
opt->nearlink = HTS_FALSE;
opt->makeindex = HTS_TRUE;
opt->kindex = HTS_FALSE;
opt->delete_old = HTS_TRUE;
opt->background_on_suspend = HTS_TRUE;
opt->makestat = HTS_FALSE;
opt->maketrack = HTS_FALSE;
opt->timeout = 120; // timeout par défaut (2 minutes)
opt->cache = 1; // cache prioritaire
opt->shell = 0; // pas de shell par defaut
opt->cache = HTS_CACHE_PRIORITY; // cache prioritaire
opt->shell = HTS_FALSE;
opt->proxy.active = 0; // pas de proxy
opt->user_agent_send = 1; // envoyer un user-agent
opt->user_agent_send = HTS_TRUE;
StringCopy(opt->user_agent,
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)");
StringCopy(opt->referer, "");
@@ -5469,34 +5469,36 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->savename_83 = 0; // noms longs par défaut
opt->savename_type = 0; // avec structure originale
opt->savename_delayed = 2; // hard delayed type (default)
opt->delayed_cached = 1; // cached delayed type (default)
opt->mimehtml = 0; // pas MIME-html
opt->delayed_cached = HTS_TRUE;
opt->mimehtml = HTS_FALSE;
opt->parsejava = HTSPARSE_DEFAULT; // parser classes
opt->hostcontrol = 0; // PAS de control host pour timeout et traffic jammer
opt->retry = 2; // 2 retry par défaut
opt->errpage = 1; // copier ou générer une page d'erreur en cas d'erreur (404 etc.)
opt->check_type = 1; // vérifier type si inconnu (cgi,asp..) SAUF / considéré comme html
opt->all_in_cache = 0; // ne pas tout stocker en cache
opt->robots = 2; // traiter les robots.txt
opt->external = 0; // liens externes normaux
opt->passprivacy = 0; // mots de passe dans les fichiers
opt->includequery = 1; // include query-string par défaut
opt->mirror_first_page = 0; // pas mode mirror links
opt->accept_cookie = 1; // gérer les cookies
opt->errpage = HTS_TRUE;
// d'erreur (404 etc.)
opt->check_type = HTS_TRUE;
// considéré comme html
opt->all_in_cache = HTS_FALSE;
opt->robots = HTS_ROBOTS_ALWAYS; // traiter les robots.txt
opt->external = HTS_FALSE;
opt->passprivacy = HTS_FALSE;
opt->includequery = HTS_TRUE;
opt->mirror_first_page = HTS_FALSE;
opt->accept_cookie = HTS_TRUE;
opt->cookie = NULL;
opt->http10 = 0; // laisser http/1.1
opt->nokeepalive = 0; // pas keep-alive
opt->nocompression = 0; // pas de compression
opt->tolerant = 0; // ne pas accepter content-length incorrect
opt->parseall = 1; // tout parser (tags inconnus, par exemple)
opt->parsedebug = 0; // pas de mode débuggage
opt->norecatch = 0; // ne pas reprendre les fichiers effacés par l'utilisateur
opt->http10 = HTS_FALSE;
opt->nokeepalive = HTS_FALSE;
opt->nocompression = HTS_FALSE;
opt->tolerant = HTS_FALSE;
opt->parseall = HTS_TRUE;
opt->parsedebug = HTS_FALSE;
opt->norecatch = HTS_FALSE;
opt->verbosedisplay = 0; // pas d'animation texte
opt->sizehack = 0; // size hack
opt->urlhack = 1; // url hack (normalizer)
opt->sizehack = HTS_FALSE;
opt->urlhack = HTS_TRUE;
StringCopy(opt->footer, HTS_DEFAULT_FOOTER);
opt->ftp_proxy = 1; // proxy http pour ftp
opt->convert_utf8 = 1; // convert html to UTF-8
opt->ftp_proxy = HTS_TRUE;
opt->convert_utf8 = HTS_TRUE;
StringCopy(opt->filelist, "");
StringCopy(opt->lang_iso, "en, *");
StringCopy(opt->accept,
@@ -5507,9 +5509,9 @@ HTSEXT_API httrackp *hts_create_opt(void) {
//
opt->log = stdout;
opt->errlog = stderr;
opt->flush = 1; // flush sur les fichiers log
//opt->aff_progress=0;
opt->keyboard = 0;
opt->flush = HTS_TRUE;
// opt->aff_progress=0;
opt->keyboard = HTS_FALSE;
//
StringCopy(opt->path_html, "");
StringCopy(opt->path_html_utf8, "");
@@ -5526,10 +5528,10 @@ HTSEXT_API httrackp *hts_create_opt(void) {
opt->waittime = -1; // wait until.. hh*3600+mm*60+ss
//
opt->exec = "";
opt->is_update = 0; // not an update (yet)
opt->dir_topindex = 0; // do not built top index (yet)
opt->is_update = HTS_FALSE;
opt->dir_topindex = HTS_FALSE;
//
opt->bypass_limits = 0; // enforce limits by default
opt->bypass_limits = HTS_FALSE;
opt->state.stop = 0; // stopper
opt->state.exit_xh = 0; // abort
//

View File

@@ -285,6 +285,82 @@ typedef enum htsparsejava_flags {
HTSPARSE_NO_AGGRESSIVE = 8 // don't aggressively parse .js or .java
} htsparsejava_flags;
/* Link-rewriting style for saved pages (opt->urlmode). */
#ifndef HTS_DEF_DEFSTRUCT_hts_urlmode
#define HTS_DEF_DEFSTRUCT_hts_urlmode
typedef enum hts_urlmode {
HTS_URLMODE_ABSOLUTE = 0, /**< absolute URL (http://host/path) everywhere */
HTS_URLMODE_ABSOLUTE_FILE = 1, /**< legacy file: form, unused */
HTS_URLMODE_RELATIVE = 2, /**< relative link (default) */
HTS_URLMODE_ABSOLUTE_URI = 3, /**< absolute URI from root (/path) */
HTS_URLMODE_KEEP_ORIGINAL = 4, /**< keep the original link, do not rewrite */
HTS_URLMODE_TRANSPARENT_PROXY = 5 /**< transparent-proxy URL */
} hts_urlmode;
#endif
/* Cache policy for updates and retries (opt->cache). */
#ifndef HTS_DEF_DEFSTRUCT_hts_cachemode
#define HTS_DEF_DEFSTRUCT_hts_cachemode
typedef enum hts_cachemode {
HTS_CACHE_NONE = 0, /**< no cache */
HTS_CACHE_PRIORITY = 1, /**< cache takes priority over the network */
HTS_CACHE_TEST_UPDATE = 2 /**< check for update before reuse (default) */
} hts_cachemode;
#endif
/* Interactive wizard level (opt->wizard). */
#ifndef HTS_DEF_DEFSTRUCT_hts_wizard
#define HTS_DEF_DEFSTRUCT_hts_wizard
typedef enum hts_wizard {
HTS_WIZARD_NONE = 0, /**< no wizard */
HTS_WIZARD_ASK = 1, /**< wizard asks questions */
HTS_WIZARD_AUTO = 2 /**< wizard runs without asking */
} hts_wizard;
#endif
/* robots.txt / meta-robots obedience level (opt->robots). */
#ifndef HTS_DEF_DEFSTRUCT_hts_robots
#define HTS_DEF_DEFSTRUCT_hts_robots
typedef enum hts_robots {
HTS_ROBOTS_NEVER = 0, /**< ignore robots rules */
HTS_ROBOTS_SOMETIMES = 1, /**< partial obedience (default) */
HTS_ROBOTS_ALWAYS = 2, /**< obey robots rules */
HTS_ROBOTS_ALWAYS_STRICT = 3 /**< obey even strict rules */
} hts_robots;
#endif
/* What to fetch (opt->getmode bitmask). */
typedef enum hts_getmode {
HTS_GETMODE_HTML = 1 << 0, /**< save HTML files */
HTS_GETMODE_NONHTML = 1 << 1, /**< save non-HTML files */
HTS_GETMODE_HTML_FIRST = 1 << 2 /**< fetch HTML first, then the other files */
} hts_getmode;
/* Allowed directions in the directory tree (opt->seeker bitmask). */
typedef enum hts_seeker {
HTS_SEEKER_DOWN = 1 << 0, /**< may descend into subdirectories */
HTS_SEEKER_UP = 1 << 1 /**< may ascend to parent directories */
} hts_seeker;
/* Link-following scope, stored in the low byte of opt->travel. */
typedef enum hts_travel_scope {
HTS_TRAVEL_SAME_ADDRESS = 0, /**< stay on the same address (host) */
HTS_TRAVEL_SAME_DOMAIN = 1, /**< stay on the same principal domain */
HTS_TRAVEL_SAME_TLD = 2, /**< stay on the same TLD (e.g. .com) */
HTS_TRAVEL_EVERYWHERE = 7 /**< follow links anywhere on the web */
} hts_travel_scope;
/* Flags OR'd into opt->travel above the scope value. */
#define HTS_TRAVEL_SCOPE_MASK 0xff /**< mask selecting the scope value */
#define HTS_TRAVEL_TEST_ALL (1 << 8) /**< also test forbidden URLs (-t) */
/* Boolean option flag. An enum (not C bool) so the option fields stay int-sized
and the httrackp layout/ABI is unchanged. */
#ifndef HTS_DEF_DEFSTRUCT_hts_boolean
#define HTS_DEF_DEFSTRUCT_hts_boolean
typedef enum hts_boolean { HTS_FALSE = 0, HTS_TRUE = 1 } hts_boolean;
#endif
#ifndef HTS_DEF_FWSTRUCT_lien_buffers
#define HTS_DEF_FWSTRUCT_lien_buffers
typedef struct lien_buffers lien_buffers;
@@ -308,14 +384,15 @@ typedef struct httrackp httrackp;
struct httrackp {
size_t size_httrackp; /**< size of this structure (version/ABI guard) */
/* */
int wizard; /**< interactive wizard level (none/full/light) */
int flush; /**< fflush() log files after each write */
hts_wizard wizard; /**< interactive wizard level (none/ask/auto) */
hts_boolean flush; /**< fflush() log files after each write */
int travel; /**< link-following scope (same domain, etc.) */
int seeker; /**< allowed direction: go up and/or down the tree */
int depth; /**< maximum recursion depth (-rN) */
int extdepth; /**< maximum recursion depth outside the start domain */
int urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */
int no_type_change; // do not change file type according to MIME
hts_urlmode
urlmode; /**< saved-link rewriting style (relative, absolute, etc.) */
hts_boolean no_type_change; // do not change file type according to MIME
int debug; /**< debug logging level */
int getmode; /**< what to fetch (HTML, images, ...) bitmask */
FILE *log; /**< informational log stream; NULL mutes it */
@@ -325,28 +402,30 @@ struct httrackp {
LLint maxfile_html; /**< max bytes per HTML file */
int maxsoc; /**< max simultaneous sockets (-cN) */
LLint fragment; /**< split site after this many bytes */
int nearlink; /**< also fetch images/data adjacent to a page but off-site */
int makeindex; /**< build a top-level index.html */
int kindex; /**< build a keyword index */
int delete_old; /**< delete locally obsolete files after update */
hts_boolean
nearlink; /**< also fetch images/data adjacent to a page but off-site */
hts_boolean makeindex; /**< build a top-level index.html */
hts_boolean kindex; /**< build a keyword index */
hts_boolean delete_old; /**< delete locally obsolete files after update */
int timeout; /**< connection timeout in seconds */
int rateout; /**< minimum transfer rate (bytes/s) before abort */
int maxtime; /**< max total mirror duration in seconds */
int maxrate; /**< max transfer rate cap (bytes/s) */
float maxconn; /**< max connections per second */
int waittime; /**< scheduled start time (wall-clock seconds) */
int cache; /**< cache generation mode */
hts_cachemode cache; /**< cache generation mode */
// int aff_progress; // progress bar
int shell; /**< driven by a shell over stdin/stdout pipes */
hts_boolean shell; /**< driven by a shell over stdin/stdout pipes */
t_proxy proxy; /**< proxy configuration */
int savename_83; /**< force 8.3 (DOS) file names */
int savename_type; /**< saved-name layout (original tree, flat, ...) */
String
savename_userdef; /**< user-defined name template (e.g. %h%p/%n%q.%t) */
int savename_delayed; // delayed type check
int delayed_cached; // delayed type check can be cached to speedup updates
int mimehtml; /**< produce a single MIME/MHTML archive */
int user_agent_send; /**< send a User-Agent header */
hts_boolean
delayed_cached; // delayed type check can be cached to speedup updates
hts_boolean mimehtml; /**< produce a single MIME/MHTML archive */
hts_boolean user_agent_send; /**< send a User-Agent header */
String user_agent; /**< User-Agent value (e.g. httrack/1.0) */
String referer; /**< Referer value to send */
String from; /**< From value to send */
@@ -355,37 +434,39 @@ struct httrackp {
String path_html_utf8; /**< output directory for the mirror, UTF-8 form */
String path_bin; /**< directory for HTML templates */
int retry; /**< extra retries on a failed transfer */
int makestat; /**< maintain a transfer-statistics log */
int maketrack; /**< maintain an operations-statistics log */
hts_boolean makestat; /**< maintain a transfer-statistics log */
hts_boolean maketrack; /**< maintain an operations-statistics log */
int parsejava; /**< Java/JS parsing mode; see htsparsejava_flags */
int hostcontrol; /**< drop hosts that are too slow, etc. */
int errpage; /**< generate an error page on 404 and similar */
int check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/
int all_in_cache; /**< keep all retrieved data in the cache */
int robots; /**< robots.txt handling level */
int external; /**< render external links as error pages */
int passprivacy; /**< strip passwords from external links */
int includequery; /**< include the query string in saved names */
int mirror_first_page; /**< only mirror the links of the first page */
hts_boolean errpage; /**< generate an error page on 404 and similar */
hts_boolean
check_type; /**< probe unknown-type links (cgi/asp/dir) and follow moves
*/
hts_boolean all_in_cache; /**< keep all retrieved data in the cache */
hts_robots robots; /**< robots.txt handling level */
hts_boolean external; /**< render external links as error pages */
hts_boolean passprivacy; /**< strip passwords from external links */
hts_boolean includequery; /**< include the query string in saved names */
hts_boolean mirror_first_page; /**< only mirror the links of the first page */
String sys_com; /**< system command to run */
int sys_com_exec; /**< actually execute sys_com */
int accept_cookie; /**< accept and send cookies */
hts_boolean sys_com_exec; /**< actually execute sys_com */
hts_boolean accept_cookie; /**< accept and send cookies */
t_cookie *cookie; /**< cookie store */
int http10; /**< force HTTP/1.0 */
int nokeepalive; /**< disable keep-alive */
int nocompression; /**< disable content compression */
int sizehack; /**< treat same-size response as "updated" */
int urlhack; // force "url normalization" to avoid loops
int tolerant; /**< accept an incorrect Content-Length */
int parseall; /**< parse aggressively, including unknown tags with links */
int parsedebug; /**< parser debug mode */
int norecatch; /**< do not re-fetch files the user deleted locally */
hts_boolean http10; /**< force HTTP/1.0 */
hts_boolean nokeepalive; /**< disable keep-alive */
hts_boolean nocompression; /**< disable content compression */
hts_boolean sizehack; /**< treat same-size response as "updated" */
hts_boolean urlhack; // force "url normalization" to avoid loops
hts_boolean tolerant; /**< accept an incorrect Content-Length */
hts_boolean
parseall; /**< parse aggressively, including unknown tags with links */
hts_boolean parsedebug; /**< parser debug mode */
hts_boolean norecatch; /**< do not re-fetch files the user deleted locally */
int verbosedisplay; /**< animated text progress display */
String footer; /**< footer/info line injected into pages */
int maxcache; /**< in-memory cache backing limit (bytes) */
// int maxcache_anticipate; // maximum links to anticipate (upper bound)
int ftp_proxy; /**< use the HTTP proxy for FTP too */
hts_boolean ftp_proxy; /**< use the HTTP proxy for FTP too */
String filelist; /**< file listing URLs to include */
String urllist; /**< file listing filters to include */
htsfilters filters; /**< filter pointers (+/-pattern rules) */
@@ -399,20 +480,20 @@ struct httrackp {
String headers; // Additional headers
String mimedefs; // ext1=mimetype1\next2=mimetype2..
String mod_blacklist; /**< blacklisted modules */
int convert_utf8; // filenames UTF-8 conversion (3.46)
hts_boolean convert_utf8; // filenames UTF-8 conversion (3.46)
//
int maxlink; /**< max number of links */
int maxfilter; /**< max number of filters */
//
const char *exec; /**< path of the running executable */
//
int quiet; /**< suppress non-wizard questions */
int keyboard; /**< poll stdin for keyboard input */
int bypass_limits; // bypass built-in limits
int background_on_suspend; // background process on suspend signal
hts_boolean quiet; /**< suppress non-wizard questions */
hts_boolean keyboard; /**< poll stdin for keyboard input */
hts_boolean bypass_limits; // bypass built-in limits
hts_boolean background_on_suspend; // background process on suspend signal
//
int is_update; /**< this run is an update (show "File updated...") */
int dir_topindex; /**< rebuild the top index afterwards */
hts_boolean is_update; /**< this run is an update (show "File updated...") */
hts_boolean dir_topindex; /**< rebuild the top index afterwards */
//
// callbacks
t_hts_htmlcheck_callbacks

View File

@@ -349,7 +349,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
#endif
// Now, parsing
if ((opt->getmode & 1) && (ptr > 0)) { // récupérer les html sur disque
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
// créer le fichier html local
HT_ADD_FOP; // écrire peu à peu le fichier
}
@@ -553,7 +553,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (opt->depth == heap(ptr)->depth) { // on note toujours les premiers liens
if (!in_media) {
if (opt->makeindex && (ptr > 0)) {
if (opt->getmode & 1) { // autorisation d'écrire
if (opt->getmode & HTS_GETMODE_HTML) {
p = strfield(html, "title");
if (p) {
if (*(html - 1) == '/')
@@ -704,7 +704,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
}
if (opt->getmode & 1) { // sauver html
if (opt->getmode & HTS_GETMODE_HTML) { // sauver html
p = 0;
switch (emited_footer) {
case 0:
@@ -740,7 +740,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (strchr(r->adr, '\r'))
eol = "\r\n";
if (StringNotEmpty(opt->footer) || opt->urlmode != 4) { /* != preserve */
if (StringNotEmpty(opt->footer) ||
opt->urlmode != HTS_URLMODE_KEEP_ORIGINAL) {
if (StringNotEmpty(opt->footer)) {
char BIGSTK tempo[1024 + HTS_URLMAXSIZE * 2];
char gmttime[256];
@@ -1746,7 +1747,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// écrire codebase avant, flusher avant code
if ((p_type == -1) || (p_type == -2)) {
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_add_adr; // refresh
}
lastsaved = html; // dernier écrit+1
@@ -1837,7 +1838,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// ne pas flusher après code si on doit écrire le codebase avant!
if ((p_type != -1) && (p_type != 2) && (p_type != -2)) {
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_add_adr; // refresh
}
lastsaved = html; // dernier écrit+1
@@ -1914,7 +1915,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (*html != '#') { // Not empty+unique #
if (eadr - html == 1) { // 1=link empty with delim (end_adr-start_adr)
if (quote) {
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_ADD("#"); // We add this for a <href="">
}
}
@@ -2569,7 +2570,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((p_type == 2) || (p_type == -2)) { // base href ou codebase, pas un lien
hts_log_print(opt, LOG_DEBUG, "Code/Codebase: %s%s",
afs.af.adr, afs.af.fil);
} else if ((opt->getmode & 4) == 0) {
} else if ((opt->getmode & HTS_GETMODE_HTML_FIRST) ==
0) {
hts_log_print(opt, LOG_DEBUG, "Record: %s%s -> %s",
afs.af.adr, afs.af.fil, afs.save);
} else {
@@ -2592,8 +2594,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
lastsaved = eadr - 1 + 1; // sauter "
}
/* */
else if (opt->urlmode == 0) { // URL absolue dans tous les cas
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html
else if (opt->urlmode == HTS_URLMODE_ABSOLUTE) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (!link_has_authority(afs.af.adr)) {
HT_ADD("http://");
} else {
@@ -2620,12 +2622,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
/* */
} else if (opt->urlmode == 4) { // ne rien faire!
} else if (opt->urlmode == HTS_URLMODE_KEEP_ORIGINAL) {
/* */
/* leave the link 'as is' */
/* Sinon, dépend de interne/externe */
} else if (forbidden_url == 1) { // le lien ne sera pas chargé, référence externe!
if ((opt->getmode & 1) && (ptr > 0)) {
} else if (forbidden_url ==
1) { // le lien ne sera pas chargé, référence
// externe!
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (p_type != -1) { // pas que le nom de fichier (pas classe java)
if (!opt->external) {
if (!link_has_authority(afs.af.adr)) {
@@ -2674,7 +2678,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
'/') ? 1 : (ishtml(opt, afs.af.fil)))) {
case 1:
case -2: // html ou répertoire
if (opt->getmode & 1) { // sauver html
if (opt->getmode & HTS_GETMODE_HTML) {
patch_it = 1; // redirect
add_url = 1; // avec link?
cat_name = "external.html";
@@ -2847,7 +2851,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// érire codebase="chemin"
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
char BIGSTK tempo4[HTS_URLMAXSIZE * 2];
tempo4[0] = '\0';
@@ -2875,9 +2880,11 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
lastsaved = eadr - 1;
}
/*
else if (opt->urlmode==1) { // ABSOLU, c'est le cas le moins courant
else if (opt->urlmode==1) { // ABSOLU, c'est le cas le
moins courant
// NE FONCTIONNE PAS!! (et est inutile)
if ((opt->getmode & 1) && (ptr>0)) { // ecrire les html
if ((opt->getmode & 1) && (ptr>0)) { // ecrire les
html
// écrire le lien modifié, absolu
HT_ADD("file:");
if (*save=='/')
@@ -2885,7 +2892,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
else
HT_ADD(save)
}
lastsaved=eadr-1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
lastsaved=eadr-1; // dernier écrit+1 (enfin euh apres
on fait un ++ alors hein)
}
*/
else if (opt->mimehtml) {
@@ -2895,18 +2903,18 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
make_content_id(afs.af.adr, afs.af.fil, cid, sizeof(cid));
HT_ADD_HTMLESCAPED(cid);
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 3) { // URI absolue /
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html
} else if (opt->urlmode == HTS_URLMODE_ABSOLUTE_URI) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
HT_ADD_HTMLESCAPED(afs.af.fil);
}
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 5) { // transparent proxy URL
} else if (opt->urlmode == HTS_URLMODE_TRANSPARENT_PROXY) {
char BIGSTK tempo[HTS_URLMAXSIZE * 2];
const char *uri;
int i;
char *pos;
if ((opt->getmode & 1) && (ptr > 0)) { // ecrire les html
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (!link_has_authority(afs.af.adr)) {
HT_ADD("http://");
} else {
@@ -2947,7 +2955,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
HT_ADD_HTMLESCAPED(tempo);
}
lastsaved = eadr - 1; // dernier écrit+1 (enfin euh apres on fait un ++ alors hein)
} else if (opt->urlmode == 2) { // RELATIF
} else if (opt->urlmode == HTS_URLMODE_RELATIVE) {
char BIGSTK tempo[HTS_URLMAXSIZE * 2];
tempo[0] = '\0';
@@ -3009,7 +3017,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// érire codebase="chemin"
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
char BIGSTK tempo4[HTS_URLMAXSIZE * 2];
tempo4[0] = '\0';
@@ -3027,7 +3036,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
//lastsaved=adr; // dernier écrit+1
}
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
// convert to local codepage - NOT, already converted into %NN, and passed to the remote server so we do not have anything to do
//if (str->page_charset_ != NULL && *str->page_charset_ != '\0') {
// char *const local_save = hts_convertStringFromUTF8(tempo, strlen(tempo), str->page_charset_);
@@ -3061,7 +3070,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
"Error building relative link %s and %s",
afs.save, relativesavename());
}
} // sinon le lien sera écrit normalement
} // sinon le lien sera écrit normalement
#if 0
if (fexist(save)) { // le fichier existe..
@@ -3089,7 +3098,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
opt->maxlink);
hts_log_print(opt, LOG_INFO,
"To avoid that: use #L option for more links (example: -#L1000000)");
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
if (fp) {
fclose(fp);
fp = NULL;
@@ -3101,9 +3110,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int pass_fix, dejafait = 0;
// Calculer la priorité de ce lien
if ((opt->getmode & 4) == 0) { // traiter html après
if ((opt->getmode & HTS_GETMODE_HTML_FIRST) == 0) {
pass_fix = 0;
} else { // vérifier que ce n'est pas un !html
} else { // vérifier que ce n'est pas un !html
if (!ishtml(opt, afs.af.fil))
pass_fix = 1; // priorité inférieure (traiter après)
else
@@ -3167,7 +3176,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if (checkrobots(_ROBOTS, afs.af.adr, "") == -1) { // robots.txt ?
// enregistrer robots.txt (MACRO)
if (!hts_record_link(opt, afs.af.adr, "/robots.txt", "", "", "", NULL)) {
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
if (fp) {
fclose(fp);
fp = NULL;
@@ -3206,7 +3216,8 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
// enregistrer
if (!hts_record_link(opt, afs.af.adr, afs.af.fil, afs.save,
former.adr, former.fil, codebase)) {
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) &&
(ptr > 0)) {
if (fp) {
fclose(fp);
fp = NULL;
@@ -3351,7 +3362,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
// ----------
// écrire peu à peu
if ((opt->getmode & 1) && (ptr > 0))
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0))
HT_add_adr;
lastsaved = html; // dernier écrit+1
// ----------
@@ -3411,7 +3422,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
opt->state._hts_in_html_parsing = 0; // flag
opt->state._hts_cancel = 0; // pas de cancel
if ((opt->getmode & 1) && (ptr > 0)) {
if ((opt->getmode & HTS_GETMODE_HTML) && (ptr > 0)) {
{
char *cAddr = TypedArrayElts(output_buffer);
int cSize = (int) TypedArraySize(output_buffer);
@@ -3443,7 +3454,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
//
} // if !error
if (opt->getmode & 1) {
if (opt->getmode & HTS_GETMODE_HTML) {
if (fp) {
fclose(fp);
fp = NULL;

View File

@@ -178,7 +178,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// -------------------- PHASE 1 --------------------
/* Doit-on traiter les non html? */
if ((opt->getmode & 2) == 0) { // non on ne doit pas
if ((opt->getmode & HTS_GETMODE_NONHTML) == 0) { // non on ne doit pas
if (!ishtml(opt, fil)) { // non il ne faut pas
//adr[0]='\0'; // ne pas traiter ce lien, pas traiter
forbidden_url = 1; // interdire récupération du lien
@@ -266,11 +266,11 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
test2 =
(strchr(tempo2 + ((*tempo2 == '/') ? 1 : 0), '/') != NULL);
if ((test1) && (test2)) { // on ne peut que descendre
if ((opt->seeker & 1) == 0) { // interdiction de descendre
if ((opt->seeker & HTS_SEEKER_DOWN) == 0) {
forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG, "lower link canceled: %s%s", adr,
fil);
} else { // autorisé à priori - NEW
} else { // autorisé à priori - NEW
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s",
@@ -278,7 +278,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
}
}
} else if ((test1) || (test2)) { // on peut descendre pour accéder au lien
if ((opt->seeker & 1) != 0) { // on peut descendre - NEW
if ((opt->seeker & HTS_SEEKER_DOWN) != 0) {
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "lower link authorized: %s%s",
@@ -290,11 +290,11 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// up
if ((!strncmp(tempo, "../", 3)) && (!strncmp(tempo2, "../", 3))) { // impossible sans monter
if ((opt->seeker & 2) == 0) { // interdiction de monter
if ((opt->seeker & HTS_SEEKER_UP) == 0) {
forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG, "upper link canceled: %s%s", adr,
fil);
} else { // autorisé à monter - NEW
} else { // autorisé à monter - NEW
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s",
@@ -302,13 +302,13 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
}
}
} else if ((!strncmp(tempo, "../", 3)) || (!strncmp(tempo2, "../", 3))) { // Possible en montant
if ((opt->seeker & 2) != 0) { // autorisé à monter - NEW
if ((opt->seeker & HTS_SEEKER_UP) != 0) {
if (!heap(ptr)->link_import) { // ne résulte pas d'un 'moved'
forbidden_url = 0;
hts_log_print(opt, LOG_DEBUG, "upper link authorized: %s%s",
adr, fil);
}
} // sinon autorisé en descente
} // sinon autorisé en descente
}
} else {
@@ -345,83 +345,81 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
//if (!opt->wizard) { // mode non wizard
// doit-on traiter ce lien?.. vérifier droits de sortie
switch ((opt->travel & 255)) {
case 0:
switch ((opt->travel & HTS_TRAVEL_SCOPE_MASK)) {
case HTS_TRAVEL_SAME_ADDRESS:
if (!opt->wizard) // mode non wizard
forbidden_url = 1;
break; // interdicton de sortir au dela de l'adresse
case 1:{ // sortie sur le même dom.xxx
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
case HTS_TRAVEL_SAME_DOMAIN: {
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
if ((i > 0) && (j > 0)) {
while((i > 0) && (adr[i] != '.'))
i--;
while((j > 0) && (urladr()[j] != '.'))
j--;
if ((i > 0) && (j > 0)) {
i--;
j--;
while((i > 0) && (adr[i] != '.'))
i--;
while((j > 0) && (urladr()[j] != '.'))
j--;
}
}
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !=
if (!opt->wizard) { // mode non wizard
//printf("refused: %s\n",adr);
forbidden_url = 1; // pas même domaine
hts_log_print(opt, LOG_DEBUG,
"foreign domain link canceled: %s%s", adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same domain link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
}
break;
case 2:{ // sortie sur le même .xxx
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
while((i > 0) && (adr[i] != '.'))
if ((i > 0) && (j > 0)) {
while ((i > 0) && (adr[i] != '.'))
i--;
while((j > 0) && (urladr()[j] != '.'))
while ((j > 0) && (urladr()[j] != '.'))
j--;
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !-
if (!opt->wizard) { // mode non wizard
//printf("refused: %s\n",adr);
forbidden_url = 1; // pas même .xx
hts_log_print(opt, LOG_DEBUG,
"foreign location link canceled: %s%s", adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG,
"same location link authorized: %s%s", adr, fil);
}
}
} else
forbidden_url = 1;
i--;
j--;
while ((i > 0) && (adr[i] != '.'))
i--;
while ((j > 0) && (urladr()[j] != '.'))
j--;
}
}
break;
case 7: // everywhere!!
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !=
if (!opt->wizard) { // mode non wizard
// printf("refused: %s\n",adr);
forbidden_url = 1; // pas même domaine
hts_log_print(opt, LOG_DEBUG, "foreign domain link canceled: %s%s",
adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same domain link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
} break;
case HTS_TRAVEL_SAME_TLD: {
size_t i = strlen(adr) - 1;
size_t j = strlen(urladr()) - 1;
while ((i > 0) && (adr[i] != '.'))
i--;
while ((j > 0) && (urladr()[j] != '.'))
j--;
if ((i > 0) && (j > 0)) {
if (!strfield2(adr + i, urladr() + j)) { // !-
if (!opt->wizard) { // mode non wizard
// printf("refused: %s\n",adr);
forbidden_url = 1; // pas même .xx
hts_log_print(opt, LOG_DEBUG,
"foreign location link canceled: %s%s", adr, fil);
}
} else {
if (opt->wizard) { // mode wizard
forbidden_url = 0; // même domaine
hts_log_print(opt, LOG_DEBUG, "same location link authorized: %s%s",
adr, fil);
}
}
} else
forbidden_url = 1;
} break;
case HTS_TRAVEL_EVERYWHERE:
if (opt->wizard) { // mode wizard
forbidden_url = 0;
break;
}
} // switch
} // switch
// ANCIENNE POS -- récupérer les liens à côtés d'un lien (nearlink)
@@ -583,7 +581,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// on doit poser la question.. peut on la poser?
// (oui je sais quel preuve de délicatesse, merci merci)
if ((question) && (ptr > 0) && (!force_mirror)) {
if (opt->wizard == 2) { // éliminer tous les liens non répertoriés comme autorisés (ou inconnus)
if (opt->wizard == HTS_WIZARD_AUTO) {
question = 0;
forbidden_url = 1;
hts_log_print(opt, LOG_DEBUG,
@@ -600,8 +598,8 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
printf("robots.txt forbidden: %s%s\n", adr, fil);
#endif
// question résolue, par les filtres, et mode robot non strict
if ((!question) && (filters_answer) && (opt->robots == 1)
&& (forbidden_url != 1)) {
if ((!question) && (filters_answer) &&
(opt->robots == HTS_ROBOTS_SOMETIMES) && (forbidden_url != 1)) {
r = 0; // annuler interdiction des robots
if (!forbidden_url) {
hts_log_print(opt, LOG_DEBUG,
@@ -685,7 +683,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
io_flush;
} else { // lien primaire: autoriser répertoire entier
if (!force_mirror) {
if ((opt->seeker & 1) == 0) { // interdiction de descendre
if ((opt->seeker & HTS_SEEKER_DOWN) == 0) {
n = 7;
} else {
n = 5; // autoriser miroir répertoires descendants (lien primaire)
@@ -712,7 +710,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
switch (n) {
case -1: // sauter tout le reste
forbidden_url = 1;
opt->wizard = 2; // sauter tout le reste
opt->wizard = HTS_WIZARD_AUTO; // sauter tout le reste
break;
case 0: // forbid the same link: adr/fil
forbidden_url = 1;
@@ -796,7 +794,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
break;
case 5: // allow the whole directory and its children
if ((opt->seeker & 2) == 0) { // not allowed to go up
if ((opt->seeker & HTS_SEEKER_UP) == 0) { // not allowed to go up
size_t i = strlen(fil) - 1;
while((fil[i] != '/') && (i > 0))
@@ -872,7 +870,7 @@ static int hts_acceptlink_(httrackp * opt, int ptr,
// lien non autorisé, peut-on juste le tester?
if (just_test_it) {
if (forbidden_url == 1) {
if (opt->travel & 256) { // tester tout de même
if (opt->travel & HTS_TRAVEL_TEST_ALL) { // tester tout de même
if (strfield(adr, "ftp://") == 0) { // PAS ftp!
forbidden_url = 1; // oui oui toujours interdit (note: sert à rien car ==1 mais c pour comprendre)
*just_test_it = 1; // mais on teste

View File

@@ -254,15 +254,6 @@ HTSEXT_API int htswrap_add(httrackp * opt, const char *name, void *fct);
or 0 if none or unknown. */
HTSEXT_API uintptr_t htswrap_read(httrackp * opt, const char *name);
/** @warning No implementation is linked into the library; calling this fails to
link. For per-callback user data use the CHAIN_FUNCTION() ARGUMENT and
CALLBACKARG_USERDEF() instead. */
HTSEXT_API int htswrap_set_userdef(httrackp * opt, void *userdef);
/** @warning No implementation is linked into the library; calling this fails to
link. Read per-callback user data with CALLBACKARG_USERDEF() instead. */
HTSEXT_API void *htswrap_get_userdef(httrackp * opt);
/* Internal library allocators, if a different libc is being used by the client */
/** strdup() through the library allocator. Returns a heap copy freed with
hts_free(), or NULL on failure. */
@@ -584,12 +575,6 @@ HTSEXT_API char *unescape_http(char *const catbuff, const size_t size, const cha
space. Returns @p catbuff. */
HTSEXT_API char *unescape_http_unharm(char *const catbuff, const size_t size, const char *s, const int no_high);
/** @warning No implementation is linked into the library; calling this fails to
link. */
HTSEXT_API char *antislash_unescaped(char *catbuff, const char *s);
HTSEXT_API void escape_remove_control(char *s);
/** Determine the MIME type of local file name @p fil into @p s (capacity
@p ssize): user --assume rules, then ".html", then the built-in extension
table. @p flag != 0 forces a fallback type. @return 1 if a type was written,