mirror of https://github.com/xroche/httrack.git synced 2026-07-02 23:24:03 +03:00

Go to file

Xavier Roche 600001b282 Split -%u URL Hacks into independent www/slash/query toggles (#271 )

-%u (--urlhack) bundled three dedup normalizations under one switch:
www.host == host, redundant // collapse, and query-argument reordering.
A mirror that needed one but not another (e.g. keep www. distinct) had to
turn the whole umbrella off. Add three opt-out sub-options, defaulting to
the umbrella so existing -%u/-%u0 behavior is unchanged:

  --keep-www-prefix      keep www.foo.com distinct from foo.com   (-%j)
  --keep-double-slashes  keep redundant // in the path            (-%o)
  --keep-query-order     keep query-argument order significant    (-%y)

The split is resolved once in hash_init() into norm_host/norm_slash/
norm_query and threaded through the dedup hash (htshash.c), the savename
lookup key (htsname.c) and the redirect-loop diagnostic (htsparse.c) so
all three stay consistent. fil_normalized() gains an internal
fil_normalized_ex(do_slash, do_query) core; the public
fil_normalized()/fil_normalized_filtered() keep their signatures.

Normalization (slash/query) now follows urlhack and its sub-flags
uniformly, while --strip-query stays orthogonal. So with urlhack off,
strip-query strips keys without sorting the remainder; the url_savename
urlhack-off branch is moved to the same do_slash=0/do_query=0 normalizer
the hash uses, so a URL is always looked up under the key it was stored
with (a self-lookup mismatch this otherwise introduced).

http/https are always merged in the dedup key (the scheme is stripped
regardless of -%u), so that part of the request needs no toggle.

The opt-outs are spelled positively (--keep-*) because httrack's generic
--no<opt> prefix only appends the disabling "0" for parametered options,
not "single" booleans, so --nowww-dedup would silently no-op.

opt grows three hts_boolean fields appended at the struct tail (offsets
stable, no soname bump, matching the strip_query addition in #112).

Tested by a -#test=urlhack engine self-test (hash_url_equals over each
flag combination) plus a -%u0 + --strip-query crawl case exercising the
urlhack-off savename branch.

Closes #271

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>

2026-06-27 20:18:29 +02:00

.githooks

Add an opt-in pre-commit hook that auto-formats changed C lines

2026-06-14 12:55:17 +02:00

.github/workflows

ci: add a MemorySanitizer job for the offline engine self-tests (#433 )

2026-06-27 08:40:22 +02:00

debian

debian: override embedded-library for bundled minizip, lint under debian:sid (#419 )

2026-06-22 22:27:18 +02:00

html

Make every shell script shellcheck-clean

2026-06-20 11:35:55 +02:00

lang

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

libtest

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

man

Split -%u URL Hacks into independent www/slash/query toggles (#271 )

2026-06-27 20:18:29 +02:00

src

Split -%u URL Hacks into independent www/slash/query toggles (#271 )

2026-06-27 20:18:29 +02:00

templates

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

tests

Split -%u URL Hacks into independent www/slash/query toggles (#271 )

2026-06-27 20:18:29 +02:00

tools

Make lintian actually gate the Debian package build (#410 )

2026-06-20 20:13:12 +02:00

.clang-format

Separate definition blocks and canonicalize the public headers

2026-06-20 12:52:19 +02:00

.flake8

tests: add offline local test server prototype (cookies + HTTPS)

2026-06-20 16:35:13 +02:00

.gitignore

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

.gitmodules

Changed .gitmodules submodule pull URL to https, but keeping the push URL (.git/modules/src/coucal/config) to ssh (closes #64 )

2015-03-20 14:08:24 +01:00

AGENTS.md

Replace single-letter -# self-tests with a named -#test=NAME registry (#427 )

2026-06-26 08:05:59 +02:00

AUTHORS

AUTHORS should be 644

2013-09-13 16:08:40 +00:00

bootstrap

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

build.sh

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

ChangeLog

build: symlink ChangeLog and NEWS to history.txt

2026-06-08 20:40:27 +02:00

CLAUDE.md

Add AGENTS.md operational checklist for AI-assisted contributions

2026-06-16 04:01:29 +02:00

CODE_OF_CONDUCT.md

Add contributor governance: CONTRIBUTING, COC, SECURITY, DCO

2026-06-14 13:41:19 +02:00

configure.ac

Fall back to the next address when a connect fails or stalls (#418 )

2026-06-22 20:56:18 +02:00

CONTRIBUTING.md

Add AGENTS.md operational checklist for AI-assisted contributions

2026-06-16 04:01:29 +02:00

COPYING

GPL v3

2012-03-24 12:03:55 +00:00

gpl-fr.txt

Converted in UTF-8

2012-05-08 16:14:10 +00:00

greetings.txt

Flush

2013-06-09 14:45:30 +00:00

history.txt

Release 3.49.9 (#414 )

2026-06-21 18:12:07 +02:00

httrack-doc.html

httrack 3.30.1

2012-03-19 12:51:31 +00:00

INSTALL

Regenerate committed autotools/libtool files

2026-06-14 20:32:07 +02:00

INSTALL.Linux

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

lang.def

Fixed typos

2023-01-14 17:21:57 +01:00

lang.indexes

Added Uzbek Latin version (Zafar Shamsiddinov)

2015-08-06 20:41:52 +02:00

license.txt

license: drop the obsolete OpenSSL linking exception

2026-06-07 14:29:33 +02:00

Makefile.am

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

NEWS

build: symlink ChangeLog and NEWS to history.txt

2026-06-08 20:40:27 +02:00

README

Reword the ethical-use notice in source headers

2026-06-14 18:38:17 +02:00

README.md

build: stop tracking generated autotools files; add bootstrap/build.sh

2026-06-16 22:48:04 +02:00

SECURITY.md

Add obfuscated personal email as alternate security contact

2026-06-14 13:47:15 +02:00

README.md

HTTrack Website Copier - Development Repository

About

Copy websites to your computer (Offline browser)

HTTrack is an offline browser utility, allowing you to download a World Wide website from the Internet to a local directory, building recursively all directories, getting html, images, and other files from the server to your computer.

HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.

HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

WinHTTrack is the Windows 2000/XP/Vista/Seven release of HTTrack, and WebHTTrack the Linux/Unix/BSD release.

Website

Main Website: http://www.httrack.com/

Compile trunk release

A git checkout ships only the autotools sources, so ./bootstrap (which runs autoreconf) regenerates configure first; this needs autoconf, automake and libtool. Released tarballs already include configure, so building from a tarball skips ./bootstrap.

git clone https://github.com/xroche/httrack.git --recurse-submodules
cd httrack
./bootstrap
./configure --prefix=$HOME/usr && make -j8 && make install

Or use the one-shot wrapper (bootstrap + configure + make), which forwards its arguments to configure:

./build.sh --prefix=$HOME/usr

Languages

C 76.4%

HTML 17.3%

Shell 4.2%

Python 0.7%

M4 0.5%

Other 0.9%