Compare commits

..

1 Commits

Author SHA1 Message Date
Xavier Roche
d4e1b72a4b Parse robots.txt with RFC 9309 Allow/Disallow precedence
The robots.txt handler only did substring Disallow matching against a flat
token blob: no Allow:, no path wildcards. Sites using "Disallow: /" plus
"Allow: /public/" were over-blocked, since Allow was never parsed.

Move the body parsing into robots_parse() (htsrobots.c) so both the crawler
and a self-test feed raw robots.txt. Rules are stored Allow/Disallow-tagged
and consulted with RFC 9309 precedence: the longest matching path pattern
wins, Allow breaking ties. Pattern matching supports '*' (any run) and a
trailing '$' (end-of-path anchor) via a linear two-pointer matcher with a
single resumable star position, so hostile patterns cannot trigger
exponential backtracking. Path matching is now case-sensitive per the RFC.

robots_wizard is internal (not in DevIncludes_DATA, no HTSEXT_API; htsopt.h
holds only an opaque pointer), so the in-memory format changed without an ABI
break. Sitemap:/Crawl-delay: lines are tolerated but ignored, as before.

New -#test=robots self-test plus tests/01_engine-robots.test cover the
Allow-over-Disallow longest match, the equal-length Allow tie, '*'/'$'
wildcards, and httrack-group selection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-29 08:57:00 +02:00

View File

@@ -20,14 +20,6 @@ if ! command -v python3 >/dev/null 2>&1; then
echo "python3 missing, skipping"
exit 77
fi
# The fixture needs a second loopback IP (dead 127.0.0.2 + live 127.0.0.1) for
# the fallback to have a target; GNU/Hurd has only 127.0.0.1, so skip there.
case "$(uname -s)" in
GNU | GNU/*)
echo "GNU/Hurd: single loopback IP, connect-fallback fixture unbuildable, skipping"
exit 77
;;
esac
server="$top_srcdir/tests/local-server.py"
root="$top_srcdir/tests/server-root"