Compare commits

..

3 Commits

Author SHA1 Message Date
Xavier Roche
6a08ca7d39 htsparse: bound the URL-end scan against a missing closing delimiter
Reviewing the @import change, ASan flagged a pre-existing heap overflow:
when a quoted/parenthesized link token has no closing delimiter before the
buffer ends (truncated input such as `@import "x`, `@import "`, `url("x`),
the scan stops at the terminating NUL, then `c += ndelim` steps past it and
`while (*c == ' ')` / the terminator test read out of bounds. Such input
aborts under ASan on master.

Skip the URL-end scan and capture when no closing delimiter was found
(`*c == '\0'` right after the scan); c never advances past the NUL.
Well-formed tokens are unaffected.

01_engine-parse.test gains a truncated-@import fixture (the valid sibling
import is still captured, the unterminated one is not) that trips the
overflow under the CI ASan job, plus a check that an @import's trailing
media/supports/layer condition survives the rewrite verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 19:25:39 +02:00
Xavier Roche
a8b491e509 htsparse: capture conditional CSS @import URLs (#94)
A bare-string @import carrying a media/supports/layer condition, e.g.
`@import "theme.css" screen;`, was dropped. The detector required the closing
quote to be immediately followed by the statement terminator, so the trailing
condition aborted the capture. The `url(...)` form already worked because it
terminates at the paren.

Two coupled defects in the inscript/CSS detector:
- accept a whitespace-separated trailing condition after a quoted @import URL;
- bound the captured URL at its last content char (b) instead of recomputing
  from the terminator. The old `c -= (ndelim + 1)` mishandled spaces skipped
  before the terminator, leaving the closing quote inside the range so the
  bogus-link guard aborted. That also silently broke `foo="url" ;` (a space
  before the semicolon) for every quoted detection, not only @import.

01_engine-parse.test gains a CSS @import section that crawls a .css directly;
the conditioned cases are negative controls that fail without the fix.

Closes #94

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
2026-06-19 18:46:31 +02:00
Xavier Roche
a8e4bb3b81 Merge pull request #395 from xroche/fix/xmlns-false-links-191
Don't crawl xmlns namespace declarations
2026-06-19 18:28:23 +02:00
2 changed files with 63 additions and 7 deletions

View File

@@ -1341,6 +1341,9 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
int can_avoid_quotes = 0;
char quotes_replacement = '\0';
int ensure_not_mime = 0;
// @import: the quoted token is the URL; a trailing
// media/supports/layer condition is not part of it
int is_import = 0;
if (inscript_tag)
expected_end = ";\"\'"; // voir a href="javascript:doc.location='foo'"
@@ -1390,6 +1393,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((nc = strfield(html, "import"))) { // import "url"
if (is_space(*(html + nc))) {
expected = 0; // no char expected
is_import = 1;
} else
nc = 0;
}
@@ -1407,6 +1411,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
if ((*a == 34) || (*a == '\'') || (can_avoid_quotes)) {
const char *b, *c;
int ndelim = 1;
int valid_url = 0;
if ((*a == 34) || (*a == '\''))
a++;
@@ -1421,12 +1426,20 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
b++;
}
c = b--;
c += ndelim;
while(*c == ' ')
c++;
if ((strchr(expected_end, *c)) || (*c == '\n')
|| (*c == '\r')) {
c -= (ndelim + 1);
// no closing delimiter here (truncated input):
// Don't scan past the buffer NUL or capture it.
if (*c != '\0') {
c += ndelim;
while (*c == ' ')
c++;
valid_url =
(strchr(expected_end, *c)) || (*c == '\n') ||
(*c == '\r') ||
(is_import && *(b + 1 + ndelim) == ' ');
}
if (valid_url) {
// URL end = last char (b), not the delimiter
c = b;
if ((int) (c - a + 1)) {
if (ensure_not_mime) {
int i = 0;
@@ -1485,7 +1498,6 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
}
}
}
}
}
}

View File

@@ -176,4 +176,48 @@ notfound "ns.gif" "$out3"
notfound "og.gif" "$out3"
notfound "rdfs.gif" "$out3"
# CSS @import (#94): every form's target is captured, crawling the .css directly.
# The "cond"/"sup"/"spc" cases carry a trailing media/supports/layer condition (or
# a space before ';'); they are the negative controls: without the parser fix the
# URL is dropped, so a regression fails these found() checks.
site4="$tmp/cssimport"
mkdir -p "$site4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do printf 'body{}\n' >"$site4/$f.css"; done
cat >"$site4/main.css" <<'EOF'
@import url(nq.css);
@import url("dqu.css");
@import url('squ.css');
@import "dqs.css";
@import 'sqs.css';
@import url(med.css) screen and (min-width: 400px);
@import "cond.css" screen;
@import "sup.css" supports(display: flex);
@import url(lay.css) layer(base);
@import "spc.css" ;
EOF
out4="$tmp/cssimport-out"
crawl "$site4/main.css" "$out4"
for f in nq dqu squ dqs sqs med cond sup lay spc; do found "$f.css" "$out4"; done
# Over-capture guard: the trailing condition is not part of the URL, so it must
# survive the rewrite verbatim. A regression that grabs it would mangle these.
m4=$(find "$out4" -type f -path '*/file/*' -name main.css -print -quit)
test -n "$m4" || ! echo "FAIL: saved main.css not found" || exit 1
for cond in '@import "cond.css" screen;' 'supports(display: flex)' 'layer(base)'; do
grep -Fq "$cond" "$m4" ||
! echo "FAIL #94: '$cond' altered on rewrite (condition captured as URL?)" || exit 1
done
# Malformed input: an unterminated @import quote (truncated CSS) must not crash or
# capture a bogus link; a valid sibling import is still captured. Guards a heap
# overflow on the URL-end scan that aborts under ASan (CI sanitizer job).
site5="$tmp/cssimport-trunc"
mkdir -p "$site5"
printf 'body{}\n' >"$site5/good.css"
printf '@import "good.css";\n@import "trunc' >"$site5/main.css"
out5="$tmp/cssimport-trunc-out"
crawl "$site5/main.css" "$out5"
found "good.css" "$out5"
notfound "trunc" "$out5"
exit 0