mirror of
https://github.com/xroche/httrack.git
synced 2026-06-13 22:04:07 +03:00
Compare commits
7 Commits
fix/doc-la
...
parser/src
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
9c167108a9 | ||
|
|
5351e96d71 | ||
|
|
9d39a57576 | ||
|
|
e3d4ec01f7 | ||
|
|
a0bf50f6b1 | ||
|
|
794404bba2 | ||
|
|
82d08aaeaf |
@@ -118,11 +118,11 @@ The command-line version
|
||||
<br>
|
||||
<br>
|
||||
<li>Add the URLs, separated by a blank space</li>
|
||||
<br><small><tt>httrack www.someweb.com/foo/</tt></small>
|
||||
<br><small><tt>httrack www.example.com/foo/</tt></small>
|
||||
<br>
|
||||
<br>
|
||||
<li>If you need, add some options (see the <a href="options.html">option list</a>)</li>
|
||||
<br><small><tt>httrack www.someweb.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
|
||||
<br><small><tt>httrack www.example.com/foo/ -O "/webs" -N4 -P proxy.myhost.com:3128</tt></small>
|
||||
<br>
|
||||
<br>
|
||||
<li>Launch the command line, and wait until the mirror is finishing</li>
|
||||
|
||||
@@ -303,43 +303,43 @@ Okay, let me explain how to precisely control the capture process.<br>
|
||||
Let's take an example:<br>
|
||||
<br>
|
||||
Imagine you want to capture the following site:<br>
|
||||
<tt>www.someweb.com/gallery/flowers/</tt><br>
|
||||
<tt>www.example.com/gallery/flowers/</tt><br>
|
||||
<br>
|
||||
HTTrack, by default, will capture all links encountered in <tt>www.someweb.com/gallery/flowers/</tt> or in lower directories, like
|
||||
<tt>www.someweb.com/gallery/flowers/roses/</tt>.<br>
|
||||
HTTrack, by default, will capture all links encountered in <tt>www.example.com/gallery/flowers/</tt> or in lower directories, like
|
||||
<tt>www.example.com/gallery/flowers/roses/</tt>.<br>
|
||||
It will not follow links to other websites, because this behaviour might cause to capture the Web entirely!<br>
|
||||
It will not follow links located in higher directories, too (for example, <tt>www.someweb.com/gallery/flowers/</tt> itself) because this
|
||||
It will not follow links located in higher directories, too (for example, <tt>www.example.com/gallery/flowers/</tt> itself) because this
|
||||
might cause to capture too much data.<br>
|
||||
<br>
|
||||
This is the <b><u>default behaviour</b></u> of HTTrack, BUT, of course, if you want, you can tell HTTrack to capture other directorie(s), website(s)!..
|
||||
<br>
|
||||
In our example, we might want also to capture all links in <tt>www.someweb.com/gallery/trees/</tt>, and in <tt>www.someweb.com/photos/</tt><br>
|
||||
In our example, we might want also to capture all links in <tt>www.example.com/gallery/trees/</tt>, and in <tt>www.example.com/photos/</tt><br>
|
||||
<br>
|
||||
This can easily done by using filters: go to the Option panel, select the 'Scan rules' tab, and enter this line:
|
||||
(you can leave a blank space between each rules, instead of entering a carriage return)<br>
|
||||
<tt>+www.someweb.com/gallery/trees/*<br>
|
||||
+www.someweb.com/photos/*</tt><br>
|
||||
<tt>+www.example.com/gallery/trees/*<br>
|
||||
+www.example.com/photos/*</tt><br>
|
||||
<br>
|
||||
This means "accept all links begining with <tt>www.someweb.com/gallery/trees/</tt> and <tt>www.someweb.com/photos/</tt>"
|
||||
This means "accept all links begining with <tt>www.example.com/gallery/trees/</tt> and <tt>www.example.com/photos/</tt>"
|
||||
- the <tt>+</tt> means "accept" and the final <tt>*</tt> means "any character will match after the previous ones".
|
||||
Remember the <tt>*.doc</tt> or <tt>*.zip</tt> encountered when you want to select all files from a certain type on your computer:
|
||||
it is almost the same here, except the begining "+"<br>
|
||||
<br>
|
||||
Now, we might want to exclude all links in <tt>www.someweb.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
|
||||
Now, we might want to exclude all links in <tt>www.example.com/gallery/trees/hugetrees/</tt>, because with the previous filter,
|
||||
we accepted too many files. Here again, you can add a filter rule to refuse these links. Modify the previous filters to:<br>
|
||||
<tt>+www.someweb.com/gallery/trees/*<br>
|
||||
+www.someweb.com/photos/*<br>
|
||||
-www.someweb.com/gallery/trees/hugetrees/*</tt><br>
|
||||
<tt>+www.example.com/gallery/trees/*<br>
|
||||
+www.example.com/photos/*<br>
|
||||
-www.example.com/gallery/trees/hugetrees/*</tt><br>
|
||||
<br>
|
||||
You have noticed the <tt>-</tt> in the begining of the third rule: this means "refuse links matching the rule"
|
||||
; and the rule is "any files begining with <tt>www.someweb.com/gallery/trees/hugetrees/</tt><br>
|
||||
; and the rule is "any files begining with <tt>www.example.com/gallery/trees/hugetrees/</tt><br>
|
||||
|
||||
Voila! With these three rules, you have precisely defined what you wanted to capture.<br>
|
||||
<br>
|
||||
A more complex example?<br>
|
||||
<br>
|
||||
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.someweb.com<br>
|
||||
<tt>+www.someweb.com/*blue*.jpg</tt><br>
|
||||
Imagine that you want to accept all jpg files (files with .jpg type) that have "blue" in the name and located in www.example.com<br>
|
||||
<tt>+www.example.com/*blue*.jpg</tt><br>
|
||||
<br>
|
||||
More detailed information can be found <a href="filters.html">here</a>!<br>
|
||||
<br>
|
||||
@@ -440,7 +440,7 @@ This will cause a performance loss, but will increase the compatibility with som
|
||||
|
||||
<a NAME="QT1">Q: <strong>Only the first page is caught. What's wrong?</a></strong></br>
|
||||
A: <em>First, check the <tt>hts-log.txt</tt> file (and/or <tt>hts-err.txt</tt> error log file) - this can give you precious information.<br>
|
||||
The problem can be a website that redirects you to another site (for example, <tt>www.someweb.com</tt> to <tt>public.someweb.com</tt>) :
|
||||
The problem can be a website that redirects you to another site (for example, <tt>www.example.com</tt> to <tt>public.example.com</tt>) :
|
||||
in this case, use filters to accept this site<br>
|
||||
This can be, also, a problem in the HTTrack options (link depth too low, for example)</em>
|
||||
|
||||
@@ -485,10 +485,10 @@ You may also want to capture files that are forbidden by default by the <a href=
|
||||
In these cases, HTTrack does not capture these links automatically, you have to tell it to do so.
|
||||
<br><br>
|
||||
<ul><li>Either use the <a href="filters.html">filters</a>.<br>
|
||||
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get .jpg images located
|
||||
in <tt>http://www.someweb.com/bar/</tt> (for example, http://www.someweb.com/bar/blue.jpg)<br>
|
||||
Then, add the filter rule <tt>+www.someweb.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
|
||||
You can, also, accept all files from the /bar folder with <tt>+www.someweb.com/bar/*</tt>, or only html files with <tt>+www.someweb.com/bar/*.html</tt> and so on..<br><br>
|
||||
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get .jpg images located
|
||||
in <tt>http://www.example.com/bar/</tt> (for example, http://www.example.com/bar/blue.jpg)<br>
|
||||
Then, add the filter rule <tt>+www.example.com/bar/*.jpg</tt> to accept all .jpg files from this location<br>
|
||||
You can, also, accept all files from the /bar folder with <tt>+www.example.com/bar/*</tt>, or only html files with <tt>+www.example.com/bar/*.html</tt> and so on..<br><br>
|
||||
</li><li>
|
||||
If the problems are related to robots.txt rules, that do not let you access some folders (check in the logs if you are not sure),
|
||||
you may want to disable the default robots.txt rules in the options. (but only disable this option with great care,
|
||||
@@ -509,8 +509,8 @@ and rescan the website as described before. HTTrack will be obliged to recatch t
|
||||
<a NAME="Q1bb">Q: <strong>FTP links are not caught! What's happening?</strong><br>
|
||||
A: <em>FTP files might be seen as external links, especially if they are located in outside domain. You have either to accept all external links (See the links options, -n option) or
|
||||
only specific files (see <a href="filters.html">filters</a> section). <br>
|
||||
Example: You are downloading <tt>http://www.someweb.com/foo/</tt> and can not get ftp://ftp.someweb.com files<br>
|
||||
Then, add the filter rule <tt>+ftp.someweb.com/*</tt> to accept all files from this (ftp) location<br>
|
||||
Example: You are downloading <tt>http://www.example.com/foo/</tt> and can not get ftp://ftp.example.com files<br>
|
||||
Then, add the filter rule <tt>+ftp.example.com/*</tt> to accept all files from this (ftp) location<br>
|
||||
</em>
|
||||
<br>
|
||||
|
||||
@@ -551,10 +551,10 @@ Note: In some rare cases, duplicate data files can be found when the website red
|
||||
|
||||
<a NAME="Q1b2">Q: <strong>I'm downloading too many files! What can I do?</strong><br>
|
||||
A: <em>This is often the case when you use too large a filter, for example <tt>+*.html</tt>, which asks the
|
||||
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.someweb.com/specificfolder/*.html</tt><br>
|
||||
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.someweb.com/big/,
|
||||
use <tt>-www.someweb.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
|
||||
mirroring http://www.someweb.com/big/index.html, is to catch everything in http://www.someweb.com/big/. Filters are your friends,
|
||||
engine to catch all .html pages (even ones on other sites!). In this case, try to use more specific filters, like <tt>+www.example.com/specificfolder/*.html</tt><br>
|
||||
If you still have too many files, use filters to avoid somes files. For example, if you have too many files from www.example.com/big/,
|
||||
use <tt>-www.example.com/big/*</tt> to avoid all files from this folder. Remember that the default behaviour of the engine, when
|
||||
mirroring http://www.example.com/big/index.html, is to catch everything in http://www.example.com/big/. Filters are your friends,
|
||||
use them!
|
||||
</em>
|
||||
<br>
|
||||
@@ -562,7 +562,7 @@ use them!
|
||||
|
||||
<a NAME="Q1b22">Q: <strong>The engine turns crazy, getting thousands of files! What's going on?</strong><br>
|
||||
A: <em>This can happen if a loop occurs in some bogus website. For example, a page that refers to itself, with a timestamp
|
||||
in the query string (e.g. <tt>http://www.someweb.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>).
|
||||
in the query string (e.g. <tt>http://www.example.com/foo.asp?ts=2000/10/10,09:45:17:147</tt>).
|
||||
These are really annoying, as it is VERY difficult to detect the loop (the timestamp might be a page number).
|
||||
To limit the problem: set a recurse level (for example to 6), or avoid the bogus pages (use the filters)
|
||||
</em>
|
||||
@@ -571,7 +571,7 @@ To limit the problem: set a recurse level (for example to 6), or avoid the bogus
|
||||
|
||||
<a NAME="Q1b3">Q: <strong>File are sometimes renamed (the type is changed)! Why?</strong><br>
|
||||
A: <em>By default, HTTrack tries to know the type of remote files. This is useful when links like
|
||||
<tt>http://www.someweb.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else.
|
||||
<tt>http://www.example.com/foo.cgi?id=1</tt> can be either HTML pages, images or anything else.
|
||||
Locally, foo.cgi will not be recognized as an html page, or as an image, by your browser. HTTrack has to rename the file
|
||||
as foo.html or foo.gif so that it can be viewed.<br>
|
||||
</em>
|
||||
@@ -730,8 +730,8 @@ but this is a smart bug..
|
||||
the domain, too. How to retrieve them?</strong><br>
|
||||
A: <em>If you just want to retrieve files that can be reached through links, just activate
|
||||
the 'get file near links' option. But if you want to retrieve html pages too, you can both
|
||||
use wildcards or explicit addresses ; e.g. add <tt>www.someweb.com/*</tt> to accept all
|
||||
files and pages from www.someweb.com.<br>
|
||||
use wildcards or explicit addresses ; e.g. add <tt>www.example.com/*</tt> to accept all
|
||||
files and pages from www.example.com.<br>
|
||||
<br>
|
||||
</em></a><a NAME="Q6">Q: <strong>I have forgotten some URLs of files during a long
|
||||
mirror.. Should I redo all?</strong><br>
|
||||
@@ -744,7 +744,7 @@ A: <em>You can use different methods. You can use the 'get files near a link' op
|
||||
files are in a foreign domain. You can use, too, a filter adress: adding <tt>+*.zip</tt>
|
||||
in the URL list (or in the filter list) will accept all ZIP files, even if these files are
|
||||
outside the address. <br>
|
||||
Example : <tt>httrack www.someweb.com/someaddress.html +*.zip</tt> will allow
|
||||
Example : <tt>httrack www.example.com/someaddress.html +*.zip</tt> will allow
|
||||
you to retrieve all zip files that are linked on the site.</em><br>
|
||||
<br>
|
||||
</a><a NAME="Q8">Q: <strong>There are ZIP files in a page, but I don't want to transfer
|
||||
@@ -771,7 +771,7 @@ them on filters!</strong><br>
|
||||
A: <em>By default, HTTrack retrieves all types of files on authorized links. To avoid
|
||||
that, define filters like </a><a NAME="Q7"><tt>-* +<website>/*.html
|
||||
+<website>/*.htm +<website>/ +*.<type wanted></tt></a><a NAME="Q10"><br>
|
||||
Example: <tt>httrack www.someweb.com/index.html -* +www.someweb.com/*.htm* +www.someweb.com/*.gif +www.someweb.com/*.jpg</tt><br>
|
||||
Example: <tt>httrack www.example.com/index.html -* +www.example.com/*.htm* +www.example.com/*.gif +www.example.com/*.jpg</tt><br>
|
||||
<br>
|
||||
</em><a NAME="Q10">Q: <strong>When I use filters, I get too many files!</strong><br>
|
||||
A: <em>You might use too large a filter, for example <tt>*.html</tt> will get ALL html
|
||||
@@ -779,13 +779,13 @@ files identified. If you want to get all files on an address, use <tt>www.<ad
|
||||
If you want to get ONLY files defined by your filters, use something like <tt>-* +www.foo.com/*</tt>, because
|
||||
<tt>+www.foo.com/*</tt> will only accept selected links without forbidding other ones!<br>
|
||||
There are lots of possibilities using filters.<br>
|
||||
Example:<tt>httrack www.someweb.com +*.someweb.com/*.htm*</tt><br>
|
||||
Example:<tt>httrack www.example.com +*.example.com/*.htm*</tt><br>
|
||||
<br>
|
||||
</em></a><a NAME="Q11">Q: <strong>When I use filters, I can't access another domain, but I
|
||||
have filtered it!</strong><br>
|
||||
A: <em>You may have done a mistake declaring filters, for example <tt>+www.someweb.com/*
|
||||
-*someweb* </tt></em>will not work, because -*someweb* has an upper priority (because it has
|
||||
been declared after +www.someweb.com)<br>
|
||||
A: <em>You may have done a mistake declaring filters, for example <tt>+www.example.com/*
|
||||
-*example* </tt></em>will not work, because -*example* has an upper priority (because it has
|
||||
been declared after +www.example.com)<br>
|
||||
<br>
|
||||
</a><a NAME="Q12">Q: <strong>Must I add a '+' or '-' in the filter list when I want
|
||||
to use filters?</strong><br>
|
||||
@@ -800,7 +800,7 @@ filter list) and accept only html files and the file(s) you want to retrieve (BU
|
||||
forget to add <tt>+<website>*.html</tt> in the filter list, or pages will not be
|
||||
scanned! Add the name of files you want with a <tt>*/</tt> before ; i.e. if you want to
|
||||
retrieve file.zip, add <tt>*/file.zip</tt>)<br>
|
||||
Example:<tt>httrack www.someweb.com +www.someweb.com/*.htm* +thefileiwant.zip</tt><br>
|
||||
Example:<tt>httrack www.example.com +www.example.com/*.htm* +thefileiwant.zip</tt><br>
|
||||
<br>
|
||||
</em>
|
||||
|
||||
@@ -828,7 +828,7 @@ A: <em>Yes. See the URL capture abilities (--catchurl for command-line release,
|
||||
A: <em>Yes. See the shell system command option (-V option for command-line release)</em>
|
||||
|
||||
<br><br><a NAME="QM6">Q: <strong>Can I use username/password authentication on a site?</strong></a><br>
|
||||
A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.someweb.com/private/mybox.html</tt>)</em>
|
||||
A: <em>Yes. Use user:password@your_url (example: <tt>http://foo:bar@www.example.com/private/mybox.html</tt>)</em>
|
||||
|
||||
<br><br><a NAME="QM7">Q: <strong>Can I use username/password authentication for a proxy?</strong></a><br>
|
||||
A: <em>Yes. Use user:password@your_proxy_name as your proxy name (example: <tt>smith:foo@proxy.mycorp.com</tt>)</em>
|
||||
|
||||
@@ -332,7 +332,7 @@ Details: User-defined option N
|
||||
%N Name of file, including file type (ex: image.gif)
|
||||
%t File type (ex: gif)
|
||||
%p Path [without ending /] (ex: /someimages)
|
||||
%h Host name (ex: www.someweb.com) (--http-10)
|
||||
%h Host name (ex: www.example.com) (--http-10)
|
||||
%M URL MD5 (128 bits, 32 ascii bytes)
|
||||
%Q query string MD5 (128 bits, 32 ascii bytes)
|
||||
%q small query string MD5 (16 bits, 4 ascii bytes) (--include-query-string)
|
||||
@@ -356,17 +356,17 @@ Shortcuts:
|
||||
|
||||
--http10 force http/1.0 requests (-%h)
|
||||
|
||||
example: httrack www.someweb.com/bob/
|
||||
means: mirror site www.someweb.com/bob/ and only this site
|
||||
example: httrack www.example.com/bob/
|
||||
means: mirror site www.example.com/bob/ and only this site
|
||||
|
||||
example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
|
||||
example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
|
||||
means: mirror the two sites together (with shared links) and accept any .jpg files on .com sites
|
||||
|
||||
example: httrack www.someweb.com/bob/bobby.html +* -r6
|
||||
example: httrack www.example.com/bob/bobby.html +* -r6
|
||||
means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web
|
||||
|
||||
example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
|
||||
runs the spider on www.someweb.com/bob/bobby.html using a proxy
|
||||
example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080
|
||||
runs the spider on www.example.com/bob/bobby.html using a proxy
|
||||
|
||||
example: httrack --update
|
||||
updates a mirror in the current folder
|
||||
@@ -2054,8 +2054,8 @@ generated automatically using the interface)
|
||||
<td>This will accept all zip files in .com addresses</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>-*someweb*/*.tar*</b></td>
|
||||
<td>This will refuse all tar (or tar.gz etc.) files in hosts containing someweb</td>
|
||||
<td><b>-*example*/*.tar*</b></td>
|
||||
<td>This will refuse all tar (or tar.gz etc.) files in hosts containing example</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><b>+*/*somepage*</b></td>
|
||||
|
||||
@@ -109,8 +109,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
|
||||
|
||||
<i>You have to know that once you have defined
|
||||
starts links, the default mode is to mirror these links - i.e. if one of your start page is
|
||||
www.someweb.com/test/index.html, all links starting with www.someweb.com/test/ will be
|
||||
accepted. But links directly in www.someweb.com/.. will not be accepted, however, because
|
||||
www.example.com/test/index.html, all links starting with www.example.com/test/ will be
|
||||
accepted. But links directly in www.example.com/.. will not be accepted, however, because
|
||||
they are in a higher strcuture. This prevent HTTrack from mirroring the whole site. (All
|
||||
files in structure levels equal or lower than the primary links will be retrieved.)<br>
|
||||
</i>
|
||||
@@ -278,8 +278,8 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
|
||||
<td>This will refuse/accept all zip files in .com addresses</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap><tt>*someweb*/*.tar*</tt></td>
|
||||
<td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing someweb</td>
|
||||
<td nowrap><tt>*example*/*.tar*</tt></td>
|
||||
<td>This will refuse/accept all tar (or tar.gz etc.) files in hosts containing example</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap><tt>*/*somepage*</tt></td>
|
||||
@@ -289,13 +289,13 @@ See also: The <a href="faq.html#VF1">FAQ</a><br>
|
||||
<td nowrap><tt>*.html</tt></td>
|
||||
<td>This will refuse/accept all html files. <br>
|
||||
Warning! With this filter you will accept ALL html files, even those in other addresses.
|
||||
(causing a global (!) web mirror..) Use www.someweb.com/*.html to accept all html files from
|
||||
(causing a global (!) web mirror..) Use www.example.com/*.html to accept all html files from
|
||||
a web.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td nowrap><tt>*.html*[]</tt></td>
|
||||
<td>Identical to <tt>*.html</tt>, but the link must not have any supplemental characters
|
||||
at the end (links with parameters, like <tt>www.someweb.com/index.html?page=10</tt>, will be
|
||||
at the end (links with parameters, like <tt>www.example.com/index.html?page=10</tt>, will be
|
||||
refused)</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
@@ -123,12 +123,12 @@ mirrored site, and resume interrupted downloads.</p>
|
||||
|
||||
|
||||
<p style="margin-left:11%; margin-top: 1em"><b>httrack
|
||||
www.someweb.com/bob/</b></p>
|
||||
www.example.com/bob/</b></p>
|
||||
|
||||
<p style="margin-left:22%;">mirror site
|
||||
www.someweb.com/bob/ and only this site</p>
|
||||
www.example.com/bob/ and only this site</p>
|
||||
|
||||
<p style="margin-left:11%;"><b>httrack www.someweb.com/bob/
|
||||
<p style="margin-left:11%;"><b>httrack www.example.com/bob/
|
||||
www.anothertest.com/mike/ +*.com/*.jpg <br>
|
||||
-mime:application/*</b></p>
|
||||
|
||||
@@ -137,18 +137,18 @@ www.anothertest.com/mike/ +*.com/*.jpg <br>
|
||||
sites</p>
|
||||
|
||||
<p style="margin-left:11%;"><b>httrack
|
||||
www.someweb.com/bob/bobby.html +* -r6</b></p>
|
||||
www.example.com/bob/bobby.html +* -r6</b></p>
|
||||
|
||||
<p style="margin-left:22%;">means get all files starting
|
||||
from bobby.html, with 6 link-depth, and possibility of going
|
||||
everywhere on the web</p>
|
||||
|
||||
<p style="margin-left:11%;"><b>httrack
|
||||
www.someweb.com/bob/bobby.html --spider -P <br>
|
||||
www.example.com/bob/bobby.html --spider -P <br>
|
||||
proxy.myhost.com:8080</b></p>
|
||||
|
||||
<p style="margin-left:22%;">runs the spider on
|
||||
www.someweb.com/bob/bobby.html using a proxy</p>
|
||||
www.example.com/bob/bobby.html using a proxy</p>
|
||||
|
||||
<p style="margin-left:11%;"><b>httrack --update</b></p>
|
||||
|
||||
@@ -1877,7 +1877,7 @@ User-defined option N</b> <br>
|
||||
%N Name of file, including file type (ex: image.gif) <br>
|
||||
%t File type (ex: gif) <br>
|
||||
%p Path [without ending /] (ex: /someimages) <br>
|
||||
%h Host name (ex: www.someweb.com) <br>
|
||||
%h Host name (ex: www.example.com) <br>
|
||||
%M URL MD5 (128 bits, 32 ascii bytes) <br>
|
||||
%Q query string MD5 (128 bits, 32 ascii bytes) <br>
|
||||
%k full query string <br>
|
||||
|
||||
@@ -131,16 +131,16 @@ This is the default primary scanning option, the engine does not go out of domai
|
||||
|
||||
d stay on the same principal domain
|
||||
This option lets the engine go on all sites that exist on the same principal domain.
|
||||
Example: a link located at www.someweb.com that goes to members.someweb.com will be followed.
|
||||
Example: a link located at www.example.com that goes to members.example.com will be followed.
|
||||
|
||||
l stay on the same location (.com, etc.)
|
||||
This option lets the engine go on all sites that exist on the same location.
|
||||
Example: a link located at www.someweb.com that goes to www.anyotherweb.com will be followed.
|
||||
Example: a link located at www.example.com that goes to www.anyotherweb.com will be followed.
|
||||
Warning: this is a potentially dangerous option, limit the recurse depth with r option.
|
||||
|
||||
e go everywhere on the web
|
||||
This option lets the engine go on any sites.
|
||||
Example: a link located at www.someweb.com that goes to www.anyotherweb.org will be followed.
|
||||
Example: a link located at www.example.com that goes to www.anyotherweb.org will be followed.
|
||||
Warning: this is a potentially dangerous option, limit the recurse depth with r option.
|
||||
|
||||
n get non-html files 'near' an html file (ex: an image located outside)
|
||||
|
||||
@@ -117,7 +117,7 @@ h4 { margin: 0; font-weight: bold; font-size: 1.18em; }
|
||||
<li>HTML Footer</li>
|
||||
<br><small>Enter here the optionnal text that will be included as a comment in each HTML file to make archiving easier
|
||||
<br>The string entered is generally an HTML comment (<tt><!-- HTML comment --></tt>) with optionnal %s, which will be transformed into a specific string information:
|
||||
<br>%s #1 : host name (for example, www.someweb.com)
|
||||
<br>%s #1 : host name (for example, www.example.com)
|
||||
<br>%s #2 : file name (for example, /index.html)
|
||||
<br>%s #3 : date of the mirror
|
||||
<br><b>Example</b>: <tt><!-- Page mirrored from %s, file %s. Archive date: %s --></tt>
|
||||
|
||||
@@ -13,3 +13,9 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
|
||||
README='$(top_srcdir)/README' $(SHELL) $(srcdir)/makeman.sh \
|
||||
'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
|
||||
.PHONY: regen-man
|
||||
|
||||
# Render html/httrack.man.html from httrack.1. Needs the groff html device
|
||||
# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
|
||||
regen-man-html: httrack.1
|
||||
groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
|
||||
.PHONY: regen-man-html
|
||||
|
||||
@@ -551,6 +551,12 @@ regen-man: makeman.sh $(top_builddir)/src/httrack$(EXEEXT)
|
||||
'$(top_builddir)/src/httrack$(EXEEXT)' > $(srcdir)/httrack.1
|
||||
.PHONY: regen-man
|
||||
|
||||
# Render html/httrack.man.html from httrack.1. Needs the groff html device
|
||||
# (Debian: full "groff" package, not "groff-base"). Run by hand: make -C man regen-man-html
|
||||
regen-man-html: httrack.1
|
||||
groff -t -man -Thtml $(srcdir)/httrack.1 > $(top_srcdir)/html/httrack.man.html
|
||||
.PHONY: regen-man-html
|
||||
|
||||
# Tell versions [3.59,3.63) of GNU make to not export all variables.
|
||||
# Otherwise a system limit (for SysV at least) may be exceeded.
|
||||
.NOEXPORT:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
.\" groff -man -Tascii httrack.1
|
||||
.\"
|
||||
.\" This file is generated by man/makeman.sh; do not edit by hand.
|
||||
.TH httrack 1 "07 June 2026" "httrack website copier"
|
||||
.TH httrack 1 "13 June 2026" "httrack website copier"
|
||||
.SH NAME
|
||||
httrack \- offline browser : copy websites to a local directory
|
||||
.SH SYNOPSIS
|
||||
@@ -98,15 +98,15 @@ httrack \- offline browser : copy websites to a local directory
|
||||
allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads.
|
||||
.SH EXAMPLES
|
||||
.TP
|
||||
.B httrack www.someweb.com/bob/
|
||||
mirror site www.someweb.com/bob/ and only this site
|
||||
.B httrack www.example.com/bob/
|
||||
mirror site www.example.com/bob/ and only this site
|
||||
.TP
|
||||
.B httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
|
||||
.B httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg \-mime:application/*
|
||||
mirror the two sites together (with shared links) and accept any .jpg files on .com sites
|
||||
.TP
|
||||
.B httrack www.someweb.com/bob/bobby.html +* \-r6
|
||||
.B httrack www.example.com/bob/bobby.html +* \-r6
|
||||
.TP
|
||||
.B httrack www.someweb.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
|
||||
.B httrack www.example.com/bob/bobby.html \-\-spider \-P proxy.myhost.com:8080
|
||||
.TP
|
||||
.B httrack \-\-update
|
||||
.TP
|
||||
@@ -411,7 +411,7 @@ File type (ex: gif)
|
||||
.IP \-%p
|
||||
Path [without ending /] (ex: /someimages)
|
||||
.IP \-%h
|
||||
Host name (ex: www.someweb.com)
|
||||
Host name (ex: www.example.com)
|
||||
.IP \-%M
|
||||
URL MD5 (128 bits, 32 ascii bytes)
|
||||
.IP \-%Q
|
||||
|
||||
@@ -712,7 +712,7 @@ void help(const char *app, int more) {
|
||||
infomsg(" '%N' Name of file, including file type (ex: image.gif)");
|
||||
infomsg(" '%t' File type (ex: gif)");
|
||||
infomsg(" '%p' Path [without ending /] (ex: /someimages)");
|
||||
infomsg(" '%h' Host name (ex: www.someweb.com)");
|
||||
infomsg(" '%h' Host name (ex: www.example.com)");
|
||||
infomsg(" '%M' URL MD5 (128 bits, 32 ascii bytes)");
|
||||
infomsg(" '%Q' query string MD5 (128 bits, 32 ascii bytes)");
|
||||
infomsg(" '%k' full query string");
|
||||
@@ -767,21 +767,21 @@ void help(const char *app, int more) {
|
||||
infomsg("Details: Option %W: External callbacks prototypes");
|
||||
infomsg("see htsdefines.h");
|
||||
infomsg("");
|
||||
infomsg("example: httrack www.someweb.com/bob/");
|
||||
infomsg("means: mirror site www.someweb.com/bob/ and only this site");
|
||||
infomsg("example: httrack www.example.com/bob/");
|
||||
infomsg("means: mirror site www.example.com/bob/ and only this site");
|
||||
infomsg("");
|
||||
infomsg
|
||||
("example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
|
||||
("example: httrack www.example.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*");
|
||||
infomsg
|
||||
("means: mirror the two sites together (with shared links) and accept any .jpg files on .com sites");
|
||||
infomsg("");
|
||||
infomsg("example: httrack www.someweb.com/bob/bobby.html +* -r6");
|
||||
infomsg("example: httrack www.example.com/bob/bobby.html +* -r6");
|
||||
infomsg
|
||||
("means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web");
|
||||
infomsg("");
|
||||
infomsg
|
||||
("example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
|
||||
infomsg("runs the spider on www.someweb.com/bob/bobby.html using a proxy");
|
||||
("example: httrack www.example.com/bob/bobby.html --spider -P proxy.myhost.com:8080");
|
||||
infomsg("runs the spider on www.example.com/bob/bobby.html using a proxy");
|
||||
infomsg("");
|
||||
infomsg("example: httrack --update");
|
||||
infomsg("updates a mirror in the current folder");
|
||||
|
||||
@@ -121,6 +121,7 @@ const char *hts_detect[] = {
|
||||
"lowsrc",
|
||||
"profile", // element META
|
||||
"src",
|
||||
"srcset", // HTML5 responsive images (<img>, <source>)
|
||||
"swurl",
|
||||
"url",
|
||||
"usemap",
|
||||
@@ -895,9 +896,9 @@ int http_sendhead(httrackp * opt, t_cookie * cookie, int mode,
|
||||
|
||||
// possibilité non documentée: >post: et >postfile:
|
||||
// si présence d'un tag >post: alors executer un POST
|
||||
// exemple: http://www.someweb.com/test.cgi?foo>post:posteddata=10&foo=5
|
||||
// exemple: http://www.example.com/test.cgi?foo>post:posteddata=10&foo=5
|
||||
// si présence d'un tag >postfile: alors envoyer en tête brut contenu dans le fichier en question
|
||||
// exemple: http://www.someweb.com/test.cgi?foo>postfile:post0.txt
|
||||
// exemple: http://www.example.com/test.cgi?foo>postfile:post0.txt
|
||||
search_tag = strstr(fil, POSTTOK ":");
|
||||
if (!search_tag) {
|
||||
search_tag = strstr(fil, POSTTOK "file:");
|
||||
|
||||
@@ -532,6 +532,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
int valid_p = 0; // force to take p even if == 0
|
||||
int ending_p = '\0'; // ending quote?
|
||||
int archivetag_p = 0; // avoid multiple-archives with commas
|
||||
int srcset_p = 0; // srcset="url1 480w, url2 2x": list of URLs
|
||||
int unquoted_script = 0;
|
||||
INSCRIPT inscript_state_pos_prev = inscript_state_pos;
|
||||
|
||||
@@ -1050,6 +1051,14 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
if (strcmp(hts_detect[i], "archive") == 0) {
|
||||
archivetag_p = 1;
|
||||
}
|
||||
/* srcset holds a comma-separated list of candidate
|
||||
URLs, each with an optional 480w/2x descriptor
|
||||
(issues #235, #236); each URL is captured and
|
||||
rewritten in turn below. */
|
||||
else if (strcmp(hts_detect[i], "srcset") == 0
|
||||
|| strcmp(hts_detect[i], "data-srcset") == 0) {
|
||||
srcset_p = 1;
|
||||
}
|
||||
}
|
||||
i++;
|
||||
}
|
||||
@@ -1815,6 +1824,15 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
html++; // sauter # pour usemap etc
|
||||
}
|
||||
}
|
||||
srcset_next:
|
||||
/* srcset: strip whitespace before each candidate URL so a value
|
||||
like " a.gif 2x" (or the URL following a comma) is captured from
|
||||
its first real byte. The opening quote was already consumed by
|
||||
the lead-in above; the skipped bytes flush verbatim below. */
|
||||
if (srcset_p) {
|
||||
while(html < r->adr + r->size && is_realspace(*html))
|
||||
INCREMENT_CURRENT_ADR(1);
|
||||
}
|
||||
eadr = html;
|
||||
|
||||
// ne pas flusher après code si on doit écrire le codebase avant!
|
||||
@@ -1844,6 +1862,7 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
if ((*eadr == quote && (!quoteinscript || *(eadr - 1) == '\\')) // end quote
|
||||
|| (noquote && (*eadr == '\"' || *eadr == '\'')) // end at any quote
|
||||
|| (!noquote && quote == '\0' && is_realspace(*eadr)) // unquoted href
|
||||
|| srcset_p // whitespace ends a srcset candidate URL
|
||||
) // si pas d'attente de quote spéciale ou si quote atteinte
|
||||
ok = 0;
|
||||
} else if (ending_p && (*eadr == ending_p))
|
||||
@@ -1872,6 +1891,10 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
break; // \" ou \' point d'arrêt
|
||||
case '?': /*quote_adr=adr; */
|
||||
break; // noter position query
|
||||
case ',':
|
||||
if (srcset_p) // comma separates srcset candidates
|
||||
ok = 0;
|
||||
break;
|
||||
}
|
||||
}
|
||||
//}
|
||||
@@ -3250,6 +3273,33 @@ int htsparse(htsmoduleStruct * str, htsmoduleStructExtended * stre) {
|
||||
}
|
||||
// adr=eadr-1; // ** sauter
|
||||
|
||||
/* srcset (issues #235, #236): a srcset value is a comma-separated
|
||||
list of "URL [descriptor]" entries. The URL just handled stopped
|
||||
at whitespace or a comma; advance over the optional descriptor
|
||||
(480w, 2x) and the comma to the next candidate URL, then re-run
|
||||
the capture so every candidate is rewritten and queued. The
|
||||
descriptor and comma flush verbatim through 'lastsaved' when the
|
||||
label re-enters. 'html' currently sits on the byte that ended the
|
||||
URL token. */
|
||||
if (srcset_p && ok == 0) {
|
||||
const char *const endp = r->adr + r->size; // end of the buffer
|
||||
const char *q = html;
|
||||
while(q < endp && *q != '\0' && *q != ',' && *q != quote
|
||||
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32)
|
||||
q++; // skip the optional descriptor
|
||||
if (q < endp && *q == ',') {
|
||||
q++; // skip the comma between candidates
|
||||
while(q < endp && is_space(*q))
|
||||
q++; // skip whitespace before the next URL
|
||||
if (q < endp && *q != '\0' && *q != ',' && *q != quote
|
||||
&& *q != '<' && *q != '>' && (unsigned char) *q >= 32) {
|
||||
INCREMENT_CURRENT_ADR(q - html); // keep the automate in sync
|
||||
ok = 1;
|
||||
goto srcset_next;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* We skipped bytes and skip the " : reset state */
|
||||
/*if (inscript) {
|
||||
inscript_state_pos = INSCRIPT_START;
|
||||
|
||||
@@ -47,3 +47,25 @@ match '*foo*bar' 'foozbar'
|
||||
|
||||
# '?' is the query-string marker, not a single-char wildcard
|
||||
nomatch 'a?c' 'abc'
|
||||
|
||||
# backslash escapes a metacharacter inside a class so it is matched literally.
|
||||
# Quirk: the decoder also adds the backslash itself to the set, so '\X' matches
|
||||
# both X and '\'. These assertions pin that behavior.
|
||||
match '*[\*]' '*'
|
||||
match '*[\*]' "\\"
|
||||
nomatch '*[\*]' 'a'
|
||||
match '*[\\]' "\\"
|
||||
nomatch '*[\\]' 'a'
|
||||
match '*[\[]' '['
|
||||
match '*[\[]' "\\"
|
||||
nomatch '*[\[]' 'a'
|
||||
|
||||
# A literal ']' cannot be a class member: the class parser stops at the first
|
||||
# ']', escaped or not. So '*[\[\]]' does NOT mean "the [ or ] character" as the
|
||||
# filter guide claims (GitHub #148); it parses as the class {'[','\'} followed
|
||||
# by a trailing literal ']'. These assertions document the current (buggy)
|
||||
# behavior so any future matcher fix is a deliberate, visible change.
|
||||
nomatch '*[\[\]]' '[' # not matched, despite the docs
|
||||
match '*[\[\]]' ']' # only via the empty class-match + trailing ']'
|
||||
match '*[\[\]]' '[]' # one of {'[','\'} then the trailing ']'
|
||||
nomatch '*[\[\]]' '[]x'
|
||||
|
||||
147
tests/01_engine-parse.test
Executable file
147
tests/01_engine-parse.test
Executable file
@@ -0,0 +1,147 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
|
||||
# Offline HTML parser / link-rewrite tests. Each section crawls a small on-disk
|
||||
# HTML tree over file:// (no network), so these run unconditionally like the
|
||||
# other 01_engine tests, and asserts which assets the parser captured and how
|
||||
# it rewrote the links.
|
||||
|
||||
set -u
|
||||
|
||||
tmp=$(mktemp -d "${TMPDIR:-/tmp}/httrack_parse.XXXXXX") || exit 1
|
||||
trap 'rm -rf "$tmp"' EXIT HUP INT QUIT PIPE TERM
|
||||
|
||||
# a minimal valid 1x1 GIF, reused for every referenced asset
|
||||
gif() {
|
||||
printf 'GIF89a\1\0\1\0\200\0\0\0\0\0\377\377\377!\371\4\1\0\0\0\0,\0\0\0\0\1\0\1\0\0\2\2D\1\0;' >"$1"
|
||||
}
|
||||
|
||||
# crawl <fixture-html> into <out> with link rewriting on, no extra fetching
|
||||
crawl() {
|
||||
local html="$1" out="$2"
|
||||
rm -rf "$out"
|
||||
mkdir -p "$out"
|
||||
httrack "file://$html" -O "$out" --quiet --near -n >"$out/.log" 2>&1
|
||||
}
|
||||
|
||||
# assert a file with the given basename was saved somewhere under <out>
|
||||
found() {
|
||||
test -n "$(find "$2" -type f -name "$1" -print -quit)" ||
|
||||
! echo "FAIL: expected '$1' to be downloaded under $2" || exit 1
|
||||
}
|
||||
|
||||
# assert NO file with the given basename was saved (e.g. a descriptor token must
|
||||
# not be mistaken for a URL)
|
||||
notfound() {
|
||||
test -z "$(find "$2" -type f -name "$1" -print -quit)" ||
|
||||
! echo "FAIL: '$1' should not have been downloaded under $2" || exit 1
|
||||
}
|
||||
|
||||
# the saved copy of the crawled fixture page. A file:// mirror stores it under
|
||||
# the generated "file/" host directory; the top-level index.html is HTTrack's
|
||||
# own landing page and must not be matched here.
|
||||
savedhtml() {
|
||||
find "$1" -type f -path '*/file/*' -name index.html -print -quit
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# srcset on <img> and <source> (issues #235, #236). Every candidate URL in a
|
||||
# srcset list must be captured and queued, the 480w/2x descriptors preserved,
|
||||
# the listed URLs actually rewritten to their local copies, and srcset must not
|
||||
# swallow the attributes that follow it in the same tag.
|
||||
site="$tmp/srcset"
|
||||
mkdir -p "$site"
|
||||
for f in a b c d e f g h i j v; do gif "$site/$f.gif"; done
|
||||
# heredoc is unquoted so $site expands in the absolute-URL candidate below; the
|
||||
# fixture contains no other '$' or backticks.
|
||||
cat >"$site/index.html" <<EOF
|
||||
<html><body>
|
||||
<img src="a.gif" srcset="b.gif 480w, c.gif 800w">
|
||||
<picture><source srcset="d.gif 1x, c.gif 2x"><img src="a.gif"></picture>
|
||||
<img srcset="e.gif,f.gif">
|
||||
<img srcset="g.gif 2x" alt="trailing attr after srcset">
|
||||
<img srcset=" h.gif 2x , i.gif ">
|
||||
<video><source src="v.gif"></video>
|
||||
<img srcset="file://$site/j.gif 2x">
|
||||
<a href="a.gif">plain link still works</a>
|
||||
</body></html>
|
||||
EOF
|
||||
out="$tmp/srcset-out"
|
||||
crawl "$site/index.html" "$out"
|
||||
|
||||
# every src=/href= and every srcset candidate must be downloaded, including the
|
||||
# unique tail-only URLs (catches first-candidate-only parsing), the
|
||||
# whitespace-padded list (h,i), the <source src> form (v), and the
|
||||
# absolute-URL candidate (j)
|
||||
for f in a b c d e f g h i j v; do found "$f.gif" "$out"; done
|
||||
|
||||
# the width/density descriptors are not URLs and must not be fetched
|
||||
notfound "480w" "$out"
|
||||
notfound "800w" "$out"
|
||||
notfound "2x" "$out"
|
||||
|
||||
saved=$(savedhtml "$out")
|
||||
test -n "$saved" || ! echo "FAIL: saved index.html not found" || exit 1
|
||||
|
||||
# descriptors must survive the rewrite (no "b.gif 480w" mangled into a path)
|
||||
grep -Eq 'srcset="[^"]*480w[^"]*800w' "$saved" ||
|
||||
! echo "FAIL: srcset width descriptors lost/reordered in rewritten HTML" || exit 1
|
||||
grep -Eq 'srcset="[^"]*1x[^"]*2x' "$saved" ||
|
||||
! echo "FAIL: srcset density descriptors lost/reordered in rewritten HTML" || exit 1
|
||||
# the no-space comma form is preserved verbatim (the rewrite flushes separators
|
||||
# byte-for-byte rather than reserializing the list)
|
||||
grep -Eq 'srcset="e\.gif,f\.gif"' "$saved" ||
|
||||
! echo "FAIL: comma-separated srcset without descriptors was altered" || exit 1
|
||||
# an attribute following srcset in the same tag must be left intact
|
||||
grep -q 'alt="trailing attr after srcset"' "$saved" ||
|
||||
! echo "FAIL: srcset swallowed a following attribute" || exit 1
|
||||
|
||||
# rewrite must be real, not passthrough: the absolute file:// candidate must be
|
||||
# replaced by a local reference. A flat fixture hides this (local name ==
|
||||
# original name), so the absolute URL is the discriminating case. The candidate
|
||||
# must become the bare local "j.gif", and no srcset value may still carry a
|
||||
# file:// URL. (HTTrack's footer provenance comment legitimately mentions the
|
||||
# source file:// URL, so the check is scoped to the srcset attribute.)
|
||||
grep -Eq 'srcset="j\.gif 2x"' "$saved" ||
|
||||
! echo "FAIL: absolute file:// srcset URL was not rewritten to a local link" || exit 1
|
||||
! grep -Eq 'srcset="[^"]*file://' "$saved" ||
|
||||
! echo "FAIL: a file:// URL survived inside a rewritten srcset attribute" || exit 1
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Attribute URL detection and rewrite for xlink:href (#298) and inline
|
||||
# background-image (#237). Both are detected reliably and rewritten to the
|
||||
# local copy. Detection is asserted by rewrite, not by download: an absolute
|
||||
# file:// URL becomes a local reference when detected, and stays file:// when
|
||||
# not (download depends on scope/--near, rewrite does not). The no-detect
|
||||
# exclusion list (title, alt, class, ...) must leave its values untouched.
|
||||
#
|
||||
# Note: generic detection of arbitrary data-* attributes (#201/#203) is NOT
|
||||
# covered here. Its behavior is currently order/context dependent (the same
|
||||
# data-* attribute is rewritten in one crawl and left alone in another), so it
|
||||
# cannot be locked as a regression test until that nondeterminism is fixed.
|
||||
site2="$tmp/attrs"
|
||||
mkdir -p "$site2"
|
||||
for f in xl ibg tt; do gif "$site2/$f.gif"; done
|
||||
cat >"$site2/index.html" <<EOF
|
||||
<html><body>
|
||||
<a xlink:href="file://$site2/xl.gif">xlink:href (#298)</a>
|
||||
<div style="background-image:url(file://$site2/ibg.gif)"></div>
|
||||
<span title="file://$site2/tt.gif">excluded attribute</span>
|
||||
</body></html>
|
||||
EOF
|
||||
out2="$tmp/attrs-out"
|
||||
crawl "$site2/index.html" "$out2"
|
||||
saved2=$(savedhtml "$out2")
|
||||
test -n "$saved2" || ! echo "FAIL: saved attrs page not found" || exit 1
|
||||
|
||||
# detected attributes: the absolute URL is rewritten to a local link
|
||||
grep -Eq 'xlink:href="xl\.gif"' "$saved2" ||
|
||||
! echo "FAIL #298: xlink:href not detected/rewritten" || exit 1
|
||||
grep -Eq 'style="background-image:url\(ibg\.gif\)"' "$saved2" ||
|
||||
! echo "FAIL #237: inline background-image url() not detected/rewritten" || exit 1
|
||||
|
||||
# excluded attribute: title is on the no-detect list, so its value is left as-is
|
||||
grep -q 'title="file://' "$saved2" ||
|
||||
! echo "FAIL: a no-detect attribute (title) was wrongly rewritten" || exit 1
|
||||
|
||||
exit 0
|
||||
@@ -9,6 +9,6 @@ TESTS_ENVIRONMENT += HTTPS_SUPPORT=$(HTTPS_SUPPORT)
|
||||
TESTS_ENVIRONMENT += top_srcdir=$(top_srcdir)
|
||||
|
||||
TEST_EXTENSIONS = .test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
|
||||
CLEANFILES = check-network_sh.cache
|
||||
|
||||
@@ -472,7 +472,7 @@ TESTS_ENVIRONMENT = PATH=$(top_builddir)/src$(PATH_SEPARATOR)$$PATH \
|
||||
ONLINE_UNIT_TESTS=$(ONLINE_UNIT_TESTS) \
|
||||
HTTPS_SUPPORT=$(HTTPS_SUPPORT) top_srcdir=$(top_srcdir)
|
||||
TEST_EXTENSIONS = .test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
TESTS = 00_runnable.test 01_engine-charset.test 01_engine-entities.test 01_engine-filter.test 01_engine-hashtable.test 01_engine-idna.test 01_engine-mime.test 01_engine-parse.test 01_engine-simplify.test 02_manpage-regen.test 10_crawl-simple.test 11_crawl-cookies.test 11_crawl-idna.test 11_crawl-international.test 11_crawl-longurl.test 11_crawl-parsing.test 12_crawl_https.test
|
||||
CLEANFILES = check-network_sh.cache
|
||||
all: all-am
|
||||
|
||||
|
||||
Reference in New Issue
Block a user