wget: Download Web Page

wget \
     --recursive \
     --level=1 \
     --convert-links \
     --page-requisites \
     --adjust-extension \
     --no-clobber \
     --random-wait \
     --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36" \
     --restrict-file-names=windows \
     --no-parent \
         http://www.buddhanet.net/xmedfile.htm

$ cat wget-crawl-site.sh
#!/bin/bash

# shellcheck disable=SC2034
read -r -d '' DOC <<EOF
wget
       -r
       --recursive
           Turn on recursive retrieving.    The default maximum depth is 5.

       -l depth
       --level=depth
           Specify recursion maximum depth level depth.

       -k
       --convert-links
           After the download is complete, convert the links in the document to make them suitable for local
           viewing.  This affects not only the visible hyperlinks, but any part of the document that links
           to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML
           content, etc.

       -p
       --page-requisites
           This option causes Wget to download all the files that are necessary to properly display a given
           HTML page.  This includes such things as inlined images, sounds, and referenced stylesheets.

       -E
       --adjust-extension
           If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with
           the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the
           local filename.  This is useful, for instance, when you're mirroring a remote site that uses .asp
           pages, but you want the mirrored pages to be viewable on your stock Apache server.

           As of version 1.12, Wget will also ensure that any downloaded files of type text/css end in the
           suffix .css.

           As of version 1.19.2, Wget will also ensure that any downloaded files with a "Content-Encoding"
           of br, compress, deflate or gzip end in the suffix .br, .Z, .zlib and .gz respectively.

       -nc
       --no-clobber
           If a file is downloaded more than once in the same directory, Wget's behavior depends on a few
           options, including -nc.  In certain cases, the local file will be clobbered, or overwritten, upon
           repeated download.  In other cases it will be preserved.

           When running Wget without -N, -nc, -r, or -p, downloading the same file in the same directory
           will result in the original copy of file being preserved and the second copy being named file.1.
           If that file is downloaded yet again, the third copy will be named file.2, and so on.  (This is
           also the behavior with -nd, even if -r or -p are in effect.)  When -nc is specified, this
           behavior is suppressed, and Wget will refuse to download newer copies of file.

           When running Wget with -r or -p, but without -N, -nd, or -nc, re-downloading a file will result
           in the new copy simply overwriting the old.  Adding -nc will prevent this behavior, instead
           causing the original version to be preserved and any newer copies on the server to be ignored.

       --random-wait
           Some web sites may perform log analysis to identify retrieval programs such as Wget by looking
           for statistically significant similarities in the time between requests. This option causes the
           time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using
           the --wait option, in order to mask Wget's presence from such analysis.

       -e command
       --execute command
           Execute command as if it were a part of .wgetrc.  A command thus invoked will be executed after
           the commands in .wgetrc, thus taking precedence over them.  If you need to specify more than one
           wgetrc command, use multiple instances of -e.

       -e robots=off
           Ignore robots.txt for the domain

       -U agent-string
       --user-agent=agent-string
           Identify as agent-string to the HTTP server.

           The HTTP protocol allows the clients to identify themselves using a "User-Agent" header field.
           This enables distinguishing the WWW software, usually for statistical purposes or for tracing of
           protocol violations.  Wget normally identifies as Wget/version, version being the current version
           number of Wget.

           However, some sites have been known to impose the policy of tailoring the output according to the
           "User-Agent"-supplied information.  While this is not such a bad idea in theory, it has been
           abused by servers denying information to clients other than (historically) Netscape or, more
           frequently, Microsoft Internet Explorer.  This option allows you to change the "User-Agent" line
           issued by Wget.  Use of this option is discouraged, unless you really know what you are doing.

 wget-crawl-site.sh+                                                                                                                                    buffers
           Specifying empty user agent with --user-agent="" instructs Wget not to send the "User-Agent"
           header in HTTP requests.

       --restrict-file-names=modes
           Change which characters found in remote URLs must be escaped during generation of local
           filenames.

           The modes are a comma-separated set of text values. The acceptable values are unix, windows,
           nocontrol, ascii, lowercase, and uppercase.

           When "windows" is given, Wget escapes the characters \, |, /, :, ?, ", *, <, >, and the control
           characters in the ranges 0--31 and 128--159.  In addition to this, Wget in Windows mode uses +
           instead of : to separate host and port in local file names, and uses @ instead of ? to separate
           the query portion of the file name from the rest.  Therefore, a URL that would be saved as
           www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as
           www.xemacs.org+4300/search.pl@input=blah in Windows mode.  This mode is the default on Windows.

       --no-parent
           Do not ever ascend to the parent directory when retrieving recursively.  This is a useful option,
           since it guarantees that only the files below a certain hierarchy will be downloaded.

       -D domain-list
       --domains=domain-list
           Set domains to be followed.  domain-list is a comma-separated list of domains.  Note that it does
           not turn on -H.
EOF

#echo -e "$DOC\n"

set -x
wget \
     --recursive \
     --level=1 \
     --convert-links \
     --page-requisites \
     --adjust-extension \
     --no-clobber \
     --random-wait \
     --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36" \
     --restrict-file-names=windows \
     --no-parent \
     --domains buddhanet.net \
         http://www.buddhanet.net/xmedfile.htm

Leave a Comment

Your email address will not be published. Required fields are marked *