Using Wget (to grab an entire site)

Nov 14th, 2013

Updated Aug 16, 2016

Here's a simple command which makes a mirrored clone of a site (-m = mirror and -k = convert links):

wget -mk --wait=9 --limit-rate=200K http://www.example.com/

Here's a more complex example wget command. Explanation to follow. see http://www.gnu.org/software/wget/manual/wget.html

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains example.com \
     --no-parent \
     --wait=9 \
     --limit-rate=200K \
     --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" \
     --reject=mov,pdf \
     --directory-prefix=./LOCAL-DIR \
     www.example.com/tutorials/html/

to restart a download that only partially finished, use wget -c

--recursive: download the entire Web site.

--no-clobber: don't overwrite any existing files (used in case the download is interrupted and
resumed).

--page-requisites: get all the elements that compose the page (images, CSS and so on).

--html-extension: save files with the .html extension.

--convert-links: convert links so that they work locally, off-line.

--restrict-file-names=windows: modify filenames so that they will work in Windows as well.

--domains website.org: don't follow links outside website.org.

--no-parent: don't ascend to the parent directory when retrieving recursively.

The next two options throttle downloads so that you don't get blacklisted on the site:

--wait: waits for specified number of seconds between download attempts

--limit-rate: limits the amount of the servers bandwidth you are using

--user-agent: download as if using a browser

--reject: reject certain file types

Article Type

General