wget

You occasionally have to copy a web site. The best way is to copy the source with rsync. But if you’re in the position of somebody like The Internet Archive and don’t have source access, or the site has logic and databases you can’t recreate, you’ll need to suck it down with wget.

wget -nv -nH -E -r -l 1 -k -p >> http://www.some.site 

Explanation of Options

-nv Non-verbose output Turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed.

-nH no-host-directories. Disable generation of host-prefixed directories. By default, invoking wget with -r »http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior. py the files.

-E html-extension. When you’re mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server, this option will cause the suffix .html to be appended to the local filename.

-r recursive. Turn on recursive retrieving.

-l N-depth depth. Specify recursion maximum depth level depth.The default maximum depth is 5.

-k convert-links. After the download is complete, convert the links in the document to make them suitable for local viewing.

-p page-requisites. T his option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

wget -nv -m -np -nH -p >>http://www.site.org/files 

-m tells wget(1) to turn on options suitable for mirroring (as in -r -N -l inf -nr).

-np (no-parent) option will prevent ascending to the parent directory.

-nH disables generation of host-prefixed directories, so you will not get a directory called www.site.org.

-p causes wget(1) to download all the files that are necessary to properly display a given HTML page.

adapted from »http://www.htdig.org/howto-mirror.html

You can also check for broken links via:

wget --recursive -nd -nv --delete-after  >>http://www.some.site/ > wget.out 2>&1
grep -B 1 404 wget.out

Authentication

wget --http-user=XXX --http-passwd=XXX https://some.site/some.file

Notes

When re-serving up a web site, you may need to edit the mime.types file, such as in /etc/apache2, by finding the text/html line and adding for example cfm (for coldfusion) as you must tell apache that is HTML too.

When renaming, that option doesn’t work as well as one would like as wget doesn’t handle image maps and flash links, so you have to just stick with the original name.


Last modified April 14, 2026: Old site imports (677647f)