wget


It happens that I use wget to test an HTTP request just to read the response. By using this syntax wget prints the result directly to stdout and won’t store a file:

wget -qO- "http://localhost:8080/sabnzbd/api?mode=qstatus&output=json"


 
Although rsync is the prefered method for mirroring web content, wget is still the only route when you want to take a pre-proccessed interpreted db driven site and render it statically.
Note: Edit the mime.types file in /etc/apache2 by finding the text/html line and adding cfm (for coldfusion) as you must tell apache that is HTML too.

The rename doesn't work as well as you'd like because wget doesn't handle image maps and flash links, so you have to just stick with the original name


Example

wget -nv -nH -E -r -l 1 -k -p >>http://www.ohio.edu

Explination of Options

-nv Non-verbose output Turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed.

-nH no-host-directories. Disable generation of host-prefixed directories. By default, invoking Wget with -r >>http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior. py the files.

-E html-extension. When you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server, this option will cause the suffix .html to be appended to the local filename.

-r recursive. Turn on recursive retrieving.

-l N-depth depth. Specify recursion maximum depth level depth.The default maximum depth is 5.

-k convert-links. After the download is complete, convert the links in the document to make them suitable for local viewing.

-p page-requisites. T his option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.


wget -nv -m -np -nH -p >>http://www.site.org/files

The -nv will turn off verbose output, but it will not be very quiet.

The -m option tells wget(1) to turn on options suitable for mirroring (as in -r -N -l inf -nr).

The -np (no-parent) option will prevent ascending to the parent directory.

The option -nH disables generation of host-prefixed directories, so you will not get a directory called www.site.org.

And last, -p causes wget(1) to download all the files that are necessary to properly display a given HTML page.

adapted from >>http://www.htdig.org/howto-mirror.html


You can also check for broken links via:

wget --recursive -nd -nv --delete-after  >>http://www.finance.ohiou.edu/ > wget.out 2>&1
grep -B 1 404 wget.out
Comments