Dmitry Leskov
 

Retrieving a Web Site for Offline View Using Wget

I’ve volunteered to do a few minor CSS tweaks on a third-party Web site, implemented in Java and hosted on Heroku. Given the tiny scope of my task, figuring out how to build the entire thing and run it on a staging instance or locally would have been an overkill. So I’ve sought a way to create a static local mirror of the site. That turned out to be less straightforward than running wget --mirror Home-Page-URL.

First, the Web site in question has its stylesheets and other static files served from a CDN (content distribution network.) It also relies on third-party services for Web fonts, video streaming, and chat.

Second, it has some really big downloads. Fortunately, they are served from a subdomain.

Third, it has a blog section, also in a subdomain, which uses a completely different stylesheet that I did not need to touch.

To cut the long story short, here is the wget command line that worked for me:

wget --mirror \
  --page-requisites \
  --convert-links \
  --span-hosts \
  --domains domain-list \
  --reject pattern-list \
  Home-Page-URL

and here is the explanation:

--mirror
Enable infinite recursion and time-stamping.
--page-requisites
Also download files required to view the web page: images, stylesheets and so on.
--convert-links
Edit links in the downloaded documents so as to enable offline viewing. This includes links to page requisites. As a result, links to the also-downloaded files point to local copies, all other links get replaced with complete URLs.
--span-hosts
Permit recursion and retrieval of page requisites to span across hosts. (Use with caution or you’d download the entire Internet.)
--domains domain-list
Restrict the list of domains to download files from. In my case, those were the “www” subdomain of the Web site being mirrored and the domain of the CDN serving its static files.

Example: --domains www.example.com,somecdn.net

--reject pattern-list
Do not mirror certain files. list is a comma-separated list of file name suffixes or patterns.

Example: --reject mp3,ogg

« | »

Talkback

  1. Jerzy
    24-Jul-2013
    3:28 pm
    1

    Thank you, this helped me to figure out that I need to add –span-hosts and –domains options to my (similar) wget command.

* Copy This Password *

* Type Or Paste Password Here *