Retrieving a Web Site for Offline View Using Wget
I’ve volunteered to do a few minor CSS tweaks on a third-party Web site, implemented in Java and hosted on Heroku. Given the tiny scope of my task, figuring out how to build the entire thing and run it on a staging instance or locally would have been an overkill. So I’ve sought a way to create a static local mirror of the site. That turned out to be less straightforward than running wget --mirror Home-Page-URL
.
First, the Web site in question has its stylesheets and other static files served from a CDN (content distribution network.) It also relies on third-party services for Web fonts, video streaming, and chat.
Second, it has some really big downloads. Fortunately, they are served from a subdomain.
Third, it has a blog section, also in a subdomain, which uses a completely different stylesheet that I did not need to touch.
To cut the long story short, here is the wget
command line that worked for me:
wget --mirror \
--page-requisites \
--convert-links \
--span-hosts \
--domains domain-list \
--reject pattern-list \
Home-Page-URL
and here is the explanation:
--mirror
- Enable infinite recursion and time-stamping.
--page-requisites
- Also download files required to view the web page: images, stylesheets and so on.
--convert-links
- Edit links in the downloaded documents so as to enable offline viewing. This includes links to page requisites. As a result, links to the also-downloaded files point to local copies, all other links get replaced with complete URLs.
--span-hosts
- Permit recursion and retrieval of page requisites to span across hosts. (Use with caution or you’d download the entire Internet.)
--domains domain-list
-
Restrict the list of domains to download files from. In my case, those were the “
www
” subdomain of the Web site being mirrored and the domain of the CDN serving its static files.Example:
--domains www.example.com,somecdn.net
--reject pattern-list
-
Do not mirror certain files.
list
is a comma-separated list of file name suffixes or patterns.Example:
--reject mp3,ogg
24-Jul-2013
3:28 pm
Thank you, this helped me to figure out that I need to add –span-hosts and –domains options to my (similar) wget command.