February 23rd, 2013
I’ve volunteered to do a few minor CSS tweaks on a third-party Web site, implemented in Java and hosted on Heroku. Given the tiny scope of my task, figuring out how to build the entire thing and run it on a staging instance or locally would have been an overkill. So I’ve sought a way to create a static local mirror of the site. That turned out to be less straightforward than running
wget --mirror Home-Page-URL.
First, the Web site in question has its stylesheets and other static files served from a CDN (content distribution network.) It also relies on third-party services for Web fonts, video streaming, and chat.
Second, it has some really big downloads. Fortunately, they are served from a subdomain.
Third, it has a blog section, also in a subdomain, which uses a completely different stylesheet that I did not need to touch.
To cut the long story short, here is the
wget command line that worked for me:
wget --mirror \ --page-requisites \ --convert-links \ --span-hosts \ --domains domain-list \ --reject pattern-list \ Home-Page-URL
and here is the explanation:
- Enable infinite recursion and time-stamping.
- Also download files required to view the web page: images, stylesheets and so on.
- Edit links in the downloaded documents so as to enable offline viewing. This includes links to page requisites. As a result, links to the also-downloaded files point to local copies, all other links get replaced with complete URLs.
- Permit recursion and retrieval of page requisites to span across hosts. (Use with caution or you’d download the entire Internet.)
Restrict the list of domains to download files from. In my case, those were the “
www” subdomain of the Web site being mirrored and the domain of the CDN serving its static files.
Do not mirror certain files.
listis a comma-separated list of file name suffixes or patterns.