2008-10-12

Mirroring a Plone site

I needed an offline backup copy of a Plone 2.5 site. Something like what
wget -m http://site/
does for simple static sites. But wget doesn't work well with plone sites; too many search results and alternative URLs are mirrored, and the folderish behaviour of content makes the mirror break.

Googles best advice was this, but it didn't work for me. I hope google will find this page next time I search...

I ended up with the following which gave me a usable mirror:
wget -o wget.log --html-extension --restrict-file-names=windows --convert-links --recursive --level=inf --page-requisites --wait=0 --quota=inf --reject="*_form, *@*, sitemap, RSS" --exclude-directories="search, author" http://site/
Notes:
  • web servers can let any URL be of any type, but in the file system html-extension tells what is html and what isn't
  • restrict-file-names to the windows subset make it work as many places as possible
  • convert-links fixes links that was broken by the changes above, and it also solves that usual file systems can't do what plone can do with both foo.png and foo.png/view
  • recursion without limits puts a high load on Plone, so be careful
  • all forms and special views, sitemaps and RSS feeds are left out - they will point to the original URL
  • searches and author overviews are left out too
  • the mirror ends up in the folder site/
By the way, the fuzzy logic of zip confuses me each time, so for the reference I package the mirror with:
zip -r mirrir-site.zip site/

One more thing... Raw content objects (for example unscaled images) can be retrieved with WebDAV. On Linux I found that it easiest to use FUSE and wdfs:
wdfs -f http://site:1980/folder /mnt/point
It can be unmounted manually with:
fusermount -u /mnt/point

2 comments:

kiilerix said...

Encolpe Degoute suggested using httrack, but I haven't seen any reports about how well it actually works.

Anonymous said...

Thx for sharing I used your wget... for an old Plone 2.0.5 site I had to archive as static html.

I had manually to do wget path_to/*.css for style sheets imported
@import url(...) and wget for various images loaded in the css but then everything worked fine...