kiilerix: Mirroring a Plone site

I needed an offline backup copy of a Plone 2.5 site. Something like what

wget -m http://site/

does for simple static sites. But wget doesn't work well with plone sites; too many search results and alternative URLs are mirrored, and the folderish behaviour of content makes the mirror break.

Googles best advice was this, but it didn't work for me. I hope google will find this page next time I search...

I ended up with the following which gave me a usable mirror:

wget -o wget.log --html-extension --restrict-file-names=windows --convert-links --recursive --level=inf --page-requisites --wait=0 --quota=inf --reject="*_form, *@*, sitemap, RSS" --exclude-directories="search, author" http://site/

Notes:

web servers can let any URL be of any type, but in the file system html-extension tells what is html and what isn't
restrict-file-names to the windows subset make it work as many places as possible
convert-links fixes links that was broken by the changes above, and it also solves that usual file systems can't do what plone can do with both foo.png and foo.png/view
recursion without limits puts a high load on Plone, so be careful
all forms and special views, sitemaps and RSS feeds are left out - they will point to the original URL
searches and author overviews are left out too
the mirror ends up in the folder site/

By the way, the fuzzy logic of zip confuses me each time, so for the reference I package the mirror with:

zip -r mirrir-site.zip site/

One more thing... Raw content objects (for example unscaled images) can be retrieved with WebDAV. On Linux I found that it easiest to use FUSE and wdfs:

wdfs -f http://site:1980/folder /mnt/point

It can be unmounted manually with:

fusermount -u /mnt/point

2 comments:

kiilerix said...: Encolpe Degoute suggested using httrack, but I haven't seen any reports about how well it actually works.; October 15, 2008 at 12:56 AM
Anonymous said...: Thx for sharing I used your wget... for an old Plone 2.0.5 site I had to archive as static html.

I had manually to do wget path_to/*.css for style sheets imported
@import url(...) and wget for various images loaded in the css but then everything worked fine...; July 31, 2009 at 5:04 PM

kiilerix

2008-10-12

Mirroring a Plone site

2 comments:

Blog Archive

About Me