wget -m http://site/does for simple static sites. But wget doesn't work well with plone sites; too many search results and alternative URLs are mirrored, and the folderish behaviour of content makes the mirror break.
Googles best advice was this, but it didn't work for me. I hope google will find this page next time I search...
I ended up with the following which gave me a usable mirror:
wget -o wget.log --html-extension --restrict-file-names=windows --convert-links --recursive --level=inf --page-requisites --wait=0 --quota=inf --reject="*_form, *@*, sitemap, RSS" --exclude-directories="search, author" http://site/Notes:
- web servers can let any URL be of any type, but in the file system html-extension tells what is html and what isn't
- restrict-file-names to the windows subset make it work as many places as possible
- convert-links fixes links that was broken by the changes above, and it also solves that usual file systems can't do what plone can do with both foo.png and foo.png/view
- recursion without limits puts a high load on Plone, so be careful
- all forms and special views, sitemaps and RSS feeds are left out - they will point to the original URL
- searches and author overviews are left out too
- the mirror ends up in the folder site/
zip -r mirrir-site.zip site/
One more thing... Raw content objects (for example unscaled images) can be retrieved with WebDAV. On Linux I found that it easiest to use FUSE and wdfs:
wdfs -f http://site:1980/folder /mnt/pointIt can be unmounted manually with:
fusermount -u /mnt/point
2 comments:
Encolpe Degoute suggested using httrack, but I haven't seen any reports about how well it actually works.
Thx for sharing I used your wget... for an old Plone 2.0.5 site I had to archive as static html.
I had manually to do wget path_to/*.css for style sheets imported
@import url(...) and wget for various images loaded in the css but then everything worked fine...
Post a Comment