2008-10-12

Mirroring a Plone site

I needed an offline backup copy of a Plone 2.5 site. Something like what
wget -m http://site/
does for simple static sites. But wget doesn't work well with plone sites; too many search results and alternative URLs are mirrored, and the folderish behaviour of content makes the mirror break.

Googles best advice was this, but it didn't work for me. I hope google will find this page next time I search...

I ended up with the following which gave me a usable mirror:
wget -o wget.log --html-extension --restrict-file-names=windows --convert-links --recursive --level=inf --page-requisites --wait=0 --quota=inf --reject="*_form, *@*, sitemap, RSS" --exclude-directories="search, author" http://site/
Notes:
  • web servers can let any URL be of any type, but in the file system html-extension tells what is html and what isn't
  • restrict-file-names to the windows subset make it work as many places as possible
  • convert-links fixes links that was broken by the changes above, and it also solves that usual file systems can't do what plone can do with both foo.png and foo.png/view
  • recursion without limits puts a high load on Plone, so be careful
  • all forms and special views, sitemaps and RSS feeds are left out - they will point to the original URL
  • searches and author overviews are left out too
  • the mirror ends up in the folder site/
By the way, the fuzzy logic of zip confuses me each time, so for the reference I package the mirror with:
zip -r mirrir-site.zip site/

One more thing... Raw content objects (for example unscaled images) can be retrieved with WebDAV. On Linux I found that it easiest to use FUSE and wdfs:
wdfs -f http://site:1980/folder /mnt/point
It can be unmounted manually with:
fusermount -u /mnt/point

2008-08-31

Scripting the inbox

For some reason I wanted to do some mass-managing of my inbox. If it had been in the file system I would have created a shell script - or used Python. But my mail reader focus on being user friendly for normal users, not on scripting for power users. A wise decision; they can't cover all use cases and preferences for language of chocie etc.

With Python as my favorite tool imaplib seemed like the obvious choice. imaplib is however not that intuitive to use - unless you know http://tools.ietf.org/html/rfc3501 by heart...

Making it work was a bit too hard, so here comes the script I came up with - it can easily be customized for other tasks:

import imaplib
M = imaplib.IMAP4()
print 'logging in:', M.login('username', 'password')
print 'selecting folder:', M.select('HomeInbox')
_type, data = M.search(None, 'ALL')
# or to search for string in field: (None, 'Subject', 'Re:')
message_ids = set()
for num in data[0].split():
# fetch seems quite obscure - fetching one field at a time to make it "simpler"
date = M.fetch(num, '(BODY[HEADER.FIELDS (DATE)])')[1][0][1].strip()
subject = M.fetch(num, '(BODY[HEADER.FIELDS (SUBJECT)])')[1][0][1].strip()
message_id = M.fetch(num, '(BODY[HEADER.FIELDS (Message-ID)])')[1][0][1].strip()
if message_id in message_ids:
print 'deleting', (num, date, subject, message_id)
#M.store(num, '+FLAGS', '\\Deleted') # uncomment to delete
else:
message_ids.add(message_id)
M.expunge()
M.close()
M.logout()


This script as it is removes all doublets based on message id.

2008-03-29

Quick'n'dirty temporary configuration of internet gateway

Scenario: One Fedora 8 host with wireless internet access, and I want to give another Fedora 8 host internet access through cabling to it.

First connect the cables - and be aware that NetworkManager might start playing tricks when it detects link on the wired interfaces.

On the host with internet access find a free IP range and configure the network interface:
ifconfig eth0 10.0.0.1 up
Enable IP forwarding:
sysctl net.ipv4.ip_forward=1
But masquerade it:
iptables -t nat -A POSTROUTING -s 10.0.0.0/8 -j MASQUERADE

On the other host configure the network interface:
ifconfig eth0 10.0.0.2 up
Set default gateway:
route add default gw 10.0.0.1
Configure DNS with settings from the other host:
echo "nameserver 192.168.2.1" > /etc/resolv.conf

THAT was dirty!

I always forget the masquerading part and have difficulties finding it - now I know where to find it ;-)