Combee on Palm OS

Wednesday, 27 April 2005  11:32 AM

My SXSW schedule application uses an unusual data structure to hold all of the events.  The first year I did the program in 2002, I used a standard Palm OS DB approach, and while that worked OK, I had some major problems.  First, I had to bundle the DBs with the program, so it could be beamed to another device easily but since I expanded those DBs into separate databases when the application started, my application would take up almost twice as much memory as needed.  Second, using a Palm OS DB had a lot of overhead that wasn't really useful for a read-only database, which meant a larger data footprint.  For a festival with over 1300 events, it was important to keep the schedule program small so it would fit on people's devices and be easy to download.

When I was redesigning the application for 2003, I decided to change the data representation entirely.  Rather than putting all of this data into individual PDB records, I'd produce several large chunks of data and include them as resources in the application.  I ended up with a set of resources that looked like this

EVENT_SCHEDULE_RESOURCE_TYPE 'EvSc'
FILM_DETAILS_RESOURCE_TYPE 'FlDt'
PANEL_DETAILS_RESOURCE_TYPE 'PnDt'
MUSIC_DETAILS_RESOURCE_TYPE 'MsDt'
VENUE_DETAILS_RESOURCE_TYPE 'VnDt'

The SXSW 2005 application has six 'EvSc' resources, and one of each of the different details resources.  The 'EvSc' #1 resource is a master index of all events at the conference.  For 2005, this was 62647 bytes, so it's likely that it will need to be changed for 2006 to handle the increasing number of events.  This resource is a list of packed entries with this structure:

UInt32 eventID;
UInt32 timestamp;
UInt16 venueOffset;
UInt16 detailsOffset;
UInt8 eventType;
char name[];

Each entry starts with a 32-bit unique ID that we derive from key bits of event data, all sent through a CRC32 routine to produce a hashed value.  So far, we've not had any conflicts between different events, but there's a check in the data generation code to flag us if one occurs.  The timestamp is the start time for the event, stored as a 32-bit seconds value.  The venueOffset and detailsOffset fields store the offset from the start of the appropriate details resource to get more information about the event.  We end with a one-character event type indicator (film, band, or one of the three panel types), and they store a tokenized event name where common words are replaced with values in the ASCII 1-31 range to get a little data compression.

'EvSc' #2 is a fixed-length record index into the 'EvSc' #1 resource.  It's 12696 bytes long; each record is 6 bytes, giving 2116 events.  An index record has a 4-byte eventID followed by a 2-byte offset into the 'EvSc' #1 resource for the event details.  This lets us find any event from its unique ID by doing a binary search in this index, then using the offset to find the event details structure.

'EvSc' #3 is an index to all events, sorted by time.  It has the same structure as 'EvSc' #2, but it's a little shorter because there are some events that are in the database but were not scheduled.  'EvSc' #4 is a small 20-byte resource that points to the first entry for each day of the festival, allowing us to show the event list on a day-by-day basis.

'EvSc' #5/#6 are the same as #3/#4, but the sort order is first sorted by venue, then sorted by day-of-festival, making it easy for us to show the events grouped by where they are happening.

The details resources are similar to my 'EvSc' #1 resource.  They're all packed records, so to find the start of one of the entries, you need the 16-bit offset value stored in the venueOffset or detailsOffset field of the original event details structure.  A panel record has a length value (in minutes), a web ID value used to construct the URL for browsing the web, and a tokenized synopsis string.  Film records store length and strings for the director, synopsis, category, and screening type. Musical performances have web IDs, a flag indicating whether they have a music file posted online, and strings for their home town, label, and category.  Finally, each venue stores strings for venue name, address, and age policy.

In summary, we built our own read-only compressed database engine with a fixed set of indices.  It is not very flexible, but it does work quickly, and with the right set of C++ classes, it's been easy to use in my code.  We hit some resource size limits in 2005, requiring us to tokenize strings to get back under 64K for some entries, but I've got some clear ideas on how to change the structure further if necessary in the future.