Round Robin Data Storage in MySQL

If you want to store large amount of volatile data (e.g. log file entries) in a database with a constant storage memory footprint and no maintenance to purge the old entries, a round robin database is the best solution. But how to implement it in MySQL?

Some of the heaviest tables in our database are tables which do some event logging. Let’s look at a table which log the most recent visitors on a member profile. Look at that ugly output:

mysql> SHOW TABLE STATUS like "profile_visits_log" \G;
*************************** 1. row **********************
           Name: profile_visits_log
           Rows: 6'226'066
 Avg_row_length: 21
    Data_length: 130'747'386
   Index_length: 393'205'760

It grows and grows… And the queries get slower and slower.

This is the case because we are keeping a lot of old, unused data. For example, on my member profile there’s data back to September 2007:

So we have to get rid of that old data. For example we can do it like Xing and store the n most recent entries for each user. Drop the rest:

Now, how do we do that in the most elegant way?

Note: We want always the 5 last entries. Even if they date back from half a year ago. So purging via WHERE last_update < DATE_SUB(NOW(), INTERVAL 7 DAYS is not an solution.

Here’s two options:

Delete old posts by a cron job

The old-fashioned way is to run a separate cron job every night to purge the old data:

Here’s some pseudo code:

foreach (tilllate-member-profile)
{
     get all visits of this profile
     delete all visits except the most recent 10 ones
}

Problems with this method

Separate, asynchronous cron job needed -> Logic is spread across the application which increases maintenance cost.
Complex job causing a lot of heavy queries all at once -> Long locking time.

Use the Round Robin Strategy

Here’s an idea I got from the RRDTool. The idea is to have a sequential-number going from 0 to 4 and then wrapping back to 0. With that “system storage footprint remains constant over time“. This is also know as “Circular buffer”.

Additional advantage: With the Round Robin Event Logging Strategy you have the whole code at the same place (The principle of cohesion). You don’t need a separate cron job in charge of the purging.

We’re working with the following table:

CREATE TABLE  `test`.`profile_visits_log_rrd` (
  `uid` int(10) unsigned NOT NULL default '0',
       # this is the uid of the visited member profile
  `sequence_id` tinyint(3) unsigned NOT NULL default '0',
       # This is the sequence number.
       # It goes from 0 to 4. Once at 4 it should
       # wrap and restart at 0',
  `last_update` datetime NOT NULL,
  `visitor_uid` varchar(45) NOT NULL,
  PRIMARY KEY (`uid`,`sequence_id`),
  KEY `uid-last_update` (`uid`,`last_update`)
) ;

Here’s some sample data:

uid	sequence_id	last_update	visitor_uid
4	0	2007-12-30 21:02:14	6755474
4	1	2007-12-31 09:48:25	9887412
4	2	2007-12-31 10:59:22	208573
4	3	2007-12-31 11:51:21	757983
4	4	2008-01-01 13:54:08	3548743
5	0	2007-12-30 11:30:00	1400251
5	1	2007-12-31 16:17:36	620098
5	2	2008-01-01 13:55:25	3548743
5	3	2008-01-01 19:01:57	940873
5	4	2008-01-01 20:23:54	1270881

For each user you have only the last 5 entries.

Reading out data for a user

SELECT * FROM profile_visits_log_rrd m WHERE uid=5
ORDER BY last_update DESC

uid	sequence_id	last_update	visitor_uid
5	2	2008-01-01 20:23:54	1270881
5	1	2008-01-01 19:01:57	940873
5	0	2008-01-01 13:55:25	3548743
5	4	2007-12-31 16:17:36	620098
5	3	2007-12-30 11:30:00	1400251

It uses the index `uid-last_update` and is therefore efficient.

Adding new data for a user

Imagine, at 2008-01-01 22:03:23 there’s a new visitor visiting the member profile of user ‘5’.

You don’t want a new row so you need to replace the oldest one. In the case above it is row

uid	sequence_id	last_update	visitor_uid
5	3	2007-12-30 11:30:00	1400251

But to replace that row you gotta find out the primary key of that row. In this case it is uid==5 AND sequence_id==3.

You got the uid. So first you have to:

Step 1: Get the next sequence_id

SELECT (`sequence_id`+1) % 5 as next_sequence_id
 FROM profile_visits_log_rrd m
 WHERE uid=5
 ORDER BY last_update DESC
 LIMIT 1;

The price is low: Can use the index uid-last_update and therefore there’s no data access. No filesort. No temporary table.

In the example above this returns ‘3’. (We’re using modulo to make sure the number wraps when it goes over ‘4’)

Step 2: Add the data

REPLACE INTO profile_visits_log_rrd SET
  uid = 5,
  sequence_id = 3,
  last_update = '2008-01-01 22:03:23';

Maybe a INSERT ... ON DUPLICATE KEY is cheaper. But anyway: It uses indexes all the time

Add atomicity

To have atomar operations you need to put everything in a stored procedure (thanks, Leo):

LOCK TABLE member_gold_visitor_rrd WRITE;
SET @next_sequence_id = 0;
SELECT (@next_sequence_id:=((`sequence_id`+1) % 5))
    as next_sequence_id
 FROM member_gold_visitor_rrd
 WHERE uid=5
 ORDER BY last_update DESC
 LIMIT 1;
#For debug only:SELECT @next_sequence_id;
REPLACE INTO member_gold_visitor_rrd SET
  uid = 5,
  sequence_id = @next_sequence_id,
  last_update = now(),
  visitor_uid = 1400254;
UNLOCK TABLES;

Benefits of this technique

You don’t need a separate cron job (Cohesion!)
There’s no long table locking
Your tables don’t grow with time. Just with the number of users. (Constant memory footprint)
You can abstract this logging via a component to make it transparent for your developers. (As a matter of fact we have extended Zend Framework with such a class)

Better suggestions?

Does anyone have other ideas on how to solve this problem elegantly?

Maybe MySQL is not the right tool for that kind of data? – Let me know what you think

There’s a PostgreSQL implementation for that issue.