If you want to store large amount of volatile data (e.g. log file entries) in a database with a constant storage memory footprint and no maintenance to purge the old entries, a round robin database is the best solution. But how to implement it in MySQL?
Some of the heaviest tables in our database are tables which do some event logging. Let’s look at a table which log the most recent visitors on a member profile. Look at that ugly output:
mysql> SHOW TABLE STATUS like "profile_visits_log" \G;
*************************** 1. row **********************
Name: profile_visits_log
Rows: 6'226'066
Avg_row_length: 21
Data_length: 130'747'386
Index_length: 393'205'760
It grows and grows… And the queries get slower and slower.
This is the case because we are keeping a lot of old, unused data. For example, on my member profile there’s data back to September 2007:
So we have to get rid of that old data. For example we can do it like Xing and store the n most recent entries for each user. Drop the rest:
Now, how do we do that in the most elegant way?
Note: We want always the 5 last entries. Even if they date back from half a year ago. So purging via WHERE last_update < DATE_SUB(NOW(), INTERVAL 7 DAYS is not an solution.
Here’s two options:
Delete old posts by a cron job
The old-fashioned way is to run a separate cron job every night to purge the old data:
Here’s some pseudo code:
foreach (tilllate-member-profile)
{
get all visits of this profile
delete all visits except the most recent 10 ones
}
Problems with this method
- Separate, asynchronous cron job needed -> Logic is spread across the application which increases maintenance cost.
- Complex job causing a lot of heavy queries all at once -> Long locking time.
Use the Round Robin Strategy
Here’s an idea I got from the RRDTool
. The idea is to have a sequential-number going from 0 to 4 and then wrapping back to 0. With that “system storage footprint remains constant over time
“. This is also know as “Circular buffer”.
Additional advantage: With the Round Robin Event Logging Strategy you have the whole code at the same place (The principle of cohesion). You don’t need a separate cron job in charge of the purging.
We’re working with the following table:
CREATE TABLE `test`.`profile_visits_log_rrd` (
`uid` int(10) unsigned NOT NULL default '0',
# this is the uid of the visited member profile
`sequence_id` tinyint(3) unsigned NOT NULL default '0',
# This is the sequence number.
# It goes from 0 to 4. Once at 4 it should
# wrap and restart at 0',
`last_update` datetime NOT NULL,
`visitor_uid` varchar(45) NOT NULL,
PRIMARY KEY (`uid`,`sequence_id`),
KEY `uid-last_update` (`uid`,`last_update`)
) ;
Here’s some sample data:
| uid | sequence_id | last_update | visitor_uid |
|---|---|---|---|
| 4 | 0 | 2007-12-30 21:02:14 | 6755474 |
| 4 | 1 | 2007-12-31 09:48:25 | 9887412 |
| 4 | 2 | 2007-12-31 10:59:22 | 208573 |
| 4 | 3 | 2007-12-31 11:51:21 | 757983 |
| 4 | 4 | 2008-01-01 13:54:08 | 3548743 |
| 5 | 0 | 2007-12-30 11:30:00 | 1400251 |
| 5 | 1 | 2007-12-31 16:17:36 | 620098 |
| 5 | 2 | 2008-01-01 13:55:25 | 3548743 |
| 5 | 3 | 2008-01-01 19:01:57 | 940873 |
| 5 | 4 | 2008-01-01 20:23:54 | 1270881 |
For each user you have only the last 5 entries.
Reading out data for a user
SELECT * FROM profile_visits_log_rrd m WHERE uid=5 ORDER BY last_update DESC
| uid | sequence_id | last_update | visitor_uid |
|---|---|---|---|
| 5 | 2 | 2008-01-01 20:23:54 | 1270881 |
| 5 | 1 | 2008-01-01 19:01:57 | 940873 |
| 5 | 0 | 2008-01-01 13:55:25 | 3548743 |
| 5 | 4 | 2007-12-31 16:17:36 | 620098 |
| 5 | 3 | 2007-12-30 11:30:00 | 1400251 |
It uses the index `uid-last_update` and is therefore efficient.
Adding new data for a user
Imagine, at 2008-01-01 22:03:23 there’s a new visitor visiting the member profile of user ‘5’.
You don’t want a new row so you need to replace the oldest one. In the case above it is row
| uid | sequence_id | last_update | visitor_uid |
|---|---|---|---|
| 5 | 3 | 2007-12-30 11:30:00 | 1400251 |
But to replace that row you gotta find out the primary key of that row. In this case it is uid==5 AND sequence_id==3.
You got the uid. So first you have to:
Step 1: Get the next sequence_id
SELECT (`sequence_id`+1) % 5 as next_sequence_id
FROM profile_visits_log_rrd m
WHERE uid=5
ORDER BY last_update DESC
LIMIT 1;
SELECT (`sequence_id`+1) % 5 as next_sequence_id FROM profile_visits_log_rrd m WHERE uid=5 ORDER BY last_update DESC LIMIT 1;
The price is low: Can use the index uid-last_update and therefore there’s no data access. No filesort. No temporary table.
In the example above this returns ‘3’. (We’re using modulo to make sure the number wraps when it goes over ‘4’)
Step 2: Add the data
REPLACE INTO profile_visits_log_rrd SET uid = 5, sequence_id = 3, last_update = '2008-01-01 22:03:23';
Maybe a INSERT ... ON DUPLICATE KEY is cheaper. But anyway: It uses indexes all the time
Add atomicity
To have atomar operations you need to put everything in a stored procedure (thanks, Leo):
LOCK TABLE member_gold_visitor_rrd WRITE;
SET @next_sequence_id = 0;
SELECT (@next_sequence_id:=((`sequence_id`+1) % 5))
as next_sequence_id
FROM member_gold_visitor_rrd
WHERE uid=5
ORDER BY last_update DESC
LIMIT 1;
#For debug only:SELECT @next_sequence_id;
REPLACE INTO member_gold_visitor_rrd SET
uid = 5,
sequence_id = @next_sequence_id,
last_update = now(),
visitor_uid = 1400254;
UNLOCK TABLES;
Benefits of this technique
- You don’t need a separate cron job (Cohesion!)
- There’s no long table locking
- Your tables don’t grow with time. Just with the number of users. (Constant memory footprint)
- You can abstract this logging via a component to make it transparent for your developers. (As a matter of fact we have extended Zend Framework with such a class)
Better suggestions?
Does anyone have other ideas on how to solve this problem elegantly?
Maybe MySQL is not the right tool for that kind of data? – Let me know what you think
There’s a PostgreSQL implementation
for that issue.



Pingback: Round Robin Data Storage in MySQL - Tilllate Techblog - To take away