Monday, May 28, 2012

MySQL Replication Challenges

If you've got a server running MySQL that's creaking under the weight of database reads, the obvious solution is to set up replication.  MySQL replication seems to work fine, but one thing I didn't appreciate at first is the changes that probably need to be made to your code (and not just changing which server it reads data from).

I imagine that most web pages, if they're anything like mine (oh dear) implicitly rely on the database being updated immediately.  My web pages usually have a similar structure:-

1: Read any get/post values and update the database
2: Display the new updated data as part of the web page

This seems fine, and it is if your database is updated immediately (since stage 2 is reading the data that has just been changed by stage 1).  If you're using replication, then you'll be updating the master database in stage 1, but then reading the data from a slave for stage 2.  However, if the slave hasn't updated itself from the master by the time stage 2 is running (and we're probably only talking millionths of a second) then stage 2 will display old data, and your users will cock their head to side wondering what's going on and why the system has ignored their commands.

Another similar problem I experienced is with an automatic routine that transfered data from one table to another.  For various good reasons, this routine would read the first 100 records (from the slave, obv), append them to another table, and then delete the originals (on the master).  And this worked fine until there was a small glitch in the communication between the master and slave.  Glitches in communication aren't normally a problem since you can specify how long the master retains logs for, so you can turn a slave off for a few days, turn it on, and it will (eventually) update itself.

However, due to the glitch, the slave wasn't updating itself.  It read the first 100 records from one table (on the slave), wrote them to another table (on the master), and then deleted the records from the original table (on the master).  Since it was reading the records from the (non-updating) slave, the next 100 records it read were the same 100 records as last time, because they were still there.

Being an automatic routine that runs overnight, this process was duplicating identical records at a hysterical rate, eventually eating up all the disk space and so the whole system went down within just a few minutes.  Oh dear.

So, in summary, replication technically works very well, but you can't just plug it in and turn it on and expect all your problems to be solved.  You probably need to think carefully about when to read from the slave or master, and what will happen if the slave is too far behind the master.

No comments: