Tuesday, 8 January 2008

MySQL (5.0): Recovering failed circular replication

There is a lot of information out there about how to setup circular replication but nothing about how to recover it when all else fails. This article will cover a quick and easy method I use. Depending on the size of your database and the interconnects between servers this method may not be suitable due to the need to copy all replicated databases from one good server to all other servers in the replication circle which requires a certain amount of down time respectively.

OK, so one server or more is showing Slave_IO_Running and/or Slave_SQL_Running as No and there is some error about a failed query when you run "show slave status;" and no amount of effort to fix it is working. First DO NOT PANIC. It is broken, OK, tell yourself that and realise that trying to fix something when you are in a panicked state is only liable to make the situation worse, hell I'd guarantee it, so relax.


The checklist

Below is our checklist of things to do, with more detailed information about each step afterward.
  1. Make sure ALL connections to MySQL are closed
  2. Back everything up (/var/lib/mysql)
  3. Pick the most up-to-date database
  4. Run stop slave; on all servers
  5. Run reset master; and reset slave; on all servers
  6. Shutdown all servers (e.g. /etc/rc.d/mysqld stop)
  7. Copy database decided on from step 3; files/ibdata etc to other servers
  8. Start servers (e.g. /etc/rc.d/mysqld start)
  9. Run start slave; on all servers

Step 1 - Make sure ALL connections to MySQL are closed

If you have anything still connected to MySQL when your fixing replication your going to find your worse off than where you started. In an ideal world you would know EXACTLY what is connected to your MySQL server. But if in doubt, or even better if you like to second check then fire up the MySQL client on all servers and run show full processlist; all you want to see there is rows with system user and one for your connected client which should show show full processlist in the Info column. If there is anything else, get rid before proceeding to Step 2.

Don't forget things like Apache running PHP scripts and PHP scripts under cron control for example which may be waiting to start a new connection without your knowledge.

An easy way to be sure nothing has modified your database is to run show master status; and note down the values of the File and Position columns. As you progress though the recovery check these values now and then to be sure they have not changed. The only time they should change is after step 8.


Step 2 - Back everything up (/var/lib/mysql)

This should be obvious, but as a reminder I've put it in. Before I fix anything in the way that is laid out in this article I make a copy of the MySQL data directory by simply doing (You may want to stop the MySQL server before doing this.):

cp -a /var/lib/mysql /var/lib/mysql-[yyyymmdd].

This will allow me to move the old directory back into play (after the whole procedure obviously) by:
  1. Stopping the MySQL server
  2. mv /var/lib/mysql /var/lib/mysql-tmp
  3. mv /var/lib/mysql-[yyyymmdd] /var/lib/mysql
  4. Starting MySQL server

You may need to do this to check data in the old version of the database. I would NOT start slave; if you do this it will most likely break replication in the other servers and you won't be able to switch back to the new database and your have to start all over again :(


Step 3 - Pick the most up-to-date database

When replication fails beyond repair you need to select a server who's database to use. This is completely your call. However if it is one server that has hiccuped use its masters database as every node after the failed server would have it's events replicated all the way around to its master.

At this point it may be a good idea to get a list of the databases that you are replicating. I use a quick call to show master status; to get a list of databases in usable format. I.e. db0,db1,db2 in the Binlog_Do_Db column. We will use this in step 7.


Step 4 - Run stop slave; on all servers

This step is pretty self explanatory. We stop the slave process from running in preparation for the next step. We "should" get away without stopping the slave process I'll admit, but it's better to be safe than sorry.


Step 5 - Run reset master; and reset slave; on all servers

This step purges bin logs and resets bin log position counters (relay-log.info and master.info) for the master and slave. This way when we copy our best database to all other servers they will all be at a common starting point.


Step 6 - Shutdown all servers (e.g. /etc/rc.d/mysqld stop)

Pretty simply stop all servers. We do this because the next step involves copying from one server to all other servers, so we don't want MySQL modifying anything mid copy.


Step 7 - Copy database decided on from step 3; files/ibdata etc to other servers

Now we need to copy the database you chose in step 3 to all other servers.

I use scp for this like so (salt to your taste):

scp -r /var/lib/mysql/{db0,db1,db2} /var/lib/mysql/ib* server1:/var/lib/mysql &

Note, if you do not replicate any InnoDB tables remove the /var/lib/mysql/ib* bit. I also background the process just in case I lose my ssh connection (i.e. to the dreaded felis silvestris catus bug), it also means you can run scp for each server from the same console. To check up on the scp processes I simply use watch "ps aux | grep scp". When all processes are gone from the list your good to go to step 8.


Step 8 - Start servers (e.g. /etc/rc.d/mysqld start)

Start the servers. This can be done in any order you feel like doing it in.


Step 9 - Run start slave; on all servers

You may have MySQL setup to start the slave automatically, if not however now is the time to start the slave processes on all servers. Give the servers time to get going as they have by default 60 second timeouts so may not connect to there master straight away, especially if you haven't actually started it before hand. If your impatient however simple run stop slave; and start slave; on each server going from master to slave around the circle.


Final thoughts

OK so that may have seemed a bit long winded, but I'll guarantee that after doing it once successfully your see it for the relatively simple process it is. Off course it doesn't help when your expected to keep 100% up time.

I never explained how to restore around a failed node which is a common problem. I thought it would simply confuse things to much even though it is a pretty simple thing to slot in around step 6. I will follow up with another blog entry to cover it another day. If you want it sooner feel free to poke me via email.

Well I hope all that helps someone, especially the lads in my team which should now have no excuse for dodging the night watch ;)

5 comments:

Anonymous said...

Just wanted to say this entry helped us out of a pinch. Thanks!

Gavving said...

Lifesaver post!

Anonymous said...

Have you ever run into issues with auto-incrementing the same table on 2 servers at the same time?

Andrew Rose said...

This should help you with your auto incrementing problems.

Paul said...

This worked but before you can start the slaves you may need to "CHANGE MASTER TO..."