OK, so one server or more is showing
Slave_IO_Running
and/or Slave_SQL_Running
as No
and there is some error about a failed query when you run "show slave status;
" and no amount of effort to fix it is working. First DO NOT PANIC. It is broken, OK, tell yourself that and realise that trying to fix something when you are in a panicked state is only liable to make the situation worse, hell I'd guarantee it, so relax.The checklist
Below is our checklist of things to do, with more detailed information about each step afterward.
- Make sure ALL connections to MySQL are closed
- Back everything up (/var/lib/mysql)
- Pick the most up-to-date database
- Run
stop slave;
on all servers - Run
reset master;
andreset slave;
on all servers - Shutdown all servers (
e.g. /etc/rc.d/mysqld stop
) - Copy database decided on from step 3; files/ibdata etc to other servers
- Start servers (e.g.
/etc/rc.d/mysqld start
) - Run
start slave;
on all servers
Step 1 - Make sure ALL connections to MySQL are closed
If you have anything still connected to MySQL when your fixing replication your going to find your worse off than where you started. In an ideal world you would know EXACTLY what is connected to your MySQL server. But if in doubt, or even better if you like to second check then fire up the MySQL client on all servers and run
show full processlist;
all you want to see there is rows with system user
and one for your connected client which should show show full processlist
in the Info column. If there is anything else, get rid before proceeding to Step 2.Don't forget things like Apache running PHP scripts and PHP scripts under cron control for example which may be waiting to start a new connection without your knowledge.
An easy way to be sure nothing has modified your database is to run
show master status;
and note down the values of the File
and Position
columns. As you progress though the recovery check these values now and then to be sure they have not changed. The only time they should change is after step 8.Step 2 - Back everything up (/var/lib/mysql)
This should be obvious, but as a reminder I've put it in. Before I fix anything in the way that is laid out in this article I make a copy of the MySQL data directory by simply doing (You may want to stop the MySQL server before doing this.):
cp -a /var/lib/mysql /var/lib/mysql-[yyyymmdd]
.This will allow me to move the old directory back into play (after the whole procedure obviously) by:
- Stopping the MySQL server
mv /var/lib/mysql /var/lib/mysql-tmp
mv /var/lib/mysql-[yyyymmdd] /var/lib/mysql
- Starting MySQL server
You may need to do this to check data in the old version of the database. I would NOT
start slave;
if you do this it will most likely break replication in the other servers and you won't be able to switch back to the new database and your have to start all over again :(Step 3 - Pick the most up-to-date database
When replication fails beyond repair you need to select a server who's database to use. This is completely your call. However if it is one server that has hiccuped use its masters database as every node after the failed server would have it's events replicated all the way around to its master.
At this point it may be a good idea to get a list of the databases that you are replicating. I use a quick call to
show master status;
to get a list of databases in usable format. I.e. db0,db1,db2
in the Binlog_Do_Db
column. We will use this in step 7.Step 4 - Run stop slave; on all servers
This step is pretty self explanatory. We stop the slave process from running in preparation for the next step. We "should" get away without stopping the slave process I'll admit, but it's better to be safe than sorry.
Step 5 - Run reset master; and reset slave; on all servers
This step purges bin logs and resets bin log position counters (relay-log.info and master.info) for the master and slave. This way when we copy our best database to all other servers they will all be at a common starting point.
Step 6 - Shutdown all servers (e.g. /etc/rc.d/mysqld stop)
Pretty simply stop all servers. We do this because the next step involves copying from one server to all other servers, so we don't want MySQL modifying anything mid copy.
Step 7 - Copy database decided on from step 3; files/ibdata etc to other servers
Now we need to copy the database you chose in step 3 to all other servers.
I use scp for this like so (salt to your taste):
scp -r /var/lib/mysql/{db0,db1,db2} /var/lib/mysql/ib* server1:/var/lib/mysql &
Note, if you do not replicate any InnoDB tables remove the
/var/lib/mysql/ib*
bit. I also background the process just in case I lose my ssh connection (i.e. to the dreaded felis silvestris catus bug), it also means you can run scp
for each server from the same console. To check up on the scp processes I simply use watch "ps aux | grep scp"
. When all processes are gone from the list your good to go to step 8.Step 8 - Start servers (e.g. /etc/rc.d/mysqld start)
Start the servers. This can be done in any order you feel like doing it in.
Step 9 - Run
start slave;
on all serversYou may have MySQL setup to start the slave automatically, if not however now is the time to start the slave processes on all servers. Give the servers time to get going as they have by default 60 second timeouts so may not connect to there master straight away, especially if you haven't actually started it before hand. If your impatient however simple run
stop slave;
and start slave;
on each server going from master to slave around the circle.Final thoughts
OK so that may have seemed a bit long winded, but I'll guarantee that after doing it once successfully your see it for the relatively simple process it is. Off course it doesn't help when your expected to keep 100% up time.
I never explained how to restore around a failed node which is a common problem. I thought it would simply confuse things to much even though it is a pretty simple thing to slot in around step 6. I will follow up with another blog entry to cover it another day. If you want it sooner feel free to poke me via email.
Well I hope all that helps someone, especially the lads in my team which should now have no excuse for dodging the night watch ;)
5 comments:
Just wanted to say this entry helped us out of a pinch. Thanks!
Lifesaver post!
Have you ever run into issues with auto-incrementing the same table on 2 servers at the same time?
This should help you with your auto incrementing problems.
This worked but before you can start the slaves you may need to "CHANGE MASTER TO..."
Post a comment