Wednesday, 30 January 2008

LAMP: The mailing lists you *SHOULD* be subscribed to.

The LAMP stack consists of some complicated software, and this software from time to time will develop faults and security flaws. How do you keep yourself informed? Hope that the issues crop up on Digg, Slashdot? Well the best way is to join the Announce lists for each of the LAMP stack components.

The Announce lists are used by the developers of the different components of the LAMP stack to keep users informed of important events, like when a security flaw emerges or a new version of there software is released, etc.

The lists

The first one is dependent on the GNU/Linux distribution your running, in my case it is Archlinux, so I've subscribed to:

Next in the LAMP stack is Apache, and you can find its Announce list here:

For MySQL:

And PHP:

And lastly, considering that SSH is the primary form of entry to my systems (and most likely yours) I am subscribed to the OpenSSL and OpenSSH mailing lists:

Final points

Now, subscribing to these lists in no way secures you from attack, but it should give you a fighting chance. When bugs / security issues do crop up you will need to take the necessary action, which should be as simple as upgrading the relevant software. But be prepared to build replacement packages from source if your GNU/Linux distribution is slow to release an updated package.

Tuesday, 8 January 2008

MySQL (5.0): Recovering failed circular replication

There is a lot of information out there about how to setup circular replication but nothing about how to recover it when all else fails. This article will cover a quick and easy method I use. Depending on the size of your database and the interconnects between servers this method may not be suitable due to the need to copy all replicated databases from one good server to all other servers in the replication circle which requires a certain amount of down time respectively.

OK, so one server or more is showing Slave_IO_Running and/or Slave_SQL_Running as No and there is some error about a failed query when you run "show slave status;" and no amount of effort to fix it is working. First DO NOT PANIC. It is broken, OK, tell yourself that and realise that trying to fix something when you are in a panicked state is only liable to make the situation worse, hell I'd guarantee it, so relax.

The checklist

Below is our checklist of things to do, with more detailed information about each step afterward.
  1. Make sure ALL connections to MySQL are closed
  2. Back everything up (/var/lib/mysql)
  3. Pick the most up-to-date database
  4. Run stop slave; on all servers
  5. Run reset master; and reset slave; on all servers
  6. Shutdown all servers (e.g. /etc/rc.d/mysqld stop)
  7. Copy database decided on from step 3; files/ibdata etc to other servers
  8. Start servers (e.g. /etc/rc.d/mysqld start)
  9. Run start slave; on all servers

Step 1 - Make sure ALL connections to MySQL are closed

If you have anything still connected to MySQL when your fixing replication your going to find your worse off than where you started. In an ideal world you would know EXACTLY what is connected to your MySQL server. But if in doubt, or even better if you like to second check then fire up the MySQL client on all servers and run show full processlist; all you want to see there is rows with system user and one for your connected client which should show show full processlist in the Info column. If there is anything else, get rid before proceeding to Step 2.

Don't forget things like Apache running PHP scripts and PHP scripts under cron control for example which may be waiting to start a new connection without your knowledge.

An easy way to be sure nothing has modified your database is to run show master status; and note down the values of the File and Position columns. As you progress though the recovery check these values now and then to be sure they have not changed. The only time they should change is after step 8.

Step 2 - Back everything up (/var/lib/mysql)

This should be obvious, but as a reminder I've put it in. Before I fix anything in the way that is laid out in this article I make a copy of the MySQL data directory by simply doing (You may want to stop the MySQL server before doing this.):

cp -a /var/lib/mysql /var/lib/mysql-[yyyymmdd].

This will allow me to move the old directory back into play (after the whole procedure obviously) by:
  1. Stopping the MySQL server
  2. mv /var/lib/mysql /var/lib/mysql-tmp
  3. mv /var/lib/mysql-[yyyymmdd] /var/lib/mysql
  4. Starting MySQL server

You may need to do this to check data in the old version of the database. I would NOT start slave; if you do this it will most likely break replication in the other servers and you won't be able to switch back to the new database and your have to start all over again :(

Step 3 - Pick the most up-to-date database

When replication fails beyond repair you need to select a server who's database to use. This is completely your call. However if it is one server that has hiccuped use its masters database as every node after the failed server would have it's events replicated all the way around to its master.

At this point it may be a good idea to get a list of the databases that you are replicating. I use a quick call to show master status; to get a list of databases in usable format. I.e. db0,db1,db2 in the Binlog_Do_Db column. We will use this in step 7.

Step 4 - Run stop slave; on all servers

This step is pretty self explanatory. We stop the slave process from running in preparation for the next step. We "should" get away without stopping the slave process I'll admit, but it's better to be safe than sorry.

Step 5 - Run reset master; and reset slave; on all servers

This step purges bin logs and resets bin log position counters ( and for the master and slave. This way when we copy our best database to all other servers they will all be at a common starting point.

Step 6 - Shutdown all servers (e.g. /etc/rc.d/mysqld stop)

Pretty simply stop all servers. We do this because the next step involves copying from one server to all other servers, so we don't want MySQL modifying anything mid copy.

Step 7 - Copy database decided on from step 3; files/ibdata etc to other servers

Now we need to copy the database you chose in step 3 to all other servers.

I use scp for this like so (salt to your taste):

scp -r /var/lib/mysql/{db0,db1,db2} /var/lib/mysql/ib* server1:/var/lib/mysql &

Note, if you do not replicate any InnoDB tables remove the /var/lib/mysql/ib* bit. I also background the process just in case I lose my ssh connection (i.e. to the dreaded felis silvestris catus bug), it also means you can run scp for each server from the same console. To check up on the scp processes I simply use watch "ps aux | grep scp". When all processes are gone from the list your good to go to step 8.

Step 8 - Start servers (e.g. /etc/rc.d/mysqld start)

Start the servers. This can be done in any order you feel like doing it in.

Step 9 - Run start slave; on all servers

You may have MySQL setup to start the slave automatically, if not however now is the time to start the slave processes on all servers. Give the servers time to get going as they have by default 60 second timeouts so may not connect to there master straight away, especially if you haven't actually started it before hand. If your impatient however simple run stop slave; and start slave; on each server going from master to slave around the circle.

Final thoughts

OK so that may have seemed a bit long winded, but I'll guarantee that after doing it once successfully your see it for the relatively simple process it is. Off course it doesn't help when your expected to keep 100% up time.

I never explained how to restore around a failed node which is a common problem. I thought it would simply confuse things to much even though it is a pretty simple thing to slot in around step 6. I will follow up with another blog entry to cover it another day. If you want it sooner feel free to poke me via email.

Well I hope all that helps someone, especially the lads in my team which should now have no excuse for dodging the night watch ;)