Thursday, 6 December 2007

MySQL: Unlimited updates in one query - ish.

One of the major parts of my job is/was developing the interface to our MIS that handles hundreds of connections on a nightly basis from other MIS systems which push data in a CSV format. These CSV files contain information that then needs to be pushed into our MIS (thousands of inserts and updates, plus some deletes).

Now, creating the insert statements are a breeze with the "insert into tbl0(...) values(...),(...),..." MySQL syntax. I can push thousands of inserts in a single query, the only limit being the MySQL packet size, hence the 'ish' in the title. But what about updates? before now I've resorted to just running the updates as single queries, which lets face it, is slow. So lets look at how we can speed this up.

The Meat

First create a test table:

create table test0(sid int, staffid varchar(255), name varchar(255), unique(sid, staffid));

Insert some data:

insert into test0(sid, staffid, name) values(1, 'ar', 'Andrew'), (1, 'jm', 'John'), (1, 'sl', 'Seb');

Now we want to turn:

update test1 set name = 'Andrew 2' where sid = 1 and staffid = 'ar';
update test1 set name = 'John 2' where sid = 1 and staffid = 'jm';
update test1 set name = 'Seb 2' where sid = 1 and staffid = 'sl';

Into a single query, and we do this like so:

insert into test0(sid, staffid, name) values(1, 'ar', 'Andrew 2'),
(1, 'jm', 'John 2'),(1, 'sl', 'Seb 2') on duplicate key update = values(name);

`test0` should now have been updated with the new values. The key to all this is using "on duplicate key update".

I should point out, for the uninitiated that you need a unique key for this to work. In this case it is (sid, staffid) plus there is no need to update the unique keys.

Initial tests show a two thirds increase in speed when we are:
  1. Hitting MySQL from a PHP script.
  2. Breaking the updates into 1024k chunks.
  3. Connecting to MySQL on the localhost
If you are connecting to MySQL over a network you should see even better speed improvements because of the effects of network latency.

Drawbacks and Unknowns
  1. I've not tested this against multiple update queries sent in one go. i.e. "update tbl0 set col1 = 1 where col0 = 0; update tbl0 set col1 = 2 where col0 =1; ...".
  2. Increased memory use from the size of the insert statement. However I took the chunks down to 500k and even 250k and lost no more than 5-10% performance.
  3. If your using PHP watch the memory limit. Also keep your chunks below the max allowed packet size.