Thursday, 6 December 2007

MySQL: Unlimited updates in one query - ish.

One of the major parts of my job is/was developing the interface to our MIS that handles hundreds of connections on a nightly basis from other MIS systems which push data in a CSV format. These CSV files contain information that then needs to be pushed into our MIS (thousands of inserts and updates, plus some deletes).

Now, creating the insert statements are a breeze with the "insert into tbl0(...) values(...),(...),..." MySQL syntax. I can push thousands of inserts in a single query, the only limit being the MySQL packet size, hence the 'ish' in the title. But what about updates? before now I've resorted to just running the updates as single queries, which lets face it, is slow. So lets look at how we can speed this up.

The Meat

First create a test table:

create table test0(sid int, staffid varchar(255), name varchar(255), unique(sid, staffid));

Insert some data:

insert into test0(sid, staffid, name) values(1, 'ar', 'Andrew'), (1, 'jm', 'John'), (1, 'sl', 'Seb');

Now we want to turn:

update test1 set name = 'Andrew 2' where sid = 1 and staffid = 'ar';
update test1 set name = 'John 2' where sid = 1 and staffid = 'jm';
update test1 set name = 'Seb 2' where sid = 1 and staffid = 'sl';

Into a single query, and we do this like so:

insert into test0(sid, staffid, name) values(1, 'ar', 'Andrew 2'),
(1, 'jm', 'John 2'),(1, 'sl', 'Seb 2') on duplicate key update = values(name);

`test0` should now have been updated with the new values. The key to all this is using "on duplicate key update".

I should point out, for the uninitiated that you need a unique key for this to work. In this case it is (sid, staffid) plus there is no need to update the unique keys.

Initial tests show a two thirds increase in speed when we are:
  1. Hitting MySQL from a PHP script.
  2. Breaking the updates into 1024k chunks.
  3. Connecting to MySQL on the localhost
If you are connecting to MySQL over a network you should see even better speed improvements because of the effects of network latency.

Drawbacks and Unknowns
  1. I've not tested this against multiple update queries sent in one go. i.e. "update tbl0 set col1 = 1 where col0 = 0; update tbl0 set col1 = 2 where col0 =1; ...".
  2. Increased memory use from the size of the insert statement. However I took the chunks down to 500k and even 250k and lost no more than 5-10% performance.
  3. If your using PHP watch the memory limit. Also keep your chunks below the max allowed packet size.


George said...

I know this is an old post, but I thought it was worth a quick comment anyhow :)

Why are you performing individual inserts and updates? Would it not be more efficient to bulk load the CSV(s) in to a temporary table and then use a SET based method instead of an RBAR one?

Andrew Rose said...

Hi George. Yeap, row by row is the way we have to do things. We deal with data from many heterogeneous systems that don't export data in the exact same ways. This data requires sanitizing and lookups into internal hash tables for their respective keys. Also some rows require actions to be performed i.e. new student or staff records may require new user accounts creating. We also need to derive certain data due to limitations of the external systems.

Using the MySQL "LOAD DATA INFILE" functionality was one of my first attempts(many years ago), which doesn't require any temp tables. We grew out of this method very quickly. If the data you take in is known to be clean and matches with your own systems schema then by all means use something like this. We don't have that luxury :)