Saturday, March 03, 2007

importance of UPSERT database operation

This week I ran into upserts. I have to create a script/tool for uploading data into databases. The source data format is pretty "loose" and allows the users a lot of flexibility in how they generate it. The upside of this approach is that you are not imposing lot of rules when users are generating source data. BTW lot of rules piss people off , things like you can not have comments in between, you can not skip lines, you can not have empty lines, you can only have a certain format etc. So we tend to make it easy for the people to generate data. However there is a downside too.

Downside is , you need upsert kind of operation. If the data that user supplied is already there then do not try to insert it again , just update with whatever the user has supplied. Now doing this kind of operation in application layer would be tough.

To avoid duplicate data insertion from your program you first need to find out if the user supplied data already exists or not. So you either fire a query to do lookup on supplied fields or you load all the keys and fields in the beginning. Both are expensive operations atleast for large tables. In first case you waste a lot of time and in second case you read and keep carrying a lot of data. what we need is some in built mechanism from database.

Surprisingly, mysql provides quite a number of options on what to do when a duplicate row of data (some unique key violation) happens.
  • You can ignore the insert using INSERT IGNORE INTO
  • you can replace the row with new data using ( need to find more)
  • you can update a counter using INSERT INTO ON DUPLICATE UPDATE

All sort of things are possible with mysql. This article gives more information. Oracle and DB2 provide a MERGE operation. You can decide the matching on existing columns and then either insert or update. All this is very database specific and puritans will shout no "portability" but what the heck! Do you change your database every night? These features are powerful features and meant to be used. I am pretty comfortable using them , let the database care about the data, I do not want to do all the house keeping in my application code.

If you are using hibernate then you have to fire custom SQL queries on JDBC connection. You can define a <sql-insert> element for your entity but I am not sure what version of hibernate it works with plus I do not want to override the entity insert in all the cases. Only the data upload case is special. I also need to find out a way to provide named parameters as part of custom SQL queries in hibernate.
© Life of a third world developer
Maira Gall