In the previous post, we took a look at implementing soft deletes, so that we aren't destroying our data. But what about updates? Updates are pretty destructive too, since they are essentially deleting a row and replacing it. At the very least, most would want an audit log of some sort to know who updated that particular row. In healthcare, if someone updates the patient's name, that is a big deal. We need to make sure that track the history of any given entity.
If you haven't read the previous post on soft deletes, go do that now. We'll build on top of our understanding of soft deletes to implement soft updates.
Going beyond soft deletes with soft updates
To handle soft updates, we'll introduce a few more concepts to our schema.
- An integer
versioncolumn to track versions of an entity
- A history table will track old values of rows, referencing the original table.
- A trigger to copy the "deleted" row on an update.
1. The version column
version column defaults to 1 and will increment with each update.
version_on column tracks the timestamp for each update.
version_by column tracks the user responsible for updating the row.
created_on is mostly convenience, since we could get created on by querying for the
version_on of the entity where
version = 1.
In this case, we'll use a
is_removed boolean column to track deletes since we have the time of the deletion with tracked in the
CREATE TABLE customer ( customer_id serial PRIMARY KEY, name varchar(256) NOT NULL, version int DEFAULT(1) NOT NULL, version_on timestamptz NOT NULL DEFAULT current_timestamp, version_by varchar(64) NOT NULL, created_on timestamptz NOT NULL DEFAULT current_timestamp, is_removed boolean NOT NULL DEFAULT(false) );
2. The history table
The history table should track our versioning columns, as well as any columns that can change from the original table. We don't need a unique constraint here, but we do want a foreign key to reference the original table.
CREATE TABLE customer_history ( customer_id int, name text NOT NULL, version int, version_on timestamptz NOT NULL DEFAULT current_timestamp, version_by varchar(64) NOT NULL, is_removed boolean NOT NULL PRIMARY KEY(customer_id, version) ); ALTER TABLE customer_history ADD CONSTRAINT fk_customer_history_customer FOREIGN KEY ( customer_id ) REFERENCES customer ( customer_id ) ON UPDATE CASCADE ON DELETE RESTRICT; CREATE INDEX fki_customer_history_customer ON customer_history (customer_id);
3. The trigger
On each update to our customer table, we want to trigger an insert to the history table. In postgres, there are two parts to this.
- The trigger function
- The trigger itself
The trigger function defines what to do. In our case, we want to increment the
version and set the
version_on for the new row in customer. Then we insert the old row into our history table, and return the
NEW row. The ability in postgres to modify the
NEW row before it is inserted is really nice, since we can keep
version_on details hidden from most queries.
CREATE OR REPLACE FUNCTION tr_update_customer() RETURNS TRIGGER AS $$ BEGIN NEW.version := OLD.version + 1; NEW.version_on := now(); INSERT INTO customer_history (customer_id, name, version, version_on, version_by, is_removed) VALUES (OLD.customer_id, OLD.name, OLD.version, OLD.version_on, OLD.version_by, OLD.is_removed); RETURN NEW; END; $$ LANGUAGE plpgsql;
The trigger itself defines when to execute our trigger function. In our case, we want to run the trigger before an
UPDATE on the customer table. Then, for each row, run the trigger.
CREATE TRIGGER tr_update_customer BEFORE UPDATE ON customer FOR EACH ROW EXECUTE PROCEDURE tr_update_customer();
Updates as usual
Updating a table is almost the same as usual, but don't forget to include the
version_by. We can prevent conflicts by ensuring that the version passed in is still the current version. In this case, no rows would be updated. Either the application code needs to handle this, throwing a conflict error, or the update can be put into a stored procedure that throws an error.
UPDATE customer SET name = 'sabre', version_by = 'Jo Bennett' WHERE name = 'dunder mifflin' AND version = 1;
customer table should now look like this:
|2||vance refrigeration||1||August, 04 2015 11:46:57||Bob Vance||August, 04 2015 11:46:57||false|
|1||sabre||2||August, 04 2015 11:52:45||Jo Bennett||August, 04 2015 11:46:57||false|
customer_history table should look like this:
|1||dunder mifflin||1||August, 04 2015 11:46:57||David Wallace||false|
So, we have our history saved and we can see who created Dunder Mifflin, as well as who updated it and when. As always, here is the SQL Fiddle.
Note for this fiddle: SQL fiddle splits queries on semi-colons, but I needed semi-colons inside the functions. So, I had to use
// to delimit each query in the schema section.
To find the most current version of an entity is simple. In our example, the
customer table always represents the most current version of the entity. We still need to check for
SELECT * FROM customer WHERE is_removed IS FALSE;
To find the first version of an entity, we want to select from
customer_history where the
version = 1.
WITH original_customer AS ( SELECT customer_id, name, version, version_on, is_removed FROM customer WHERE version = 1 UNION ALL SELECT customer_id, name, version, version_on, is_removed FROM customer_history ch WHERE version = 1 ) SELECT * FROM original_customer
To get the version of an entity at a specific point in time requires a little more work, but it is still possible. We want to select the latest version of customer from either the
customer_history tables where that version of the customer existed before the given time.
WITH customer_in_time AS ( SELECT DISTINCT ON(customer_id) customer_id, name, version, version_on, is_removed FROM ( SELECT customer_id, name, version, version_on, is_removed FROM customer WHERE version_on < '2015-08-05 06:48:52-05' UNION ALL SELECT customer_id, name, version, version_on, is_removed FROM customer_history ch WHERE version_on < '2015-08-05 06:48:52-05' ORDER BY version_on DESC ) c ORDER BY c.customer_id, c.version DESC ) SELECT * FROM customer_in_time
DISTINCT ON and
ORDER BY here allows us to easily get the latest version of a customer from the resulting rows matching the
If changes to your data really matter, consider using soft updates in addition to soft deletes. For those of us in healthcare, this is a necessity. Of course, you can also consider alternatives like Event Sourcing and Datomic. CQRS is the idea of storing all the changes to an entity rather than the current state of the entity. Datomic is an immutable data store, so you get soft everything for free. I'm very interested in the potential of Datomic and systems like Datomic for this very reason (for which they might be part of a future blog post). Nevertheless, we're using an RDBMS for better or worse, and even though it takes effort to keep history in a traditional RDBMS, it's worth it for us.
After posting to Hacker News, there were a lot of great suggestions and references to other implementations and examples. Check them out now at Hacker News.
As I have time to review some of these, I'll either update this post or write some follow up articles.
- Removed exercises and added some example queries instead, including how to make use of the history table
- Corrected reference of CQRS to be Event Sourcing.
is_removedto history table. Necessary because one could undo a deletion.
UPDATEquery to use
versionto prevent concurrency issues.