Auto-Increment IDs are not Strictly Monotonic, here's some Solutions

Intro

Most people assume that Auto-Increment ID’s are always increasing and complete. You may be surprised to learn they are not guaranteed to always increase and not guaranteed to be without gaps.

Monotonically increasing means IDs will only ever go up. Here’s an example that can happen where this is not the case: first you see rows 1, 2, 3, 4, 6 and then you see 1, 2, 3, 4, 5, 6. “5” appears in the table after “6” appears. I call this a “2 before 1” issue.

And by complete I mean there won’t be gaps. Also not true - you can get gaps in your IDs e.g. 1, 2, 3, 4, 6 where ‘5’ will never appear.

The obvious time it can bite you is if you’re using bookmarks to save a processes position in a large table. If you’ve saved your position as “6” (meaning you’ve read all the IDs before “6”) and then “5” appears, your bookmarked process has moved on and will never read “5”.

Isolation Levels

The first key thing to note is that auto-increment IDs are not included in database transactions. They are not rolled back if a transaction rolls back and they do not follow transaction isolation rules.

Unfortunately even Serializable isolation can’t help us here. Both “2 before 1” and the gaps issue can occur even in Serializable mode in databases, the strictest isolation level offered in MySQL and Postgres.

Because Serializable is the most strict people wrongly assume transactions are executed one after the other (serially) and don’t overlap. This isn’t true. What Serializable gives us is a total ordering of transactions, but not serial execution. This means that everything will be completely consistent as if all transactions were executed in some serial order, but can still be executed in parallel under the hood. The serial order isn’t known and isn’t necessarily the transaction start order. The database chooses the order based on how it wants to interleave transactions for best performance.

A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.

From : (Second Informal Review Draft) ISO/IEC 9075:1992, Database Language SQL- July 30, 1992

Taken too literally: a transaction with a single SELECT could return an empty set despite the table containing rows because we can always put the transaction at the beginning of the databases history when everything was empty. In reality databases do a pretty good job of returning what you would expect in most cases and don’t take this definition too far.

So what actually happens in our 2 before 1 case under Serializable mode (and every other isolation level)?

  1. Both INSERT’s A and B are running in parallel (allowed under Serializable).
  2. A takes out position 5 from the sequence number generator and B takes out position 6
  3. Both send a commit request
  4. B’s’ commit is processed first
  5. A’s commit is processed

Between steps 4) and 5) is when you could read the whole table and see 6 but not 5.

Note also that if A rolled back at 5) we would have gaps in our auto-increment field and would never see ID 5 in the table.

One nuance to reiterate here is that the sequence number generator is not part of the transaction. If it were then Serializable mode would not be able to overlap these transactions.

What we actually want is Linearizable, which is stricter than Serializable. Linearizable means all statements are executed serially i.e. the results of one statement return before the start of another. Offering Linearizability is a niche edge case that isn’t worth the trade-offs for most databases (except some new ones like CockroachDB) and can be simulated with database locks anyway.

Solutions

1. Locks and Batching

This is my recommended solution as it’s simple and easy to test for correctness.

The solution is to take out an advisory lock to ensure each insert is done in isolation. This does indeed solve the 2 before 1 problem. But now we have a lock that slows everything down.

The solution to the performance problem is to batch inserts. Rather than take 1 lock out and insert 1 row we take 1 lock out and insert 1k rows. We only pay the performance penalty of one lock which is insignificant compared to the time to transfer and store 1k rows. This actually brings performance back up close to parallel inserts.

All rows inserted in a batch are either all accepted and inserted at effectively the exact same time, or they’re all rejected leaving a large gap.

2. No Sequence Number Generator

Another approach is to:

  1. Begin the transaction
  2. Insert a batch of commands without locking, and without filling in the ID field
  3. Take an advisory lock out
  4. Fill in the blanks of all rows from the application level or in a stored proc
  5. Release the advisory lock
  6. Commit the transaction

Similarly you can do the write to a temporary table, then take the lock and copy rows across. Copying within the database is much faster than writing to the database over a connection.

This approach assumes most of the time is spent on the batch insert. This may be true if your batch sizes need to be very large. If this is true this method would be faster than the previous one albeit more complicated to implement. In the real world this isn’t worth the effort.

3. Storage Engine ID

Some databases have IDs for the actual block storage location of rows. ROWID in Oracle, ctid in Postgres. I don’t recommend using them as they don’t came with a lot of guarantees.

4. Transaction IDs on the Write Side

I didn’t consider this approach because of the trickiness of txids, but then was proven wrong by my colleague Bruno Medeiros who was using this successfully in production already!

You can use the transaction IDs of the currently running transactions plus a sequence number to order things. The table would have two columns: tx_id (the transaction ID of the writing transaction) and position (an auto-increment sequence number). The bookmark must contain both tx_id and sequence also. The rows are ordered by tx_id first, and if equal sequence second. However we can only read rows that have a tx_id less than the minimum currently executing transaction.

Then you can pull all records out in a deterministic order without missing any with the following Postgres-compatible SQL:

...
  WHERE ( 
    tx_id > bookmark.tx_id
    OR (tx_id = bookmark.tx_id AND position > bookmark.position)
  )
  AND tx_id < txid_snapshot_xmin(txid_current_snapshot()) 
-- txid_snapshot_xmin(txid_current_snapshot()) returns the lowest txid of currently running (uncommitted) transactions

Downsides of this approach are:

  • You don’t get a single incrementing sequence number. Not a big deal usually though
  • By relying on transaction IDs which are a global concept in DBs, if any writing transaction (including ones not involving this table) in the database takes a long time, the read side of this process will be blocked for that amount of time.
  • Txids are becoming more common to use, but still they are very rarely used at the application level. you may run into txid sync issues if you swap primary DBs, do an upgrade, etc.

5. Transaction IDs on the Read Side

Let’s say you have a table that is not employing any of these techniques. Is it possible to read from it using bookmarks and without missing any rows? Yes, using transaction IDs.

Here’s the steps:

  1. Keep a bookmark of the max(xmin) (which we’ll call previous_max_xmin) of the read rows
  2. Read the next batch of rows where xmin > previous_max_xmin and xmin < txid_snapshot_xmin(txid_current_snapshot())

Downsides are very similar to previously that any long running transactions will block the reader. However a big problem with this method is that xmin is not indexed as it is a system column, so for large tables this query will not work, hence why the previous solution added a column for xmin. You also have to be careful with xid wrap-around which can occur on these system columns.

Note that if you are doing updates on your rows, you will read any updated rows even if you have read those rows previously. To additionally track DELETEs use xmax and keep a bookmark.

I haven’t tested this method myself so proceed with caution and do some solid testing.

6. Wait for Current Transactions to Finish

This takes advantage of the fact that if all currently running transactions are finished, all gaps of IDs from before these will either be filled or never be filled.

  1. Read highest auto-increment ID (high_id)
  2. Read the highest currently-running txid (high_txid)
  3. Wait for all transactions up to and including high_txid to complete (i.e. current min xid is greater than previous high_txid)
  4. It is now safe to read all IDs lower than high_id

The only downside is that you will be delayed by as much as your slowest running transactions. Note that read-only transactions don’t consume an xid - so one strategy would be to pull slow running queries out of transactions that do writes.

This could be optimised further by only employing this method if you hit an out-of-order or gap in IDs. Something like:

  1. Read a batch after bookmark.sequence plus read the highest currently-running txid (high_txid)
  2. For all IDs that are in-order without gaps after bookmark.sequence, these are good to read. and go back to Step 1)
  3. If we hit an out-of-order or gap in sequence numbers at the beginning of the batch, drop all the rows after the last good row, and wait for all transactions up to and including high_txid to complete (i.e. current min xid is greater than previous high_txid)
  4. It’s now safe to read from the gap onwards

7. Write-Ahead Log

As a last resort you can tail the Write-Ahead Log to get every change that is made to the database, in a serial order. This would allow you to replicate tables that have updates and deletes as well, rather than the above methods which are for insert-only tables.

I say this method is a last resort for the following reasons:

  • You’re now working at a low-level in the database and that can be tricky and possibly database-version dependant.
  • The ability to test is reduced with this method
  • If you have a high-throughput database and you do not keep up with the changes you can cause storage to fill up
  • For RDS Postgres 11 and prior you have to read the WAL off the primary database.

Summary

In summary, if you need the guarantee that IDs are monotonically increasing then use an advisory lock with batching to improve performance. This doesn’t guarantee that the sequence numbers have no gaps, but does guarantee they are monotonically increasing.

If you have an existing table where you don’t want to change the write-side then my preference is “6. Wait for Current Transactions to Finish”. This allows you to replicate the table without missing rows as long as the table is insert-only.

As a last resort look at tailing the Write-Ahead Log.

updated_at 19-05-2021