Jun 04, 2026

Why PostgreSQL Never Overwrites Rows: Understanding MVCC Explained

Think about what happens when you update a row in a database. The intuitive mental model is that PostgreSQL finds the row on disk, changes the value, and saves it back. Clean, simple, done. That mental model is completely wrong, and understanding why changes how you think about everything from query performance to VACUUM to mysterious table bloat.

PostgreSQL never overwrites a row. Not for updates. Not even for deletes. Every change creates a new version of the row while leaving the old one in place. This is MVCC — Multi-Version Concurrency Control — and it is the reason PostgreSQL can let thousands of readers and writers operate simultaneously without locking each other out.

The Secret Fields Inside Every Row

Every row stored in PostgreSQL carries hidden metadata that most developers never see. Two of these fields are the heart of the whole system.

The first is xmin — the transaction ID of the transaction that inserted this row version. When you insert a row, PostgreSQL takes the current transaction's ID and stamps it into the row. That stamp says: This version of the row was born here.

The second is xmax — the transaction ID of the transaction that deleted or replaced this row version. A freshly inserted row has xmax set to zero, meaning nobody has deleted it yet. When a row is deleted or updated, PostgreSQL writes the deleting transaction's ID into xmax. That stamp says: This version of the row died here.

There is also a third field called t_ctid, which is a physical pointer. Every row points to itself by default. When a row is updated, the old version's t_ctid is changed to point to the new version sitting somewhere else on disk. This forms a chain — a linked list of row versions threading through time.

What Happens When You Run UPDATE

When you run an UPDATE, PostgreSQL does not touch the old row's actual data at all. Instead, it writes a completely new physical row somewhere on the page with the updated values. The new row gets xmin set to the current transaction ID and xmax set to zero — brand new, not yet deleted.

Then PostgreSQL goes back to the old row and sets its xmax to the current transaction ID and updates its t_ctid to point to the new row. The old row now says: I was deleted by this transaction, and my replacement lives over there.

A DELETE is even simpler. No new row is written at all. PostgreSQL just sets xmax on the existing row to the deleting transaction's ID and leaves everything else alone. The row is still physically there on disk. It has just been marked as dead.

How a Snapshot Decides What You Can See

When a transaction starts, PostgreSQL takes a snapshot of the world. A snapshot is three things: the lowest transaction ID that was active at that moment, the highest transaction ID that had been assigned, and a list of every transaction that was in flight between those two numbers. Together, these three pieces of data define a window in time — a precise picture of which transactions had been committed and which had not.

When your query reads a row, it runs that row through a visibility check. The logic goes roughly like this: is xmin committed, and is that commit visible to my snapshot? If the transaction that inserted this row was still running when I took my snapshot, I cannot see the row — it might not commit. If xmin was committed after my snapshot was taken, I also cannot see it. If xmin committed before my snapshot and the transaction is not in my in-progress list, the row exists for me.

Then the same question is asked about xmax. If xmax is zero, the row has not been deleted — I can see it. If xmax is set to a transaction that aborted, the deletion never happened — I can still see the row. If xmax is set to a transaction that was still running when I took my snapshot, the deletion is not committed as far as I am concerned — I can still see the row. Only if xmax is set to a transaction that committed and that commit is visible to my snapshot does the row disappear from my view.

This is why two transactions can query the same table at the same time and see completely different data. They are not looking at the same thing. Each one has its own snapshot and its own window of visibility, and the heap contains enough information to answer both questions independently.

The Hint Bits Shortcut

Checking whether a transaction committed sounds expensive — it requires consulting the commit log, a persistent file on disk that stores two bits per transaction recording whether it committed or aborted. Do this for every row in every query, and the overhead adds up.

PostgreSQL cheats. Once it has looked up the commit status of a transaction and confirmed that xmin committed, it writes that fact directly into the row's infomask — a small set of status flags in the tuple header. These are called hint bits. The next time any backend looks at that row, it sees the flag and skips the commit log lookup entirely. No lock is needed to set a hint bit because it is an idempotent operation — multiple backends can write the same flag simultaneously without corrupting anything.

The Dead Tuple Problem

Now you can see the problem. Every update leaves behind an old row version. Every delete leaves behind a ghost. Over time, a table that sees heavy write activity accumulates enormous numbers of dead tuples — rows that no living transaction will ever see again, but which are still physically sitting on disk, taking up space.

This is where VACUUM comes in. VACUUM's job is to find dead tuples and mark their space as reusable. But it cannot simply remove any tuple with a non-zero xmax. It has to be certain that no currently running transaction could possibly need to see the old version. To determine this, PostgreSQL computes the oldest transaction ID that any open snapshot could care about. Any tuple whose xmax is older than that threshold is provably invisible to everyone alive and can be safely reclaimed.

If VACUUM falls behind — if writes are heavy and autovacuum cannot keep up — dead tuples accumulate, pages fill with invisible garbage, and queries start reading more data than they need. This is table bloat, and it is a direct consequence of MVCC's approach to immutability.

The Upside of All This Complexity

The reason PostgreSQL goes to all this trouble is concurrency. Because old row versions are never overwritten, a long-running read transaction can sit comfortably reading a snapshot from five minutes ago while write transactions hammer the same tables at full speed. Nobody blocks anybody. Readers and writers do not compete for the same locks. The database can serve hundreds of simultaneous connections with minimal coordination overhead.

The cost is storage and the need for VACUUM. The benefit is that your reporting query does not slow down your order processing pipeline, and your order processing pipeline does not make your reporting query see partial results. That trade-off is at the core of every PostgreSQL installation, running quietly underneath every query you have ever sent it.