Log manager

We’ve already seen that to increase its performances, a database stores data in memory buffers. But if the server crashes when the transaction is being committed, you’ll lose the data still in memory during the crash, which breaks the Durability of a transaction.

You can write everything on disk but if the server crashes, you’ll end up with the data half written on disk, which breaks the Atomicity of a transaction.

Any modification written by a transaction must be undone or finished.

To deal with this problem, there are 2 ways:

Shadow copies/pages: Each transaction creates its own copy of the database (or just a part of the database) and works on this copy. In case of error, the copy is removed. In case of success, the database switches instantly the data from the copy with a filesystem trick then it removes the “old” data.
Transaction log: A transaction log is a storage space. Before each write on disk, the database writes an info on the transaction log so that in case of crash/cancel of a transaction, the database knows how to remove (or finish) the unfinished transaction.

WAL

The shadow copies/pages creates a huge disk overhead when used on large databases involving many transactions. That’s why modern databases use a transaction log. The transaction log must be stored on a stable storage. I won’t go deeper on storage technologies but using (at least) RAID disks is mandatory to prevent from a disk failure.

Most databases (at least Oracle, SQL Server, DB2, PostgreSQL, MySQL and SQLite) deal with the transaction log using the Write-Ahead Logging protocol (WAL). The WAL protocol is a set of 3 rules:

1) Each modification into the database produces a log record, and the log record must be written into the transaction log before the data is written on disk.
2) The log records must be written in order; a log record A that happens before a log record B must but written before B
3) When a transaction is committed, the commit order must be written on the transaction log before the transaction ends up successfully.

This job is done by a log manager. An easy way to see it is that between the cache manager and the data access manager (that writes data on disk) the log manager writes every update/delete/create/commit/rollback on the transaction log before they’re written on disk. Easy, right?

WRONG ANSWER! After all we’ve been through, you should know that everything related to a database is cursed by the “database effect”. More seriously, the problem is to find a way to write logs while keeping good performances. If the writes on the transaction log are too slow they will slow down everything.

ARIES

In 1992, IBM researchers “invented” an enhanced version of WAL called ARIES. ARIES is more or less used by most modern databases. The logic might not be the same but the concepts behind ARIES are used everywhere. I put the quotes on invented because, according to this MIT course, the IBM researchers did “nothing more than writing the good practices of transaction recovery”. Since I was 5 when the ARIES paper was published, I don’t care about this old gossip from bitter researchers. In fact, I only put this info to give you a break before we start this last technical part. I’ve read a huge part of the research paper on ARIES and I find it very interesting! In this part I’ll only give you an overview of ARIES but I strongly recommend to read the paper if you want a real knowledge.

ARIES stands for Algorithms for Recovery and Isolation Exploiting Semantics.

The aim of this technique is double:

1) Having good performances when writing logs
2) Having a fast and reliable recovery

There are multiple reasons a database has to rollback a transaction:

Because the user cancelled it
Because of server or network failures
Because the transaction has broken the integrity of the database (for example you have a UNIQUE constraint on a column and the transaction adds a duplicate)
Because of deadlocks

Sometimes (for example, in case of network failure), the database can recover the transaction.

How is that possible? To answer this question, we need to understand the information stored in a log record.

The logs

Each operation (add/remove/modify) during a transaction produces a log. This log record is composed of:

LSN: A unique Log Sequence Number. This LSN is given in a chronological order*. This means that if an operation A happened before an operation B the LSN of log A will be lower than the LSN of log B.
TransID: the id of the transaction that produced the operation.
PageID: the location on disk of the modified data. The minimum amount of data on disk is a page so the location of the data is the location of the page that contains the data.
PrevLSN: A link to the previous log record produced by the same transaction.
UNDO: a way to remove the effect of the operation

For example, if the operation is an update, the UNDO will store either the value/state of the updated element before the update (physical UNDO) or the reverse operation to go back at the previous state (logical UNDO)**.

REDO: a way replay the operation

Likewise, there are 2 ways to do that. Either you store the value/state of the element after the operation or the operation itself to replay it.

…: (FYI, an ARIES log has 2 others fields: the UndoNxtLSN and the Type).

Moreover, each page on disk (that stores the data, not the log) has id of the log record (LSN) of the last operation that modified the data.

*The way the LSN is given is more complicated because it is linked to the way the logs are stored. But the idea remains the same.

**ARIES uses only logical UNDO because it’s a real mess to deal with physical UNDO.

Note: From my little knowledge, only PostgreSQL is not using an UNDO. It uses instead a garbage collector daemon that removes the old versions of data. This is linked to the implementation of the data versioning in PostgreSQL.

To give you a better idea, here is a visual and simplified example of the log records produced by the query “UPDATE FROM PERSON SET AGE = 18;”. Let’s say this query is executed in transaction 18.

Each log has a unique LSN. The logs that are linked belong to the same transaction. The logs are linked in a chronological order (the last log of the linked list is the log of the last operation).

Log Buffer

To avoid that log writing becomes a major bottleneck, a log buffer is used.

When the query executor asks for a modification:

1) The cache manager stores the modification in its buffer.
2) The log manager stores the associated log in its buffer.
3) At this step, the query executor considers the operation is done (and therefore can ask for other modifications)
4) Then (later) the log manager writes the log on the transaction log. The decision when to write the log is done by an algorithm.
5) Then (later) the cache manager writes the modification on disk. The decision when to write data on disk is done by an algorithm.

When a transaction is committed, it means that for every operation in the transaction the steps 1, 2, 3,4,5 are done. Writing in the transaction log is fast since it’s just “adding a log somewhere in the transaction log” whereas writing data on disk is more complicated because it’s “writing the data in a way that it’s fast to read them”.

STEAL and FORCE policies

For performance reasons the step 5 might be done after the commit because in case of crashes it’s still possible to recover the transaction with the REDO logs. This is called a NO-FORCE policy.

A database can choose a FORCE policy (i.e. step 5 must be done before the commit) to lower the workload during the recovery.

Another issue is to choose whether the data are written step-by-step on disk (STEAL policy) or if the buffer manager needs to wait until the commit order to write everything at once (NO-STEAL). The choice between STEAL and NO-STEAL depends on what you want: fast writing with a long recovery using UNDO logs or fast recovery?

Here is a summary of the impact of these policies on recovery:

STEAL/NO-FORCE needs UNDO and REDO: highest performances but gives more complex logs and recovery processes (like ARIES). This is the choice made by most databases. Note: I read this fact on multiple research papers and courses but I couldn’t find it (explicitly) on the official documentations.
STEAL/ FORCE needs only UNDO.
NO-STEAL/NO-FORCE needs only REDO.
NO-STEAL/FORCE needs nothing: worst performances and a huge amount of ram is needed.

The recovery part

Ok, so we have nice logs, let’s use them!

Let’s say the new intern has crashed the database (rule n°1: it’s always the intern’s fault). You restart the database and the recovery process begins.

ARIES recovers from a crash in three passes:

1) The Analysis pass: The recovery process reads the full transaction log* to recreate the timeline of what was happening during the crash. It determines which transactions to rollback (all the transactions without a commit order are rolled back) and which data needed to be written on disk at the time of the crash.
2) The Redo pass: This pass starts from a log record determined during analysis, and uses the REDO to update the database to the state it was before the crash.

During the redo phase, the REDO logs are processed in a chronological order (using the LSN).

For each log, the recovery process reads the LSN of the page on disk containing the data to modify.

If LSN(page_on_disk)>=LSN(log_record), it means that the data has already been written on disk before the crash (but the value was overwritten by an operation that happened after the log and before the crash) so nothing is done.

If LSN(page_on_disk)<LSN(log_record) then the page on disk is updated.

The redo is done even for the transactions that are going to be rolled back because it simplifies the recovery process (but I’m sure modern databases don’t do that).

3) The Undo pass: This pass rolls back all transactions that were incomplete at the time of the crash. The rollback starts with the last logs of each transaction and processes the UNDO logs in an anti-chronological order (using the PrevLSN of the log records).

During the recovery, the transaction log must be warned of the actions made by the recovery process so that the data written on disk are synchronized with what’s written in the transaction log. A solution could be to remove the log records of the transactions that are being undone but that’s very difficult. Instead, ARIES writes compensation logs in the transaction log that delete logically the log records of the transactions being removed.

When a transaction is cancelled “manually” or by the lock manager (to stop a deadlock) or just because of a network failure, then the analysis pass is not needed. Indeed, the information about what to REDO and UNDO is available in 2 in-memory tables:

a transaction table (stores the state of all current transactions)
a dirty page table (stores which data need to be written on disk).

These tables are updated by the cache manager and the transaction manager for each new transaction event. Since they are in-memory, they are destroyed when the database crashes.

The job of the analysis phase is to recreate both tables after a crash using the information in the transaction log. *To speed up the analysis pass, ARIES provides the notion of checkpoint. The idea is to write on disk from time to time the content of the transaction table and the dirty page table and the last LSN at the time of this write so that during the analysis pass, only the logs after this LSN are analyzed.

5.2.4 Log manager