How Microsoft made Azure SQL more available by making recovery faster

0
109


microsoft-azure-SQL.jpg
Image: traffic_analyzer/Getty Images

Databases are wonderful things, until they go wrong. Then you’re left waiting for a recovery to unwind long transactions and replay log files. The more complex the transactions, the longer recovery can take, leaving you relying on alternative solutions to keep your business running. Slow recovery is expensive, costing time and money.

So how can we get our databases back quickly, with minimal disruption to business processes? Sometimes that requires going back to basic research, or in this case to a 2019 Microsoft Research paper entitled “Constant Time Recovery in Azure SQL Database”. The paper is fascinating, promising a way of recovering databases in constant time with smaller log files. While it’s clearly aimed at cloud services, where keeping recovery storage to a minimum has distinct economic value, it’s a technique that can work anywhere. The benefits certainly seem clear, with most recoveries taking less than three minutes instead of hours.

The old way: ARIES

Most databases, including SQL Server and Azure SQL, depend on the ARIES protocol. The Algorithm for Recovery and Isolation Exploiting Semantics is a fundamental tool for modern databases, using write-ahead logging to provide a way of recovering database state. ARIES uses three types of log, undo-only, redo-only, and undo-redo. These record each update, with the type of log determining what type of recovery can be attempted, with the aim of either recovering to a before or after state for each update operation on your database, handling changes to your records and tables.

Both SQL Server and Azure SQL use the transaction log to replay operations from the last known good state, analyzing the log to determine the state of all recorded transactions, redoing committed transactions in order from oldest to newest, while undoing uncommitted transactions in reverse order, from newest to oldest. Once this process is complete, your database is back online and ready to use. There’s a lot for SQL Server to handle in an ARIES recovery, as a complete recovery can require two or three passes through the logs, especially when you’re dealing with long transactions that require multiple updates.

SEE: Resource and data recovery policy (TechRepublic Premium)

There are ways to improve performance here, usually involving taking advantage of parallel processing techniques, but they require significant hardware investments, needing high-powered servers. That might be fine for on-premises databases where you’re prepared to have the necessary hardware in hand, but it’s not really practical in the cloud where not only are database running on optimized hardware, but they can also often be orders of magnitude bigger than on your own hardware. There are other issues around cloud platforms’ use of multi-tenancy, where compute resources are shared on hardware running at a higher load. The resulting constraints on cloud operations mean failover and recovery are far more likely than on premises, so having a faster recovery process is essential.

And now for the new: ADR

Microsoft’s new database recovery technique is intended to avoid the problems that come with cloud recovery, keeping downtime to a minimum while not requiring extra resources, no matter how big the workload.

The new constant time recovery algorithm mixes ARIES with SQL Server’s Multi-version Concurrency Control. Designed to manage concurrent transactions, MVCC uses a temporary database to store different versions of rows as data updates, marshalling changes based on transaction IDs and time stamps. This store is deleted every time the database is restarted as it’s only needed to maintain isolation between concurrent transactions. Old rows roll off the database, keeping storage requirements low.

This store is the basis for constant time recovery, which both Azure SQL and SQL Server now calls Accelerated Database Recovery (ADR). Instead of waiting for transaction logs to roll back, you now get nearly instantaneous transaction rollback, with no effects from long-running transactions. And as the underlying system is based on the MVCC store, it’s kept small, or as Microsoft describes it, “aggressively truncated.” This version of the MVCC is part of the user database, and is referred to as the Persisted Version Store (the original MVCC remains a separate store and continues to support Azure SQL’s concurrency features).

Implementing Constant Time Recovery

While the recovery process is much like that used by ARIES, it’s a lot quicker as it doesn’t need to process the entire transaction log. Instead, the data in the new PVS is used to provide an instant replay of transactions between the last checkpoint and the oldest uncommitted transaction. This leaves only a short period of transactions to redo. At the same time a new in-memory log stream is used to replay non-versioned operations, like cache management and locks, and at the same time keeping the transaction log small. This secondary log stream (the sLog) is copied to the transaction log at checkpoints, keeping it small but ensuring that its data is there if you need to replicate an entire database as part of a business recovery process.

If you’re using Azure SQL Database and Aure SQL Managed Instance, then you’re already using ADR. It’s available for other SQL Server-based databases too, including SQL server 2019 and Azure Synapse SQL. If you’re using SQL Server 2019 ADR might not be necessary for all your operations, but it’s certainly worth considering if you have long transactions, large transaction logs, or where database recovery has caused significant outages in the past. There are some cases, like when you’re using database mirroring, that it’s not supported.

To turn ADR on, use the SQL Server command line to toggle ADR on or off, and to define the filegroup used to host its PVS data. The process can take some time, affecting normal operations as it locks your database while it runs, preventing new sessions from starting until setup is complete. Once the lock is released, your database will be protected by ADR.

It’s a simple change that has a big effect, avoiding significant down times in complex data applications. But like all simple changes, that one line of code depends on a lot of work by Microsoft’s SQL Server team and their Microsoft Research counterparts. This is one occasion where we can easily see how research becomes a product, and how it can have a big impact not only on Azure but in our own data centers.



Source link

Leave a reply

Please enter your comment!
Please enter your name here