How to handle ACID transactions in a distributed system?
Scenario
We are developing a trip application. The requirement is that the application users would provide the origin and the destination locations. Based on the inputs the application’s logic hasto book flights, cabs, and hotel rooms for the entire trip. Let’s say if any of the subsequent booking fails, then we’d have to cancel the whole trip.
Monolith
It is easy to ensure this in a monolith. You will create all the bookings in a single transaction. If we fail to save either or them the whole transaction is rolled back. Our monolith code could be something as follows:
ActiveRecord::Base.transaction do
@flight_ticket = FilghtTicket.new(...)
@flight_ticket.save!
@cab = Cab.new(...)
@cab.save!
@hotel_reservation = HotelReservation.new(...)
@hotel_reservation.save!
end
Microservices
In a microservice architecture, we would be having separate services to handle booking flight tickets, booking a cab and making a hotel reservation. The state is spread all around the system. Using a transaction is no more an option as each of the services have their own database.
Approaches to solving the problem
Spanner
Google’s globally distributed SQL database.
Two Phase Commit
As the name suggests there are 2 phases for saving the data to the database
Prepare phase:
- Co-ordinator proposes to the all the microservices if they can process a particular request and the microservices vote on whether they can do it or not and block the resources required to execute the request in case they can process it.
Commit/Abort phase:
- If all the microservices return true then the coordinator asks the microservice to process the request. They respond back as done to the Coordinator once it is complete.
- Even if one of the microservice returns false, the coordinator will send an abort request to all the microservices so that they can release the resources.
Disadvantages of this approach
- It is not good at scaling.
- Holding resources is not a good option
- The coordinator is a single point of failure
Distributed Sagas
Let’s discuss this approach in more detail in the next section.
Distributed Sagas
This is a pattern for managing failures in distributed systems. A saga represents a high-level business process. This process consists of several low-level requests each updating data in a single service. Let’s take our example of creating a trip, here the creation of the trip is a saga and booking a flight ticket, a cab, and a hotel reservation are the low-level requests, each updating data in a single service. Each of these requests has a compensating request like “cancel flight ticket”, “cancel the cab” and “cancel hotel reservation”. The compensating request restores the state of the system to the state before the request was executed. When any of the low-level requests fail, the compensating requests for the low-level requests that have run successfully are executed.
If a saga completes execution then either all the low-level transactions have completed successfully or a set of requests and corresponding compensating requests have been executed.
Core Principles
- A request can abort
- Requests must be idempotent
- How many ever times we make the same request, the system should be as if the request was executed only once.
- Reason: The service might receive the same request multiple times.
- Compensating requests should semantically undo the effect of a request ex:
- Refunding a charged amount
- Sending a follow-up email to correct the first email
- A compensating request cannot abort
- A compensating request must be idempotent
- A compensating request must be commutative with respect to the request.
- Even if it receives a compensating request before an actual request, the net effect should be as if the actual request and compensating request have executed in order.
Disadvantages of using Distributed Sagas
- Developers must design compensation transactions explicitly to undo changes made earlier in the saga.
- All the changes are visible to the systems during the execution of the saga i.e the flight reservation is visible even before the execution of the saga completes.
- Lack of isolation, this could lead to Lost update or Dirty reads