Distributed Tracing
Distributed Tracing
- What is tracing?
- What is Distributed Tracing?
- Why do we need tracing?
- Microservices vs Monolithic Example: E-commerce
- Why we are not using tracing?
- What is Jaeger?
- Tracer
- Span
- Showing the OpenTracing Hot R.O.D. Demo
- Conclusion
- Req/Res Visualisation
- Helps in Debugging
- Root Cause Analysis
Key Takeaways
- Introducing the concept of tracing in distributed systems.
What is Tracing?
In software engineering, tracing involves a specialized use of logging to record information about a program's execution. - Wikipedia
What is Distributed Tracing?
Distributed tracing is a method used to profile and monitor applications. It helps pinpoint where failures occur and what causes poor performance. It is also called as distributed request tracing.
Why do we need distributed tracing?
Let’s take the example of an E-commerce website.
Given below is the architecture of a monolithic e-commerce app with User management, Order management, Inventory management, etc.
In a monolithic application, tracing can help to debug and analyze the performance of the application since the request coming to the app is processed in one location.
Now, let’s consider a refactoring of the whole monolithic app to different microservices.
After a while to increase sales, we require to add a recommendation engine.
Soon after integrating the recommendation engine the system starts slowing down. At this point, we can easily say the culprit is the recommendation engine since it has been recently added but what about the existing systems. We do not have a mechanism to identify the bottlenecks. The immediate action would be to collect the request and response metrics in each service.
Why are we not using distributed tracing?
Instrumentation is complex
It is very complex and time-consuming to implement distributed tracing in a project.
Vendor Lock-in
Requires a lot of change if we plan to change a distributed tracing implementation from one vendor to other.
Inconsistent APIs
Tracing semantics must not be language dependent.
OpenTracing is an API specification which overcomes the aforementioned limitations and makes the integration of distributed tracing seamless within microservices. Jaeger is one implementation of OpenTracing.
Jaeger
Jaeger is developed in Go by Uber for solving their distributed tracing requirements. It is inspired by Dapper and OpenZipkin. It has instrumentation libraries for all popular languages.
Before going forward, let’s discuss two important concepts span and trace.
Span
A span represents an individual unit of work done in a distributed system. It contains an operation name, start time and end time of the operation.
Trace
A trace is a data execution path through the system. It can be thought of as a directed acyclic graph of spans.
Here in the above figure, we can see that the Service 1 - span is spread horizontally and under the same, there are spans from the same as well as different services performing different operations. A trace is a combination of all the spans from different services.
Example:
Hot R.O.D. - Rides on Demand app
The application HotROD comprises of frontend service, route service, customer service and driver service and it has MySQL and Redis as a data store. We can see the dependency graph of the system below.
Moving forward let’s request for a cab from the frontend -
We have four customers, and by clicking one of the four buttons we summon a cab to arrive at the customer’s location. Once a request for a cab is sent to the backend, it responds with the cab’s license plate number and the expected time of arrival, which we can visualize in the app which will look similar to the image given below.
Now, you can go to the Jaeger UI and explore how the request made above is flowing from one service to other.
Let’s zoom in and see what is happening in our MySQL span.
We can see that from looking into the span we are getting information like from where the request is coming from client or server and we can also see that what is the query which we are executing on the database and finally we have logs from the system as well. So we can say that we have enough info to analyse and debug our system.
Conclusion
Request/Response Visualisation
Provides a UI to visualize the request flowing to different services which provide details like time being consumed between the calls.
Helps in Debugging
It helps in debugging and improving the system.
Root Cause Analysis
It helps in easily finding the real cause of an issue.
Reference:
https://www.jaegertracing.io/docs/1.9/
https://docs.lightstep.com/docs/opentracing-instrumentation
https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941