Managing distributed systems is always hard, with each addition of number of microservices the fault possibilities increases exponentially.

Suppose an old monolith system was running with 99.0% availability SLO – with just 1% of allowed failure. This system when disintegrated into 10 microservices, the availability number simply reduce down to 90.0% availability with 10% of failure which is not a good number to look at. Not just the reduced availability if the monolith used to fail , the failures used to impact the whole application- the system freezes all its operations. Restoration steps have been quite straight forward without impacting the state of application data.

There are thousands of articles out there which illustrate all issues related to microservices and how to improve on it – below are few outliers from personal experience so far.

Reliability at a cost

Communications between systems in microservices mostly are done using messages, and there are generally 2 ways of handling messages considering every message adds delay to system as it requires processing

  • Fast but unreliable – Copies of messages to several subscribers which can be sent fast. Although systems need to expect to receive duplicate , out of orders and missing messages. its unreliable yet fast way of communication.
  • Slow and reliable – Reliability can be only built with a stateful version of messages saved on the systems to be processed later, it includes a sending a receipt or acknowledgement as well. Important in systems which handles transactions which cant be lost at any cost and need to maintain the order of communication

Not just with messages , reliability speed trade-off needs to be done with APIs also specially around BFF layers who have been transformed from session based stateful systems , yet have to maintain some sort of ordering/visibility to prevent CRUD operations to fail on transactions.

No single source of truth(SSoT)

This has been a common problem with the microservices world , as the data is flown between systems there is no global database or common storage to refer. Each system have been living on an “old truth” with a copy of its own database/memory.

Easiest way to cope up with it is event log and data replication techniques. Although in cases where time is at most importance- systems end up using same database in microservices to have a Single view of truth(SVoT) if source is not possible. Even with that we are never sure that data will be consistent enough to have right SLOs adapted to it.

Retries everywhere

Rather than failure itself, recovery from failures have been more expensive in microservices world – every retry in a failure point can flood the systems with requests/messages/transactions causing butterfly or domino effects because shear amount of communication between all systems. Need to be careful with correct number of retry thresholds and often need to have delays in between retries with correct algorithm to avoid network flooding. This is where we need to identify and use correct circuit breaker to handle such cases and to prevent resources of the system to exhaust in recovery itself.

Visibility

Often when we build a microservices system, with so many components orchestrated – we are sometime blinded with so much data around. With the speed of stateless distributed systems do we have right view of issues , logs in hand. It might be impossible to track issues and debug problems without correlation id and right tools in place to track every HTTP request.

Optimism

With movement into MACH architecture where Microservices approach , API-first , Cloud native and headless systems – there is still astonishment in everyone. There is unrealistic optimism and hope towards microservices that it will work in production. Not yet being tested to a level hoping systems to fail, there is a need to have chaos engineering practices and approaches followed to test, build the systems. Tools like chaos monkey or litmus chaos needs to be used as part of finding weakness and faults in systems.

In summary – managing a stateless distributed system shouldn’t be underestimated. And definitely we shouldn’t expect it not to fail. Applications migrating to microservices should be built small and grow slowly over it – slowly making it more and more reliable!

Author: Prateek Srivastava
Lead Solutions Architect / Engineering Manager

Original post: https://www.linkedin.com/pulse/reliability-engineering-microservices-prateek-srivastava/