Distributed systems are all around us, providing the backbone of the computing infrastructure that we rely upon — think of the mesh of computing nodes connected by wireless and wireline networks of various kinds that help us get our financial transactions done in the blink of an eyelid, or those that get our web orders processed, or those that make sure the lights turn on when we flip the switch. A question that has faced us for some time is:
Can these distributed systems stay available continuously as well as reconfigure when things change, say when the workload changes or the user requirements change?
It is a nuanced story and traces its roots back to the days where live upgrades of single node computing systems, like Linux or Lucent’s 5ESS switch, were designed. We take this for granted today thanks to heroic design and implementation efforts. But live upgrades of distributed systems throw up a set of opportunities as well as challenges.
Why would we want to upgrade or reconfigure?
We would want to upgrade for the obvious reason that there is new software or new hardware that needs to be put in place. Such upgrades are relatively infrequent events. Much more of a challenge is how to handle reconfiguration that may be needed much more frequently. Reconfiguration may be needed if:
- The workload characteristics coming into the distributed system change, say, a set of sparse read operations are replaced by bursty writes to some hot files.
- The environment of the execution platform changes, say, a demanding set of co-located applications start up.
- The user requirements change, say the user demands a faster response or greater resilience to errors. This may happen in response to some perceived security attacks or cascading failures.