Distributed systems are all around us, providing the backbone of the computing infrastructure that we rely upon — think of the mesh of computing nodes connected by wireless and wireline networks of various kinds that help us get our financial transactions done in the blink of an eyelid, or those that get our web orders processed, or those that make sure the lights turn on when we flip the switch or, most importantly, serve up my weekly dose of SNL spoofs going back N years. A question that has faced us for some time is:
Can these distributed systems stay available continuously as well as reconfigure when things change, say when the workload changes or the user requirements change?
It is a nuanced story and traces its roots back to the days where live upgrades of single node computing systems, like Linux or Lucent’s 5ESS switch, were designed. Live upgrades of single nodes is a done deal today and has been for many years now. But live upgrades of distributed systems throw up a set of opportunities as well as challenges.
Why would we Want to Upgrade or Reconfigure?
We would want to upgrade for the obvious reason that there is new software or new hardware that needs to be put in place. Such upgrades are relatively infrequent events. Much more of a challenge is how to handle reconfiguration that is needed much more frequently and based on unpredictable events.
Reconfiguration may be needed if:
- The workload characteristics coming into the distributed system change, say, a set of sparse read operations is replaced by bursty writes to some hot files.
- The environment of the execution platform changes, say, a demanding set of co-located applications start up.
- The user requirements change, say the user demands a faster response or greater resilience to errors. This may happen in response to some perceived security attacks or cascading failures.
Searching for the Perfect Fit
The state-of-practice still is to employ armies of database administrators (DBAs) to keep the database tuned up as things change. Here state-of-art has raced quite a bit ahead and provided automated ways of analyzing the workload and pinpointing the best configuration to use. A notable work in this space is Ottertune[1], a wonderful example of machine learning done right for computer systems work.
Ottertune performs nearest-neighbor interpolation for optimizing configurations in unseen workloads. The challenge that all techniques in this space have to deal with, and which have been largely solved by the state-of-the-art, is that the response of the system is not linear with respect to variations in the parameter values. So one has to be careful to avoid local maxima. A second challenge is that one cannot search by running the actual system with the new configuration value being queried — this would take too long. Our work, Rafiki[2], addressed this by creating a surrogate model for the performance of the system (through training with sparse data) and querying this surrogate model, rather than executing the actual system, while doing the search. If the surrogate model can be built right, then querying it is a lot faster and you don’t sacrifice that much in accuracy.

The Times they are A-Changin’
Dylan has said many wise words, none more so than in his 1964 album “The Times They Are A-Changin'”. The times are actually changing in distributed databases that are serving up our fickle queries very fast. One moment this video stream from a 15 minute celebrity is being requested by everyone on the planet and the next moment, a new celebrity is born for her transient viral fame. So this means that the workload characteristics seen by the backend databases also change. And this means that that perfect fit you found for your database so painstakingly is no longer such a good fit.
Therefore, people have developed online reconfiguration systems[3, 4, 5], for distributed databases and distributed ML serving systems and everything in between. Abstracting out the specifics, each works in the following steps:
- It extracts information about the current workload and uses a workload prediction model to anticipate what the workload will look like in the near future.
- It then queries a static configuration tuner (the kind described in the previous section) to see what would be the best configuration for each point in the future window of workloads. For each such best configuration, what would be the achieved metric (like throughput from a database).
- It calculates what would be the cost of making all these switches (from configuration A to configuration B). This is where the stateful nature of the distributed system is key because switch usually means that one will have to move state, which takes time.
- It searches through the space of possible configuration switches (not that there can be multiple switches over the future time window) and picks the one that has the best benefit:cost ratio.
- It then runs a distributed protocol to carefully switch over the cluster from the old configuration to the new configuration.
So What’s Left to be Done?
There are still a few knotty problems to be solved. Solving these problems will have large impact on our software infrastructures, the kind that we rely on for our safety and for our fun. Much of these are stateful distributed applications. Examples for the first category are the computing infrastructure helping us in relief and rescue operations, in securing the global supply chains that make up products that we trust our lives on, and the complex chain that guards the IP rights that are the lifeblood of our innovation ecosystem. Examples for the second category are not hard to find — pretty much every social network and video sharing platform is serving up our stream of messages and favorite cat videos through a distributed NoSQL backend [ Twitter, Facebook, Google ].
So what are some of these knotty problems that we can get behind and get them solved.
- How can such distributed software systems co-exist on premise and on the cloud and still be reconfigured as things change? This is an emerging system model as people put some of their data on the cloud and keep some that they deem more sensitive on premises.
- How do innovations in containerization help in this? Containers arguably can help package up state well and containers can be moved more seamlessly to a new node. So this should help. But robust solutions that ride the innovations (Kubernetes, Linux Containers, etc.) are still not there.
- How to automatically figure out dependencies among the configuration options? A rising trend has been n-way dependencies among the configuration parameters — independent or two-way dependencies are handled in the state-of-the-art. However, perhaps as a result of the way these systems have been built and evolved over time, with diverse groups of developers, configuration parameters are no longer neatly orthogonal. How to figure out (automatically) which ones are dependent, how long are the dependency chains? This seems like a ripe area where strong data analytics algorithms can make a dent.
In Conclusion
Distributed software systems are all around us, some of them are being used for critical application areas where our safety depends on the systems being continuously available. Also, to keep them running fast, we need to reconfigure them when things change — the workload characteristics, or the failure modes, or the user requirements. These changes can happen in unpredictable ways. The technical community has made strong progress in answering the question: Can we keep these systems and their data continuously available while performing online reconfigurations? While the answer is a “Yes”, the work has shed light on a few challenging and high impact technical problems that we still need to solve. So off to some ground-breaking work from our community!
References
[1] Van Aken, Dana, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. “Automatic database management system tuning through large-scale machine learning.” In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), pp. 1009-1024. 2017.
[2] Mahgoub, Ashraf, Paul Wood, Sachandhan Ganesh, Subrata Mitra, Wolfgang Gerlach, Travis Harrison, Folker Meyer, Ananth Grama, Saurabh Bagchi, and Somali Chaterji. “Rafiki: a middleware for parameter tuning of nosql datastores for dynamic metagenomics workloads.” In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pp. 28-40. 2017.
[3] Li, Guoliang, Xuanhe Zhou, Shifu Li, and Bo Gao. “Qtune: A query-aware database tuning system with deep reinforcement learning.” Proceedings of the VLDB Endowment 12, no. 12 (2019): 2118-2130.
[4] Mahgoub, Ashraf, Paul Wood, Alexander Medoff, Subrata Mitra, Folker Meyer, Somali Chaterji, and Saurabh Bagchi. “{SOPHIA}: Online reconfiguration of clustered nosql databases for time-varying workloads.” In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19), pp. 223-240. 2019.
[5] Li, Min, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, Ali R. Butt, and Nicholas Fuller. “Mronline: Mapreduce online performance tuning.” In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pp. 165-176. 2014.