Big Data and Security: Oxymoron?

Citation

Big data technologies have dramatically changed the world we live in, and in double quick time. And you know that unless you have been living under a Martian rock. We take it for granted in many of our daily interactions — in our personal lives as well as at work. Big data technologies fuel the seemingly never ending growth of the big tech behemoths — the Big Five of Google, Amazon, Facebook, Microsoft, and Apple are household names, but also many many others which owe their growth to big data. The question I have been posing in my research is how dependable are such technologies — can we trust life and death decisions to these algorithms — and what is the trend line, i.e., are they becoming more dependable over time and at what rate? This question forms the central theme of our Army Research Lab-funded Army AI Innovation Institute (A2I2) that started in 2020 and is slated for five years [1].

This is of course a sweeping question to ask for several reasons. What kinds of big data technologies? What application domains are we talking about? What does dependability mean — resilience to what kinds of failures or what kinds of security attacks? Therefore I will give you generalizations across some subsets of these factors and in a few cases, talk of nuances that do not follow as well from the generalizations [2].

Why should we care about big data becoming dependable?

We care that our big data processing systems are dependable. This is because increasingly we are relying on these systems to make critical decisions, at work and at home, in civilian and slowly but surely, in law enforcement and military situations as well. Take for example, at home, the Siris and the Alexas of the world are at our beck and call to do our bidding. This also means that they are in private spaces with ears, and sometime eyes, that can listen to our private conversations. At work, many companies process large amounts of data that are the most prized currency of the digital realm — the company that knows how to extract the most value out of the data wins. So the most prized currency is morphing from the one that has access to the troves of data to the one that has such access and is able to extract monetizable insights from it. On the non-civilian side, cybersecurity increasingly relies on automated bots for defending our systems as well as attacking others. Also, decisions about law enforcement — where to police, how to decide on sentencing — are being done in parts by algorithms (an area that has been the focus of intense debate in political and intellectual circles).

So overall, big data systems have stepped out definitely and defiantly from the playhouse to the house of consequential decisions, even decisions of life and death.

How do we achieve dependability of big data?

In the context of our institute, A2I2, we achieve dependability of big data by solving the following three big questions:

Can we build algorithms that are distributed and yet resilient to malicious actors?
Can we execute such algorithms in real time on a distributed computing platform? Further, the distributed platform has devices of all hues, static and mobile, from embedded devices through edge computing devices to server-class machines.
Can we as human users of such systems read into the inscrutable workings of these algorithms? In other words, can we get some explanation for security-related decisions taken by the algorithms?

Attack Model. First let us think of what we are trying to make our autonomous systems resilient against. In one dimension, we are accounting for the possibility that our data or our models can be maliciously corrupted. For the data corruption, this can happen either at training time while learning the model, or while using the model for performing the inferencing. On the model corruption aspect, this can happen either in a targeted manner (the model always mistakes “red” color for “green”) or in an untargeted (and generally more damaging) manner (the model has poor accuracy across all classes). In another dimension, we account for what fraction of the agents in the system can act in a malicious manner, and of those malicious agents, what is the level of collusion that is possible.

Robust algorithms under adversarial settings. Here we develop algorithms that can provide guarantees around their results even under adversarial conditions. This is challenging as one has to consider unconstrained adversaries, including those who know the inner workings of your algorithms. Our guarantees apply to worst-case behavior while also ensuring that performance under benign conditions is not significantly reduced.

Secure, real-time, distributed execution. Here we create practical instantiations of the algorithms that execute on distributed computing platforms, typically a heterogeneous mix of low to mid-end embedded and edge devices. This involves parallelization strategies (called, separately, data and model decomposition) and right sizing each part of the algorithm to run on the device according to its capabilities. This also means dealing with the fact that each node may not be continuously connected to a backend server farm, but may only have transient connectivity, and even low bandwidth at that.

Interpretable Operation. Interpretability is an important aspect of our solution approach as it lets users adapt the model as needed and have greater level of trust in the outputs of the autonomy pipeline. By explaining predictions, we can aid the user in determining whether the models have been compromised by an adversary. We design methods for two distinct problems of interpretability in ML: (a) algorithmic interpretability (understanding the learning process) and (b) prediction interpretability (explaining the predictions).

What are the trend lines and the end game in a 5-year outlook?

This is a fast moving area — assurance of autonomous systems — and hence predictions are risky. Nevertheless, my crystal ball gazing tells me that in the next few years, we will broaden the scope of autonomous systems. This will doubtless increase their attack surface — all the different ways in which adversaries can compromise them. But we will make rapid gains in the defenses. We are fast building up a good understanding of the fundamental characteristics of the algorithms that underlie autonomous systems. And from this understanding will arise foundational defenses, defenses that are far away from the current seemingly endless cycle of attack, a defense for it, and an attack to bypass that defense. Rather these foundational defenses will erect (practically) impenetrable shields against a wide, and rigorously quantified, set of attacks.

An instructive comparison can be drawn to the world of cryptography used in financial systems. They are still broken into but increasingly rarely and increasingly only by powerful nation state adversaries. Similarly, we will raise the barrier to compromising autonomous systems and use our defenses in combination with other measures, like human intuitive validation of data and models. Thus we will glide into a world where autonomous systems are trustworthy and their level of trust can be quantified and rigorously proven.

[1] The A2I2 project involves as thrust leads Somali Chaterji (ABE), Mung Chiang (ECE), David Inouye (ECE), and Prateek Mittal (Princeton EE) with Saurabh Bagchi as PI.

[2] A terminology clarification. I use the term “dependability” rather than “security” through the article as that encompasses resilience to malicious attacks plus natural bugs that are introduced without malicious intent.

Why should we care about big data becoming dependable?

How do we achieve dependability of big data?

What are the trend lines and the end game in a 5-year outlook?

Share this:

Related

Leave a comment Cancel reply