Protecting your data supply chain against emergent cyberthreats

  • Date: Mar 19, 2024
  • Read time: 5 minutes

The risks posed by bad actors altering training data to change how an AI algorithm functions could prove disastrous to your AI initiatives.

Background

Interest in AI and ML has exploded and driven investments in IT infrastructure, with all industry verticals prioritizing projects to adopt this new technology. The race to adopt new technology without proper security controls in place – especially as it relates to AI training data – could expose the enterprise to new attack vectors.

We will look at some of these risks, the solutions and show a proof-of-concept attack vector against Machine Learning infrastructure

Overview

Solid training data is the foundation of all AI/ML initiatives. The risks posed by bad actors altering training data to change how an AI algorithm functions could prove disastrous to your AI initiatives. Thinking about it, it sounded like a theory that needed to be proven. We set out to demonstrate how protecting your training data is essential for designing a secure AI/ML data pipeline. The ML data pipeline typically involves:

  1. Collecting training data from various sources.
  2. Transforming training data into a data set that is ready for running training models. This transformation step can take many forms including augmenting the training data with different sources or combining data from different sources into a data set.
  3. Running training algorithms on training data sets to produce model files.
  4. Using the model files in prediction or anomaly prediction, or one of the many other uses of ML/AI pipelines.
  5. Finally, deriving results from the AI models in a live environment, acting on live data inputs.

Identifying the Risks

As we can see, steps 1 and 2 are potential attack vectors for manipulating the training data to affect the model creation, and subsequently the downstream AI functions that depend on the models. Attacks on training data could have many undesirable consequences. Let’s use a security product use case that depends on training models and AI for detecting suspicious activity in endpoints, networks, storage, or applications.

Proof-of-Concept Attack on AI and ML

The accompanying video demonstration walks through a scenario in which training data is being used to baseline normal activity in user and host data access patterns for a storage device. The model was designed to detect suspicious activity from users accessing the storage by training on audit logs produced by the storage device. An attacker wanting to go undetected for large data operations – for example, copying large quantities of data from the storage device – could easily modify the training data set in step 1 above.

The audit data collected from the storage devices needs to be turned into a data set for training; and the data could be modified at this step to show the attacker’s access patterns as being very high data reads. This could mask the attacker’s goal of copying very large quantities of data without being noticed.

The risks posed by bad actors altering training data to change how an AI algorithm functions could prove disastrous to your AI initiatives.

Once the data set is passed into the training step, the attacker has essentially “poisoned” the AI model’s ability to detect the anomaly.

Most AI/ML projects generate and operate against massive quantities of file-based data stored on scale-out NAS platforms. Being able to monitor this data through the entire data pipeline is critical to ensuring that the enterprise can trust the output of the AI model. If this sounds difficult, watch the video demonstration of a successful ML/AI attack that used the steps above to allow the attacker to go undetected.

Summary

The Superna Data Security Edition portfolio of cyberstorage-aware solutions is designed to prevent this type of attack, with end-to-end event integrity and in-RAM processing of anomalies.

This architecture denies attackers the opportunity to manipulate the audit data input that’s being collected directly from the storage device. The audit data on storage devices is also stored in a read-only, non-modifiable format that further prevents attacks against the source of the training data set.

This is only one example of the many possible attack vectors on ML/AI. We believe that any ML/AI project should invest in security of the training data sets with a complete, real-time and historical cyberstorage-capable solution to protect AI models from the “Data Poison” attack vector.

Superna supports all major storage platforms, helping to ensure that your AI/ML initiatives can be protected from data poison attacks.

Prevention is the new recovery

For more than a decade, Superna has provided innovation and leadership in data security and cyberstorage solutions for unstructured data, both on-premise and in the hybrid cloud. Superna solutions are utilized by thousands of organizations globally, helping them to close the data security gap by providing automated, next-generation cyber defense at the data layer. Superna is recognized by Gartner as a solution provider in the cyberstorage category. Superna… because prevention is the new recovery!