Partnership between VAST Data and Superna® Announced – View Press Release

Machine Learning: A cornerstone of Superna’s vision for cyberstorage

  • Date: Jan 11, 2024
  • Read time: 7 minutes

Securing data requires multiple approaches and layers of security, based on the characteristics of the threat.

The Threat Landscape

The threat landscape is constantly evolving, including polymorphic code malware, zero-day vulnerability exploits, timed attacks, data exfiltration, partial data encryption, staged attacks, some laying and waiting for days before execution. And as the world becomes more and more reliant on digital communication, the list of attack vectors continues to expand. How can a CISO keep up?

Have Attacker Objectives Changed?

Ransomware and cybercriminals go hand-in-hand, with attacks-per-second increasing weekly. Attackers continue to target unstructured data, as it tends to represent a high-value asset that most businesses are more than willing to pay to recover.

The methods used to gain a foothold inside your IT infrastructure continue to evolve rapidly, but the prize remains the same: access to your data for a) encryption or b) data exfiltration. Encrypting and holding data for ransom continues to be a tactical mainstay of many attackers, but double extortion – combining encryption and data exfiltration – is an emergent threat that’s becoming more widespread. 

Gartner has recognized the need for security at the storage layer and created a new product category: cyberstorage. This new category is targeting a protection layer that is focused solely on your data, regardless of whether it’s files or objects. Having launched our first Ransomware Defender product back in 2017, Superna has come to be recognized as a leader and innovator in this new product category.

Attackers continue to target unstructured data… a high-value asset most businesses are more than willing to pay to recover.

Where does Cyberstorage Fit within the Security Tools Ecosystem?

In a layered security architecture, cyberstorage operates in a security domain that has been largely ignored by current endpoint and network-focused security products.

Back to Machine Learning and Storage Security

If we look at storage protection for file and object data along with the metadata available from storage devices to protect it from attackers, we find a couple of attributes of interest for use in a machine learning solution. For example, storage devices are able to:

  1. Record the authenticated User ID
  2. Record the IP Address of the host originating the data request
  3. Record what the user was doing, e.g. reading, writing, deleting, renaming, or some other data manipulation, such as permissions or Access Conrtol Lists (ACLs)
  4. Record the time of day and the frequency of activity for a given user or host
  5. The amount of data read or written may also be available

Visualizing this data as time series data allows patterns to emerge, as users and applications touch the metadata generated by the storage, which can then be used for machine learning, as well as with anomaly detection and predictions.    

In a previous post, we looked at how this very same metadata can be used to understand business patterns that result in performance baselines, then using Machine Learning predictions to look for signs of performance degradation.

If we use the exact same metadata, we can look for other patterns that are relevant to securing data from attackers, regardless of the attack vector they use. The advantage of this kind of defense stems from the fact that storage layer security operates independently of any vulnerability used by an attacker to breach an IT system with access to the data. It’s this independence that provides cyberstorage solutions with an advantage over other security domains when it comes to detection and response. The diagram below summarizes some of the relevant security drawbacks of endpoint protection and network based detection capabilities to protect your data versus Cyber Storage security domain capabilities

Cyberstorage Offers Unique Offensive Data Protection Capabilities

In modern enterprise security architecture, SOAR platforms (Security Orchestration, Automation and Response) are deployed to help automate responses across security domains and leverage detection capabilities across all security domains. By integrating cyberstorage into SOAR workflows and playbooks, enterprises can achieve a composable security architecture while increasing response time to address domain security threats.

When it comes to offensive security, the key to using Machine Learning is to focus on anomaly detection over long-running patterns. This means “baselining” or training the machine learning on exactly what “normal” is supposed to look like. In order to do this, the learning needs to span days, not just minutes or seconds. Business IO patterns will, in fact, appear different, depending on the day, time of day, or day of the week. By expanding the learning window, we’re able to obtain a much more accurate picture of what is happening to the data, as well as where and when it’s happening.

The primary goal is to find the needle in the haystack of patterns that are all otherwise normal. The secondary goal is to eliminate false positives that are always a biproduct of machine learning approaches.

A Practical Example

If we look at data exfiltration as a primary example of an attack that could be executed over the course of several days, it’s obvious that multiple factors need to be considered for detecting this type of attack.   

Two possible scenarios: a) the attacker takes over a host that has access to the target data and regularly accesses that data, or b) the attacker uses a compromised host to read data that might not typically be accessed from this host. Both scenarios lend themselves to a machine learning approach, as models can handle known hosts in the training data in order to baseline what’s “normal”, along with capabilities to track unknown hosts within the training data.  

The time period used to execute training will vary depending on the type of prediction or anomaly detection. In the case of data exfiltration, other dimensions can be considered, such as a) data read over a long period of time, or b) data read over a shorter period of time but at higher rates.      

Machine learning algorithms are able to factor in multiple inputs or data features; it’s vital to understand all the features of the training data and the algorithm selection, in order to build both the training model and the anomaly detection algorithm.

Looking ahead

At Superna, we’ve been testing a variety of algorithms for both training and anomaly detection/prediction, with great success. We’ve actively engaged customers on live training and anomaly detection. We’ll soon be introducing a Machine Learning Framework within our security products to enable on-prem training and anomaly detection covering a wide range of data threats. This framework is designed to not require an internet connection or cloud resources and can operate completely on prem, as required.

Storage security must enable detection of new threats by combining feature inputs such as time of day, new or existing host, normal vs suspicious IO patterns. The power of machine learning is the ability to easily re-use the framework to solve a different problem with the exact same input data, simply by changing the algorithm, time frame, and anomaly detection

We will soon be posting a live video demo showcasing this new Machine Learning framework for cyberstorage, so stay tuned!

Prevention is the new recovery

At Superna, we’ve positioned ourselves at the forefront of securing these large, complex unstructured data footprints. We work closely with a variety of storage platforms to help ensure that our customers are empowered to copy, move, audit and secure their most precious commodity: their data. And after nearly 15 years of engineering innovation and tireless commitment, we’ve successfully solved for the security, audit and orchestration challenges of the largest digital data footprint on earth: unstructured data.