Structured and unstructured data is on a collision course and Security will be collateral damage

  • Date: Oct 30, 2024
  • Read time: 4 minutes

AI (Artificial Intelligence) and Machine Learning (ML) transform enterprise data management, they also introduce new security challenges. The future of data management increasingly blurs the boundaries between structured and unstructured data, opening novel avenues for attackers. Existing security tools, however, fail to adequately address these new gaps across diverse data stores. 

This blog explores the security challenges within Data Lakes and how Superna’s Cyber Storage platform enables comprehensive data pipeline security.

Overview

File-based, distributed databases that utilize data locality for querying have expanded Data Lakes to unprecedented scales. Popular columnar databases leveraging file or object storage include:

-Apache Cassandra- Apache HBase

– Apache Hive

– Apache Spark

– Apache Flink

– File formats like Parquet, ORC, and Iceberg

Advancements in affordable flash storage and distributed “shared-nothing” clustering have enabled these file- and object-based Data Lakes to rival traditional relational databases in performance.

With AI and ML requiring extensive data pools for enterprise-wide analysis, there’s a shift toward making all data accessible for model training and generative AI applications. Most enterprise data is file-based, with object storage as a close second. As enterprises seek to extract insights from all their data, they must bridge their data silos.

Enterprise Storage Architecture for Exploratory Data Analysis (EDA)

To unlock the value of corporate data, enterprises need exploratory data analysis (EDA) capabilities, which involve querying and processing data from various sources—databases, file systems, object stores, and applications—using data manipulation languages. This drives the demand for a centralized, accessible pool of data. EDA enables data cleansing, filtering, and joining, eliminating inconsistencies that could impair ML training.

In ML workflows, EDA supports functions that help prepare and preprocess data. The development and testing phases in the pipeline align with Gartner’s Trism security model.   The Superna solution address the largest attack surface which is the Data Lake.

Superna’s Cyber solution for ML AI and Data Lakes

Data Silo Integration

SQL, a decades-old language, remains the bridge between structured databases and unstructured Data Lakes. Tools like Starburst and Trino lead the movement toward universal Data Lakes, simplifying data analysis, cleansing, and preparation tasks essential to AI and ML workflows. These tools also support feature engineering—transforming raw data into structured datasets for model training.

When enterprise architecture undergoes significant change, it creates opportunities to harness data value, though these benefits may come at a transitional cost.

Data Lakes for AI: The Security Impact

Data Lakes often bring security complexities. Traditional security tools—designed to protect separate databases, file systems, or object storage—lack unified oversight. Each data source relies on its own security tools, leading to a fragmented view of data security. No single tool offers comprehensive visibility.

Trino’s data connectors centralize control by bridging databases and unstructured data sources, but they rely on privileged service accounts to access each source. These service accounts obscure the actual users behind data manipulations, depriving security tools of essential context.

The Resulting Security Gap

Service accounts mask user activity within the Data Lake, meaning file, database, and object storage security tools cannot detect the users executing data manipulations. This creates a blind spot that attackers could exploit.

Improving Cybersecurity in Data Lakes

Superna’s Cyber Storage platform consolidates file and object security, using AI to monitor for anomalies in both data types.

The next evolution of Cyber Storage integrates SQL data manipulation protocols across Data Lakes, offering visibility into both structured and unstructured data access patterns. By mapping logical tables in the Data Lake to the corresponding physical files or objects and associating user context with SQL-layer actions, it provides holistic insight into data manipulations.

Zero Trust Data Lake Security

Data location-independent security requires user context, causality, and traceability to model normal and anomalous behavior effectively. Superna’s Cyber Storage platform integrates these requirements into a single solution tailored to secure the future of Enterprise Data Lakes.

In the future, Superna will enable the Zero trust model for file, object and SQL to create the Zero Trust Data Lake.

Stay tuned for a demonstration of Superna’s Cyber Storage for Data Lakes solution.