Partnership between VAST Data and Superna® Announced – View Press Release

AI in the Enterprise: New Opportunities, Newer Risks

  • Date: Jul 17, 2024
  • Read time: 13 minutes

Safeguarding your data pipeline today helps prevent problems tomorrow

Background

Artificial Intelligence (AI) and Machine Learning (ML) in the enterprise promise to unlock new value and provide superior insights, for enhanced decision-making and better, more consistent outcomes. But they also bring new risks: risks to integrity, privacy, and effectiveness. You see, hackers aren’t really interested in your apps… they’re after your DATA. And AI generates massive amounts of it, all unstructured… a prime target for cyberthreats including ransomware, encryption, exfiltration, and poisoning.

Overview

With AI, you need to secure both your production data and your data pipeline – your data’s chain-of-custody, as it were – to ensure its availability… and integrity. Traditional approaches to storage security no longer cut it. For CIOs and CISOs, the focus has now shifted to what Gartner refers to as Cyberstorage solutions, with protection integrated with the storage itself. The goal is to enable earlier detection of threats using a layered approach that incorporates active defense technologies and provides automated remediation and recovery. Essentially, protection by design instead of relying on traditional backups. This helps to close the data security gap left by traditional approaches to protection of stored data.

Large Language Models: The backbone of AI

Most of the AI and ML initiatives that we’re seeing in the enterprise are based on Large Language Models (LLMs). LLMs are transformative tools that have reshaped how businesses interact with data and technology. A large language model, such as GPT-4, refers to a type of artificial intelligence that’s been trained on vast amounts of textual data in order to “understand” and generate human-like text responses. Some key aspects of an LLM include:

  • Training Data: LLMs are typically trained on massive datasets, comprised of a diverse range of primary sources including books, articles, websites, and other textual material from both primary sources and the internet. This training data is used to “teach” the model the nuances of human language – including grammar, syntax, semantics, and context – providing a foundation for the model to generate human-like responses to user queries or “prompts.”
  • Architecture: These models are built using advanced deep learning architectures, often based on transformer networks. Transformers are designed to efficiently handle sequential data, making them well-suited for tasks requiring understanding and generation of text.
  • Capabilities: LLMs are capable of various language-related tasks such as text generation, translation, summarization, sentiment analysis, and answering questions. They can generate coherent and contextually appropriate responses based on the prompts or queries that have been provided.
  • Scalability: In the context of these models, “Large” refers to both the size of the training data on which they’re trained as well as the size of the model itself in terms of parameters (weights and connections). Models like GPT-4 have billions of parameters, allowing them to capture and utilize more complex patterns in language.
  • Applications: LLMs have numerous practical applications that span many industries – things like customer service automation, content generation, language translation, and even scientific research — where processing and understanding large volumes of text data is essential.
  • Ethical Considerations: The development and deployment of these LLMs raise ethical concerns around a myriad of issues including training data bias, misuse of generated content, and the very real potential for misuse in creating misleading or harmful information.

Generative AI – a use case based on LLMs – is currently the fastest-growing category in the AI market. Valued at $300 billion in 2023, it’s expected to have a value of $4.31 trillion by 2030, with the potential to dramatically transform business processes across all sectors.

Some examples of LLMs

  • GPT-4 – Developed by OpenAI, GPR-4 is a cutting-edge language model known for its deep learning capabilities. The model’s advanced Natural Language Understanding and ability to generate well-considered responses enables high-quality text processing; contextually-relevant outputs; and efficient automation of text-based tasks. Common applications include customer service, science and research, and education.
  • Google AI’s LaMDA focuses on conversational AI, adept at maintaining context even over long dialogue sessions, while spanning a broad range of conversational topics with a natural flow. Features include enhanced conversational abilities; more natural user interactions; and improved engagement in digital communications. LaMDA is being applied to chatbots, virtual assistants, and customer service interfaces.
  • LLaMA from Meta AI has been engineered for deep language analysis, and is capable of interpreting and processing intricate language structures and understanding subtle language nuances. Used for processing complex language patterns; providing accurate sentiment analysis; and efficient text summarization. Current applications include sentiment analysis tools, text summarization platforms, and advanced language interpretation systems.
  • Bloom (BigScience) – With more than 176 billion parameters, BLOOM is able to generate text in 46 natural languages and 13 programming languages. As a collaborative, open-source initiative, Bloom is multi-faceted and supports a wide array of languages and dialects, making it highly inclusive and versatile for global linguistic tasks. It’s notable for fostering innovation through collaboration; versatility in diverse language tasks; and encouraging community-driven development. It’s currently being used in academic research, large-scale content generation, and collaborative AI projects.
  • PaLM by Google AI specializes in multilingual processing, with strong capabilities in handling a variety of languages and dialects with advanced language comprehension and generation techniques. Highlights include the ability to handle intricate language nuances; robust multilingual capabilities; and scalability in language processing. Its used today as a basis for translation services, creative writing, and multilingual content creation.
  • Gemini (Google AI) – Focused on creativity and narrative, Gemini has been designed to excel in generating engaging, imaginative text, making it ideal for creative storytelling and content creation. Features include ability to create novel narratives; engaging storytelling capabilities; and development of creative content for marketing and scriptwriting. It’s ideal for storytelling apps, developing marketing content, and for use in scriptwriting software applications.
  • Cohere’s language model is designed for flexibility and ease of integration, making it an ideal choice for embedding advanced language understanding into a wide range of applications. It’s optimized for Generative AI and offers highly adaptable language processing; and user-friendly AI integration. Cohere is currently being used in business tools for customer support automation and content moderation use cases.
  • Claude (Google AI) – Optimized for instantaneous responses and interactions, Claude is particularly suited for real-time conversational AI, providing rapid and context-aware responses in dialogues. Its features include seamless, responsive dialogue; real-time interaction capabilities; and enhanced user experiences in virtual assistance. It’s being applied to interactive chatbots, virtual assistants, and customer service platforms.
  • Megatron-Turing NLG – A collaborative effort between NVIDIA and Microsoft, Megatron-Turing NLG is a colossal language model, notable for its ability to process and analyze language at very large scale, handling extensive datasets with complex linguistic structures. It’s known for easily handling large-scale language tasks; advanced research capabilities; and sophisticated data analysis. It’s currently being used in high-level research and complex language processing applications.
  • Wu Dao 2.0 (Baidu) – As a multimodal language model, Wu Dao 2.0 is unique in its capability to understand and generate both text and image data, making it suitable for a wide range of integrated multimodal applications. Its strengths include the ability to understand and blend both text and images; versatility in visual-textual applications; and in applications in image captioning and virtual reality, making it ideal for content creation involving both images and text, and in virtual reality applications.
  • AI21 Labs Jamba – Jamba is the world’s first production-grade Mamba-based model. By enhancing Mamba Structured State Space model (SSM) technology with elements of traditional Transformer architecture, Jamba compensates for the inherent limitations of a pure SSM model. Offering a 256K context window, it is already demonstrating remarkable gains in throughput and efficiency. Jamba outperforms or matches other state-of-the-art models in its size class on a wide range of benchmarks. Benefits include high throughput and efficiency; high-quality, accurate text output; and a reduced memory footprint. Jamba is a base model intended for use as a foundation layer for fine tuning, training, & developing custom solutions. It’s used in automated content generation and data analysis platforms.

LLMs in the Enterprise: New opportunities, newer risks

Over the coming months, expect to see AI and LLMs rolled-out everywhere, across every vertical. Businesses of all shape and size are racing to deploy, looking to use AI to augment (or supplant) human workers, in order to capitalize on the very real potential it has to dramatically impact the enterprise, through increased speed and efficiency; cost reduction; enhanced decision making; accessibility; scalability; and more.

Of course, while this exciting new technology promises to reshape the technological landscape with massive impact to businesses and consumers alike, it brings with it some unique problems around data security, going far beyond what can be managed through traditional data protection and backup snapshot technologies. For sure, you need to secure the massive output of unstructured data that’s being generated. But if you’re not securing your input data – your training data – you’re leaving yourself wide-open to numerous threats, with the very real possibility for catastrophic consequences.

Locking-down the AI Data Pipeline to secure your training data

Risks to the data being used to train your AI can impact the integrity, privacy, and effectiveness of the AI system:

  • Data Breaches: If AI training data is not properly secured, it’s vulnerable to data breaches. This could lead to unauthorized access, theft of sensitive information, or even manipulation of data, compromising the AI model’s reliability.
  • Privacy Issues: Your training data might contain personal or sensitive information about individuals. Inadequate protection could result in privacy violations or regulatory non-compliance, especially with regulations like GDPR (General Data Protection Regulation) in the EU. And in spite of the possibility of regulatory roll-backs in a volatile political climate, privacy breaches can still be problematic for an enterprise.
  • Bias Amplification: If the training data is biased or inaccurate, an AI model can perpetuate and even amplify these biases, leading to unfair or discriminatory outcomes in decision-making processes.
  • Data Quality Issues: Poor-quality training data (incomplete, inaccurate, or outdated) can degrade the AI model’s performance and accuracy, affecting its ability to make reliable predictions or classifications.
  • Adversarial Attacks: Malicious actors may attempt to manipulate AI systems by feeding them carefully crafted input data (adversarial examples) to cause the AI to make incorrect decisions or predictions.

To give you a better idea of how your AI training data is at risk, here are some examples that highlight the nature of evolving threats to the data pipeline.

  • Nightshade: Artists have started using tools like Nightshade to protect their work from unauthorized scraping by AI models. Developed by researchers at the University of Chicago, Nightshade manipulates images in a way that corrupts the AI models’ ability to generate accurate outputs. For instance, by introducing as few as 50 poisoned images into the training dataset, Stable Diffusion’s outputs were significantly distorted, generating bizarre images like dogs with multiple limbs. With just 300 poisoned samples, the model could be manipulated to generate cats instead of dogs. This technique demonstrates how even a small number of corrupted samples can have a profound impact on generative AI models.​ (from MIT Technology Review)​
  • Generalized and Targeted Attacks: Data poisoning can be broadly categorized into generalized and targeted attacks. Generalized attacks aim to reduce the overall accuracy of the model by introducing corrupted data, leading to misclassifications and false positives or negatives. Targeted attacks, on the other hand, affect specific subsets of the data. The result? Given that the model will continue to perform as “normal” with most inputs, this kind of attack is much more difficult to detect. As you can imagine, this type of attack is particularly insidious because it can basically go unnoticed until significant damage has been done​. (from TechRadar)​
  • Model Extraction: Attackers try to steal the AI model itself, which can involve reverse engineering or extracting the model’s parameters and architecture. This can lead to theft of intellectual property or allow bad actors to create adversarial models. (from Nightfall AI)
  • Prompt-Specific Poisoning: Text-to-image models, like those used in diffusion models, are vulnerable to prompt-specific poisoning attacks. These attacks involve injecting poisoned data into the training set to manipulate the model’s output for specific prompts. For example, an attacker could corrupt a model to generate images of cats when prompted with “dog” or replace anime styles with oil paintings. This kind of attack exploits the concept sparsity in training data, where certain concepts have limited representation, making them more susceptible to corruption.​ (from ar5iv)​
  • Model Inversion: Model inversion is a machine learning security threat that involves using the output of a model to infer some of its parameters or architecture. This can be done by querying the model and using the output to infer some of its parameters. The stolen model can then be used to create a copy of the original model or to extract sensitive information that was used to train the model. (from Nightfall AI)

Evolving threats and the need for proactive measures

As AI and machine learning models become more prevalent, threats to the AI data pipeline – essentially, the data chain-of-custody – are expected to grow. It’s become clear that traditional cybersecurity measures such as snapshot backups are no longer enough. Top analyst firms like Gartner are now urging organizations to implement proactive measures – such as using high-speed data verifiers, maintaining strict access controls, monitoring and managing activity at the data layer, and employing statistical methods – to detect anomalies in the training data.

It’s become serious enough that regulatory bodies are now providing guidance on secure AI development, in order to mitigate these risks​​​. Of course, there is a possibility that with a November election could come a considerable push for deregulation in the U.S., but regardless of whether compliance is mandated or not, common sense dictates that properly addressing evolving threats to AI and machine learning should continue to be a number one priority for virtually any enterprise.

Key Takeaways

  • AI in the enterprise is here to stay, and CIOs and CISOs must safeguard the integrity of the data pipeline to ensure both the integrity and availability of AI and ML initiatives.
  • Traditional data protection strategies and technologies such as access controls, honeypots, and snapshot backups are no longer adequate against constantly-evolving cyberthreats.
  • You need to be evaluating storage-integrated data protection solutions – what Gartner refers to as cyberstorage – that deliver active technologies to identify, protect, detect, respond and recover from cyberattacks, especially at the data layer. This helps to close the data security gap left by traditional approaches to data protection and recovery.

Summary

AI and ML are here to stay. Their potential for benefit to the enterprise is enormous, but risks such as those outlined above underscore the importance of safeguarding your AI training data pipeline to ensure the reliability and integrity of your AI models. As technology advances, so do the tactics used by malicious actors, making it crucial for developers and organizations to stay vigilant and adopt robust defenses for their stored data.

Prevention is the new recovery

For more than a decade, Superna has provided innovation and leadership in data security and cyberstorage solutions for unstructured data, both on-premise and in the hybrid cloud. Superna solutions are utilized by thousands of organizations globally, helping them to close the data security gap by providing automated, next-generation cyber defense at the data layer. Superna is recognized by Gartner as a solution provider in the cyberstorage category. Superna… because prevention is the new recovery!