LLMs are now available in snack size but digest with care - Source: www.csoonline.com

Distilled models can improve the contextuality and accessibility of LLMs, but can also amplify existing AI risks, including threats to data privacy, integrity, and brand security.

As large language models (LLMs) gain mainstream, they are pushing the edges on AI-driven applications, adding more power and complexity. Running these massive models, however, comes at a price. The high costs and latency associated with them make them impractical for many real-world scenarios.

Enter model distillation. A technique AI engineers are using to pack the most useful aspects of these “high-parameter” models into much smaller rip-offs. They do this by training a smaller “student” model from the ground up to replicate the behavior of a larger “teacher” model.

“Model distillation enables engineers to capture much of the operational capacity of a high-parameter model within the reduced computational footprint of a lower-parameter model,” said David Brauchler, technical director & head of AI and ML security at NCC Group. “Model distillation is most effective when the student model has a more limited purpose or knowledge domain than the generalized teacher model.”

While distillation enables cost savings, faster inference, and better operational efficiency, distilled models inherit many security risks from their teacher models, along with a few others of their own.

Students take on the teacher’s burden

Distilled models inherit a huge part of their teacher model’s behavior, including any security risks embedded in their training data. These risks include intellectual property theft, privacy leaks, and model inversion attacks.

“Typical model distillation uses the training data originally consumed by the larger teacher model alongside the teacher model’s predictions of valid possible outputs (i.e. the probability distribution of outputs),” Brauchler said. “Consequently, the student model has the opportunity to memorize many of the same behaviors as the teacher model, including sensitive data in the training sets.”

Security vulnerabilities of teacher models carry into the student through the transfer of latent knowledge, biases, and flaws. What this means is that DistilGPT-2, a student model to GPT-2, is equally capable of leaking personally identifiable information (PII) from its training data that GPT-2 was found guilty of in 2020 when prompted in specific ways.

The same distillation is potentially prone to a model inversion attack through the black-box extraction techniques that GPT-3.5 was demonstrated vulnerable to in a study in 2020. Smaller models represent less complex functions and are often more vulnerable to security attacks such as model inversion, Brauchler added.

Passed down wisdom can distort reality

Rather than developing their own contextual understanding, student models rely heavily on their teacher models’ pre-learned conclusions. Whether this limitation can lead to model hallucination is highly debated by experts.

Brauchler is of the opinion that the efficiency of the student models is tied to that of their teachers, irrespective of the way they were trained. What this means is that if a teacher model isn’t hallucinated, chances are students won’t be either.

Agreeing with most of that argument, Arun Chandrasekaran, VP Analyst at Gartner, clarifies student models may indeed suffer from newly introduced hallucinations with respect to their size and purpose.

“Distillation itself does not necessarily increase the rate of hallucinations, but if the student model is significantly smaller, it might lack the capacity to capture all the nuances of the teacher model, potentially leading to more errors or oversimplifications,” Chandrasekaran said.

When a model hallucinates, it can be exploited by threat actors to craft adversarial prompts that manipulate outputs, leading to misinformation campaigns or AI-driven exploits.

An instance of model hallucination used by miscreants is the discovery of WormGPT in 2023, an AI system deliberately trained on unverified, potentially biased, and adversarial data to hallucinate legal terminologies, business processes, and financial policies to create convincing but completely fabricated phishing emails and scam content.

Snatch AI made easy

Distilled models also lower the barriers for adversaries attempting model extraction attacks. By extensively querying these models, attackers can approximate their decision boundaries and recreate functionally similar models—often with reduced security constraints.

“Once an adversary has extracted a model, they can potentially modify it to bypass security measures or proprietary guidelines embedded in the original model,” Chandrasekaran said. “This could include altering the model’s behavior to ignore certain inputs or to produce outputs that align with the adversary’s goals.”

Brauchler, however, argues that bypassing an AI model’s proprietary security guardrails is not the primary driver behind model extraction attacks using distilled models. “Model extraction is usually exploited with the intent of capturing a proprietary model’s performance, not with the express purpose of bypassing guardrails,” he said. “There are much less strenuous techniques to avoid AI guardrails.”

Instead of using a distilled model for extraction, he explained, threat actors may disguise a malicious model as a crispier version, given that model extraction attacks closely resemble model distillation.

One particular risk arises when proprietary models provide probability distributions (soft labels), as threat actors can leverage distillation methodologies to replicate the target model’s functional behavior. While similar attacks can be executed using only output labels, the absence of probability distributions significantly reduces their effectiveness, added Brauchler.

To sum up, distillation can potentially expose models to extraction, either by serving as a cover for replicating a source model’s behavior in an extraction attack or by enabling post-distillation extraction attempts with security bypasses.

They may not always have your back

Another downside to distillation is its interpretability. Large LLMs benefit from extensive logs and complex decision-making pathways that security teams can analyze for root cause investigation. Distilled models, however, often lack this granularity making it harder to diagnose vulnerabilities or trace security incidents.

“In the context of incident response, the lack of detailed logs and parameters in student models can make it harder to perform root cause analysis,” Chandrasekaran said. “Security researchers might find it more difficult to pinpoint the exact conditions or inputs that led to a security incident or to understand how an adversary exploited a vulnerability.”

This opacity complicates defensive strategies and forces security teams to rely on external monitoring techniques rather than internal AI audit trails.

Fighting the AI curse

While security risks from distilled models are quite pressing, the broader risk remains the nascent state of AI security itself, which is a key driver of all these vulnerabilities.

“AI guardrails remain soft defense-in-depth controls, not security boundaries,” Brauchler noted. “And as systems move toward agentic contexts, the AI engineering industry will quickly discover that relying on guardrails will result in deep, impactful security vulnerabilities in critical systems, as NCC Group has already observed across multiple application environments.”

Only when developers change the way they think about AI application architectures, will we be able to move toward designing systems with trust-based access controls in mind, he added.

SUBSCRIBE TO OUR NEWSLETTER