How Model Extraction Attacks Turn AI APIs Into Theft Opportunities

Model extraction is a type of cyberattack where an adversary, with no prior knowledge of a machine learning model's internal workings, creates a functional copy of it simply by repeatedly sending it queries and observing the responses.

As artificial intelligence becomes more valuable, a new breed of digital theft has emerged. Companies spend millions developing proprietary AI models, only to discover that their intellectual property can be stolen without ever breaking into a server or downloading a single file.

‍Model extraction is a type of cyberattack where an adversary, with no prior knowledge of a machine learning model's internal workings, creates a functional copy of it simply by repeatedly sending it queries and observing the responses. The attacker essentially plays a game of "20 Questions" with the AI, using the model's own predictions to reverse-engineer its logic and build a surprisingly accurate replica. This stolen model can then be used for free, analyzed for vulnerabilities, or used to launch other attacks, all without the original owner's permission (Tramèr et al., 2016).

This isn't about hacking into a server and downloading the model files. It's a far more subtle form of intellectual property theft that happens in plain sight, through the very API that companies expose to their customers. It poses a significant threat to the burgeoning Machine-Learning-as-a-Service (MLaaS) industry, where companies invest millions of dollars developing proprietary models only to make them publicly accessible on a pay-per-query basis. Model extraction exploits this tension between accessibility and confidentiality, turning a model's public interface into its greatest vulnerability.

The economics of these attacks are particularly troubling. Research has shown that effective model extraction can be achieved for as little as $7 in API costs (Tramèr et al., 2016). In some cases, attackers have created replicas that achieve over 95% accuracy while querying only 5-7% of the training data. The stolen models can then be used to undercut the original provider's business, analyzed to find security vulnerabilities, or deployed in ways that violate the original owner's terms of service. It's a high-reward, low-risk proposition for attackers, and the threat is only growing as more valuable models are deployed via public APIs.

‍

The Art of the Digital Heist

So how does an attacker actually pull off this digital heist? It boils down to a clever process of learning by imitation. The core technique behind many modern model extraction attacks is known as knowledge distillation. Originally developed as a method for compressing large, complex models into smaller, more efficient ones, attackers have repurposed it to steal models. The idea is to train a smaller "student" model to mimic the behavior of a larger "teacher" model. In this case, the victim's proprietary model is the teacher, and the attacker's replica is the student.

The secret sauce lies in the soft labels that many machine learning models provide. When you ask a classification model to identify a picture, it doesn't just give you a single, hard answer like "cat." It often provides a probability distribution across all the possible classes, like "85% cat, 10% dog, 5% fox." These probabilities, or soft labels, are a goldmine of information for an attacker. They reveal not just what the model thinks the answer is, but also how confident it is, and what it thinks the next most likely answers are. This gives the attacker a much richer signal to train their own model on. They're not just learning the right answers; they're learning the teacher model's entire thought process, including its biases and uncertainties (Praetorian, 2026).

The attack unfolds in a few key steps. First, the attacker needs to create a dataset to train their replica model. They do this by sending a series of carefully chosen queries to the victim model's API. These queries could be random data, synthetic data generated to explore the decision boundaries of the model, or even data from a publicly available dataset that is similar to what the victim model was likely trained on. For each query, the attacker records the victim model's full output, including the soft labels.

Once they have a sufficiently large dataset of these input-output pairs, the attacker trains their own model—the student—to reproduce the teacher's outputs for the same inputs. They use a special loss function, often based on a concept called Kullback-Leibler (KL) divergence, which measures how different the student's probability distribution is from the teacher's. By minimizing this divergence, the student model learns to produce the same probability distributions as the teacher, effectively becoming a functional clone.

A crucial parameter in this process is the temperature. In knowledge distillation, the temperature is a value used to soften the probability distributions. A higher temperature makes the probabilities more uniform, forcing the model to reveal more about the relationships it has learned between different classes. Attackers can experiment with different temperatures to extract the maximum amount of information from the teacher model's outputs (Praetorian, 2026).

What's particularly alarming is that the attacker doesn't even need to know the architecture of the victim model. They can use a completely different, often much simpler, architecture for their replica. The goal is to mimic behavior, not to replicate the exact structure. This makes the attack incredibly versatile and difficult to defend against. As long as the attacker can query the model and get detailed outputs, they have a good chance of being able to steal it.

A recent demonstration by security researchers at Praetorian showed just how effective these attacks can be. Using a Fashion-MNIST dataset, they were able to create a replica model with just 1,000 queries—a tiny fraction of the original training set. The stolen model achieved an 80.8% agreement rate with the victim model, meaning it produced the same prediction 4 out of 5 times. What's more, the stolen model inherited not just the victim's correct predictions, but also its mistakes and biases. Both models struggled with the same ambiguous categories, like distinguishing between shirts and t-shirts. The replica had essentially learned to think like the original (Praetorian, 2026).

‍

More Than Just a Copy

The consequences of a successful model extraction attack go far beyond simple intellectual property theft. A stolen model can be a launchpad for a variety of other malicious activities. For example, an attacker can analyze the stolen model offline to find its weaknesses and then use that knowledge to craft adversarial examples—inputs that are specifically designed to fool the original model. This could be used to bypass a spam filter, trick a content moderation system, or fool a facial recognition system.

Furthermore, a stolen model can be used to mount membership inference attacks, where the goal is to determine whether a specific individual's data was used to train the original model. This is a serious privacy breach, especially if the model was trained on sensitive data like medical records or financial information. By having a local copy of the model to query an unlimited number of times, the attacker can perform the extensive analysis needed to infer information about the training data.

There are several different flavors of model extraction attacks, each with its own nuances and levels of effectiveness. The most common type is the black-box extraction, where the attacker has no knowledge of the model's architecture or training data and can only interact with it through its public API. This is the most realistic scenario for attacks against MLaaS platforms.

A more powerful variant is the white-box extraction, where the attacker has some knowledge of the model's architecture. This might be the case if the model is based on a well-known, open-source architecture like BERT or ResNet. With this knowledge, the attacker can design a replica model with the same architecture, making the extraction process much more efficient and the resulting clone much more accurate.

There are also side-channel attacks, where the attacker uses information leaked during the model's execution—like the time it takes to make a prediction or the amount of power it consumes—to infer information about its internal workings. These attacks are much more difficult to pull off, but they can be incredibly effective, especially against models deployed on edge devices.

Model Extraction Attack Types and Characteristics
Attack Type	Attacker's Knowledge	Key Technique	Vulnerable Models
Black-Box Extraction	API access only	Knowledge Distillation	Most MLaaS models
White-Box Extraction	Knowledge of model architecture	Transfer Learning	Open-source architectures
Side-Channel Attack	Physical access or monitoring	Timing/Power analysis	Edge devices, IoT
Instruction-Tuning	API access to LLM	Prompting and fine-tuning	Large Language Models

‍

The Rogues' Gallery of Stolen Models

The vulnerability of different types of AI models to extraction attacks varies depending on their architecture and the nature of their outputs. Some models are, by their very nature, more talkative than others, and that chattiness can be their undoing.

Decision trees and other simpler models with discrete decision boundaries are often easier to extract. An attacker can systematically probe the model to map out its decision rules. It's like feeling your way around a dark room; with enough patience, you can eventually figure out where all the furniture is.

Deep neural networks, on the other hand, are a double-edged sword. Their complexity makes them incredibly powerful, but it also makes them more vulnerable. The rich, high-dimensional outputs of a deep neural network provide a wealth of information for an attacker to exploit. This is especially true for models that output detailed probability distributions, as is common in image classification and natural language processing tasks. The very information that makes the model so good at its job also makes it an open book to a determined attacker (Brian D. Colwell, 2025).

Large language models (LLMs) are perhaps the most vulnerable of all. Their ability to generate coherent, human-like text in response to a wide variety of prompts makes them a prime target for extraction. An attacker can use a technique called instruction-tuning, where they prompt the LLM with a series of tasks and then use the model's own outputs to train a smaller, open-source LLM to perform the same tasks. In essence, they're using the victim LLM as a data factory to create a training set for their own model. The resulting replica can often achieve performance surprisingly close to the original, at a fraction of the cost.

The implications for the LLM industry are particularly severe. Companies like OpenAI, Anthropic, and Google have invested billions of dollars in training their flagship models. If an attacker can create a functional replica by simply querying the API a few thousand times, it undermines the entire business model. The attacker gets a free ride on the victim's research and development investment, and can then deploy the stolen model without any of the overhead costs. This has led to a growing concern about "model leeching," where smaller players use the outputs of expensive, proprietary models to train their own cheaper alternatives.

‍

Building a Better Lock

Given the significant threat posed by model extraction, researchers and practitioners have been working on a variety of defense mechanisms. The goal is to make it harder, more expensive, or more conspicuous for an attacker to steal a model. It's a bit like adding more locks to a door; no single lock is perfect, but a combination of them can make a burglar think twice.

One of the most intuitive defenses is to simply reduce the amount of information that the model gives away. Instead of returning a full probability distribution, the API could return only the top prediction (a hard label) or a rounded, less precise probability. This is like giving a one-word answer instead of a detailed explanation. It does make extraction more difficult, but it's not a silver bullet. Determined attackers can still use clever query strategies to piece together the model's decision boundaries, and this defense often comes at the cost of reduced utility for legitimate users who might benefit from the richer information.

Another approach is to try to detect and block suspicious query patterns. An attacker trying to extract a model will often issue a large number of queries in a short amount of time, or their queries might look systematically different from those of a normal user. By implementing rate limiting and monitoring query logs for unusual activity, a provider can potentially identify and shut down an attack in progress. However, sophisticated attackers can often evade these measures by distributing their queries across multiple accounts and IP addresses, or by carefully crafting their queries to mimic normal usage patterns (Brian D. Colwell, 2025).

More proactive defenses involve modifying the model itself to make it inherently more resistant to extraction. Watermarking is a technique where a unique, secret signature is embedded into the model's behavior. This is done by training the model to respond in a specific, unusual way to a secret set of inputs. If a stolen model is later found in the wild, the owner can prove it was stolen by demonstrating that it contains their secret watermark. This doesn't prevent the theft, but it provides a way to prove ownership and take legal action.

Other model modifications, like pruning (removing unnecessary connections in a neural network) and quantization (reducing the precision of the model's parameters), can also make extraction more difficult. These techniques essentially add noise to the model's outputs, making it harder for an attacker to get a clean signal to train their replica on. The challenge is to do this without significantly degrading the model's performance on legitimate tasks.

Finally, there's a growing interest in using differential privacy as a defense against model extraction. By adding carefully calibrated noise to the model's training process or its outputs, differential privacy can provide a mathematical guarantee that the model's predictions don't reveal too much information about any single data point. This can, as a side effect, make it more difficult for an attacker to extract a high-fidelity copy of the model.

‍

An Unwinnable Arms Race?

The development of model extraction attacks and defenses is a classic cat-and-mouse game. As defenders come up with new ways to protect their models, attackers find new ways to circumvent those protections. The fundamental tension between making models accessible and keeping them secure is a difficult one to resolve.

The rise of MLaaS platforms has democratized access to powerful AI, but it has also created a massive attack surface. The very APIs that drive this new economy are also the front door for attackers. As models become more complex and more valuable, the incentive to steal them will only grow.

The problem is compounded by the fact that many organizations don't even know when they've been attacked. Unlike traditional cyberattacks that leave obvious traces—like breached servers or stolen files—model extraction can look indistinguishable from normal API usage. An attacker querying a model 10,000 times over the course of a week might just look like a heavy user. Without sophisticated monitoring and anomaly detection systems in place, the theft can go completely unnoticed until the stolen model shows up as a competitor's product or is used in a downstream attack.

There's also a legal gray area. In many jurisdictions, it's unclear whether model extraction even constitutes theft. The attacker isn't breaking into any systems or violating any computer fraud laws. They're simply using a public API in a way that the provider didn't intend. Some argue that if a company exposes a model via an API, they've implicitly consented to having it learned from. This legal ambiguity makes it difficult for victims to seek recourse, even when they can prove that their model has been stolen.

The future of model extraction defense will likely involve a combination of the techniques described above, as well as new, more sophisticated approaches. We may see the development of new cryptographic methods for querying models in a privacy-preserving way, or new AI-powered systems that are specifically designed to detect and block extraction attacks in real time. We may also see a shift in the way that MLaaS providers do business, with more emphasis on legal agreements and watermarking to deter theft, rather than relying solely on technical defenses.

Ultimately, there may be no perfect solution. As long as there is a way to query a model and observe its behavior, there will be a way to learn from it. The goal for defenders is not necessarily to make extraction impossible, but to make it so costly and so difficult that it's no longer a worthwhile endeavor for most attackers. It's an ongoing arms race, and the security of the AI-powered future may depend on who stays one step ahead.

Some researchers are exploring more radical approaches. One idea is to fundamentally rethink how we deploy AI models. Instead of exposing models via APIs, we could use secure enclaves or trusted execution environments that allow users to run computations on their data without ever revealing the model itself. Another approach is to use homomorphic encryption or secure multi-party computation to allow queries to be answered without the model ever seeing the raw input data or revealing its internal state. These techniques are still largely experimental and come with significant performance overhead, but they represent a possible path toward truly extraction-resistant AI systems.

The reality is that model extraction attacks are here to stay. They're a natural consequence of the way we've chosen to deploy AI—as a service, accessible via the internet, to anyone willing to pay. As long as that remains the dominant paradigm, defenders will need to stay vigilant, constantly adapting their strategies to stay ahead of increasingly sophisticated attackers. The stakes are high, and the outcome is far from certain.