Sitemap

EN | Adversarial Risk in Modern NLP Systems: LLM-Focused Threat and Attack Tactics — Part 1

11 min readJul 10, 2025
Press enter or click to view image in full size
I created it using Midjourney

Artificial Intelligence systems have begun to be used across all sectors and fields today. Autonomous systems powered by decision-support models are already making their way onto the global stage — particularly in domains such as automotive, healthcare, finance, and the military.
However, it is increasingly emphasized that these models can produce dangerous outcomes through specific tactics and techniques.
Especially in healthcare, automotive, and financial applications, decision-support systems may lead to outputs and decisions that could be considered fatal.

The Turkish version of this article is available at the following link (Makalemin Türkçe versiyonuna buradan ulaşabilirsiniz) :

In this article, beyond merely expressing these concerns, we will delve into their technical depth. We will examine the work conducted by NIST and MITRE, and analyze the most recently published academic research on the topic.

Artificial intelligence systems — particularly machine learning models — are highly vulnerable to cybersecurity threats specifically designed with domain-specific attack techniques and tactics.
Adversarial attacks are a class of threats that deliberately manipulate the inputs to a model, causing it to make incorrect decisions. These attacks exploit the statistical and correlation-based reasoning nature of AI, leading to critically risky and erroneous outcomes in fields such as image recognition, natural language processing, and autonomous systems.

According to NIST’s Adversarial Machine Learning (AML) Taxonomy, such attacks can occur during both the training and inference phases of the AI lifecycle. They are classified based on the attacker’s knowledge level into white-box, black-box, and gray-box categories.

Press enter or click to view image in full size
Taxonomy of attacks on PredAI systems (Source: NIST AI 100–2e2025)

In the following sections, I will provide a detailed analysis of adversarial attacks, explaining their mathematical foundations. We will also examine their operational mechanisms through practical Proof-of-Concepts (PoCs). Our analysis will focus on both predictive AI (PredAI) — such as classification models — and generative AI (GenAI), including large language models (LLMs).

Adversarial attacks gained prominence with the introduction of the adversarial examples concept by Szegedy et al. in 2013.
According to the NIST AI 100–2e2025 report, adversarial threats are categorized into several classes, including: Evasion, Poisoning, Privacy, and Prompt Injection attacks.

Adversarial threats targeting machine learning models can be broadly categorized into several types. In evasion attacks, an attacker subtly perturbs the input data — often with imperceptible noise — to mislead the model into making incorrect classifications. In contrast, poisoning attacks involve injecting deliberately corrupted samples into the training dataset, thereby compromising the integrity of the learned model. Other threat types include model extraction and membership inference attacks, which aim to leak sensitive or proprietary information from the model.

A comprehensive study published in 2025 by the Japan AI Safety Institute identified eleven primary types of adversarial attacks targeting both predictive and generative AI systems.

Press enter or click to view image in full size
Source: https://arxiv.org/html/2506.23296v1

Let’s begin with a brief overview of these attack types:

  • Model Extraction: The attacker attempts to replicate a closed-source model with comparable accuracy using reverse-engineering techniques, typically via query-only or API-level access.
  • Training Data-related Attacks: These attacks aim to extract sensitive or proprietary information from the dataset used to train the model.
  • Model Poisoning: Post-training, the model’s weights are directly manipulated — sometimes by leveraging learning paradigms such as Reinforcement Learning from Human Feedback (RLHF) — to induce malicious behavior or bias.
  • Data Poisoning: Malicious or misrepresentative samples are injected into the training data to degrade the model’s overall performance or to cause it to behave incorrectly under specific inputs.
  • Evasion: Carefully crafted inputs, imperceptibly perturbed to the human eye, are used to fool a trained model into producing incorrect outputs.
  • Energy/Latency Attacks: These resemble Denial-of-Service (DoS) attacks. The goal is to degrade system responsiveness or inflate computational costs, rendering the model service unsustainable or slow.
  • Prompt Stealing: The attacker seeks to expose or replicate proprietary developer or system-level prompts used in closed or commercial models, potentially for unauthorized reuse or further exploitation.
  • Prompt Injection: Malicious instructions are embedded within user inputs to override the model’s high-level system prompts or guardrails, leading to unauthorized actions or sensitive information leakage.
  • Code Injection: During code generation tasks, the model is manipulated to output attacker-controlled code — potentially leading to Remote Code Execution (RCE), Cross-Site Scripting (XSS), or other exploitable artifacts.
  • Adversarial Fine-tuning: A malicious actor fine-tunes a base model on a small custom dataset to induce harmful behavior, effectively generating a backdoored or biased model variant.
  • Rowhammer Attacks: A hardware-level threat in which rapid, repeated access to DRAM cells causes bit-flips in adjacent cells. This can lead to weight corruption in model parameters or activate hidden backdoors.

Model Extraction — Attack Strategies

The core idea behind a Model Extraction attack is to observe the behavior of a closed-source model and reconstruct a replica or a close approximation of it. As you may know, many organizations lack access to large-scale datasets and computational resources. As a result, they often seek shortcuts to develop high-performing models — either by fine-tuning an existing model or by extracting capabilities from publicly accessible models.

In the field of AI, many of the world’s most advanced models developed by major corporations are made available through API endpoints. Users can send inputs and receive outputs, but they are not granted direct access to the model’s internals. A Model Extraction attack leverages this black-box access to infer the internal structure, logic, and behavior of the model. The ultimate goal is to steal the model’s intellectual property, reconstruct its architecture, or indirectly leak sensitive information about the training data.

Press enter or click to view image in full size
Source: https://arxiv.org/html/2506.23296v1

This attack is typically carried out by collecting a sufficient number of input-output pairs and learning the decision boundaries and behavioral patterns of the target model. It is analogous to a student trying to understand a teacher’s reasoning by observing how they respond to different questions.

The attack flow generally proceeds as follows:

Step 1: Target Model Discovery and Analysis

The first phase involves understanding the characteristics of the target model. The attacker aims to determine the following:

  • Input format: What type of data does the model accept? Is it images, text, or numerical vectors?
  • Output format: What kind of results does the model produce? Are the outputs discrete classes (classification), continuous values (regression), or probabilistic distributions?
  • API limitations: Are there any rate limits on the number of queries, output truncations, or access restrictions?
  • Cost structure: What is the cost per query? Are there pricing thresholds or usage quotas that could constrain the attack process?
Press enter or click to view image in full size

Step 2: Query Strategy Design

In this step, the attacker determines which inputs to send in order to extract the most informative responses from the target model. There are three primary strategies an attacker may pursue:

  • Random Sampling: Inputs are randomly selected from the model’s input space. This method is straightforward but typically inefficient.
  • Active Learning-Based Sampling: The attacker selects inputs that are expected to yield the most information. For instance, examples near the decision boundaries are often more valuable, as they reveal critical aspects of the model’s behavior.
  • Synthetic Data Generation: In this approach, the attacker crafts custom inputs specifically designed to explore and probe the model’s internal logic and behavior.

Step 3: Data Collection Phase

At this stage, the attacker begins to systematically send queries to the target model and records the outputs. Let’s clarify this process through a concrete example involving a credit scoring model:

# Example adversarial data collection script for model extraction – educational use only

import random

def collect_training_data(target_api, num_queries):
training_data = []

for i in range(num_queries):
# Generate a synthetic financial profile
synthetic_profile = {
'age': random.randint(18, 80),
'income': random.randint(20000, 200000),
'credit_history': random.randint(300, 850),
'debt_ratio': random.uniform(0, 1),
'employment_years': random.randint(0, 40)
}

# Query the target model for a prediction
prediction = target_api.predict(synthetic_profile)

# Save input-output pair for training surrogate model
training_data.append((synthetic_profile, prediction))

return training_data

Step 4: Local Training of the Surrogate Model

Once the input-output pairs have been collected, the attacker can begin training their own local surrogate model. The key insight here is that the attacker does not need to know the exact architecture of the target model. Learning the input–output mapping is sufficient to replicate the model’s behavior to a large extent.

This can be accomplished using a simple script like the one below:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def train_surrogate_model(training_data):
X = [list(sample[0].values()) for sample in training_data]
y = [sample[1] for sample in training_data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Simple surrogate model
surrogate_model = RandomForestClassifier(n_estimators=100, random_state=42)
surrogate_model.fit(X_train, y_train)

# Evaluate performance
y_pred = surrogate_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Surrogate model accuracy: {acc:.4f}")

return surrogate_model

Step 5: Model Validation and Improvement

Finally, the attacker tests the performance of the extracted model and, if necessary, sends additional queries to improve it. In particular, more samples are collected in parameter regions where the model makes incorrect predictions.

Model Extraction — Defense Strategies

Model owners can implement various defense techniques against these attack strategies. These include:

  • Output Perturbation: Small random noise is added to model outputs to make extraction more difficult.
  • Query Budgeting: Each user is granted a limited number of queries. This is commonly enforced by most AI API providers today.
  • Watermarking: Hidden markers are embedded into the model that can be detected in extracted copies. This allows large-scale copying attempts to be identified and legal rights to be enforced.

Training Data-related Attacks — Attack Strategies

Since this section focuses on adversarial attacks related to training data, let’s delve into data poisoning attacks in detail. Data poisoning is a subtype of adversarial attacks that targets the training data and involves intentionally tampering with the training dataset to compromise the integrity of the model.

In data poisoning attacks, an attacker deliberately corrupts the training data so that the resulting model produces incorrect predictions on the test data. This manipulation causes the model to behave undesirably on specific test samples. As described in the NIST report, data poisoning attacks target the training phase and jeopardize the reliability of the model. These attacks are particularly effective in scenarios involving large-scale datasets, where the attacker may gain access to the training data or influence its sources. The goal is to make the model exhibit specific errors at test time.

It is important to distinguish this from model poisoning, which involves directly influencing the model parameters by manipulating the training process itself.

We also categorize the attacker’s level of knowledge as follows:

  • White-box: The attacker has full access to the training data and model parameters. They can use gradient-based optimization to craft poisoned data.
  • Black-box: The attacker has no information about the model, and can only interact with it via output queries.
  • Gray-box: The attacker has partial information, such as knowing the model architecture but not its parameters.

Attack methods include injecting new poisoned samples or modifying existing data points’ features or labels. In real-world scenarios, clean-label attacks are common — where the labels remain unchanged, but the features are subtly manipulated.

Press enter or click to view image in full size

Types of Data Poisoning Attacks

These attacks are classified based on the attacker’s objective and the scope of impact. According to the NIST taxonomy, the main types are as follows:

  • Availability Attacks: These aim to degrade the overall performance of the model, causing misclassification on most test samples. For example, in a label flipping scenario, examples from one class are added with labels from another class. In optimization-based cases, poisoned data is computed to maximize the loss function — for instance, in SVM or linear regression. A real-world example would be injecting poisoned emails into a spam filter to reduce its accuracy below 50%.
  • Targeted Attacks: These attacks affect only specific or a small number of test samples, while the model performs normally on others. The target samples must be known during training time.
  • Backdoor Attacks: Triggers are embedded into the training data so that test samples containing the trigger are misclassified. The trigger can be static (e.g., a pixel pattern), dynamic, or semantic.
  • Subpopulation Attacks: These attacks target a particular subset of the test samples. The size of the affected group falls between that of targeted and availability attacks.

Label Flipping

The simplest method is to modify the labels. For example:

def label_flipping_attack(dataset, poison_rate=0.1):
poisoned_dataset = []
for data, label in dataset:
if random.random() < poison_rate:
# Flip the label
new_label = "spam" if label == "normal" else "normal"
poisoned_dataset.append((data, new_label))
else:
poisoned_dataset.append((data, label))
return poisoned_dataset

Feature Manipulation

This involves strategically altering the data features. For example:

def feature_poisoning_attack(image_data, target_class):
# Add a hidden pattern to the image
poisoned_image = image_data.copy()
# Insert a small mark in the corner
poisoned_image[0:5, 0:5] = specific_pattern
return poisoned_image, target_class

Gradient-Based Poisoning

Poisoned examples are crafted using model gradients to maximize their adversarial influence. For example:

def gradient_based_poisoning(model, clean_data, target_misclassification):
# Find a poisoned example that maximizes impact on the model
# using gradient-based optimization
poison_example = clean_data.copy()

for iteration in range(max_iterations):
gradient = compute_influence_gradient(model, poison_example)
poison_example += learning_rate * gradient

return poison_example

Training Data-related Attacks — Defense Strategies

Defenses against data poisoning attacks can be grouped into three main categories:

  • Data Sanitization: Poisoned data points are detected and filtered using techniques such as outlier detection (e.g., k-means clustering, anomaly detection) or data provenance analysis. One of the key challenges here is avoiding the removal of legitimate (clean) data.
  • Robust Optimization: The loss function is modified to reduce the impact of poisoned samples — e.g., by using robust loss functions. Certified defenses can also be applied. For instance, randomized label assignment can provide formal guarantees against a certain percentage of label-flipping attacks.
  • Model Inspection and Repair: Especially useful for backdoor attacks, this involves analyzing neuron activation differences to detect anomalies. Triggers can be reverse-engineered and neutralized during this process.

That wraps up the first part of our series. I personally find these attack strategies both fascinating and thought-provoking. I hope you enjoyed it as much as I did. See you in the next part!

--

--

Alican Kiraz
Alican Kiraz

Written by Alican Kiraz

Sr. Staff Security Engineer @Trendyol | CSIE | CSAE | CCISO | CASP+ | OSCP | eCIR | CPENT | eWPTXv2 | eCDFP | eCTHPv2 | OSWP | CEH Master | Pentest+ | CySA+

No responses yet