405 Billion-Parameter Cybersecurity-Focused LLM Model Project — Part 1: Pushing the Limits!

Alican Kiraz
5 min readDec 29, 2024
Sofie Conte, 2021

For my first cybersecurity model, SenecaLLM, I used Meta’s llama3.1–8B base model. Even training an 8-billion-parameter LLM proved to be quite challenging. However, once I became adept at using the right dataset and leveraging the Transformers library’s training framework, I managed to complete my model in a reasonably short amount of time. Now, I’ve decided to push my model’s capabilities even further by training and fine-tuning a Meta llama3.1 model with 405 billion parameters.

Let’s start by examining parameters we often see in LLMs, such as 3B, 8B, 70B, and 405B.

In an LLM model, the ‘B’ stands for “billion.” The figures 3B, 8B, 70B, and 405B reflect how large and complex the model is; they represent the total count of weights or biases in the network. As these numbers increase, the model grows in size, can learn more new information, and has a greater capacity to remember instructions. However, the larger the model, the more challenging it becomes to train, store, and operate.

https://myscale.com/

In my previous article, as you know, artificial neural networks operate by performing numerous multiplication and addition operations in the form of matrices and various functions.

Each of these matrices represents a learned weight. In transformer-based language models, these weights include:

  • Embedding layer: Converts words or tokens into numerical vectors.
  • Attention mechanism: Comprises the Q, K, and V matrices.
  • Feed-forward layers: Includes hidden layers, activation functions, and additional parameters such as layer normalization.

As the total number of parameters in a model increases:

  • Memory (VRAM/RAM) and computational power (FLOPS) requirements multiply.
  • The amount of information the model can theoretically learn also grows.
  • Training time extends and costs escalate.

Smaller models may lag behind larger ones in challenging tasks that require language fluency, logical consistency, and a broad knowledge base. As the number of parameters increases, a model can exhibit more complex ‘language abilities,’ yet the performance gain is not always linear. In some tasks, models around 70B parameters may yield similar results. For example, OpenAI’s GPT-3.5/GPT-4 models are estimated to have over 100B parameters, though their exact parameter count has not been disclosed.

Generally, as parameter size increases, the model’s capacity to learn complex patterns and generate more coherent/lengthy text grows. However, this does not mean that moving from 70B to 405B parameters will automatically give you a model that is six times smarter or better. The increase depends on factors such as data quality, training duration, and hyperparameter settings. Also, keep in mind that as parameter sizes grow, so do VRAM and training time costs, so larger models may not be suitable for every project.

https://www.humanfirst.ai/blog/rag-llm-context-size

Now that we have a better grasp of the topic, let’s analyze and think about the challenging points in my project. The first and most crucial issue on my mind is the GPU count and the associated costs, especially since I plan to make this project available to the community on a non-profit basis. Therefore, I’ll be setting aside a specific budget to keep the project running.

Given the massive size of the model, if I tried to train it in FP16 without any quantization or LoRA, I’d need hundreds of H100 80GB GPUs. Instead, here’s my plan:

By using a parameter-efficient approach like LoRA, there’s no need to update all of the model’s parameters, drastically reducing both memory requirements and the number of GPUs needed. Because of this, I’m considering starting the training on Runpod or Vast.ai with 11×H100 80GB GPUs. You might be asking how I arrived at that number:

If we were to load the entire model into memory without quantization, just the raw model weights alone (405B parameters × 2 bytes each) would require 810 GB of VRAM — simply to run it unquantized.

But once we factor in training, the moment/gradient memory usage plus what’s needed for forward/backward passes could require upwards of 2–3 TB of VRAM.

Hence, the plan is to train with LoRA as well. LoRA typically uses 8-bit or 4-bit quantization. For example, in a 4-bit quantization scenario, if each parameter consumes 0.5 bytes, then for 405B parameters we’d need approximately;

  • That means we need approximately 202.5 GB of VRAM for 405 × 10⁹ × 0.5 bytes.

However, we may still need roughly 1.5–2 times that space for factors like activations and temporary memory usage. So, we’re looking at an upper bound of around 300–400 GB of VRAM, which is still quite substantial…

LoRA

Low-Rank Adaptation (LoRA) is an approach designed to adapt massive networks — like large language models with enormous parameter sets — by focusing on much smaller additional parameters instead of retraining the entire network from scratch. This way:

  • Most of the model’s original parameters remain frozen and are not updated, preventing the model from being disrupted or producing unstable outputs.
  • The parts that do need updating are reduced to low-rank matrix multiplications, which largely preserve the model’s learning capacity. We’ll look at this in more detail shortly.

Ordinarily, a full matrix ΔW∈R^(d×k) would need to be updated, but LoRA decomposes this into two smaller matrices, ΔW=A×B.

  • Here, the rank r is chosen to be small, such as 4, 8, or 16. Accordingly, A∈R(d×r) and B∈R(r×k). Because of this decomposition, instead of updating all d×k parameters in ΔW, we only update (d×r)+(r×k) parameters, which is significantly fewer.

In other words, while the original weight WW remains frozen, the values in the matrices AA and BB are updated. The resulting formula is:

  • (W)eff=W+ΔW=W+A×B

In brief, this is how LoRA operates.

In industrial and academic settings, even when fine-tuning models above 70B parameters with LoRA, 12–16 GPUs are typically used. This is because they want to keep the training duration reasonable while also benefiting from scenarios with larger batch size and sequence length. By the same token, we can apply a similar approach for a much larger model — like 405B parameters.

Considering that a cluster of 12–16 H100 GPUs provides around 960–1280 GB of VRAM, in a 4-bit quantization + LoRA scenario, that capacity would allow us to train more comfortably and potentially increase the batch size.

Now, let’s get to work. You can follow the latest developments on my X account. See you soon… https://x.com/AlicanKiraz0

https://tenor.com/tr/view/follow-the-white-rabbit-gif-10243961

--

--

Alican Kiraz
Alican Kiraz

Written by Alican Kiraz

Head of Cyber Defense Center @Trendyol | CSIE | CSAE | CCISO | CASP+ | OSCP | eCIR | CPENT | eWPTXv2 | eCDFP | eCTHPv2 | OSWP | CEH Master | Pentest+ | CySA+

No responses yet