405 Billion-Parameter Cybersecurity-Focused LLM Model Project — Part 2: Surviving on the Edge of VRAM Limits

6 min readJan 7, 2025

Hi, everyone. In the second part of our article series, we will think about and focus on how we can train our model with limited GPUs and VRAM.

First, let’s understand how VRAM is utilized by the data during training. Many parameters — such as the model architecture employed in the AI training process, data processing parameters, and optimization techniques — directly affect VRAM usage.

The first thing that comes to mind is focusing on storing the increasing number of weights, which grow in relation to the number of parameters in the LLM. Weights are the parameters that the model learns during the training process. These parameters are continuously used in the forward-backward passes. If you recall, I discussed the forward and backward passes in my previous article. Nevertheless, to briefly touch on them again: a neural network has certain weights and bias parameters in each of its layers in order to carry out mathematical operations. During training, in the forward pass, the weights are involved in mathematical operations with the input data to produce a prediction (output). In the backward pass, after calculating the loss, this error signal travels back through the network to determine how much each weight should be adjusted, that is, to compute the gradient.

Therefore, these weights must be stored in memory (GPU VRAM or CPU RAM) both during and after training. The GPUs used in training hold the weights in their VRAM to speed up calculations.

TR |How Will Cybersecurity Technologies Be Shaped by AI?

Bu serimde yapay zekanın Siber güvenlik teknolojileri ve araçlarını nasıl şekillendirebileceğini, geliştirebileceğini…

alican-kiraz1.medium.com

There are different storage formats for these weights as well. These can include FP32, BF16, FP16, and other numerical representation (precision) formats. Among them, lower-precision formats such as FP16/BF16 use less VRAM. Let’s take a look at these.

The precision format used to store the weights directly affects both the model’s performance — namely metrics like accuracy and overall performance — and its VRAM usage.

https://www.researchgate.net/figure/mplement-the-calculation-on-the-32-bit-floating-point-precision-of-the-CFU-block-where_fig2_370874176

FP32 (32-bit Floating Point)

According to the IEEE 754 standard, this is the single-precision floating-point format, commonly referred to as float32. Each number (i.e., weight) requires 32 bits (4 bytes). It offers more precise representation and carries a lower risk of overflow in mathematical operations. However, alongside its advantages, it consumes more VRAM and is slower in computation compared to 16-bit formats like BF16 and FP16.

FP16 (16-bit Floating Point)

Based on the IEEE 754 standard’s half-precision floating-point format (float16), each number requires 16 bits (2 bytes) of memory. This means it uses half the memory compared to FP32. Many modern GPUs, especially NVIDIA Tensor Cores, can perform matrix multiplications very quickly at FP16 precision. However, its narrower representable range increases the risk of overflow or underflow. During training, methods such as gradient scaling are used to manage these risks.

BF16 (16-bit Brain Floating Point)

Conceptually, this is a special format that retains the exponent bits of FP32 but reduces the fraction bits. Like FP16, it uses 16 bits. It provides a wider representable range compared to FP16, which results in fewer overflow issues. However, similar to FP16, the reduced 16-bit storage means the mantissa is smaller. Despite this, BF16 is considered more stable in practice than FP16.

https://huggingface.co/docs/transformers/v4.15.0/performance

Other Formats and Low-Precision Methods

INT8, INT4: These formats are especially used for compressing models during the inference stage. Since their representational capacity is more limited compared to floating-point calculations, they are rarely used directly in training.
Quantization Aware Training (QAT): It’s possible to train certain layers in INT8 or similarly low-bit formats during training, but this typically adds complexity.
Mixed Precision: Some parts of the model can be stored in FP16/BF16, while others remain in FP32. In particular, gradient accumulation and weight updates might be managed using master weights stored in FP32.

What Is a Model, and How Are Weights Stored?
A model comprises layers and the data structure formed by those layers — such as PyTorch’s nn.Module or TensorFlow’s tf.Module. These parameters are usually initialized randomly at the start.

The first step is loading them into memory. During this loading process, the defined parameters are copied to the GPU’s VRAM or to the CPU, depending on the settings. The weights in each layer are stored in the chosen precision format.

How Are They Used in Forward and Backward Passes?
During training, these parameters are used for matrix multiplications and nonlinear activations in the forward pass. In the backward pass, the gradient for each parameter is calculated. These gradients are usually kept in the same or a similar precision format.

https://stackoverflow.com/questions/64621585/pytorch-optimizer-adamw-and-adam-with-weight-decay

Mixed Precision Training
However, when mixed precision training is involved, some operations are performed using an FP32 master copy to reduce overflow/underflow errors. Depending on the optimizer used in training (e.g., SGD, Adam, AdamW), additional memory is required to update these weights. Considering Adam as an example, two extra vectors — “moment” and “variance” — are stored for each parameter. These may be held in the same precision as the weights or sometimes in higher precision, like FP32.

Batch Size and Micro-Batch Size

One of the most critical factors is batch size (or micro-batch size). They determine the size of activations and the overall processing load. A larger batch size allows more data samples to pass through the model simultaneously but also leads to significantly higher VRAM usage. Although using a smaller batch size can save VRAM, it may prolong the training time or affect the model’s generalization performance.

Sequence Length

In transformer-based LLMs, as the sequence length of input text increases, so do the computational cost of self-attention layers and the size of the activations. A longer sequence length means larger matrices must be stored during forward and backward passes, causing VRAM consumption to rise quickly.

Numerical Precision
Compared to FP32, parameters and activations in FP16 or BF16 can take up to half the VRAM. Modern GPUs, especially those with Tensor Cores, offer both speed improvements and VRAM savings when using FP16/BF16.

https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/mechanics/gradient-checkpointing-nin.ipynb

Gradient Checkpointing and ZeRO

Gradient Checkpointing: This approach involves recalculating certain intermediate activations instead of storing them. It lowers the storage cost for activations but requires additional computational power for recalculation.
ZeRO (Zero Redundancy Optimizer): In distributed training settings, ZeRO reduces VRAM usage by splitting parameters and optimization variables across multiple GPUs.
Model/Pipeline/Tensor Parallelism;By employing pipeline parallelism or tensor parallelism, you can distribute the model across multiple GPUs, thus reducing the amount of parameters and activations that each individual GPU needs to store.

I hope this has been an enjoyable read. In the next article, we’ll write code tailored to these parameters.