Set Up Your Own Cybersecurity-Focused AI Development, Training, and Fine-Tuning Lab at Home

Alican Kiraz
6 min readNov 19, 2024
Image Source : ChatGPT’s response to the question, ‘Based on what you know about me, create an image of what my life currently looks like:’ :)

As AI applications rapidly evolve, commercial platforms like OpenAI, Gemini, and many other LLM versions are offering advanced capabilities across thousands of domains. However, for us cybersecurity enthusiasts, these platforms fall short of providing the in-depth knowledge we truly need. They often share only surface-level or academic technical information sourced from publicly available data.

That’s why, in recent months, I decided to take the notes I compiled while preparing for the 23 certification programs I have successfully completed so far and use them to train a selected model, creating a fine-tuned version tailored to my needs. This idea led me to conduct extensive research, and I wanted to share the insights and knowledge I gained during this process with you.

Now, let’s get straight to work because while information is abundant everywhere, it rarely discusses technical progression.

Choosing a Model on Huggingface

Hugging Face is a platform that provides open-source tools, datasets, and numerous models used in the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI). It enables users to easily develop, train, and use AI models. One of its strongest features is the Transformers library. This library, developed by Hugging Face, is a powerful open-source Python library used for NLP tasks. It allows the easy implementation of pre-trained models like BERT, GPT, T5, and LLaMA and is optimized for tasks such as text classification, translation, and summarization. With its simple API structure, it offers users the ability to easily download and integrate models, perform fine-tuning, and even develop their own applications.

You can choose from thousands of libraries on Hugging Face. Two key points to consider when selecting a model are: compatibility with the deployment environment and compatibility with the model you intend to train.

Note: Since GPU passthrough is not possible with VMware Workstation and VirtualBox, you cannot utilize your GPU on a virtual machine! Therefore, you should install Docker on your base system. Alternatively, you can try virtualization with Hyper-V or ESXi.

Build the Docker Image

Let’s create a Dockerfile to build a Docker image based on Ubuntu and then install PyTorch on it.

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04

RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
wget \
&& rm -rf /var/lib/apt/lists/*

# Install PyTorch, torchvision, and torchaudio libraries with CUDA support
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

WORKDIR /workspace

CMD ["bash"]

Let’s process the Dockerfile as follows:

docker build -t pytorch-gpu .

docker run --gpus all -it --name pytorch-gpu-container pytorch-gpu

Let’s check if the PyTorch installation was successful;

nvidia-smi

python3 -c "import torch; print(torch.cuda.is_available())"

Now we can proceed step by step to select, load, and fine-tune an LLM from Hugging Face using virtualization. To utilize processor power efficiently, you can use either virtualenv or conda.

Option 1:

pip install virtualenv  
virtualenv llm_env

source llm_env/bin/activate

Option 2:

Python also provides the venv module, which can be used to create virtual environments. If you don't want to use virtualenv, you can opt for this instead.

python3 -m venv llm_env

source llm_env/bin/activate

Huggingface Model Selection and Token

After selecting a model on Hugging Face, create an account and generate a token. Then, access Hugging Face from your machine’s main bash using the token and pull your model.

huggingface-cli login

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MODEL_NAME"
token = "<your_token>"

model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=token)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=token)

Let’s Install the Libraries

Since we have activated the virtual environment, we will install libraries that will only be valid within this environment.

pip install transformers datasets torch

pip install torch==2.4.0+cu124

pip install --upgrade transformers

Testing the Model

input_text = "Merhaba, ben bir dil modeliyim ve size nasıl yardımcı olabilirim?"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output
outputs = model.generate(inputs["input_ids"], max_length=50)

# Decode the output back into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

Training with the Dataset

To perform fine-tuning on the model, we can use the Hugging Face Trainer class. For fine-tuning, we need an appropriate training dataset and suitable configurations.

As an example, let’s use the IMDb dataset. For a more advanced test dataset, you can use WikiText.

from datasets import load_dataset

dataset = load_dataset("imdb")

train_dataset = dataset["train"]
eval_dataset = dataset["test"]

As an example, let’s use the meta-llama/Llama-3.1-8B-Instruct model.

Note: To access the model from Meta, submit a request via Hugging Face!

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Now, let’s write a tokenization function suitable for the RTX 3080. One of the most important factors to consider during data tokenization is the batch size and token length. For a GPU like the RTX 3080, using a high batch size and long token lengths during tokenization of large datasets can overload the memory.

def tokenize_function(examples):

# We fix the length of each example using the padding and truncation parameters.
return tokenizer(
examples["text"], # We tokenize the text in the 'text' field
padding="max_length", # We pad all texts to max_length
truncation=True, # We truncate long texts
max_length=512, # Maximum length (typically between 512-1024 for RTX 3080)
)
  • max_length: Fixes the token length of each text. For large datasets, keeping this value at a reasonable level, such as 512 or 1024, is typically a good practice to optimize memory usage.
  • padding="max_length": Ensures that all tokenized texts are of equal length by padding each text up to the max_length.
  • truncation=True: Truncates texts that exceed the specified max_length.

We will tokenize your dataset using the map function. This function processes and tokenizes each example collectively.

  • batched=True: Allows tokenization to be performed in batches, meaning multiple examples are processed at once. This can increase processing speed.
  • num_proc=4: Enables parallel processing with CPU cores, reducing processing time. However, it's important to select an appropriate value to avoid out-of-memory errors, especially for GPUs like the RTX 3080.
train_dataset = train_dataset.map(tokenize_function, batched=True, num_proc=4)  # 4 işlemci çekirdeğiyle paralel işleme
eval_dataset = eval_dataset.map(tokenize_function, batched=True, num_proc=4) # 4 işlemci çekirdeğiyle paralel işleme

To make the tokenization process more efficient, you may need to reduce the per_device_train_batch_size and per_device_eval_batch_size parameters. Setting these values to 4 or lower can be beneficial when using low VRAM GPUs.

from transformers import TrainingArguments

training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
logging_dir="./logs",
logging_steps=5,
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=3,
weight_decay=0.01,
fp16=True,
gradient_accumulation_steps=4,
)

If you encounter a Pad Token error;

if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

inputs["attention_mask"] = inputs["attention_mask"].to(model.device)

if model.config.pad_token_id is None:
model.config.pad_token_id = model.config.eos_token_id

# Run the model to generate output
outputs = model.generate(inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=50)

# Decode the output into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

Start the training;

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)


trainer.train()

Practices, Tricks, Errors, and Solutions

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. 
GPU 0 has a total capacity of 23.64 GiB of which 20.81 MiB is free.
Process 2131856 has 23.62 GiB memory in use. Of the allocated memory
23.24 GiB is allocated by PyTorch, and 1.11 MiB is reserved by PyTorch
but unallocated. If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In this case, reduce the batch size.

training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
)

Gradient Accumulation

Instead of reducing the batch size, we can use gradient accumulation to train with small batch sizes while simulating the behavior of larger batch sizes.

training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
gradient_accumulation_steps=8,
)

FP16 Train (Half Precision Training)

By using mixed_precision training, we can work with low-precision numbers. This can significantly reduce memory usage while also improving performance.

training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
fp16=True,
)

Logging and Metric Tracking with Trainer

training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
logging_dir="./logs",
logging_steps=5,
per_device_train_batch_size=8,
num_train_epochs=3,
)

Monitoring with TensorBoard

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir="./results",
logging_dir="./logs",
logging_steps=10,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)

trainer.train()

%load_ext tensorboard
%tensorboard --logdir ./logs

Save the Training Results

trainer.save_model()
tokenizer.save_pretrained("./results")

Check the training status

# Checking the number of steps in training
print(trainer.state.global_step)

# Checking the current epoch
print(trainer.state.epoch)

--

--

Alican Kiraz
Alican Kiraz

Written by Alican Kiraz

Head of Cyber Defense Center @Trendyol | CSIE | CSAE | CCISO | CASP+ | OSCP | eCIR | CPENT | eWPTXv2 | eCDFP | eCTHPv2 | OSWP | CEH Master | Pentest+ | CySA+

Responses (2)