Set Up Your Own Cybersecurity-Focused AI Development, Training, and Fine-Tuning Lab at Home
As AI applications rapidly evolve, commercial platforms like OpenAI, Gemini, and many other LLM versions are offering advanced capabilities across thousands of domains. However, for us cybersecurity enthusiasts, these platforms fall short of providing the in-depth knowledge we truly need. They often share only surface-level or academic technical information sourced from publicly available data.
That’s why, in recent months, I decided to take the notes I compiled while preparing for the 23 certification programs I have successfully completed so far and use them to train a selected model, creating a fine-tuned version tailored to my needs. This idea led me to conduct extensive research, and I wanted to share the insights and knowledge I gained during this process with you.
Now, let’s get straight to work because while information is abundant everywhere, it rarely discusses technical progression.
Choosing a Model on Huggingface
Hugging Face is a platform that provides open-source tools, datasets, and numerous models used in the fields of Natural Language Processing (NLP) and Artificial Intelligence (AI). It enables users to easily develop, train, and use AI models. One of its strongest features is the Transformers library. This library, developed by Hugging Face, is a powerful open-source Python library used for NLP tasks. It allows the easy implementation of pre-trained models like BERT, GPT, T5, and LLaMA and is optimized for tasks such as text classification, translation, and summarization. With its simple API structure, it offers users the ability to easily download and integrate models, perform fine-tuning, and even develop their own applications.
You can choose from thousands of libraries on Hugging Face. Two key points to consider when selecting a model are: compatibility with the deployment environment and compatibility with the model you intend to train.
Note: Since GPU passthrough is not possible with VMware Workstation and VirtualBox, you cannot utilize your GPU on a virtual machine! Therefore, you should install Docker on your base system. Alternatively, you can try virtualization with Hyper-V or ESXi.
Build the Docker Image
Let’s create a Dockerfile to build a Docker image based on Ubuntu and then install PyTorch on it.
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
wget \
&& rm -rf /var/lib/apt/lists/*
# Install PyTorch, torchvision, and torchaudio libraries with CUDA support
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
WORKDIR /workspace
CMD ["bash"]
Let’s process the Dockerfile
as follows:
docker build -t pytorch-gpu .
docker run --gpus all -it --name pytorch-gpu-container pytorch-gpu
Let’s check if the PyTorch installation was successful;
nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"
Now we can proceed step by step to select, load, and fine-tune an LLM from Hugging Face using virtualization. To utilize processor power efficiently, you can use either virtualenv
or conda
.
Option 1:
pip install virtualenv
virtualenv llm_env
source llm_env/bin/activate
Option 2:
Python also provides the venv
module, which can be used to create virtual environments. If you don't want to use virtualenv
, you can opt for this instead.
python3 -m venv llm_env
source llm_env/bin/activate
Huggingface Model Selection and Token
After selecting a model on Hugging Face, create an account and generate a token. Then, access Hugging Face from your machine’s main bash using the token and pull your model.
huggingface-cli login
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "MODEL_NAME"
token = "<your_token>"
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=token)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=token)
Let’s Install the Libraries
Since we have activated the virtual environment, we will install libraries that will only be valid within this environment.
pip install transformers datasets torch
pip install torch==2.4.0+cu124
pip install --upgrade transformers
Testing the Model
input_text = "Merhaba, ben bir dil modeliyim ve size nasıl yardımcı olabilirim?"
# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")
# Generate output
outputs = model.generate(inputs["input_ids"], max_length=50)
# Decode the output back into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Print the generated text
print(generated_text)
Training with the Dataset
To perform fine-tuning on the model, we can use the Hugging Face Trainer class. For fine-tuning, we need an appropriate training dataset and suitable configurations.
As an example, let’s use the IMDb dataset. For a more advanced test dataset, you can use WikiText.
from datasets import load_dataset
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
As an example, let’s use the meta-llama/Llama-3.1-8B-Instruct
model.
Note: To access the model from Meta, submit a request via Hugging Face!
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Now, let’s write a tokenization function suitable for the RTX 3080. One of the most important factors to consider during data tokenization is the batch size and token length. For a GPU like the RTX 3080, using a high batch size and long token lengths during tokenization of large datasets can overload the memory.
def tokenize_function(examples):
# We fix the length of each example using the padding and truncation parameters.
return tokenizer(
examples["text"], # We tokenize the text in the 'text' field
padding="max_length", # We pad all texts to max_length
truncation=True, # We truncate long texts
max_length=512, # Maximum length (typically between 512-1024 for RTX 3080)
)
max_length
: Fixes the token length of each text. For large datasets, keeping this value at a reasonable level, such as 512 or 1024, is typically a good practice to optimize memory usage.padding="max_length"
: Ensures that all tokenized texts are of equal length by padding each text up to themax_length
.truncation=True
: Truncates texts that exceed the specifiedmax_length
.
We will tokenize your dataset using the map
function. This function processes and tokenizes each example collectively.
batched=True
: Allows tokenization to be performed in batches, meaning multiple examples are processed at once. This can increase processing speed.num_proc=4
: Enables parallel processing with CPU cores, reducing processing time. However, it's important to select an appropriate value to avoid out-of-memory errors, especially for GPUs like the RTX 3080.
train_dataset = train_dataset.map(tokenize_function, batched=True, num_proc=4) # 4 işlemci çekirdeğiyle paralel işleme
eval_dataset = eval_dataset.map(tokenize_function, batched=True, num_proc=4) # 4 işlemci çekirdeğiyle paralel işleme
To make the tokenization process more efficient, you may need to reduce the per_device_train_batch_size
and per_device_eval_batch_size
parameters. Setting these values to 4 or lower can be beneficial when using low VRAM GPUs.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
logging_dir="./logs",
logging_steps=5,
learning_rate=2e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=3,
weight_decay=0.01,
fp16=True,
gradient_accumulation_steps=4,
)
If you encounter a Pad Token error;
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
inputs["attention_mask"] = inputs["attention_mask"].to(model.device)
if model.config.pad_token_id is None:
model.config.pad_token_id = model.config.eos_token_id
# Run the model to generate output
outputs = model.generate(inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=50)
# Decode the output into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Print the generated text
print(generated_text)
Start the training;
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
Practices, Tricks, Errors, and Solutions
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB.
GPU 0 has a total capacity of 23.64 GiB of which 20.81 MiB is free.
Process 2131856 has 23.62 GiB memory in use. Of the allocated memory
23.24 GiB is allocated by PyTorch, and 1.11 MiB is reserved by PyTorch
but unallocated. If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
In this case, reduce the batch size.
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
)
Gradient Accumulation
Instead of reducing the batch size, we can use gradient accumulation to train with small batch sizes while simulating the behavior of larger batch sizes.
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=3,
weight_decay=0.01,
gradient_accumulation_steps=8,
)
FP16 Train (Half Precision Training)
By using mixed_precision
training, we can work with low-precision numbers. This can significantly reduce memory usage while also improving performance.
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
fp16=True,
)
Logging and Metric Tracking with Trainer
training_args = TrainingArguments(
output_dir="./results",
eval_strategy="epoch",
logging_dir="./logs",
logging_steps=5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
Monitoring with TensorBoard
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
logging_dir="./logs",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
%load_ext tensorboard
%tensorboard --logdir ./logs
Save the Training Results
trainer.save_model()
tokenizer.save_pretrained("./results")
Check the training status
# Checking the number of steps in training
print(trainer.state.global_step)
# Checking the current epoch
print(trainer.state.epoch)