How Will Cybersecurity Technologies Be Shaped by AI? Part 2: Creating LLM Train Dataset with Cyber Security Information and Data
The first spark for using AI models with a cybersecurity focus will be either developing an AI model that inherently thinks with a cybersecurity mindset or fine-tuning an already developed model specifically for cybersecurity, thereby readjusting its weights. In this regard, the initial idea that comes to mind is to start out with a cybersecurity model that has been fine-tuned. This is because, in developing general LLM models for reasoning, decision-making, and analysis, the expectation is to create a model with the competence needed for these processes, going beyond just cybersecurity. In the next step, a more robust framework can be established by training a model — whose mechanisms for understanding, solution generation, and decision-making have already matured — with a cybersecurity focus.
So how should we begin the training? In fact, let’s think about it: How is an LLM trained? How is LLM training carried out? Essentially, an LLM’s learning generally consists of three main parts:
Pre-training
It’s actually the hardest part… A massive collection of text consisting of billions or even trillions of words is typically trained using self-supervised approaches like Masked Language Modeling or Auto-regressive Language Modeling. The goal of this training, in fact, is to enable our model to learn the structure of language, general world knowledge, and context.
Fine-tuning
The model is trained with more specific tasks or more specifically structured datasets for certain use cases, ensuring it achieves better performance.
It is an additional training process carried out to adapt an LLM, already trained on a large text corpus, to a specific purpose, goal, or dataset. In doing so, the model becomes specialized to achieve higher performance in a particular task or domain, going beyond just general language knowledge and patterns.
Pre-trained models learn general linguistic patterns on very large and diverse datasets. In order to better fulfill a specific domain specialization or a particular task — like question-answering or sentiment analysis — the model is given a purpose through these structures. Moreover, training an LLM from scratch for a certain goal is extremely costly and time-consuming, and is practically impossible for individual users. Therefore, applying layered training techniques such as LoRA to adjust and shape the model weights is much more effective.
How is LLM fine-tuning done?
Data Preparation
- Task Definition or Labeled Data: If the fine-tuning we perform is intended for a supervised task such as classification or question-answer, we need a dataset that includes label information — essentially correct/incorrect — in order for the model to learn.
- Cleaned & Preprocessed Data: The text dataset is broken down into tokens in accordance with the model’s token vocabulary — in other words, tokenization is applied. This process removes unnecessary characters. Additionally, in compliance with GDPR, the dataset must also be cleared of any personal data.
- Tokenization and Segmentation: Split your data into tokens using the tokenizer (e.g., Byte-Pair Encoding, WordPiece, SentencePiece, etc.) that the model you plan to fine-tune uses. You must be careful in this step because using a tokenizer compatible with the model is crucial; otherwise, you risk disrupting the model’s language.
Splitting the Dataset
- Training Set: The primary data used by the model to update its parameters.
- Validation Set: Used to monitor overfitting, particularly in techniques like hyperparameter selection or early stopping.
- Test Set: The data that should be used last, reserved for measuring the model’s true performance.
Model Architecture and Parameters
- Full Fine-Tuning: All layers of the model — such as encoder/decoder and attention heads — are included in the learning process and updated. However, because there are a very large number of parameters, the training costs will be high. Additionally, during model training, the shifting of weights can lead to highly inaccurate outputs.
- LoRA (Low-Rank Adaptation): Techniques like LoRA (as well as Prefix Tuning, Adapter Tuning, etc.) involve adding small additional parameters only to certain layers or keeping a portion of the model fixed during training. In doing so, memory and computational costs are reduced, and it becomes more difficult to disrupt the balance of the weights.
Training Environment and Infrastructure
- Hardware: GPUs (RTX Series, Ada Lovelace Series, etc.) or TPUs are frequently preferred for training LLMs. For models with a very large number of parameters, distributed training using multiple GPUs/TPUs may be necessary. DDP (Distributed Data Parallel) is used for this in the training code.
- Software: Typically, deep learning libraries such as PyTorch, TensorFlow, or JAX are used for training. Additionally, extra libraries like Hugging Face Transformers, DeepSpeed, and Accelerate offer significant convenience for model training, data processing, and distributed training.
It would be highly beneficial to install the following libraries in your environment before you begin training.
pip3 install transformers datasets peft accelerate bitsandbytes torch
Memory and Computational Considerations
In LLMs, factors such as batch size, sequence length, and the number of model parameters can rapidly fill up GPU memory. Therefore, to optimize, we often use methods like Gradient Checkpointing, Mixed Precision Training (FP16, BF16), and parameter-efficient techniques (LoRA, Prefix Tuning, etc.). We will explore these in detail in the following sections.
Hyperparameter Tuning
- Learning Rate: Typically, small values like 1e-5 or 2e-5 are preferred. For large models, setting the learning rate too high can “disrupt” the weights, causing the model to lose its previously learned knowledge. Additionally, a learning rate scheduler can be employed to gradually decrease the learning rate as the number of steps increases.
- Batch Size: Instead of using small batch sizes (e.g., 8, 16), larger batch sizes (e.g., 32, 64, 128) might be desired; however, due to GPU memory constraints, gradient accumulation is sometimes used to effectively increase the batch size. Larger batch sizes can improve training stability but also increase memory usage.
- Epoch or Number of Steps: This determines how many complete passes are made over the entire dataset or how many total training steps are taken. More epochs does not always mean better results. Over-training can lead to overfitting. Typically, 3–5 epochs or a specific number of steps are tried.
- Regularization: Methods such as dropout, weight decay, and label smoothing are used to control overfitting.
- Sequence Length: Indicates the maximum number of tokens the model can process at once. As this value increases, training costs also rise.
What kind of datasets should we prepare?
For the datasets we create, we can use the following alarms and logs as examples:
- HIDS and NIDS Detection Data
- Sysmon Detection Log
- EDR/EPP/AV Detection Log
Now, let’s examine these sample logs and detection data one by one and see how we can transform them into training data.
HIDS and NIDS Detection Data
Let’s take an example of a Suricata alarm sample.
{
"timestamp": "2017-03-24T14:04:30.413770-0600",
"flow_id": 1915383544959049,
"pcap_cnt": 172556,
"event_type": "alert",
"src_ip": "10.192.91.91",
"src_port": 58677,
"dest_ip": "10.10.1.14",
"dest_port": 139,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": 1,
"signature_id": 2001579,
"rev": 15,
"signature": "ET SCAN Behavioral Unusual Port 139 traffic Potential Scan or Infection",
"category": "Misc activity",
"severity": 3
},
"flow": {
"pkts_toserver": 2,
"pkts_toclient": 0,
"bytes_toserver": 120,
"bytes_toclient": 0,
"start": "2017-03-24T14:04:30.413769-0600"
}
}
Source: https://github.com/FrankHassanabad/suricata-sample-data/blob/master/samples/wrccdc-2017/alerts-only.json
When we want the model to interpret this alarm data, we first need to train it on Suricata outputs and alarm formats. We need to do this using OpenAI’s training approach based on the System–User–Assistant logic.
For example:
System: You are an AI model specialized in incident response.
User: What are the parameters used in interpreting Suricata alarms?
Assistant: …
dest_port
means destination port.dest_ip
means the destination host. …
In the next phase, after teaching the syntax and format, example steps are provided for interpreting the incoming data.
User:
What is the goal in the Suricata NIDS detection output shown below?
….
“src_ip”: “10.192.91.91”,
“src_port”: 58677,
“dest_ip”: “10.10.1.14”,
“dest_port”: 139,
…Assistant: According to this output, the
src_ip
is scanning port 139 on thedest_ip
to perform enumeration.
In the next dataset step:
User:
In the Suricata NIDS detection output shown below, why is port 139 being targeted?
….
“src_ip”: “10.192.91.91”,
“src_port”: 58677,
“dest_ip”: “10.10.1.14”,
“dest_port”: 139,
…Assistant: According to this output, the
src_ip
is scanning port 139 on thedest_ip
to perform enumeration. The purpose of this enumeration could be to check the port’s version, status, and configuration in order to carry out vulnerability checks.
In this way, we have step by step taught the model the structure of a detection log taken from a source, then its cause, and finally its purpose.
If the same structure is applied to Sysmon and EDR/EPP/AV outputs for training the model, it can be specifically trained for certain tools and use cases.
Maybe we could even build an AI Agent… :)