Build Your Own LLM: A Complete Guide to Training LLM Models with Hugging Face Transformers

7 min readDec 23, 2024

Image Credits: NicoElNino / Getty Images

Hugging Face’s Transformers training library is a fantastic tool, particularly once you’re well-versed in training metrics. Once you start using it and master it, you’ll find you won’t need any other AI training tools. Now, let’s take a detailed look at the Transformers library, the training resources it utilizes, and its various parameters. Afterward, we’ll train a base LLM model, create our own LLM, and upload it to Hugging Face.

While reading this article, you can also experiment with the sample training code I’ve provided. With this code, you can download a model from Hugging Face and train it on a suitable dataset (with Instruction, Input, and Output columns).

import os
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    BitsAndBytesConfig,
    pipeline,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

# Suppress tokenizers parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# 1. Load dataset
dataset = load_dataset("DATASET_NAME")

# 2. Preprocess data
def preprocess_data(batch):
    texts = [f"Instruction: {instruction}\nInput: {input_text}\nOutput: {output_text}" for instruction, input_text, output_text in zip(batch["instruction"], batch["input"], batch["output"])]
    return {"text": texts}
processed_dataset = dataset.map(preprocess_data, batched=True)


model_path = "MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 4. Tokenize data and add labels
def tokenize_function(batch):
    tokenized_inputs = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=512)
    tokenized_inputs["labels"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in input_ids]
        for input_ids in tokenized_inputs["input_ids"]
    ]
    return tokenized_inputs
tokenized_dataset = processed_dataset.map(tokenize_function, batched=True, remove_columns=["instruction", "input", "output"])
train_test_split = tokenized_dataset["train"].train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

# 5. Load model with quantization
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)

# Configure PEFT (LoRA)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

# 6. Training arguments
training_args = TrainingArguments(
    output_dir="./output_model",
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=1000,
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Increased for better throughput
    gradient_accumulation_steps=4,  # Adjusted for effective batch size
    num_train_epochs=3,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    logging_dir="./logs",
    warmup_steps=500,
    dataloader_num_workers=4,
    max_grad_norm=1.0,
)

# 7. Trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# 8. Train model
trainer.train()

# 9. Save model and tokenizer
peft_model.save_pretrained("./output_model")
tokenizer.save_pretrained("./output_model")
print("Training completed successfully. Model and tokenizer saved to ‘./output_model’.")

# 10. Switch to base model for inference
inference_pipeline = pipeline(
    "text-generation",
    model=base_model,  # Use the base model, not PeftModelForCausalLM
    tokenizer=tokenizer
)

# 11. Ask 3 questions to the trained model
questions = [
    "QUESTION 1",
    "QUESTION 2",
    "QUESTION 3"
]
answers = []
for question in questions:
    output = inference_pipeline(
        question,
        max_length=100,
        num_return_sequences=1,
        temperature=0.7,  # Encourage creativity
        top_p=0.9,       # Nucleus sampling
        top_k=50,        # Limit to top 50 tokens
        truncation=True  # Explicit truncation
    )
    answers.append(output[0]["generated_text"])

# 12. Display the questions and answers
for question, answer in zip(questions, answers):
    print(f"Q: {question}")
    print(f"A: {answer}")
    print("-" * 50)

In the first step, let’s take a look at the following libraries, which you will frequently encounter and need during training:

transformers datasets peft accelerate bitsandbytes torch

Transformers

It’s a framework that offers ready-to-use architectures and high-level training/fine-tuning functions for popular and modern large language models (GPT, BERT, etc.). It supports both PyTorch and TensorFlow backends and is commonly used for tasks such as text classification, question answering, text generation, translation, and summarization.

import os
import torch
from transformers

Datasets

It’s a library that allows you to easily load, manage, transform, and share datasets in various formats (CSV, JSON, text files, etc.). Thanks to its design optimized for distributed and parallel processing, it can comfortably handle even very large datasets containing millions of rows. Additionally, you can organize and preprocess your datasets with functions like map, filter, shuffle, and train_test_split.

import os
import torch
from datasets import load_dataset, DatasetDict
from transformers

Peft

PEFT (Parameter-Efficient Fine-Tuning) allows you to improve LLMs during fine-tuning by updating only a small number of parameters. It achieves this through methodologies such as LoRA (Low-Rank Adaptation), Prefix Tuning, and Prompt Tuning, letting you freeze the majority of the model’s parameters and train only additional parameters. As a result, flaws and errors within your model are reduced, and even large-scale models can be fine-tuned using this approach.

import os
import torch
from datasets import load_dataset, DatasetDict
from transformers
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

Accelerate, Bitsandbytes ve torch (PyTorch)

It’s a library that simplifies distributed or multi-GPU training scenarios in PyTorch-based workflows. In other words, you can scale your PyTorch code from a single GPU to multiple GPUs or even multiple nodes with minimal changes. Bitsandbytes, on the other hand, is a library that enables training and running large models with low-bit (8-bit, 4-bit) precision. Torch, as you know, is among the most popular libraries used for deep learning and accelerating computational graphics.

import os
import torch
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    Trainer, 
    TrainingArguments,
    BitsAndBytesConfig,
    DefaultDataCollator
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

Now let’s examine the trainer arguments.

{
  "auto_find_batch_size": "false",
  "chat_template": "none",
  "disable_gradient_checkpointing": "false",
  "distributed_backend": "ddp",
  "eval_strategy": "epoch",
  "merge_adapter": "false",
  "mixed_precision": "fp16",
  "optimizer": "adamw_torch",
  "peft": "true",
  "padding": "right",
  "quantization": "int4",
  "scheduler": "linear",
  "unsloth": "false",
  "use_flash_attention_2": "false",
  "batch_size": "2",
  "block_size": "1024",
  "epochs": "3",
  "gradient_accumulation": "4",
  "lr": "0.00003",
  "logging_steps": "-1",
  "lora_alpha": "32",
  "lora_dropout": "0.05",
  "lora_r": "16",
  "max_grad_norm": "1",
  "model_max_length": "2048",
  "save_total_limit": "1",
  "seed": "42",
  "warmup_ratio": "0.1",
  "weight_decay": "0",
  "target_modules": "all-linear"
}

auto_find_batch_size

This parameter determines whether or not to automatically set the batch size. If it were set to ‘true,’ it would attempt to automatically find the largest batch size that fits into the GPU’s memory. If it’s set to ‘false,’ it indicates that training will use the specified fixed batch size.

chat_template

It determines how the model’s input-output format will be shaped according to a special ‘chat’ template during training.

disable_gradient_checkpointing

Gradient checkpointing typically saves memory during the training of large models.

distributed_backend

If training is to be performed on multiple GPUs (and/or multiple nodes), it indicates that the distributed training method in use is DDP.

eval_strategy

It specifies when the evaluation will be performed. ‘epoch’ indicates that evaluation will be carried out at the end of each epoch.

merge_adapter

In adapter-based methods like LoRA, it indicates the option to merge the adapter weights into the main model after training.

mixed_precision

It indicates whether mixed precision is being used during training. ‘fp16’ signifies that the training is conducted at 16-bit precision, which generally provides speed and memory savings.

optimizer

It indicates which optimization algorithm is being used. adamw_torch refers to PyTorch’s built-in AdamW optimizer.

peft

PEFT is a strategy for efficiently fine-tuning large models by updating fewer parameters (e.g., LoRA, Prefix Tuning).

padding

Indicates where padding will be added when aligning token sequences. “right” padding means that empty space (pad tokens) will be added to the right side of the sequence.

quantization

Refers to converting model weights into an integer format to reduce memory usage. “int4” indicates reducing the weights to a 4-bit integer format. This makes it possible to fit very large models into memory, though it partially reduces precision.

scheduler

Specifies how the learning rate will change during training. “linear” indicates a scheduler that decreases linearly from the initial value down to 0.

batch_size

Denotes the number of examples used in each forward pass step.

block_size

Defines the maximum length at which input texts are truncated or split into segments.

epochs

Shows that the dataset will be trained end-to-end for 3 epochs.

gradient_accumulation

Indicates that gradients will be accumulated over 4 steps before each backpropagation, and then the update will be performed. This is used to increase the effective batch size.

lr

Value: “0.00003” (3e-5). This is the learning rate, which controls how quickly the model’s parameters are updated.

lora_alpha

Used to rescale the outputs of LoRA layers, thus maintaining a balance when updating low-rank matrices.

lora_dropout

The dropout rate used in LoRA layers. A low dropout value helps slightly regularize the model.

max_grad_norm

Limits the maximum norm of the gradients to 1. This is gradient clipping, which makes training more stable by restricting very large update steps.

model_max_length

Specifies the maximum number of tokens the model can process.

seed

Setting a fixed random seed for processes involving randomness ensures reproducible results. “42” is a popular example value.

Now, let’s also see how to merge the LoRA weights with our model after obtaining the outputs and how to upload it to Hugging Face.


from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import HfApi, HfFolder, upload_folder

# Hugging Face repository details
repository_name = "REPO_NAME"
private = True  # Set repository as private

# Paths to base and LoRA models
base_model_path = "BASE_MODEL_NAME"
lora_model_path = "SAVED_MODEL_NAME"
merged_model_path = "./merged_finetuned_model"

# Step 1: Load the base model
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Step 2: Load LoRA weights and merge
print("Loading LoRA weights and merging with base model...")
lora_model = PeftModel.from_pretrained(base_model, lora_model_path)
merged_model = lora_model.merge_and_unload()  # Merge LoRA weights into the base model

# Step 3: Save the merged model locally
print(f"Saving merged model to {merged_model_path}...")
merged_model.save_pretrained(merged_model_path)
tokenizer.save_pretrained(merged_model_path)

# Step 4: Upload to Hugging Face Hub
print(f"Uploading merged model to Hugging Face Hub ({repository_name})...")

# Authenticate with Hugging Face
hf_token = HfFolder.get_token()  # Ensure you are logged in using `huggingface-cli login`
if not hf_token:
    raise ValueError("Hugging Face token not found. Please log in using `huggingface-cli login`.")

# Initialize API and create repository
api = HfApi()

# Create the repository
api.create_repo(
    repo_id=repository_name,  # Use repo_id instead of name
    token=hf_token,
    private=private,
    exist_ok=True  # Avoid errors if the repo already exists
)

# Upload the folder to the repository
upload_folder(
    folder_path=merged_model_path,
    repo_id=repository_name,
    token=hf_token
)

print(f"Model uploaded successfully to {repository_name} as private.")

Using this code, we merge our training outputs with the base model and upload them to Hugging Face. Don’t forget to generate an access token on Hugging Face and grant repo access before uploading.

Now you can train and create your own LLM :)