Fine-tuning LLMs with QLoRA on Custom Datasets
Written on
Large language models (LLMs) have transformed natural language processing with their impressive capabilities. Trained on vast text corpuses, these models excel in tasks like text generation, translation, summarization, and question-answering. However, their performance may not always meet the demands of specialized applications or domains.
In this guide, we will discuss the benefits of fine-tuning LLMs, which can lead to improved performance, reduced training costs, and more precise results tailored to specific contexts.
What is LLM Fine-tuning?
Fine-tuning a large language model refers to the process of further training an already established model that has learned from a broad dataset, using a smaller, domain-specific dataset. The term "LLM" typically refers to models such as OpenAI's GPT series. Training a model from scratch is computationally expensive and time-consuming; therefore, leveraging the knowledge from a pre-trained model allows for high task-specific performance with less data and fewer resources.
Key steps in LLM fine-tuning include:
- Choosing a Pre-trained Model: Start by selecting a base model that fits your needs. Pre-trained models are versatile and trained on extensive unlabeled datasets.
- Collecting a Relevant Dataset: Next, gather a dataset pertinent to your task. It should be appropriately labeled or structured for effective learning.
- Preparing the Dataset: Preprocessing involves cleaning the data and dividing it into training, validation, and test sets, ensuring compatibility with the model.
- Fine-tuning: Fine-tune the selected model using your domain-specific dataset, which may relate to a specific application or field, enabling the model to adapt to its context.
- Task-specific Adjustment: During this phase, the model's parameters are fine-tuned based on the new dataset, enhancing its ability to understand and generate task-relevant content. This process retains the general language knowledge acquired during initial training while honing in on the particularities of the target domain.
Fine-tuning LLMs is frequently utilized for various natural language processing tasks such as sentiment analysis, named entity recognition, summarization, translation, and more, where context comprehension and coherent language generation are essential.
Fine-tuning Techniques
Fine-tuning an LLM typically involves a supervised learning approach, using a dataset with labeled examples to modify the model’s weights for improved task performance. Below are some notable techniques used in this process:
- Full Fine Tuning (Instruction Fine-tuning): This method enhances the model’s performance across multiple tasks by training it on examples that guide responses to questions. The dataset selection is crucial, tailored to specific tasks like summarization or translation. This technique updates all model weights, resulting in a new version with enhanced capabilities, but it requires significant memory and computational resources akin to pre-training.
- Parameter Efficient Fine-Tuning (PEFT): PEFT is a more efficient alternative to full fine-tuning. Training a language model entirely can be resource-intensive. PEFT addresses this by updating only a subset of parameters while keeping the rest frozen, reducing the number of trainable parameters. This approach helps manage memory limitations and prevents catastrophic forgetting, allowing the original LLM weights to remain intact. Various methods exist for achieving PEFT, with Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) being the most recognized.
What is LoRA?
LoRA enhances the fine-tuning process by adjusting two smaller matrices that approximate the larger weight matrix of the pre-trained LLM, known as the LoRA adapter. After fine-tuning, the original LLM remains unchanged while a compact “LoRA adapter” is created, often only a fraction of the original model's size in MBs. During inference, the LoRA adapter must be combined with the original model, facilitating the reuse of the base model for various tasks and reducing overall memory use.
What is Quantized LoRA (QLoRA)?
QLoRA is an even more memory-efficient version of LoRA, which reduces the weights of the LoRA adapters to lower precision (e.g., 4-bit instead of 8-bit). This further decreases the memory footprint and storage needs. In QLoRA, the pre-trained model is loaded into GPU memory with quantized weights, while still performing comparably to LoRA.
In this tutorial, we'll implement Parameter-efficient fine-tuning with QLoRA.
Let’s delve into the steps for fine-tuning an LLM on a custom dataset using QLoRA on a single GPU:
- Setting up the Notebook
- Installing necessary libraries
- Loading the dataset
- Creating Bitsandbytes configuration
- Loading the Pre-trained model
- Tokenization
- Testing the Model with Zero Shot Inferencing
- Pre-processing the dataset
- Preparing the model for QLoRA
- Setting up PEFT for Fine-Tuning
- Training the PEFT Adapter
- Evaluating the Model Qualitatively (Human Evaluation)
- Evaluating the Model Quantitatively (with ROUGE Metric)
Setting up the Notebook
We will use a Kaggle notebook for this demonstration, but any Jupyter notebook environment will suffice. Kaggle provides 30 hours of free GPU usage weekly, which is ample for our needs. Open a new notebook, set up some headings, and connect to the runtime.
Here, we will choose the GPU P100 as the ACCELERATOR. Feel free to explore other GPU options available in Kaggle or any other environment.
For this tutorial, we will employ HuggingFace libraries to download and train the model. An Access Token is necessary for model downloads from HuggingFace. If you have an account, you can generate a new Access Token from the settings.
Installing Required Libraries
Let's install the required libraries for this experiment.
!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score
Understanding the importance of some libraries:
- Bitsandbytes: A lightweight package providing custom CUDA functions that optimize LLM performance.
- transformers: A Hugging Face library offering pre-trained models and utilities for NLP tasks.
- peft: A library enabling parameter-efficient fine-tuning.
- accelerate: Simplifies multi-GPU/TPU operations without altering the rest of your code.
- datasets: Offers easy access to various datasets.
- einops: Simplifies tensor operations.
Loading the necessary libraries:
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
Trainer,
GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login
interpreter_login()
We will not track our training metrics in this tutorial, so let's disable Weights and Biases. To do this, set the following environment property:
import os
# Disable Weights and Biases
os.environ['WANDB_DISABLED'] = "true"
If you have an account with Weights and Biases, feel free to activate it for experimentation.
Loading the Dataset
Numerous datasets can be used for fine-tuning. In this case, we will use the DialogSum dataset from HuggingFace, which contains 13,460 dialogues with labeled summaries and topics.
To load this dataset, execute the following code:
huggingface_dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(huggingface_dataset_name)
After loading, let's examine the dataset:
The dataset includes:
- dialogue: The text of the dialogue.
- summary: A human-written summary.
- topic: A brief topic description.
- id: A unique identifier for each example.
Creating Bitsandbytes Configuration
To load the model, we need a configuration that specifies the desired quantization method. We will use BitsAndBytesConfig to load our model in 4-bit format, significantly reducing memory usage at a slight cost to accuracy.
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False,
)
Loading the Pre-trained Model
We will utilize Microsoft’s open-sourced Phi-2, a Small Language Model (SLM) with 2.7 billion parameters. This model demonstrates excellent reasoning and language comprehension capabilities.
Loading Phi-2 using 4-bit quantization from HuggingFace:
model_name = 'microsoft/phi-2'
device_map = {"" : 0}
original_model = AutoModelForCausalLM.from_pretrained(model_name,
device_map=device_map,
quantization_config=bnb_config,
trust_remote_code=True,
use_auth_token=True)
The model is now loaded in 4-bit format using the BitsAndBytesConfig from the bitsandbytes library, part of the QLoRA process.
Tokenization
Next, we will set up the tokenizer, ensuring left-padding to optimize memory during training.
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left", add_eos_token=True, add_bos_token=True, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
Testing the Model with Zero Shot Inferencing
We will evaluate the base model using sample inputs.
%%time
from transformers import set_seed
seed = 42
set_seed(seed)
index = 10
prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
formatted_prompt = f"Instruct: Summarize the following conversation.n{prompt}nOutput:n"
res = gen(original_model, formatted_prompt, 100)
output = res[0].split('Output:n')[1]
dash_line = '-' * 100
print(dash_line)
print(f'INPUT PROMPT:n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:n{summary}n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:n{output}')
The model struggles to summarize the dialogue compared to the baseline summary, but it does extract key information, indicating potential for fine-tuning.
Pre-processing the Dataset
The dataset needs formatting to be suitable for fine-tuning. The prompt must follow the specified format based on the HuggingFace documentation.
We will create helper functions to format our dataset appropriately for fine-tuning, converting dialogue-summary pairs into explicit instructions for the LLM.
def create_prompt_formats(sample):
"""
Format various fields of the sample ('instruction','output')
Then concatenate them using two newline characters :param sample: Sample dictionary
"""
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
RESPONSE_KEY = "### Output:"
END_KEY = "### End"
blurb = f"n{INTRO_BLURB}"
instruction = f"{INSTRUCTION_KEY}"
input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
response = f"{RESPONSE_KEY}n{sample['summary']}"
end = f"{END_KEY}"
parts = [part for part in [blurb, instruction, input_context, response, end] if part]
formatted_prompt = "nn".join(parts)
sample["text"] = formatted_prompt
return sample
This function can now convert our input into the required prompt format.
We will use our model tokenizer to process these prompts into tokenized formats, aiming for consistent sequence lengths to enhance fine-tuning efficiency while avoiding exceeding the model's maximum token limit.
from functools import partial
def get_max_length(model):
conf = model.config
max_length = None
for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
max_length = getattr(model.config, length_setting, None)
if max_length:
print(f"Found max length: {max_length}")
break
if not max_length:
max_length = 1024
print(f"Using default max length: {max_length}")
return max_length
def preprocess_batch(batch, tokenizer, max_length):
"""
Tokenizing a batch
"""
return tokenizer(
batch["text"],
max_length=max_length,
truncation=True,
)
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset):
"""
Format & tokenize it so it is ready for training :param tokenizer (AutoTokenizer): Model Tokenizer :param max_length (int): Maximum number of tokens to emit from tokenizer
"""
# Add prompt to each sample
print("Preprocessing dataset...")
dataset = dataset.map(create_prompt_formats)
# Apply preprocessing to each batch of the dataset & remove unnecessary fields
_preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
dataset = dataset.map(
_preprocessing_function,
batched=True,
remove_columns=['id', 'topic', 'dialogue', 'summary'],
)
# Filter out samples exceeding max_length
dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
# Shuffle dataset
dataset = dataset.shuffle(seed=seed)
return dataset
Using these functions, our dataset will be prepared for the fine-tuning process!
# Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)
train_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['validation'])
Preparing the Model for QLoRA
We will now prepare the model for QLoRA training with the prepare_model_for_kbit_training() method from PEFT.
original_model = prepare_model_for_kbit_training(original_model)
This function sets up the model for QLoRA by establishing the necessary configurations.
Setting Up PEFT for Fine-Tuning
Next, we will define the LoRA configuration for fine-tuning the base model.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
config = LoraConfig(
r=32, # Rank
lora_alpha=32,
target_modules=[
'q_proj',
'k_proj',
'v_proj',
'dense'
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM",
)
# Enable gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()
peft_model = get_peft_model(original_model, config)
The rank (r) parameter determines the dimension of the adapter being trained, influencing the number of trainable parameters. Higher ranks increase expressivity but also computational demands.
Training the PEFT Adapter
Define training arguments and create a Trainer instance.
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'
import transformers
peft_training_args = TrainingArguments(
output_dir=output_dir,
warmup_steps=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=1000,
learning_rate=2e-4,
optim="paged_adamw_8bit",
logging_steps=25,
logging_dir="./logs",
save_strategy="steps",
save_steps=25,
evaluation_strategy="steps",
eval_steps=25,
do_eval=True,
gradient_checkpointing=True,
report_to="none",
overwrite_output_dir='True',
group_by_length=True,
)
peft_model.config.use_cache = False
peft_trainer = transformers.Trainer(
model=peft_model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=peft_training_args,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
The model will be trained for 1000 steps, which is adequate for our custom dataset. Hyperparameters may vary based on the dataset and model.
Begin training, which may take time depending on the set parameters.
peft_trainer.train()
After successful training, we can utilize the model for inference by adding an adapter to the original Phi-2 model. We set is_trainable=False to indicate that this model is solely for inference.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
base_model_id = "microsoft/phi-2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id,
device_map='auto',
quantization_config=bnb_config,
trust_remote_code=True,
use_auth_token=True)
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token
from peft import PeftModel
ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-dialogue-summary-training-1705417060/checkpoint-1000", torch_dtype=torch.float16, is_trainable=False)
Fine-tuning is often iterative; based on validation and test results, adjustments may be necessary to improve model performance. Next, let's evaluate the fine-tuned model results.
Evaluate the Model Qualitatively (Human Evaluation)
We will perform inference using the PEFT model with the same input as before.
%%time
from transformers import set_seed
set_seed(seed)
index = 5
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"
peft_model_res = gen(ft_model, prompt, 100)
peft_model_output = peft_model_res[0].split('Output:n')[1]
prefix, success, result = peft_model_output.partition('###')
dash_line = '-' * 100
print(dash_line)
print(f'INPUT PROMPT:n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:n{summary}n')
print(dash_line)
print(f'PEFT MODEL:n{prefix}')
Evaluate the Model Quantitatively (with ROUGE Metric)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for assessing automatic summarization and machine translation systems by comparing generated summaries to reference summaries, typically human-produced.
We will now apply the ROUGE metric to quantify the quality of the summarizations produced by the models.
original_model = AutoModelForCausalLM.from_pretrained(base_model_id,
device_map='auto',
quantization_config=bnb_config,
trust_remote_code=True,
use_auth_token=True)
import pandas as pd
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']
original_model_summaries = []
peft_model_summaries = []
for idx, dialogue in enumerate(dialogues):
human_baseline_text_output = human_baseline_summaries[idx]
prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"
original_model_res = gen(original_model, prompt, 100)
original_model_text_output = original_model_res[0].split('Output:n')[1]
peft_model_res = gen(ft_model, prompt, 100)
peft_model_output = peft_model_res[0].split('Output:n')[1]
peft_model_text_output, success, result = peft_model_output.partition('###')
original_model_summaries.append(original_model_text_output)
peft_model_summaries.append(peft_model_text_output)
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
import evaluate
rouge = evaluate.load('rouge')
original_model_results = rouge.compute(
predictions=original_model_summaries,
references=human_baseline_summaries[0:len(original_model_summaries)],
use_aggregator=True,
use_stemmer=True,
)
peft_model_results = rouge.compute(
predictions=peft_model_summaries,
references=human_baseline_summaries[0:len(peft_model_summaries)],
use_aggregator=True,
use_stemmer=True,
)
print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")
improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
print(f'{key}: {value*100:.2f}%')
The results indicate a significant improvement in the PEFT model compared to the original model, represented in percentage terms.
If you're interested in accessing the complete notebook, please refer to the following repository:
FineTune Phi-2 on Custom DataSet
Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources www.kaggle.com
Conclusion
Fine-tuning large language models has become crucial for organizations seeking to enhance their operations. While initial training provides a general understanding of language, fine-tuning refines models into specialized tools capable of addressing specific topics with greater accuracy. Customizing LLMs for particular tasks, industries, or datasets expands their capabilities, ensuring their relevance in a rapidly evolving digital world. Future advancements in LLMs, paired with improved fine-tuning techniques, are set to lead to smarter, more effective, and context-aware AI systems.
References
- microsoft/phi-2 · Hugging Face
- Fine-tuning large language models (LLMs) in 2024 | SuperAnnotate
- microsoft/phi-2 · How to fine-tune this? + Training code
- Phi-2: The surprising power of small language models
- While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use?
- LoRA
- ROUGE - a Hugging Face Space by evaluate-metric
- GitHub - TimDettmers/bitsandbytes: Accessible large language models via k-bit quantization for PyTorch