Fine-tuning LLMs with QLoRA on Custom Datasets

Large language models (LLMs) have transformed natural language processing with their impressive capabilities. Trained on vast text corpuses, these models excel in tasks like text generation, translation, summarization, and question-answering. However, their performance may not always meet the demands of specialized applications or domains.

In this guide, we will discuss the benefits of fine-tuning LLMs, which can lead to improved performance, reduced training costs, and more precise results tailored to specific contexts.

What is LLM Fine-tuning?

Fine-tuning a large language model refers to the process of further training an already established model that has learned from a broad dataset, using a smaller, domain-specific dataset. The term "LLM" typically refers to models such as OpenAI's GPT series. Training a model from scratch is computationally expensive and time-consuming; therefore, leveraging the knowledge from a pre-trained model allows for high task-specific performance with less data and fewer resources.

Key steps in LLM fine-tuning include:

Choosing a Pre-trained Model: Start by selecting a base model that fits your needs. Pre-trained models are versatile and trained on extensive unlabeled datasets.
Collecting a Relevant Dataset: Next, gather a dataset pertinent to your task. It should be appropriately labeled or structured for effective learning.
Preparing the Dataset: Preprocessing involves cleaning the data and dividing it into training, validation, and test sets, ensuring compatibility with the model.
Fine-tuning: Fine-tune the selected model using your domain-specific dataset, which may relate to a specific application or field, enabling the model to adapt to its context.
Task-specific Adjustment: During this phase, the model's parameters are fine-tuned based on the new dataset, enhancing its ability to understand and generate task-relevant content. This process retains the general language knowledge acquired during initial training while honing in on the particularities of the target domain.

Fine-tuning LLMs is frequently utilized for various natural language processing tasks such as sentiment analysis, named entity recognition, summarization, translation, and more, where context comprehension and coherent language generation are essential.

Fine-tuning Techniques

Fine-tuning an LLM typically involves a supervised learning approach, using a dataset with labeled examples to modify the model’s weights for improved task performance. Below are some notable techniques used in this process:

Full Fine Tuning (Instruction Fine-tuning): This method enhances the model’s performance across multiple tasks by training it on examples that guide responses to questions. The dataset selection is crucial, tailored to specific tasks like summarization or translation. This technique updates all model weights, resulting in a new version with enhanced capabilities, but it requires significant memory and computational resources akin to pre-training.
Parameter Efficient Fine-Tuning (PEFT): PEFT is a more efficient alternative to full fine-tuning. Training a language model entirely can be resource-intensive. PEFT addresses this by updating only a subset of parameters while keeping the rest frozen, reducing the number of trainable parameters. This approach helps manage memory limitations and prevents catastrophic forgetting, allowing the original LLM weights to remain intact. Various methods exist for achieving PEFT, with Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) being the most recognized.

What is LoRA?

LoRA enhances the fine-tuning process by adjusting two smaller matrices that approximate the larger weight matrix of the pre-trained LLM, known as the LoRA adapter. After fine-tuning, the original LLM remains unchanged while a compact “LoRA adapter” is created, often only a fraction of the original model's size in MBs. During inference, the LoRA adapter must be combined with the original model, facilitating the reuse of the base model for various tasks and reducing overall memory use.

What is Quantized LoRA (QLoRA)?

QLoRA is an even more memory-efficient version of LoRA, which reduces the weights of the LoRA adapters to lower precision (e.g., 4-bit instead of 8-bit). This further decreases the memory footprint and storage needs. In QLoRA, the pre-trained model is loaded into GPU memory with quantized weights, while still performing comparably to LoRA.

In this tutorial, we'll implement Parameter-efficient fine-tuning with QLoRA.

Let’s delve into the steps for fine-tuning an LLM on a custom dataset using QLoRA on a single GPU:

Setting up the Notebook
Installing necessary libraries
Loading the dataset
Creating Bitsandbytes configuration
Loading the Pre-trained model
Tokenization
Testing the Model with Zero Shot Inferencing
Pre-processing the dataset
Preparing the model for QLoRA
Setting up PEFT for Fine-Tuning
Training the PEFT Adapter
Evaluating the Model Qualitatively (Human Evaluation)
Evaluating the Model Quantitatively (with ROUGE Metric)

Setting up the Notebook

We will use a Kaggle notebook for this demonstration, but any Jupyter notebook environment will suffice. Kaggle provides 30 hours of free GPU usage weekly, which is ample for our needs. Open a new notebook, set up some headings, and connect to the runtime.

Here, we will choose the GPU P100 as the ACCELERATOR. Feel free to explore other GPU options available in Kaggle or any other environment.

For this tutorial, we will employ HuggingFace libraries to download and train the model. An Access Token is necessary for model downloads from HuggingFace. If you have an account, you can generate a new Access Token from the settings.
Installing Required Libraries

Let's install the required libraries for this experiment.

!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

Understanding the importance of some libraries:
- Bitsandbytes: A lightweight package providing custom CUDA functions that optimize LLM performance.
- transformers: A Hugging Face library offering pre-trained models and utilities for NLP tasks.
- peft: A library enabling parameter-efficient fine-tuning.
- accelerate: Simplifies multi-GPU/TPU operations without altering the rest of your code.
- datasets: Offers easy access to various datasets.
- einops: Simplifies tensor operations.
Loading the necessary libraries:

from datasets import load_dataset

from transformers import (

AutoModelForCausalLM,

AutoTokenizer,

BitsAndBytesConfig,

HfArgumentParser,

TrainingArguments,

Trainer,

GenerationConfig

)

from tqdm import tqdm

from trl import SFTTrainer

import torch

import time

import pandas as pd

import numpy as np

from huggingface_hub import interpreter_login

interpreter_login()

We will not track our training metrics in this tutorial, so let's disable Weights and Biases. To do this, set the following environment property:

import os

# Disable Weights and Biases

os.environ['WANDB_DISABLED'] = "true"

If you have an account with Weights and Biases, feel free to activate it for experimentation.
Loading the Dataset

Numerous datasets can be used for fine-tuning. In this case, we will use the DialogSum dataset from HuggingFace, which contains 13,460 dialogues with labeled summaries and topics.

To load this dataset, execute the following code:

huggingface_dataset_name = "neil-code/dialogsum-test"

dataset = load_dataset(huggingface_dataset_name)

After loading, let's examine the dataset:

The dataset includes:
- dialogue: The text of the dialogue.
- summary: A human-written summary.
- topic: A brief topic description.
- id: A unique identifier for each example.
Creating Bitsandbytes Configuration

To load the model, we need a configuration that specifies the desired quantization method. We will use BitsAndBytesConfig to load our model in 4-bit format, significantly reducing memory usage at a slight cost to accuracy.

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_quant_type='nf4',

bnb_4bit_compute_dtype=compute_dtype,

bnb_4bit_use_double_quant=False,

)
Loading the Pre-trained Model

We will utilize Microsoft’s open-sourced Phi-2, a Small Language Model (SLM) with 2.7 billion parameters. This model demonstrates excellent reasoning and language comprehension capabilities.

Loading Phi-2 using 4-bit quantization from HuggingFace:

model_name = 'microsoft/phi-2'

device_map = {"" : 0}

original_model = AutoModelForCausalLM.from_pretrained(model_name,

device_map=device_map,

quantization_config=bnb_config,

trust_remote_code=True,

use_auth_token=True)

The model is now loaded in 4-bit format using the BitsAndBytesConfig from the bitsandbytes library, part of the QLoRA process.
Tokenization

Next, we will set up the tokenizer, ensuring left-padding to optimize memory during training.

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left", add_eos_token=True, add_bos_token=True, use_fast=False)

tokenizer.pad_token = tokenizer.eos_token
Testing the Model with Zero Shot Inferencing

We will evaluate the base model using sample inputs.

%%time

from transformers import set_seed

seed = 42

set_seed(seed)

index = 10

prompt = dataset['test'][index]['dialogue']

summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.n{prompt}nOutput:n"

res = gen(original_model, formatted_prompt, 100)

output = res[0].split('Output:n')[1]

dash_line = '-' * 100

print(dash_line)

print(f'INPUT PROMPT:n{formatted_prompt}')

print(dash_line)

print(f'BASELINE HUMAN SUMMARY:n{summary}n')

print(dash_line)

print(f'MODEL GENERATION - ZERO SHOT:n{output}')

The model struggles to summarize the dialogue compared to the baseline summary, but it does extract key information, indicating potential for fine-tuning.
Pre-processing the Dataset

The dataset needs formatting to be suitable for fine-tuning. The prompt must follow the specified format based on the HuggingFace documentation.

We will create helper functions to format our dataset appropriately for fine-tuning, converting dialogue-summary pairs into explicit instructions for the LLM.

def create_prompt_formats(sample):

"""

Format various fields of the sample ('instruction','output')

Then concatenate them using two newline characters :param sample: Sample dictionary

"""

INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."

INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."

RESPONSE_KEY = "### Output:"

END_KEY = "### End"

blurb = f"n{INTRO_BLURB}"

instruction = f"{INSTRUCTION_KEY}"

input_context = f"{sample['dialogue']}" if sample["dialogue"] else None

response = f"{RESPONSE_KEY}n{sample['summary']}"

end = f"{END_KEY}"

parts = [part for part in [blurb, instruction, input_context, response, end] if part]

formatted_prompt = "nn".join(parts)

sample["text"] = formatted_prompt

return sample

This function can now convert our input into the required prompt format.

We will use our model tokenizer to process these prompts into tokenized formats, aiming for consistent sequence lengths to enhance fine-tuning efficiency while avoiding exceeding the model's maximum token limit.

from functools import partial

def get_max_length(model):

conf = model.config

max_length = None

for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:

max_length = getattr(model.config, length_setting, None)

if max_length:

print(f"Found max length: {max_length}")

break

if not max_length:

max_length = 1024

print(f"Using default max length: {max_length}")

return max_length

def preprocess_batch(batch, tokenizer, max_length):

"""

Tokenizing a batch

"""

return tokenizer(

batch["text"],

max_length=max_length,

truncation=True,

)

def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset):

"""

Format & tokenize it so it is ready for training :param tokenizer (AutoTokenizer): Model Tokenizer :param max_length (int): Maximum number of tokens to emit from tokenizer

"""

# Add prompt to each sample

print("Preprocessing dataset...")

dataset = dataset.map(create_prompt_formats)

# Apply preprocessing to each batch of the dataset & remove unnecessary fields

_preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)

dataset = dataset.map(

_preprocessing_function,

batched=True,

remove_columns=['id', 'topic', 'dialogue', 'summary'],

)

# Filter out samples exceeding max_length

dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

# Shuffle dataset

dataset = dataset.shuffle(seed=seed)

return dataset

Using these functions, our dataset will be prepared for the fine-tuning process!

# Pre-process dataset

max_length = get_max_length(original_model)

print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['train'])

eval_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['validation'])
Preparing the Model for QLoRA

We will now prepare the model for QLoRA training with the prepare_model_for_kbit_training() method from PEFT.

original_model = prepare_model_for_kbit_training(original_model)

This function sets up the model for QLoRA by establishing the necessary configurations.
Setting Up PEFT for Fine-Tuning

Next, we will define the LoRA configuration for fine-tuning the base model.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(

r=32, # Rank

lora_alpha=32,

target_modules=[

'q_proj',

'k_proj',

'v_proj',

'dense'

],

bias="none",

lora_dropout=0.05, # Conventional

task_type="CAUSAL_LM",

)

# Enable gradient checkpointing to reduce memory usage during fine-tuning

original_model.gradient_checkpointing_enable()

peft_model = get_peft_model(original_model, config)

The rank (r) parameter determines the dimension of the adapter being trained, influencing the number of trainable parameters. Higher ranks increase expressivity but also computational demands.
Training the PEFT Adapter

Define training arguments and create a Trainer instance.

output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

import transformers

peft_training_args = TrainingArguments(

output_dir=output_dir,

warmup_steps=1,

per_device_train_batch_size=1,

gradient_accumulation_steps=4,

max_steps=1000,

learning_rate=2e-4,

optim="paged_adamw_8bit",

logging_steps=25,

logging_dir="./logs",

save_strategy="steps",

save_steps=25,

evaluation_strategy="steps",

eval_steps=25,

do_eval=True,

gradient_checkpointing=True,

report_to="none",

overwrite_output_dir='True',

group_by_length=True,

)

peft_model.config.use_cache = False

peft_trainer = transformers.Trainer(

model=peft_model,

train_dataset=train_dataset,

eval_dataset=eval_dataset,

args=peft_training_args,

data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),

)

The model will be trained for 1000 steps, which is adequate for our custom dataset. Hyperparameters may vary based on the dataset and model.

Begin training, which may take time depending on the set parameters.

peft_trainer.train()

After successful training, we can utilize the model for inference by adding an adapter to the original Phi-2 model. We set is_trainable=False to indicate that this model is solely for inference.

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"

base_model = AutoModelForCausalLM.from_pretrained(base_model_id,

device_map='auto',

quantization_config=bnb_config,

trust_remote_code=True,

use_auth_token=True)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)

eval_tokenizer.pad_token = eval_tokenizer.eos_token

from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-dialogue-summary-training-1705417060/checkpoint-1000", torch_dtype=torch.float16, is_trainable=False)

Fine-tuning is often iterative; based on validation and test results, adjustments may be necessary to improve model performance. Next, let's evaluate the fine-tuned model results.
Evaluate the Model Qualitatively (Human Evaluation)

We will perform inference using the PEFT model with the same input as before.

%%time

from transformers import set_seed

set_seed(seed)

index = 5

dialogue = dataset['test'][index]['dialogue']

summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"

peft_model_res = gen(ft_model, prompt, 100)

peft_model_output = peft_model_res[0].split('Output:n')[1]

prefix, success, result = peft_model_output.partition('###')

dash_line = '-' * 100

print(dash_line)

print(f'INPUT PROMPT:n{prompt}')

print(dash_line)

print(f'BASELINE HUMAN SUMMARY:n{summary}n')

print(dash_line)

print(f'PEFT MODEL:n{prefix}')
Evaluate the Model Quantitatively (with ROUGE Metric)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for assessing automatic summarization and machine translation systems by comparing generated summaries to reference summaries, typically human-produced.

We will now apply the ROUGE metric to quantify the quality of the summarizations produced by the models.

original_model = AutoModelForCausalLM.from_pretrained(base_model_id,

device_map='auto',

quantization_config=bnb_config,

trust_remote_code=True,

use_auth_token=True)

import pandas as pd

dialogues = dataset['test'][0:10]['dialogue']

human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []

peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):

human_baseline_text_output = human_baseline_summaries[idx]

prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"

original_model_res = gen(original_model, prompt, 100)

original_model_text_output = original_model_res[0].split('Output:n')[1]

peft_model_res = gen(ft_model, prompt, 100)

peft_model_output = peft_model_res[0].split('Output:n')[1]

peft_model_text_output, success, result = peft_model_output.partition('###')

original_model_summaries.append(original_model_text_output)

peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

import evaluate

rouge = evaluate.load('rouge')

original_model_results = rouge.compute(

predictions=original_model_summaries,

references=human_baseline_summaries[0:len(original_model_summaries)],

use_aggregator=True,

use_stemmer=True,

)

peft_model_results = rouge.compute(

predictions=peft_model_summaries,

references=human_baseline_summaries[0:len(peft_model_summaries)],

use_aggregator=True,

use_stemmer=True,

)

print('ORIGINAL MODEL:')

print(original_model_results)

print('PEFT MODEL:')

print(peft_model_results)

print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))

for key, value in zip(peft_model_results.keys(), improvement):

print(f'{key}: {value*100:.2f}%')

The results indicate a significant improvement in the PEFT model compared to the original model, represented in percentage terms.

If you're interested in accessing the complete notebook, please refer to the following repository:

FineTune Phi-2 on Custom DataSet

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources www.kaggle.com

Conclusion

Fine-tuning large language models has become crucial for organizations seeking to enhance their operations. While initial training provides a general understanding of language, fine-tuning refines models into specialized tools capable of addressing specific topics with greater accuracy. Customizing LLMs for particular tasks, industries, or datasets expands their capabilities, ensuring their relevance in a rapidly evolving digital world. Future advancements in LLMs, paired with improved fine-tuning techniques, are set to lead to smarter, more effective, and context-aware AI systems.

dayonehk.com

Fine-tuning LLMs with QLoRA on Custom Datasets

What is LLM Fine-tuning?

Fine-tuning Techniques

What is LoRA?

What is Quantized LoRA (QLoRA)?

FineTune Phi-2 on Custom DataSet

Conclusion

References

Share the page:

Recent Post:

Discovering Hugging Face's New Diffusers Library for AI Models

Finding Motivation: Insights from Successful Individuals

The Intertwined Histories of Power and Powerlessness

Unlocking Income: Top 5 Websites to Boost Your Earnings

Surprising Historical Discoveries That Will Astonish You

Exploring TypeScript Enums and Their Alternatives

Jellyfish Exhibit Learning Abilities: An Ancient Adaptation

Mastering Dropdowns in React Development with React Suite