dayonehk.com

Fine-tuning LLMs with QLoRA on Custom Datasets

Written on

Large language models (LLMs) have transformed natural language processing with their impressive capabilities. Trained on vast text corpuses, these models excel in tasks like text generation, translation, summarization, and question-answering. However, their performance may not always meet the demands of specialized applications or domains.

In this guide, we will discuss the benefits of fine-tuning LLMs, which can lead to improved performance, reduced training costs, and more precise results tailored to specific contexts.

What is LLM Fine-tuning?

Fine-tuning a large language model refers to the process of further training an already established model that has learned from a broad dataset, using a smaller, domain-specific dataset. The term "LLM" typically refers to models such as OpenAI's GPT series. Training a model from scratch is computationally expensive and time-consuming; therefore, leveraging the knowledge from a pre-trained model allows for high task-specific performance with less data and fewer resources.

Key steps in LLM fine-tuning include:

  1. Choosing a Pre-trained Model: Start by selecting a base model that fits your needs. Pre-trained models are versatile and trained on extensive unlabeled datasets.
  2. Collecting a Relevant Dataset: Next, gather a dataset pertinent to your task. It should be appropriately labeled or structured for effective learning.
  3. Preparing the Dataset: Preprocessing involves cleaning the data and dividing it into training, validation, and test sets, ensuring compatibility with the model.
  4. Fine-tuning: Fine-tune the selected model using your domain-specific dataset, which may relate to a specific application or field, enabling the model to adapt to its context.
  5. Task-specific Adjustment: During this phase, the model's parameters are fine-tuned based on the new dataset, enhancing its ability to understand and generate task-relevant content. This process retains the general language knowledge acquired during initial training while honing in on the particularities of the target domain.

Fine-tuning LLMs is frequently utilized for various natural language processing tasks such as sentiment analysis, named entity recognition, summarization, translation, and more, where context comprehension and coherent language generation are essential.

Fine-tuning Techniques

Fine-tuning an LLM typically involves a supervised learning approach, using a dataset with labeled examples to modify the model’s weights for improved task performance. Below are some notable techniques used in this process:

  1. Full Fine Tuning (Instruction Fine-tuning): This method enhances the model’s performance across multiple tasks by training it on examples that guide responses to questions. The dataset selection is crucial, tailored to specific tasks like summarization or translation. This technique updates all model weights, resulting in a new version with enhanced capabilities, but it requires significant memory and computational resources akin to pre-training.
  2. Parameter Efficient Fine-Tuning (PEFT): PEFT is a more efficient alternative to full fine-tuning. Training a language model entirely can be resource-intensive. PEFT addresses this by updating only a subset of parameters while keeping the rest frozen, reducing the number of trainable parameters. This approach helps manage memory limitations and prevents catastrophic forgetting, allowing the original LLM weights to remain intact. Various methods exist for achieving PEFT, with Low-Rank Adaptation (LoRA) and Quantized LoRA (QLoRA) being the most recognized.

What is LoRA?

LoRA enhances the fine-tuning process by adjusting two smaller matrices that approximate the larger weight matrix of the pre-trained LLM, known as the LoRA adapter. After fine-tuning, the original LLM remains unchanged while a compact “LoRA adapter” is created, often only a fraction of the original model's size in MBs. During inference, the LoRA adapter must be combined with the original model, facilitating the reuse of the base model for various tasks and reducing overall memory use.

What is Quantized LoRA (QLoRA)?

QLoRA is an even more memory-efficient version of LoRA, which reduces the weights of the LoRA adapters to lower precision (e.g., 4-bit instead of 8-bit). This further decreases the memory footprint and storage needs. In QLoRA, the pre-trained model is loaded into GPU memory with quantized weights, while still performing comparably to LoRA.

In this tutorial, we'll implement Parameter-efficient fine-tuning with QLoRA.

Let’s delve into the steps for fine-tuning an LLM on a custom dataset using QLoRA on a single GPU:

  1. Setting up the Notebook
  2. Installing necessary libraries
  3. Loading the dataset
  4. Creating Bitsandbytes configuration
  5. Loading the Pre-trained model
  6. Tokenization
  7. Testing the Model with Zero Shot Inferencing
  8. Pre-processing the dataset
  9. Preparing the model for QLoRA
  10. Setting up PEFT for Fine-Tuning
  11. Training the PEFT Adapter
  12. Evaluating the Model Qualitatively (Human Evaluation)
  13. Evaluating the Model Quantitatively (with ROUGE Metric)
  1. Setting up the Notebook

    We will use a Kaggle notebook for this demonstration, but any Jupyter notebook environment will suffice. Kaggle provides 30 hours of free GPU usage weekly, which is ample for our needs. Open a new notebook, set up some headings, and connect to the runtime.

    Here, we will choose the GPU P100 as the ACCELERATOR. Feel free to explore other GPU options available in Kaggle or any other environment.

    For this tutorial, we will employ HuggingFace libraries to download and train the model. An Access Token is necessary for model downloads from HuggingFace. If you have an account, you can generate a new Access Token from the settings.

  2. Installing Required Libraries

    Let's install the required libraries for this experiment.

    !pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

    Understanding the importance of some libraries:

    • Bitsandbytes: A lightweight package providing custom CUDA functions that optimize LLM performance.
    • transformers: A Hugging Face library offering pre-trained models and utilities for NLP tasks.
    • peft: A library enabling parameter-efficient fine-tuning.
    • accelerate: Simplifies multi-GPU/TPU operations without altering the rest of your code.
    • datasets: Offers easy access to various datasets.
    • einops: Simplifies tensor operations.

    Loading the necessary libraries:

    from datasets import load_dataset

    from transformers import (

    AutoModelForCausalLM,

    AutoTokenizer,

    BitsAndBytesConfig,

    HfArgumentParser,

    TrainingArguments,

    Trainer,

    GenerationConfig

    )

    from tqdm import tqdm

    from trl import SFTTrainer

    import torch

    import time

    import pandas as pd

    import numpy as np

    from huggingface_hub import interpreter_login

    interpreter_login()

    We will not track our training metrics in this tutorial, so let's disable Weights and Biases. To do this, set the following environment property:

    import os

    # Disable Weights and Biases

    os.environ['WANDB_DISABLED'] = "true"

    If you have an account with Weights and Biases, feel free to activate it for experimentation.

  3. Loading the Dataset

    Numerous datasets can be used for fine-tuning. In this case, we will use the DialogSum dataset from HuggingFace, which contains 13,460 dialogues with labeled summaries and topics.

    To load this dataset, execute the following code:

    huggingface_dataset_name = "neil-code/dialogsum-test"

    dataset = load_dataset(huggingface_dataset_name)

    After loading, let's examine the dataset:

    The dataset includes:

    • dialogue: The text of the dialogue.
    • summary: A human-written summary.
    • topic: A brief topic description.
    • id: A unique identifier for each example.
  4. Creating Bitsandbytes Configuration

    To load the model, we need a configuration that specifies the desired quantization method. We will use BitsAndBytesConfig to load our model in 4-bit format, significantly reducing memory usage at a slight cost to accuracy.

    compute_dtype = getattr(torch, "float16")

    bnb_config = BitsAndBytesConfig(

    load_in_4bit=True,

    bnb_4bit_quant_type='nf4',

    bnb_4bit_compute_dtype=compute_dtype,

    bnb_4bit_use_double_quant=False,

    )

  5. Loading the Pre-trained Model

    We will utilize Microsoft’s open-sourced Phi-2, a Small Language Model (SLM) with 2.7 billion parameters. This model demonstrates excellent reasoning and language comprehension capabilities.

    Loading Phi-2 using 4-bit quantization from HuggingFace:

    model_name = 'microsoft/phi-2'

    device_map = {"" : 0}

    original_model = AutoModelForCausalLM.from_pretrained(model_name,

    device_map=device_map,

    quantization_config=bnb_config,

    trust_remote_code=True,

    use_auth_token=True)

    The model is now loaded in 4-bit format using the BitsAndBytesConfig from the bitsandbytes library, part of the QLoRA process.

  6. Tokenization

    Next, we will set up the tokenizer, ensuring left-padding to optimize memory during training.

    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left", add_eos_token=True, add_bos_token=True, use_fast=False)

    tokenizer.pad_token = tokenizer.eos_token

  7. Testing the Model with Zero Shot Inferencing

    We will evaluate the base model using sample inputs.

    %%time

    from transformers import set_seed

    seed = 42

    set_seed(seed)

    index = 10

    prompt = dataset['test'][index]['dialogue']

    summary = dataset['test'][index]['summary']

    formatted_prompt = f"Instruct: Summarize the following conversation.n{prompt}nOutput:n"

    res = gen(original_model, formatted_prompt, 100)

    output = res[0].split('Output:n')[1]

    dash_line = '-' * 100

    print(dash_line)

    print(f'INPUT PROMPT:n{formatted_prompt}')

    print(dash_line)

    print(f'BASELINE HUMAN SUMMARY:n{summary}n')

    print(dash_line)

    print(f'MODEL GENERATION - ZERO SHOT:n{output}')

    The model struggles to summarize the dialogue compared to the baseline summary, but it does extract key information, indicating potential for fine-tuning.

  8. Pre-processing the Dataset

    The dataset needs formatting to be suitable for fine-tuning. The prompt must follow the specified format based on the HuggingFace documentation.

    We will create helper functions to format our dataset appropriately for fine-tuning, converting dialogue-summary pairs into explicit instructions for the LLM.

    def create_prompt_formats(sample):

    """

    Format various fields of the sample ('instruction','output')

    Then concatenate them using two newline characters :param sample: Sample dictionary

    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."

    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."

    RESPONSE_KEY = "### Output:"

    END_KEY = "### End"

    blurb = f"n{INTRO_BLURB}"

    instruction = f"{INSTRUCTION_KEY}"

    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None

    response = f"{RESPONSE_KEY}n{sample['summary']}"

    end = f"{END_KEY}"

    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "nn".join(parts)

    sample["text"] = formatted_prompt

    return sample

    This function can now convert our input into the required prompt format.

    We will use our model tokenizer to process these prompts into tokenized formats, aiming for consistent sequence lengths to enhance fine-tuning efficiency while avoiding exceeding the model's maximum token limit.

    from functools import partial

    def get_max_length(model):

    conf = model.config

    max_length = None

    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:

    max_length = getattr(model.config, length_setting, None)

    if max_length:

    print(f"Found max length: {max_length}")

    break

    if not max_length:

    max_length = 1024

    print(f"Using default max length: {max_length}")

    return max_length

    def preprocess_batch(batch, tokenizer, max_length):

    """

    Tokenizing a batch

    """

    return tokenizer(

    batch["text"],

    max_length=max_length,

    truncation=True,

    )

    def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset):

    """

    Format & tokenize it so it is ready for training :param tokenizer (AutoTokenizer): Model Tokenizer :param max_length (int): Maximum number of tokens to emit from tokenizer

    """

    # Add prompt to each sample

    print("Preprocessing dataset...")

    dataset = dataset.map(create_prompt_formats)

    # Apply preprocessing to each batch of the dataset & remove unnecessary fields

    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)

    dataset = dataset.map(

    _preprocessing_function,

    batched=True,

    remove_columns=['id', 'topic', 'dialogue', 'summary'],

    )

    # Filter out samples exceeding max_length

    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset

    dataset = dataset.shuffle(seed=seed)

    return dataset

    Using these functions, our dataset will be prepared for the fine-tuning process!

    # Pre-process dataset

    max_length = get_max_length(original_model)

    print(max_length)

    train_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['train'])

    eval_dataset = preprocess_dataset(tokenizer, max_length, seed, dataset['validation'])

  9. Preparing the Model for QLoRA

    We will now prepare the model for QLoRA training with the prepare_model_for_kbit_training() method from PEFT.

    original_model = prepare_model_for_kbit_training(original_model)

    This function sets up the model for QLoRA by establishing the necessary configurations.

  10. Setting Up PEFT for Fine-Tuning

    Next, we will define the LoRA configuration for fine-tuning the base model.

    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

    config = LoraConfig(

    r=32, # Rank

    lora_alpha=32,

    target_modules=[

    'q_proj',

    'k_proj',

    'v_proj',

    'dense'

    ],

    bias="none",

    lora_dropout=0.05, # Conventional

    task_type="CAUSAL_LM",

    )

    # Enable gradient checkpointing to reduce memory usage during fine-tuning

    original_model.gradient_checkpointing_enable()

    peft_model = get_peft_model(original_model, config)

    The rank (r) parameter determines the dimension of the adapter being trained, influencing the number of trainable parameters. Higher ranks increase expressivity but also computational demands.

  11. Training the PEFT Adapter

    Define training arguments and create a Trainer instance.

    output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

    import transformers

    peft_training_args = TrainingArguments(

    output_dir=output_dir,

    warmup_steps=1,

    per_device_train_batch_size=1,

    gradient_accumulation_steps=4,

    max_steps=1000,

    learning_rate=2e-4,

    optim="paged_adamw_8bit",

    logging_steps=25,

    logging_dir="./logs",

    save_strategy="steps",

    save_steps=25,

    evaluation_strategy="steps",

    eval_steps=25,

    do_eval=True,

    gradient_checkpointing=True,

    report_to="none",

    overwrite_output_dir='True',

    group_by_length=True,

    )

    peft_model.config.use_cache = False

    peft_trainer = transformers.Trainer(

    model=peft_model,

    train_dataset=train_dataset,

    eval_dataset=eval_dataset,

    args=peft_training_args,

    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),

    )

    The model will be trained for 1000 steps, which is adequate for our custom dataset. Hyperparameters may vary based on the dataset and model.

    Begin training, which may take time depending on the set parameters.

    peft_trainer.train()

    After successful training, we can utilize the model for inference by adding an adapter to the original Phi-2 model. We set is_trainable=False to indicate that this model is solely for inference.

    import torch

    from transformers import AutoTokenizer, AutoModelForCausalLM

    base_model_id = "microsoft/phi-2"

    base_model = AutoModelForCausalLM.from_pretrained(base_model_id,

    device_map='auto',

    quantization_config=bnb_config,

    trust_remote_code=True,

    use_auth_token=True)

    eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)

    eval_tokenizer.pad_token = eval_tokenizer.eos_token

    from peft import PeftModel

    ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-dialogue-summary-training-1705417060/checkpoint-1000", torch_dtype=torch.float16, is_trainable=False)

    Fine-tuning is often iterative; based on validation and test results, adjustments may be necessary to improve model performance. Next, let's evaluate the fine-tuned model results.

  12. Evaluate the Model Qualitatively (Human Evaluation)

    We will perform inference using the PEFT model with the same input as before.

    %%time

    from transformers import set_seed

    set_seed(seed)

    index = 5

    dialogue = dataset['test'][index]['dialogue']

    summary = dataset['test'][index]['summary']

    prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"

    peft_model_res = gen(ft_model, prompt, 100)

    peft_model_output = peft_model_res[0].split('Output:n')[1]

    prefix, success, result = peft_model_output.partition('###')

    dash_line = '-' * 100

    print(dash_line)

    print(f'INPUT PROMPT:n{prompt}')

    print(dash_line)

    print(f'BASELINE HUMAN SUMMARY:n{summary}n')

    print(dash_line)

    print(f'PEFT MODEL:n{prefix}')

  13. Evaluate the Model Quantitatively (with ROUGE Metric)

    ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for assessing automatic summarization and machine translation systems by comparing generated summaries to reference summaries, typically human-produced.

    We will now apply the ROUGE metric to quantify the quality of the summarizations produced by the models.

    original_model = AutoModelForCausalLM.from_pretrained(base_model_id,

    device_map='auto',

    quantization_config=bnb_config,

    trust_remote_code=True,

    use_auth_token=True)

    import pandas as pd

    dialogues = dataset['test'][0:10]['dialogue']

    human_baseline_summaries = dataset['test'][0:10]['summary']

    original_model_summaries = []

    peft_model_summaries = []

    for idx, dialogue in enumerate(dialogues):

    human_baseline_text_output = human_baseline_summaries[idx]

    prompt = f"Instruct: Summarize the following conversation.n{dialogue}nOutput:n"

    original_model_res = gen(original_model, prompt, 100)

    original_model_text_output = original_model_res[0].split('Output:n')[1]

    peft_model_res = gen(ft_model, prompt, 100)

    peft_model_output = peft_model_res[0].split('Output:n')[1]

    peft_model_text_output, success, result = peft_model_output.partition('###')

    original_model_summaries.append(original_model_text_output)

    peft_model_summaries.append(peft_model_text_output)

    zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

    df = pd.DataFrame(zipped_summaries, columns=['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])

    import evaluate

    rouge = evaluate.load('rouge')

    original_model_results = rouge.compute(

    predictions=original_model_summaries,

    references=human_baseline_summaries[0:len(original_model_summaries)],

    use_aggregator=True,

    use_stemmer=True,

    )

    peft_model_results = rouge.compute(

    predictions=peft_model_summaries,

    references=human_baseline_summaries[0:len(peft_model_summaries)],

    use_aggregator=True,

    use_stemmer=True,

    )

    print('ORIGINAL MODEL:')

    print(original_model_results)

    print('PEFT MODEL:')

    print(peft_model_results)

    print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

    improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))

    for key, value in zip(peft_model_results.keys(), improvement):

    print(f'{key}: {value*100:.2f}%')

The results indicate a significant improvement in the PEFT model compared to the original model, represented in percentage terms.

If you're interested in accessing the complete notebook, please refer to the following repository:

FineTune Phi-2 on Custom DataSet

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources www.kaggle.com

Conclusion

Fine-tuning large language models has become crucial for organizations seeking to enhance their operations. While initial training provides a general understanding of language, fine-tuning refines models into specialized tools capable of addressing specific topics with greater accuracy. Customizing LLMs for particular tasks, industries, or datasets expands their capabilities, ensuring their relevance in a rapidly evolving digital world. Future advancements in LLMs, paired with improved fine-tuning techniques, are set to lead to smarter, more effective, and context-aware AI systems.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Discovering Hugging Face's New Diffusers Library for AI Models

Explore Hugging Face's new Diffusers library, simplifying AI model access for creative image generation.

Finding Motivation: Insights from Successful Individuals

Explore how successful people maintain motivation and achieve lasting happiness in their lives and work.

The Intertwined Histories of Power and Powerlessness

Explore the tragic fates of two royal families and their struggle against powerlessness in history.

Unlocking Income: Top 5 Websites to Boost Your Earnings

Discover five high-paying websites that can help you earn money online effectively and easily.

Surprising Historical Discoveries That Will Astonish You

Uncover four astonishing historical findings that challenge our understanding of the past.

Exploring TypeScript Enums and Their Alternatives

This article delves into TypeScript enums, their compilation to JavaScript, and introduces an effective alternative for consistent outputs.

Jellyfish Exhibit Learning Abilities: An Ancient Adaptation

Jellyfish demonstrate associative learning, a behavior previously linked only to more complex organisms, showcasing their adaptive capabilities.

Mastering Dropdowns in React Development with React Suite

Explore how to efficiently implement dropdowns using the React Suite library in your React applications.