Exploring AI-ML-NLP: LoRA, QLoRA and Fine-tuning large language models (LLMs)

Introduction.

Fine-tuning large language models (LLMs) is a common practice to adapt them for specific tasks, but it can be computationally expensive.

LoRA (Low-Rank Adaptation) is a technique that makes this process more efficient by introducing small adapter modules to the model. These adapters capture task-specific knowledge without modifying the original model's parameters, significantly reducing the number of trainable parameters.

QLoRA (Quantized LoRA) takes this further by combining LoRA with quantization, a process that reduces the precision of the model's weights. This decreases the model's memory footprint, making it possible to fine-tune LLMs on consumer-grade hardware. Both LoRA and QLoRA offer a powerful way to customize large language models for specific use cases while minimizing computational requirements and ensuring efficient model deployment.

Video Tutorial on LoRA.

Video Tutorial on QLoRA.

Video Tutorial on Fine-tuning Large Language Model (LLMs).

Fine-tuning Large Language Model (LLMs) - Without LORA/QLORA

Webpage Link: https://www.quantacosmos.com/2024/06/fine-tune-pretrained-large-language.html

Code: Fine-tuning Large Language Model (LLMs)- with LoRA (Training).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset
import transformers
# pip install peft
# Sample QA Data
data = {
    'question': [
        "What is the capital of France?",
        "Who painted the Mona Lisa?",
        "What is the tallest mountain in the world?",
        "When did World War II end?",
        "Who wrote the play 'Romeo and Juliet'?",
        "What is the chemical symbol for gold?"
    ],
    'context': [
        "Paris is the capital and most populous city of France.",
        "The Mona Lisa is a half-length portrait painting by Italian Renaissance artist Leonardo da Vinci.",
        "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas.",
        "World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945.",
        "Romeo and Juliet is a tragedy written by William Shakespeare early in his career about two young star-crossed lovers whose deaths ultimately reconcile their feuding families.",
        "Gold is a chemical element with the symbol Au and atomic number 79. In its purest form, it is a bright, slightly reddish yellow, dense, soft, malleable, and ductile metal."
    ],
    'answer': [
        "Paris",
        "Leonardo da Vinci",
        "Mount Everest",
        "1945",
        "William Shakespeare",
        "Au"
    ]
}
dataset = Dataset.from_dict(data)

# Load Llama Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained("D:\\OLLAMA_MODELS\\meta-llama\\Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("D:\\OLLAMA_MODELS\\meta-llama\\Meta-Llama-3-8B-Instruct")
# Ensure padding token is set
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Create PEFT Model
model = get_peft_model(model, peft_config)

# Preprocess Data
def generate_prompt(data_point):
    return f"""[INST] {data_point["question"]} [/INST] {data_point["context"]} {data_point["answer"]} [/INST]"""

dataset = dataset.map(lambda data_point: {"text": generate_prompt(data_point)})

# Tokenize Data
def tokenize(prompt):
    result = tokenizer(prompt["text"])
    return {
        "input_ids": result["input_ids"],
        "attention_mask": result["attention_mask"],
    }
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

# Training Arguments (Optimized for CPU)
training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Very small batch size for CPU
    gradient_accumulation_steps=8,  # Accumulate gradients over multiple steps
    num_train_epochs=3,
    learning_rate=1e-4,  # Smaller learning rate for CPU
    logging_steps=10,
    output_dir="./llama-3-finetuned-qa-cpu",
)

# Create Trainer
trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Fine-tune!
model.config.use_cache = False
trainer.train()

# Save the Fine-tuned Model
model.save_pretrained("./llama-3-finetuned-qa-cpu")

Code: Fine-tuning Large Language Model (LLMs)- with LoRA (Test)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Load Fine-Tuned Model and Tokenizer
model_path = "E:\\Niraj_Work\\DL_Projects\\llm_projects\\llm_advance_1\\llama-3-finetuned-qa-cpu"  # Path to your saved model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Ensure Model is on CPU
device = torch.device("cpu")
model.to(device)
if tokenizer.pad_token is None:
    # tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.pad_token = tokenizer.eos_token
# Load Your Question-Answering Dataset (Replace with your dataset)
# Assuming you have a list of dictionaries, each with 'question', 'context', and 'answer' keys
eval_data = [
    {"question": "What is the capital of France?", "context": "Paris is the capital and most populous city of France.", "answer": "Paris"},
    {"question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a half-length portrait painting by Italian Renaissance artist Leonardo da Vinci.", "answer": "Leonardo da Vinci"},
]

# Function to generate the prompt
def generate_prompt(data_point):
    return f"""[INST] {data_point["question"]} [/INST] {data_point["context"]} {data_point["answer"]} [/INST]"""


# Test the Model
for data_point in eval_data:
    input_text = generate_prompt(data_point)
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)  # Move input to CPU

    # Generate Answer
    generation_output = model.generate(
        input_ids=input_ids,
        max_new_tokens=50,  # Adjust as needed
        num_beams=1,  # You can try increasing num_beams if you have enough memory
        early_stopping=True,
    )

    # Extract and Print Answer
    generated_answer = tokenizer.decode(generation_output[0])
    print(f"Question: {data_point['question']}")
    print(f"Generated Answer: {generated_answer.split('[/INST]')[-2].strip()}")
    print(f"Actual Answer: {data_point['answer']}")

Reference.

Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).
Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. "Qlora: Efficient finetuning of quantized llms." Advances in Neural Information Processing Systems 36 (2024).

Exploring AI-ML-NLP

Saturday, June 29, 2024

LoRA, QLoRA and Fine-tuning large language models (LLMs)