Exploring AI-ML-NLP: LLM

Showing posts with label LLM. Show all posts

Saturday, July 13, 2024

Finetune Large Language Models with DoRA (Weight-Decomposed Low-Rank Adaptation)

Introduction.

The DoRA (Weight-Decomposed Low-Rank Adaptation) algorithm offers an advanced approach to fine-tuning Large Language Models (LLMs) by decomposing weight matrices into magnitude and direction components. Traditional methods like Low-Rank Adaptation (LoRA) improve parameter efficiency but often face performance and stability trade-offs. DoRA addresses these issues by leveraging the Frobenius norm to separate the weight matrix into a stable magnitude and a fine-tuned direction. This decomposition ensures efficient learning while maintaining model expressiveness and stability. Key advantages of DoRA include enhanced parameter efficiency, improved generalization, faster adaptation to new tasks, and minimal inference overhead.

Key Points:

Decomposes weights into magnitude and direction.
Enhances parameter efficiency without compromising performance.
Improves training stability and generalization.
Facilitates faster adaptation to new tasks.
Maintains efficient inference with minimal overhead.

Video Tutorial.

Code: Finetune Large Language Models with DoRA (Train).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from datasets import Dataset
import transformers
# pip install peft
# Sample QA Data
data = {
    'question': [
        "What is the capital of France?",
        "Who painted the Mona Lisa?",
        "What is the tallest mountain in the world?",
        "When did World War II end?",
        "Who wrote the play 'Romeo and Juliet'?",
        "What is the chemical symbol for gold?"
    ],
    'context': [
        "Paris is the capital and most populous city of France.",
        "The Mona Lisa is a half-length portrait painting by Italian Renaissance artist Leonardo da Vinci.",
        "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas.",
        "World War II (WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945.",
        "Romeo and Juliet is a tragedy written by William Shakespeare early in his career about two young star-crossed lovers whose deaths ultimately reconcile their feuding families.",
        "Gold is a chemical element with the symbol Au and atomic number 79. In its purest form, it is a bright, slightly reddish yellow, dense, soft, malleable, and ductile metal."
    ],
    'answer': [
        "Paris",
        "Leonardo da Vinci",
        "Mount Everest",
        "1945",
        "William Shakespeare",
        "Au"
    ]
}
dataset = Dataset.from_dict(data)

# Load Llama Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained("D:\\OLLAMA_MODELS\\meta-llama\\Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("D:\\OLLAMA_MODELS\\meta-llama\\Meta-Llama-3-8B-Instruct")
# Ensure padding token is set
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora =True
)

# Create PEFT Model
model = get_peft_model(model, peft_config)

# Preprocess Data
def generate_prompt(data_point):
    return f"""[INST] {data_point["question"]} [/INST] {data_point["context"]} {data_point["answer"]} [/INST]"""

dataset = dataset.map(lambda data_point: {"text": generate_prompt(data_point)})

# Tokenize Data
def tokenize(prompt):
    result = tokenizer(prompt["text"])
    return {
        "input_ids": result["input_ids"],
        "attention_mask": result["attention_mask"],
    }
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=dataset.column_names)

# Training Arguments (Optimized for CPU)
training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Very small batch size for CPU
    gradient_accumulation_steps=8,  # Accumulate gradients over multiple steps
    num_train_epochs=3,
    learning_rate=1e-4,  # Smaller learning rate for CPU
    logging_steps=10,
    output_dir="./llama-3-finetuned-qa-cpu",
)

# Create Trainer
trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Fine-tune!
model.config.use_cache = False
trainer.train()

# Save the Fine-tuned Model
model.save_pretrained("./llama-3-finetuned-qa-cpu")

Code: Finetune Large Language Models with DoRA (Test).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Load Fine-Tuned Model and Tokenizer
model_path = "E:\\Niraj_Work\\DL_Projects\\llm_projects\\llm_advance_1\\llama-3-finetuned-qa-cpu"  # Path to your saved model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Ensure Model is on CPU
device = torch.device("cpu")
model.to(device)
if tokenizer.pad_token is None:
    # tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    tokenizer.pad_token = tokenizer.eos_token
# Load Your Question-Answering Dataset (Replace with your dataset)
# Assuming you have a list of dictionaries, each with 'question', 'context', and 'answer' keys
eval_data = [
    {"question": "What is the capital of France?", "context": "Paris is the capital and most populous city of France.", "answer": "Paris"},
    {"question": "Who painted the Mona Lisa?", "context": "The Mona Lisa is a half-length portrait painting by Italian Renaissance artist Leonardo da Vinci.", "answer": "Leonardo da Vinci"},
]

# Function to generate the prompt
def generate_prompt(data_point):
    return f"""[INST] {data_point["question"]} [/INST] {data_point["context"]} {data_point["answer"]} [/INST]"""


# Test the Model
for data_point in eval_data:
    input_text = generate_prompt(data_point)
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)  # Move input to CPU

    # Generate Answer
    generation_output = model.generate(
        input_ids=input_ids,
        max_new_tokens=50,  # Adjust as needed
        num_beams=1,  # You can try increasing num_beams if you have enough memory
        early_stopping=True,
    )

    # Extract and Print Answer
    generated_answer = tokenizer.decode(generation_output[0])
    print(f"Question: {data_point['question']}")
    print(f"Generated Answer: {generated_answer.split('[/INST]')[-2].strip()}")
    print(f"Actual Answer: {data_point['answer']}")

Reference.

Liu, Shih-Yang, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. "Dora: Weight-decomposed low-rank adaptation." arXiv preprint arXiv:2402.09353 (2024).
https://huggingface.co/papers/2402.09353
https://www.nirajai.com/home/llm

Saturday, June 1, 2024

Mastering Retrieval-Augmented Generation (RAG) with LLMs: A Comprehensive Guide

RAG Background.

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating retrieval mechanisms with generative capabilities. It improves response accuracy by accessing external databases for relevant information, overcoming LLM limitations in knowledge cut-off and hallucinations. RAG combines the strengths of retrieval (precise, up-to-date data) and generation (contextual, fluent language), making it essential for complex queries, factual correctness, and dynamic knowledge. It optimizes performance, especially in specialized or rapidly evolving fields, ensuring comprehensive, accurate, and contextually relevant outputs, thus significantly enhancing the utility and reliability of LLMs in practical applications.

In this tutorial, I delve into the key topics and advancements related to Retrieval-Augmented Generation (RAG) using executable codes. Each topic is meticulously explained through video tutorials, offering detailed yet accessible discussions and working demonstrations, accompanied by the corresponding code. Some code segments are sourced from relevant libraries for demonstration purposes. This comprehensive tutorial covers the following topics, providing an in-depth understanding of RAG and its practical applications.

Basics of RAG (Retrieval Augmented Generation) with LLM
How to use LLM + RAG to Construct Knowledge Graph.
How to construct Flow-Diagram by Using LLM + RAG.
Graph Based RAG (Retrieval Augmented Generation) Techniques.

Note: In addition to this, it also provides a linked tutorial on a pressing topic: "Use of Long Text Sequences with LLMs Trained on Shorter Text Sequences.". In the future, I will introduce new research advancements in the field of Large Language Models.

1. Basics of RAG (Retrieval Augmented Generation) with LLM.

Video Tutorial.

Basic RAG Code.

import ollama
import chromadb

documents = [
  "Quantum mechanics is a fundamental theory in physics that describes the behavior of nature at and below the scale of atoms.",
  "It is the foundation of all quantum physics, which includes quantum chemistry, quantum field theory, quantum technology, and quantum information science.",
  "Quantum mechanics can describe many systems that classical physics cannot.",
  "Classical physics can describe many aspects of nature at an ordinary (macroscopic and (optical) microscopic) scale, but is not sufficient for describing them at very small submicroscopic (atomic and subatomic) scales.",
  "Most theories in classical physics can be derived from quantum mechanics as an approximation valid at large (macroscopic/microscopic) scale.",
  "Quantum systems have bound states that are quantized to discrete values of energy, momentum, angular momentum, and other quantities, in contrast to classical systems where these quantities can be measured continuously.",
  "Measurements of quantum systems show characteristics of both particles and waves (wave–particle duality), and there are limits to how accurately the value of a physical quantity can be predicted prior to its measurement, given a complete set of initial conditions (the uncertainty principle)."
]
# Create database
client = chromadb.Client()
collection = client.create_collection(name="docs")
# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[d]
  )

# an example prompt
prompt = "What are the key benefits of using quantum mechanics over classical physics?"

# generate an embedding for the prompt and retrieve the most relevant doc
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1
)
data = results['documents'][0][0]

# generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)

print(output['response'])

2. How to use LLM + RAG to Construct Knowledge Graph..

Video Tutorial.

Code to Generate Knowledge Graph Triplets.

import ollama
import chromadb

documents = [
  "Quantum mechanics is a fundamental theory in physics that describes the behavior of nature at and below the scale of atoms.",
  "It is the foundation of all quantum physics, which includes quantum chemistry, quantum field theory, quantum technology, and quantum information science.",
  "Quantum mechanics can describe many systems that classical physics cannot.",
  "Classical physics can describe many aspects of nature at an ordinary (macroscopic and (optical) microscopic) scale, but is not sufficient for describing them at very small submicroscopic (atomic and subatomic) scales.",
  "Most theories in classical physics can be derived from quantum mechanics as an approximation valid at large (macroscopic/microscopic) scale.",
  "Quantum systems have bound states that are quantized to discrete values of energy, momentum, angular momentum, and other quantities, in contrast to classical systems where these quantities can be measured continuously.",
  "Measurements of quantum systems show characteristics of both particles and waves (wave–particle duality), and there are limits to how accurately the value of a physical quantity can be predicted prior to its measurement, given a complete set of initial conditions (the uncertainty principle)."
]

# Create database
client = chromadb.Client()
collection = client.create_collection(name="docs")

# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(ids=[str(i)], embeddings=[embedding], documents=[d] )

# an example prompt
prompt1 = "What are the key benefits of using quantum mechanics over classical physics?"
prompt2 = "List all entities, and generate the knowledge graph triplets by using all entities."
# Generate Answers - for Prompt-1:
# generate an embedding for the prompt and retrieve the most relevant doc
response1 = ollama.embeddings(
  prompt=prompt1,
  model="mxbai-embed-large"
)
results1 = collection.query(
  query_embeddings=[response1["embedding"]],
  n_results=1
)
data1 = results1['documents'][0][0]

# generate a response combining the prompt and data we retrieved in step 2
output1 = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data1}. Respond to this prompt: {prompt1}"
)
print("Response for the question -1",output1['response'])
# Generate Answers - for Prompt-2:
# generate an embedding for the prompt and retrieve the most relevant doc
response2 = ollama.embeddings(
  prompt=prompt2,
  model="mxbai-embed-large"
)
results2 = collection.query(
  query_embeddings=[response2["embedding"]],
  n_results=1
)
data2 = results2['documents'][0][0]

# generate a response combining the prompt and data we retrieved in step 2
output2 = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data2}. Respond to this prompt: {prompt2}"
)
print("Response for the question -2",output2['response'])

Code to Visualize the Knowledge Graph (by using above triplets).

import networkx as nx
import matplotlib.pyplot as plt

# Step 1: Define your triplets
triplets = [
    ("Quantum mechanics", "a fundamental theory", "Theory"),
    ("Theory", "in physics", "Physics"),
    ("Physics", "describes the behavior of", "Nature"),
    ("Nature", "at and below the scale of", "Atoms"),
    ("Atoms", "is related to the scale of", "Scale"),
]

# Step 2: Create a directed graph
G = nx.DiGraph()

# Step 3: Add edges from triplets
for subject, predicate, obj in triplets:
    G.add_edge(subject, obj, label=predicate)

# Step 4: Draw the graph
pos = nx.spring_layout(G, seed=42)  # Position nodes using Fruchterman-Reingold force-directed algorithm

# Draw nodes and edges
nx.draw(G, pos, with_labels=True, node_size=3000, node_color="lightblue", font_size=10, font_weight="bold", arrowsize=20)

# Draw edge labels
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='red')

# Display the graph
plt.title("Knowledge Graph Visualization")
plt.show()

3. How to construct Flow-Diagram by Using LLM + RAG.

Video Tutorial.

Code.

import ollama
import chromadb

documents = [
  "Quantum mechanics is a fundamental theory in physics that describes the behavior of nature at and below the scale of atoms.",
  "It is the foundation of all quantum physics, which includes quantum chemistry, quantum field theory, quantum technology, and quantum information science.",
  "Quantum mechanics can describe many systems that classical physics cannot.",
  "Classical physics can describe many aspects of nature at an ordinary (macroscopic and (optical) microscopic) scale, but is not sufficient for describing them at very small submicroscopic (atomic and subatomic) scales.",
  "Most theories in classical physics can be derived from quantum mechanics as an approximation valid at large (macroscopic/microscopic) scale.",
  "Quantum systems have bound states that are quantized to discrete values of energy, momentum, angular momentum, and other quantities, in contrast to classical systems where these quantities can be measured continuously.",
  "Measurements of quantum systems show characteristics of both particles and waves (wave–particle duality), and there are limits to how accurately the value of a physical quantity can be predicted prior to its measurement, given a complete set of initial conditions (the uncertainty principle)."
]
# Create database
client = chromadb.Client()
collection = client.create_collection(name="docs")

# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(ids=[str(i)], embeddings=[embedding], documents=[d] )

# an example prompt
prompt1 = "Generate a Mermaid diagram."
# Generate Answers - for Prompt-1:
# generate an embedding for the prompt and retrieve the most relevant doc
response1 = ollama.embeddings(
  prompt=prompt1,
  model="mxbai-embed-large"
)
results1 = collection.query(
  query_embeddings=[response1["embedding"]],
  n_results=1
)
data1 = results1['documents'][0][0]

# generate a response combining the prompt and data we retrieved in step 2
output1 = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data1}. Respond to this prompt: {prompt1}"
)
print("Response for the question -1",output1['response'])

4. Graph Based RAG (Retrieval Augmented Generation) Techniques.

Video Tutorial.

Code.

import ollama
import chromadb

documents = [
  "The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables large language models (LLMs) to answer questions over private and/or previously unseen document collections.",
  "However, RAG fails on global questions directed at an entire text corpus, such as “What are the main themes in the dataset?”, since this is inherently a queryfocused summarization (QFS) task, rather than an explicit retrieval task.",
  "Prior QFS methods, meanwhile, fail to scale to the quantities of text indexed by typical RAG systems.",
  "To combine the strengths of these contrasting methods, we propose a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed.",
  "Our approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pregenerate community summaries for all groups of closely-related entities.",
  "Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user.",
  "For a class of global sensemaking questions over datasets in the 1 million token range, we show that Graph RAG leads to substantial improvements over a na¨ıve RAG baseline for both the comprehensiveness and diversity of generated answers."
]
# Create database
client = chromadb.Client()
collection = client.create_collection(name="docs")
# store each document in a vector embedding database
for i, d in enumerate(documents):
  response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
  embedding = response["embedding"]
  collection.add(
    ids=[str(i)],
    embeddings=[embedding],
    documents=[d]
  )

# an example prompt
prompt = "How Graph RAG (Retrieval Augmented Generation, used with Large Language Model) generates a Global Summarization for the given context?"
Entity_1 = "Graph RAG"
Entity_2 = "Global Summarization"
Community_Triplets = [("Graph RAG", "Uses", "Leiden Community Detection Algorithm"),
("LLM", "Extracts", "Entity Knowledge Graph"),
("Graph Index", "Partitioned By", "Community Detection Algorithms"),
("Community Summaries", "Used For", "Global Summarization"),]
summary_text = "Graph RAG - Uses - Leiden Community Detection Algorithm; LLM - Extracts - Entity Knowledge Graph; Graph Index - Partitioned By - Community Detection Algorithms; Community Summaries - Used For - Global Summarization"
# generate an embedding for the prompt and retrieve the most relevant doc
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)
summary_text_embedding = ollama.embeddings(
  prompt=summary_text,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[summary_text_embedding["embedding"]],
  n_results=1
)
data = results['documents'][0][0]

# generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)

print(output['response'])

Additional Reference for the topic.

".Use of Long Text Sequences with LLMs Trained on Shorter Text Sequences." (Or use the direct link: https://www.nirajai.com/home/llm )

Reference.

Ding, Yujuan, Wenqi Fan, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. "A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models." arXiv preprint arXiv:2405.06211 (2024).
Wu, Kevin, Eric Wu, and James Zou. "How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior." arXiv preprint arXiv:2404.10198 (2024).
Li, Jiarui, Ye Yuan, and Zehua Zhang. "Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases." arXiv preprint arXiv:2403.10446 (2024).
Edge, Darren, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." arXiv preprint arXiv:2404.16130 (2024).

Sunday, March 17, 2024

Use of Long Text Sequences with LLM’s Trained on Shorter Text Sequences - ALiBi & RoFORMER

Introduction.

Training large language models (LLMs) on longer sequences poses challenges in computational resources, model complexity, gradient propagation, and overfitting. These include increased memory requirements due to self-attention mechanisms, longer training times, difficulty in scaling Transformers for very long sequences, challenges in capturing long-term dependencies, risk of vanishing or exploding gradients, and potential overfitting to training data. Solutions like linear biases, RoFormer, and RoPE improve handling of long-range dependencies, enhance model generalization, and incorporate positional information for better performance in NLP tasks. For Example:

Attention with linear Biases

Improved Handling of Long-Range Dependencies. Traditional attention mechanisms struggle with capturing long-range dependencies in text due to the quadratic increase in computational complexity with sequence length. Linear biases help to mitigate this by effectively incorporating positional information, thus enhancing the model’s ability to maintain context over long distances within the text.

RoFormer

Improved Model Generalization: By more effectively encoding positional information, RoFormer helps LLMs to generalize better across different tasks and datasets. This results in enhanced performance on a wide range of NLP tasks, including text classification, machine translation, and semantic analysis.

Enhanced Positional Encoding: RoPE uniquely integrates positional information with the token embeddings, preserving the relative distances between tokens. This method enables the model to better understand and utilize the order of words or tokens, which is crucial for many language understanding and generation tasks.

Video Tutorial -1

Video Tutorial -2

Video Tutorial -3

References.

Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
Press, Ofir, Noah A. Smith, and Mike Lewis. "Train short, test long: Attention with linear biases enables input length extrapolation." arXiv preprint arXiv:2108.12409 (2021).
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).