LoRA: Low-Rank Adaptation of Large Language Models

25-01-2025fine-tuning · low-rank · transfer-learning

LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning of large language models by decomposing weight updates into low-rank matrices, dramatically reducing trainable parameters while preserving pre-trained knowledge and allowing zero-inference-latency deployment through weight merging.

Paper Link

Key Definitions

TermDefinition
Inference Latency- Refers to the amount of time it takes for a model to generate predictions after receiving input
- The document discusses inference latency in relation to adapter layers, which are used in neural network architectures to adapt pre-trained models for new tasks with minimal re-training of the model
- It mentions that these layers, while beneficial for parameter efficiency, can introduce significant latency during the inference phase, especially in online scenarios where the batch size is small
Intrinsic Dimension- Refers to the underlying complexity or the minimum number of parameters required to capture the significant variance or behavior of a model
- In the context of neural networks, especially those pre-trained on a vast amount of data like GPT-3, the intrinsic dimension indicates how the effective information or the actual degrees of freedom within the model can be represented with fewer parameters than those used in the full model
- Essentially, it's about understanding the minimal complexity necessary to achieve comparable performance
Adapter Layers- In the context of neural network architectures, particularly those involving transfer learning, an adapter layer is a small module inserted between the pre-existing layers of a pre-trained model to fine-tune the model for specific tasks without altering the original parameters significantly
- These adapter layers allow for task-specific training with minimal impact on the overall parameter count of the model, providing an efficient way to customize large models for new tasks
Prefix-Tuning- In prefix-tuning, a set of trainable vectors, known as a "prefix", is prepended to the sequence of embeddings at the input of each Transformer layer
- These prefixes serve as additional, task-specific parameters that influence the behavior of the model during the forward pass
- Unlike traditional fine-tuning, which updates all model weights, prefix-tuning adapts the model by only learning the prefixes
- This allows the base model to retain its general capabilities while gaining the ability to perform well on specific tasks
Low-Rank Property- In the context of neural networks, the low-rank property refers to the ability of the network's weights or transformations to be approximated or fully captured using matrices with a rank significantly lower than the dimensions of the matrices themselves
- This suggests that the essential information in these large matrices can be condensed into a much smaller subspace
- The low-rank property is particularly relevant for model adaptation because it implies that small, targeted changes (using low-rank matrices) can effectively capture the necessary adjustments for new tasks without needing to alter the entire high-dimensional weight matrices of a pre-trained model

Motivation

  • Challenges with Large Models: As language models like GPT-3 have grown larger, full fine-tuning (retraining all model parameters) has become increasingly impractical due to high computational and storage demands
  • Need for Efficiency: There is a critical need for more efficient methods to adapt these large pre-trained models to new tasks without the overhead of retraining all parameters

Introduction to LoRA

  • LoRA addresses these challenges by introducing a method to adapt models using low-rank matrices
    • his method significantly reduces the number of trainable parameters by freezing the original pre-trained model weights and only training small rank decomposition matrices (A and B) that modify the model's behavior

Advantages

  • Parameter Efficiency: By using low-rank matrices, LoRA limits the number of parameters that need to be updated, making the adaptation process more manageable and less resource-intensive
  • Preservation of Pre-trained Knowledge: The approach preserves the underlying structure and knowledge of the pre-trained model, leveraging the extensive learning it has already undergone
  • Flexibility and Scalability: LoRA enables quick adaptation to multiple tasks by swapping out the low-rank matrices tailored for each task, facilitating faster deployment and easier scaling

LoRA Example

lora

Instead of computing the full weight update matrix ΔW\Delta \mathbf{W} during fine-tuning, LoRA introduces a more efficient approach based on the insight that these updates have a low intrinsic rank.

Consider a pre-trained weight matrix WRd×k\mathbf{W} \in \mathbb{R}^{d \times k}. During traditional fine-tuning, we would learn a full update matrix ΔWRd×k\Delta \mathbf{W} \in \mathbb{R}^{d \times k} and update the weights as:

Wupdated=W+ΔWW_{\text{updated}} = \mathbf{W} + \Delta \mathbf{W}

For large models like GPT-3 with 7B parameters, this means computing and storing an equally massive 7B-parameter update matrix, which is computationally expensive and memory-intensive.

How LoRA Works

LoRA solves this problem by decomposing the weight update matrix into two smaller, low-rank matrices:

ΔW=BA\Delta \mathbf{W} = \mathbf{B}\mathbf{A}

where:

  • ARr×k\mathbf{A} \in \mathbb{R}^{r \times k} is a low-rank matrix (typically initialized with random Gaussian values)
  • BRd×r\mathbf{B} \in \mathbb{R}^{d \times r} is a low-rank matrix (typically initialized with zeros)
  • rmin(d,k)r \ll \min(d, k) is the rank, which is much smaller than the original dimensions

The updated forward pass becomes:

h=Wx+BAxh = \mathbf{W}x + \mathbf{B}\mathbf{A}x

Key Benefits of This Decomposition:

  • Dramatic Parameter Reduction: Instead of training d×kd \times k parameters, we only train r×(d+k)r \times (d + k) parameters. For example, if d=k=1000d = k = 1000 and r=2r = 2, we reduce from 1,000,000 parameters to just 4,000 parameters
  • Frozen Base Model: The original weights W\mathbf{W} remain frozen during training, preserving the pre-trained knowledge
  • Task Switching: Different tasks can use different A\mathbf{A} and B\mathbf{B} matrices while sharing the same base model, enabling efficient multi-task deployment
  • No Inference Latency: Unlike adapter layers, LoRA can be merged into the original weights (W=W+BA\mathbf{W}' = \mathbf{W} + \mathbf{B}\mathbf{A}) after training, adding zero additional inference cost

LoRA Hyperparameters: Rank and Alpha

LoRA introduces two critical hyperparameters that control the adaptation process: the rank rr and the scaling factor α\alpha.

Rank (rr)

The rank rr determines the dimensionality of the low-rank decomposition and directly controls the expressiveness versus efficiency trade-off:

  • Definition: rr is the inner dimension of the matrices A\mathbf{A} and B\mathbf{B}, representing the bottleneck size of the adaptation
  • Impact on Parameters: The number of trainable parameters is r×(d+k)r \times (d + k), so lower ranks mean fewer parameters to train
  • Typical Values: Common choices range from r=1r = 1 to r=64r = 64, with r=8r = 8 or r=16r = 16 being popular defaults
  • Trade-offs:
    • Lower rank (r=1,2,4r = 1, 2, 4): Maximum parameter efficiency, faster training, but may limit the model's ability to capture complex task-specific patterns
    • Higher rank (r=32,64r = 32, 64): Greater expressiveness and better performance on complex tasks, but with increased parameter count and training cost

Empirical studies in the original paper show that surprisingly low ranks (even r=1r = 1 or r=2r = 2) can achieve competitive performance on many tasks, validating the low intrinsic dimension hypothesis.

Alpha (α\alpha)

The scaling factor α\alpha controls the magnitude of the LoRA adaptation relative to the pre-trained weights:

  • Definition: α\alpha is a constant used to scale the LoRA contribution, typically applied as: h=Wx+αrBAxh = \mathbf{W}x + \frac{\alpha}{r}\mathbf{B}\mathbf{A}x
  • Purpose: Acts as a learning rate modifier specifically for the LoRA parameters, allowing control over how much the adaptation influences the model
  • Typical Values: Often set to values like α=16\alpha = 16 or α=32\alpha = 32, though this varies by application
  • Scaling Relationship: The factor αr\frac{\alpha}{r} normalizes the contribution of LoRA updates, making the magnitude somewhat independent of the choice of rank

Practical Considerations

Choosing Rank and Alpha:

  • Start with a moderate rank like r=8r = 8 and α=16\alpha = 16 as baseline values
  • For simple tasks (e.g., sentiment classification), lower ranks may suffice
  • For complex tasks (e.g., instruction following, domain-specific reasoning), higher ranks may be necessary
  • The ratio αr\frac{\alpha}{r} can be kept constant when experimenting with different ranks to maintain consistent update magnitudes

Independence from Task Complexity: Interestingly, research shows that the optimal rank doesn't always correlate with task difficulty. Some complex tasks can be effectively learned with very low-rank adaptations, suggesting that the required weight changes often lie in low-dimensional subspaces regardless of task complexity.

Applying LoRA (Hugging Face PEFT)

Below is a minimal example showing how to attach LoRA adapters to a causal LM, train only the low-rank parameters, and save the adapter weights.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model
 
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
base_model = AutoModelForCausalLM.from_pretrained(model_id)
 
# Configure LoRA to target attention projection layers
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["c_attn"]
)
 
model = get_peft_model(base_model, lora_config)
 
# Example data
texts = ["LoRA makes fine-tuning efficient.", "Low-rank adapters scale well."]
encodings = tokenizer(texts, return_tensors="pt", padding=True)
 
args = TrainingArguments(
    output_dir="lora-demo",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=1,
)
 
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=[encodings] * 10,
)
 
trainer.train()
model.save_pretrained("lora-demo-adapter")