The Remedy for Fine-Tuning LLMs in Long Context – LongLoRA

Large language models (LLMs) have revolutionized natural language processing tasks, but their pre-defined context size limits their applicability in tasks involving long documents or questions. Traditional approaches such as full fine-tuning and low-rank adaptation (LoRA) have proven to be ineffective and computationally expensive for extending the context window of LLMs. This is where LongLoRA comes into play, an efficient fine-tuning approach that overcomes these limitations and enables LLMs to effectively handle longer contexts.

Ineffectiveness of Standard Self-Attention and LoRA

Standard self-attention mechanisms, while effective for short contexts, struggle to maintain performance and efficiency as the context window expands. LoRA, which employs low-rank weight updates, falls short in effectively handling long context models. Empirical findings reveal high perplexity and increased computational costs as the context size increases, making these methods suboptimal for extending the context window of LLMs.

Introducing LongLoRA

LongLoRA presents a novel approach to efficient fine-tuning of LLMs with extended context windows. It leverages techniques such as FlashAttention-2 and DeepSpeed ZeRO to optimize training and inference. LongLoRA introduces shift short attention (S2-Attn), which splits the context length into groups and performs attention within each group individually. By shifting tokens between groups, information flow between neighboring groups is ensured. This approach approximates long context learning while retaining the original attention architecture during inference.

Overview of LongLoRA designs

Key Components of LongLoRA:

  • FlashAttention-2 Compatibility: LongLoRA is designed to be compatible with FlashAttention-2, allowing for seamless integration with existing optimization and infrastructure techniques for LLMs.
  • DeepSpeed ZeRO: LongLoRA leverages DeepSpeed ZeRO, a memory optimization technique, to minimize the computational cost associated with extending the context window.
  • Shift Short Attention (S2-Attn): S2-Attn enables efficient context extension by performing attention within individual groups, reducing the computational burden compared to standard self-attention. The information flow between groups is facilitated through token shifting.
  • Rotary Embedding: LongLoRA incorporates learnable embedding and normalization layers, which are crucial for long context learning. These layers contribute only a small proportion of the overall model parameters, ensuring computational efficiency.
Illustration of shift short attention

Results and Improvements

Experimental results demonstrate the effectiveness and efficiency of LongLoRA. Fine-tuned models using LongLoRA achieve comparable performance to full-attention and fully fine-tuned models while significantly reducing computational costs. LongLoRA successfully extends the context window of LLMs, such as LLaMA2 7B, 13B, and 70B, allowing for context lengths up to 100k or 32k on a single 8× A100 machine. These improvements enable LLMs to handle long documents and questions more effectively.

Conclusion

LongLoRA presents an efficient fine-tuning approach for extending the context window of LLMs, addressing the limitations of standard self-attention, LoRA, and full fine-tuning. By leveraging techniques such as FlashAttention-2, DeepSpeed ZeRO, shift short attention, and rotary embedding, LongLoRA achieves comparable performance to full fine-tuning while significantly reducing computational costs. This breakthrough empowers LLMs to handle long contexts, opening doors to new possibilities in natural language processing tasks.

Bonus Content

FlashAttention-2 is an improved version of FlashAttention, designed to address the inefficiencies and limitations of the attention layer in Transformers when scaling to longer sequence lengths. The attention layer is a major bottleneck in processing longer sequences, as its runtime and memory requirements increase quadratically. FlashAttention-2 introduces better parallelism and work partitioning techniques to optimize the attention computation. It reduces the number of non-matrix multiplication floating-point operations (FLOPs), parallelizes the attention computation across different thread blocks, and distributes the work between warps within each thread block to minimize communication and shared memory access. These enhancements result in significant speedup compared to FlashAttention, reaching up to 2 times faster and approaching the efficiency of optimized matrix-multiply operations. FlashAttention-2 has been empirically validated and shown to achieve faster training speeds for GPT-style models, making it a valuable solution for processing longer sequences in language modeling and other applications.

DeepSpeed ZeRO (Zero Redundancy Optimizer) is a memory optimization technique for training large deep learning models. It eliminates memory redundancies in data- and model-parallel training, improving training speed and enabling the efficient training of models with trillions of parameters. By partitioning model states and optimizing memory consumption, DeepSpeed ZeRO allows for scaling the model size proportional to the number of devices, while maintaining high computational and communication efficiency. It achieves significant memory reduction and enables training of trillion-parameter models on a large number of GPUs.

This article is drafted with the assistance of A.I. and referencing from the sources below:
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
LongLoRA and LongAlpaca for Long-context LLMs
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

The work described in this article was supported by InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.
(AIFT strives but cannot guarantee the accuracy and reliability of the content, and will not be responsible for any loss or damage caused by any inaccuracy or omission.)

Share this content

Address

Units 1101-1102 & 1121-1123,
Building 19W Science Park West Avenue,
Hong Kong Science Park,
Shatin, Hong Kong

Products & Solutions

People

About Us

Address

Copyright © 2024 Laboratory for AI-Powered Financial Technologies Ltd. All Rights Reserved.