Meta Open-Sources MEGALODON LLM for Efficient Long Sequence Modeling

Date: unknown

Location: www.infoq.com

Researchers from Meta, University of Southern California, Carnegie Mellon University, and University of California San Diego recently open-sourced MEGALODON, a large language model (LLM) with an unlimited context length. MEGALODON has linear computational complexity and outperforms a similarly-sized Llama 2 model on a range of benchmarks.

MEGALODON is designed to address several shortcomings of the Transformer neural architecture underlying most LLMs. Instead of the standard multihead attention, MEGALODON uses a chunk-wise attention. The research team also introduced sequence-based parallelism during training, improving scalability for long-context training. When evaluated on standard LLM benchmarks, such as WinoGrande and MMLU, MEGALODON outperformed a Llama 2 model with the same amount of parameters, training data, and training compute budget. According to the researchers: 

MEGALODON achieves impressive improvements on both training perplexity and across downstream benchmarks. Importantly, experimental results on long-context modeling demonstrate MEGALODON’s ability to model sequences of unlimited length. Additional experiments on small/medium-scale benchmarks across different data modalities illustrate the robust improvements of MEGALODON, which lead to a potential direction of future work to apply MEGALODON for large-scale multi-modality pretraining.

Although the Transformer architecture has become the standard for most Generative AI models, Transformers do have some drawbacks. In particular, their self-attention mechanism has quadratic complexity in both compute and storage, which limits the models' input context length. Several alternatives to the standard self-attention model have been developed recently, including structured state space models (SSMs) like Mamba, which scales linearly with context length. Another scheme that InfoQ recently covered is the RWKV Project's attention-free Transformer model, which has no maximum input context length.

MEGALODON builds on the research team's previous model, MEGA (exponential moving average with gated attention), with several new features. First, while MEGA uses a "classical" exponential moving average (EMA) within its attention mechanism, MEGALODON computes a complex EMA (CEMA). Mathematically, the CEMA component makes MEGALODON equivalent to a "simplified state space model with diagonal state matrix."

The research team trained a seven-billion parameter model, MEGALODON-7B, using the same 2-trillion token dataset that Llama2-7B used; they also used the same training hyperparameters. The team observed that MEGALODON-7B was more computationally efficient. When the Llama model was scaled up to a 32k context length, MEGALODON-7B was "significantly" faster.

Besides evaluating MEGALODON-7B on standard LLM benchmarks, the researchers also tested its performance on the SCROLLS long-context question-answering benchmark, and compared its results with several baseline models, including the modified Llama 2 model with a 32k context length. MEGALODON outperformed all baseline models on the NarrativeQA subtask, and on all tasks achieved results "competitive" with Llama 2.

In a discussion about MEGALODON on Hacker News, one user wondered how well the model performed on recall tasks, given that other non-Transformer models tend to perform poorly. Another user replied:

For what it's worth, RWKV's website on that matter mentions that yes it's bad on recall, but for the vast majority of tasks you can just ask the question *before* the content, and it'll handle the task just fine.

The MEGALODON code is available on GitHub.

About the Author

Anthony Alford