Reading Note: Megatron-LM v1

date

Feb 16, 2025

slug

megatron-lm

status

Published

TL; DR

This post summarizes the intra-layer model parallelism mechanism proposed in Megatron-LM, which enables parallel processing for transformer models, making it possible to train larger transformer-based language models.

Problem & Motivation

Large language models can deliver better performance (scaling law), so being able to train larger language models is important. However, due to memory constraints, they cannot fit onto a single GPU. Some existing methods used to reduce memory consumption often lead to significant efficiency drops, such as gradient checkpointing; other methods solve this problem through model parallelism, such as pipeline parallelism, but these methods require rewriting the model or depend on customized compilers and frameworks that are still under development.

Megatron-LM proposes an intra-layer model parallel method that can be implemented using native PyTorch and can be used orthogonally with other model parallel methods like pipeline parallelism.

Approach

MLP Parallelization

The MLP in the Transformer block can be represented as two parts and . In the first part, through column partitioning, , in the second part, . Only an all-reduce is needed at the end to populate the final result, which reduces communication compared to other model partitioning methods.

Self-Attention Parallelization

The computation is divided according to the heads in Multi-head Attention, with each head's computation placed on a separate device, and the results are populated using all-reduce at the end.

Communication Cost

In the above method, is an identity function in the forward pass and an all-reduce operation in the backward pass, while is the opposite. Therefore, for a transformer layer using this parallelism approach, a single forward + backward pass requires 4 all-reduce operations.