date
slug
status
tags
summary
type
TL; DR
This post summarizes the intra-layer model parallelism mechanism proposed in Megatron-LM, which enables parallel processing for transformer models, making it possible to train larger transformer-based language models.
Problem & Motivation
Large language models can deliver better performance (scaling law), so being able to train larger language models is important. However, due to memory constraints, they cannot fit onto a single GPU. Some existing methods used to reduce memory consumption often lead to significant efficiency drops, such as gradient checkpointing; other methods solve this problem through model parallelism, such as pipeline parallelism, but these methods require rewriting the model or depend on customized compilers and frameworks that are still under development.
Megatron-LM proposes an intra-layer model parallel method that can be implemented using native PyTorch and can be used orthogonally with other model parallel methods like pipeline parallelism.
Approach
MLP Parallelization

The MLP in the Transformer block can be represented as two parts and . In the first part, through column partitioning, , in the second part, . Only an all-reduce is needed at the end to populate the final result, which reduces communication compared to other model partitioning methods.
Self-Attention Parallelization

The computation is divided according to the heads in Multi-head Attention, with each head's computation placed on a separate device, and the results are populated using all-reduce at the end.
Communication Cost

In the above method, is an identity function in the forward pass and an all-reduce operation in the backward pass, while is the opposite. Therefore, for a transformer layer using this parallelism approach, a single forward + backward pass requires 4 all-reduce operations.
.
- Author:Lifan Sun
- URL:stevensun.site/article/megatron-lm
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Reading Notes: “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving”

Reading Notes: “Preble: Efficient Distributed Prompt Scheduling for LLM Serving”

Reading Note: “ORCA: A Distributed Serving System for Transformer-Based Generative Models”

Reading Notes: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”

Reading Notes: “Efficient Memory Management for Large Language Model Serving with PagedAttention”

Reading Notes: “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”
