date
slug
status
tags
summary
type
Motivation
In recent years, the power of scaling laws has become evident, with larger models achieving superior performance across various tasks. However, as model sizes continue to grow, hardware constraints such as memory limitations prevent them from fitting within a single accelerator. To address this, various model parallel approaches have been proposed.
However, many of these methods are either task- or architecture-specific or introduce significant communication overhead—particularly those relying on intra-operation parallelism (which could be a problem if there are communications happenning on slow links)—making them difficult to use practically.
GPipe offers a task- and architecture-agnostic model parallelism approach by leveraging pipelining across different stages of a model.
Approach

Core Idea
GPipe is based on the following core ideas:
- a model is a sequence of layers
- a model can be partitioned to a sequence of cells, each consisting a group of consecutive layers
- the execution of this sequence of cells (stages) can be pipelined to reduce device idle time
- mini-batch is further divided to micro-batch in execution
- gradients are aggregated at the end of the processing of the mini-batch
Performance Optimizations
- Activation checkpointing: only keep activation in the cell boundary, recompute other needed activations during backward pass to reduce the peak memory usage.
- Communication Overhead: communication between cells only happens at the cell boundary.
- Load Balancing: use a cost estimator to estimate the cost of stage execution, which help to prevent imbalance load between stages.
Evaluation
Evaluate the approach on image classification and machine translation tasks, some key observations:
- Transformer get almost linearly increased speedup when scaling up (since it consists of identical blocks so the load is easily balanced.)
- GPipe achieves notable speedup even without NVLink.
- Author:Lifan Sun
- URL:stevensun.site/article/gpipe
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Reading Notes: “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving”

Reading Notes: “Preble: Efficient Distributed Prompt Scheduling for LLM Serving”

Reading Note: “ORCA: A Distributed Serving System for Transformer-Based Generative Models”

Reading Notes: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”

Reading Notes: “Efficient Memory Management for Large Language Model Serving with PagedAttention”

Reading Notes: “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”
