Reading Note: “ORCA: A Distributed Serving System for Transformer-Based Generative Models”

PMPP Reading Notes

Reading Notes: Qwen Technical Report

Reading Notes Collections: Context Length Extrapolation

Reading Notes: MiniCPM Technical Report

Reading Notes: LLaMA Technical Report

Word Embedding Techniques

Understanding Tokenization Methods

Rotatry Positional Encoding

Reading Notes: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”

Reading Notes: “Efficient Memory Management for Large Language Model Serving with PagedAttention”

Reading Notes: “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”

Reading Notes: “Training Compute-Optimal Large Language Models”

Reading Notes: “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”

Reading Notes: GPT Series

Reading Notes: “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning”

Reading Notes: “GPipe: Easy Scaling with Micro-Batch Pipeline”

Distributed Training Basics

Reading Note: Megatron-LM v1

Quantization for NN Inference


© Lifan Sun 2023 - 2025