Reading Notes: “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”Mar 3, 2025 NLP
Reading Notes: “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”Mar 1, 2025 NLP