LLVM IR Tutorial: Phis, GEPs, and other things, oh my! - LLVM Developers Conference 2019

date
Jan 13, 2024
slug
llvm-ir-tutorial-llvm-dev-conf-2019
status
Published
tags
LLVM
summary
type
Post

What is LLVM IR?

  • A low-level programming language (RISC-like instruction set)
  • while being able to represent high-level ideas (ideas in high-level language can be easily mapped to LLVM IR)
  • Enable efficient code optimization.
notion image

IR & Compilation Process

notion image
  • Source codes in different languages can be compiled into LLVM IR bit code form, which can be optimized by the optimizer of LLVM(e.g., opt) and then be linked.

Simplified IR Layout

notion image
Modules represent the top-level structure in an LLVM program. An LLVM module is effectively a translation unit or a collection of translation units merged together. (from LLVM documentation)

An Example: a basic main program

main.cpp
corresponding code in LLVM IR (omitting unnecessary parts)
Some observations:
  • LLVM IR is a typed language (and no implicit conversion)
  • Easily represent high-level ideas such as function.
%var is a virtual register in LLVM IR, which is used to represent a local variable. LLVM IR has infinite virtual registers. (%name or %number)

More Examples: recursive factorial

main.cpp
corresponding LLVM IR

Basic Blocks

The above example contains several basic blocks.
A basic block is a list of non-terminator instructions ending with a terminator instruction: - Branch: “br” - Return: “ret” - Switch: “switch” - Unreachable: “unreachable” - Exception handling instructions
notion image

Control Flow Graph

A Control Flow Graph captures the pred-succ relationship between basic blocks, the node set contains all basic blocks, andthe edge set contains all control flow transfer relations.
notion image
Every basic block has a label; if you do not explicitly write it, it will be implicitly added.

More Examples: iterative factorial

main.cpp
LLVM IR:

SSA: Single Static Assignment

SSA is a form of IR, with the following restrictions:
  • Every variable is assigned exactly once.
  • Every variable is defined before it is used.
In the above example, we want to update %temp and %i, but this violates the constraints of SSA, so what should we do?

Phi instruction comes to the rescue!

notion image
CFG when using Phis
notion image

Another Way to Cheat SSA: the Alloca instruction

The alloca instruction:
  • Allocates memory on the stack frame of the executing function.
  • Automatically released.
  • Plays a big part in generating IR in SSA form.
CFG using alloca:
notion image

Global Variables

Allocas allocate memory for function scopes. Global variables fill that role for the module in a static way.
  • They are always pointers, like the values returned by Allocas.
notion image
  • GVs are always constant pointers(will not point to other locations)

LLVM’s Type System

types in LLVM’s type system(from langRef):
notion image

Aggregate Type: Array Type

The array type is defined by:
  • A constant size. e.g.: @array = global [17 x ]
  • An element type. e.g.: @array = global [17 x i8]
  • (For GVs) An initializer: e.g.: @array = global [17 x i8] zeroinitializer

Accessing Array & Manipulating Pointers

We can access array elements using GEP instruction, which provides a way to:
  • calculate array offset
  • abstract details like sizeof type and padding inside structs

Manipulating Pointers using GEPs

notion image
here is an example:
notion image
if we only provide the 0-th dimension offset, GEP will return a pointer of the base type; if you provide an index in more dimensions, it will return element types.

GEP fundamentals

  1. Understand the first index:
  • It does NOT change the resulting pointer type.
  • It offsets by the base type. (may become a bit confusing when the base type is one of the aggregate types)
  1. Further indices:
  • Offset inside aggregate types. (and vectors)
  • Change the pointer type by removing one layer of “aggregation”

A common misusage of GEP

notion image
GEP offsets by base type if only the first index is provided, so in this case, it offsets by the size of the entire array.

Aggregate Type: the Struct Type

A struct type is defined by:
  • A name
  • Keyword “type”
  • A list of types
  • example: %MyStruct = type { i8, i32, [3 x i32] }

GEP with structs

Basically, it’s the same as with the array type. (offset by members), but should always be indexed by constants (so the compiler can figure out types of the elements in a static way)
notion image
For example, if the index is not constant, we cannot check whether it could be indexed at third-level. (this is based on the fact that the 2-nd element is an array)

GEP fundamentals - continue

  1. Struct indices must be constants.
  1. GEPs never load from memory!

Final Remarks

LLVM IR is constantly evolving.
Covered fundamental topics unlikely to change soon.
  • There is a lot more to explore!
Next steps:
  • Learn how to manipulate IR using the LLVM library
  • Look at the OPT code!

© Lifan Sun 2023 - 2025