CSE221 - lec15: Scalability: RCU & Analysis of Linux Scalability
date
Nov 24, 2024
slug
cse221-lec15
status
Published
tags
System
summary
type
Post
Multi-core CPUs
- ~2000s: 2 cpu cores
- scalability goals: performance increases linearly with cores
- initially: one big lock on the kernel
- better approach: finer granularity locks on different subsystems
RCU Goals
- concurrent reads, even during updates
- low overhead: computation and storage overhead should be low
- deterministic completion time (for example some real-time applications will need this)
Example: Linked-List Operations
- how to enforce concurrency control?
Approach #1: Spin Lock
- access is mutual exclusive
Approach #2: ReadWrite Lock
- multiple readers or 1 writer
- benefit: scalable read
- drawbacks:
- cannot read while writing
- space overhead(use an integer)
- still has overhead with only readers (serializing atomic access on the integer for RW lock)
thought experiment: what if readers skip the lock
- concurrent writer: readers may see intermediate results
- no writer: everything is ok
Read-Copy-Update (RCU)
- RCU is a scalable synchronizing mechanisms
RCU #1: Writers make a copy
- readers skip the lock, so they should be able to make progress even when update is happening.
- only one operation is made during two states, the update process is done by the writer itself independently without interfering the original data structure, so there is no intermediate states
- Problem: reordering of instructions (by compilers, or by modern processors at runtime - out of order execution)
RCU #2: memory barriers to enforce ordering
- Problem: use after free
- readers may hold pointers to freed space
- e.g.: user read, writer update, user read by the old pointer again.
RCU #3: grace period
- invariant enforced: readers cannot hold an RCU pointer across context switch (indicate usage of RCU pointers by enclosing it by rcu_critical_section)
- writers delay free until all CPUs have context switch (can be monitored by some state variables)
- Implementation: disable preemption in rcu_critical_section, writers wait
Linux RCU API
Reader
- rcu_read_lock: enter rcu critical section, disable timer interrupt
- rcu_read_unlock: leave rcu critical section, enable timer interrupt
- rcu_dereference(ptr): use memory barriers when dereferencing rcu pointers
Writer
- rcu_synchronize: wait for context switch - typically wait for ~ms
- call_rcu(callback, args…): async alternative to rcu_synchronize
- rcu_assign_pointer(ptr_addr, ptr): use memory barriers when assigning rcu pointers
RCU’s benefits over RW Lock
Readers | Writer |
RCU can always read | RCU can take longer |
almost no cost | ㅤ |
Drawbacks of RCU
- best when readers >> writers
- complex logic
- doesn’t help writer especially when there are concurrent writers
- have to copy data structures
- some data structures cannot use it (e.g. some applications need stronger semantics)
- readers can see stale data
- code has to drop reference across yield / sleep
- rely on frequent context switch
Summary
- multiple CPUs: need scalable synchronization mechanism
- RCU: readers don’t wait
- widely used in Linux
Analysis of Linux Scalability
Sources of scalability bottleneck
- Applications
- OS
- Hardware
Analysis Approach
- a set of apps: previous apps that reveal bottleneck & parallel and kernel intensive apps
- find and fix bottleneck
- iterative until OS is not bottleneck
Cache Coherency
Scalability Bottleneck
- lock on shared data structures
- writing to shared variable
- competition on cache, memory bandwidth
- lack enough concurrency / parallelism
Techniques for improving scalability
- avoid lock contention & data contention (via per-core data structures)
- sloppy counters: single global counters but partitioned across cores, true value = value(shared) - sum(per-core counter values)
- wait-free synchronization: e.g. RCU
- avoid false sharing: logically independent but share some resources physically, which incurs contention.
Summary
- general techniques for scalability
- Linux is quite scalable, maybe no need to design a new architecture just for scalability.