Computer Architecture

System Evaluation Metrics

Cost Metrics

The cost of a chip includes:

Design cost: non-recurring engineering (NRE), can be amortized well if there is high volume;
Manufacturing cost: depends on area;
- Manufacturing Semiconductor Chips: Ingot → Wafer → Die (unpackaged chip) → Chip
- To measure the production efficiency of semiconductor manufacturing, we use the metric yield: the portion of good chips per wafer.
Testing cost: depends on yield and test time;
Packaging cost: depends on die size, number of pins, power delivery, ...

The cost of a system includes:

Power cost;
Cooling cost;
Total Cost of Ownership (TCO) of datacenters:
- Capital expenses (CAPEX): facilities, assembly & installation, compute, storage,
  networking, software, …
- Operational expenses (OPEX): energy, rent, maintenance, employee salaries, …
System availability: Downtime is expensive and results in a direct loss of revenue. Redundancy (adding backup components) improves availability but also increases the initial capital cost.

Performance Metrics

Performance metrics:

Latency: time to complete a task;
Throughput: tasks completed per unit time;

Improving latency often reduces throughput, but not vice versa. For example, inter-task parallelization improves throughput but not latency of a task, while intra-task parallelization improves both.

Buffering/queuing/batching improves throughput but may hurt latency, leads to the tradeoff between latency and throughput.

Digital systems (e.g., processors) operate using a constant-rate clock:

Clock cycle time (CCT): duration of a clock cycle;
Clock frequency (rate): cycles per second.

To compute the execution time of a program, we first compute the number of instructions (IC), which is fixed for a given program. Then we compute the average number of cycles per instruction (CPI), which depends on the system architecture and implementation. All together, we have

\[\text{Execution Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Time}}{\text{Cycle}}= \text{IC} \times \text{CPI} \times \text{CCT}. \]

Roughly speaking, software determines IC, ISA determines CPI, and microarchitecture/circuit determines CCT.

So far we only discuss the performance on processors. What about memory? It could be reflected on CPI. We know that

\[\text{Runtime}=\max(\text{#ops}/\text{processor throughput},\text{#bytes}/\text{memory bandwidth}). \]

Denote operational Intensity (OI) as $\frac{\text{#ops}}{\text{#bytes}}$, we have

\[\begin{align*} \text{Perf}=&\ \text{#ops}/\text{Runtime}\\ =&\ \min(\text{processor throughput},\text{memory bandwidth}\times\text{operational intensity}). \end{align*} \]

Drawing the graph of performance vs. operational intensity, we have the roofline model (for certain system):

Power and Energy Metrics

Dynamic/active power: $C\times V_{dd}^2\times f_{0\to 1}=\alpha C V_{dd}^2 f$, where $C$ is the capacitance being switched, $V_{dd}$ is the supply voltage, $f_{0\to 1}$ is the frequency of 0-to-1 transitions, $\alpha$ is the activity factor (the fraction of capacitance being switched), and $f$ is the clock frequency.

Static/leakage power: $V_{dd}I_{leak}$, where $I_{leak}$ is the leakage current.

Therefore, total power is

\[\text{Power}=\alpha C V_{dd}^2 f + V_{dd} I_{leak}. \]

And

\[\text{Energy}=\text{Power}\times \text{Time}. \]

Limiting factors of power, energy, and power density:

Power is limited by infrastructure, e.g., power supply;
Power density is limited by thermal dissipation, e.g., fans, liquid cooling;
Energy is limited by battery capacity or electrical bill.

Power scaling:

Dennard scaling (1974-2005): If the feature size scales by $1/S$, the supply voltage and current can scale by $1/S$;
Post-Dennard scaling (2006-now): Power limits performance scaling (power wall), so we need to slow down frequency scaling or reduce chip utilization.

Normalize performance to power:

\[\text{Energy Efficiency}=\frac{\text{Performance}}{\text{Power}}=\frac{\text{Operations}/\text{Time}}{\text{Energy}/\text{Time}}=1/\frac{\text{Energy}}{\text{Operations}}. \]

For certain task, choose the "optimal" design to trade off performance and energy.

Scalability

Scalability measures the speedup achieved by using $N$ processors compared to using just $1$ processor.

Two settings to evaluate scalability:

Strong scaling: speedup on $N$ processors with fixed total workload size
Weak scaling: speedup on $N$ processors with fixed per-processor workload size

How to balance the workload?

Static load balancing: to partition input as evenly as possible
Dynamic load balancing, e.g., work dispatch, work stealing

Suppose that an optimization accelerates a fraction $f$ of a program by a factor of $S$, then the overall speedup is given by Amdahl's Law:

\[\text{Speedup}=\frac{1}{(1-f)+\frac{f}{S}}. \]

Benchmark

Benchmark is a carefully selected programs used to measure performance. And benchmark suite is a collection of benchmarks.

To report the average performance on a benchmark suite, we may use three types of means: arithmetic (for absolutes), geometric (for rates) and harmonic (for ratios).

ISA

RISC: reduced instruction set computer, e.g., RISC-V, MIPS
CISC: complex instruction set computer, e.g., x86, x86-64

RISC-V Instructions

System States:

Program counter (PC): 32-bit in RV32I or 64-bit in RV64I;
Registers: 32 general-purpose registers. Each register is 32-bit in RV32I or 64-bit in RV64I;
Memory: Byte address from 0 to MSIZE–1. Address alignment is to 4 bytes in RV32I or 8 bytes in RV64I.

By convention, registers have ABI names when used for certain purpose

zero (x0) always contains the hardwired value 0; writing to zero has no effect
ra, sp, gp, tp(x1, x2, x3, x4) for return address, stack pointer, global pointer,
thread pointer
t0--t6 (x5--x7, x28--x31) for temporaries
s0--s11 (x8, x9, x18--x27) for saved values
a0-- a7 (x10--x17) for arguments and return values

Instructions:

Basic compute instructions: arithmetic/logic/shift/compare
Memory access instructions: load/store
Control flow instructions: branch/jump
...

Arithmetic/Logic Instructions:

<op> rd, rs1, rs2 for add, sub, and, or, xor;
<op> rd, rs1, imm for addi, andi, ori, xori, where imm is a immediate operand, up to 12-bit signed value.

Shift Instructions:

sll rd, rs1, rs2: shift left logical;
srl rd, rs1, rs2: shift right logical;
sra rd, rs1, rs2: shift right arithmetic;
slli/srli/srai rd, rs1, shamt: versions with immediate.

Comparison Instructions:

slt rd, rs1, rs2: If rs1 < rs2,rd = 1; otherwise rd = 0;
sltu/slti/sltiu: Versions for unsigned and immediate.

Pseudo-Instructions:

sltz rd, rs == slt rd, rs, zero
sgtz rd, rs == slt rd, zero, rs
mv rd, rs == addi rd, rs, 0
nop == addi zero, zero, 0

Data Transfer Instructions:

lw rd, offset(rs1): Load a 32-bit word;
sw rs2, offset(rs1): Store a 32-bit word.
lh/lhu/sh: Load/store a halfword (16-bit), sign/zero-extend the loaded value
lb/lbu/sb: Load/store a byte (8-bit). Addresses should be 4-aligned for lw/sw, 2-aligned for lh/sh, ...
lui rd, imm: Load upper immediate, rd = imm (20-bit) | 0 (12-bit)
auipc rd, imm: Add upper immediate to PC, rd = PC + (imm (20-bit) | 0 (12-bit))

Branch/Jump Instructions:

beq/bne/blt/bge/bltu/bgeu rs1, rs2, L: Branch if equal/not equal/less than/not less than/less than unsigned/not less than unsigned
jal rd, L: Jump and link. Store PC + 4 in rd; goto L
jalr rd, rs1, offset: Jump and link register. Store PC + 4 in rd; goto rs1 + offset

If-else:

Switch-case:

While/for loops:

Procedure Calls and RISC-V Calling Convention

Call: jump-and-link, jal ra, L
Procedure arguments: Before calling via jal, caller stores the first 8 arguments in a0 to a7; others are passed through the stack.
Return: jr ra
Return values: Before returning via jr, callee stores the first 2 return values in a0, a1; others are passed through the stack.

The problem: both caller and callee would like to freely use registers; must save and restore register values before calling and after returning.

ISA clearly specifies caller/callee-saved registers.

Caller-saved: caller saves before the call, and restores after the call returns. ra, t0 -- t6, a0 -- a7
Callee-saved: callee saves before using, and restores before returning. sp, s0 -- s11 (fp is s0)

Stack Frame:

sp (stack pointer): points to current stack top
fp (frame pointer): points to current frame bottom

RISC-V Encoding

All instructions are 32-bit long and aligned in memory. There are six instruction formats:

R-format: register-register operations
I-format: register-immediate operations, loads, jalr (12-bit imm, 11:0)
S-format: stores (12-bit imm, 11:0)
B-format: branches (12-bit imm, 12:1)
J-format: jumps (20-bit imm, 20:1)
U-format: upper immediate instructions (20-bit imm, 31:12)

RISC-V Vector Extension

Single-Instruction, Multiple-Data. Compute on a vector of numbers simultaneously. Data-level parallelism.

Intel x86 SSE and AVX Extensions

Vector length keeps increasing.

Fixed total length can be split into various numbers of data types.

SIMD intrinsics: use C functions instead of inline assembly to call SSE/AVX instructions.

Function name: _mm<width>_[function]_[type]

A key drawback: vector length is hard-coded in instructions.

RISC-V Vector Extension

Similar to x86 SSE/AVX, use data-level parallelism, a.k.a., SIMD

Allow for variable vector length in software application.

32 vector data registers: v0 to v31, each is VLEN bits long.

Vector length register vl: set the vector length in number of elements.

Vector configuration register vtype: LMUL, SEW, VLMAX, ...

LMUL: vector register grouping multiplier
SEW: selected element width

Advantages of Vector ISAs

Compact: a single instruction defines N operations
Parallel: N operations are (data) parallel
Expressive: various memory patterns
Compare to SSE/AVX: ISA is agnostic to hardware vector register length

Processor

Single-Cycle Processor

We will start simple, and later optimize for performance

Execute one instruction to completion before moving to the next one
Execute each instruction within one clock cycle, i.e., CPI = 1

We will focus on a subset of the RISC-V instructions

Arithmetic (R, I): add, sub, addi, ori
Memory (I, S): lw, sw
Branch / jump (B, J): beq, j

Generally speaking, two parts:

Datapath: the hardware that processes and stores data
Control: the hardware that manages the datapath

If we want update states one-after-one, CPI = 1, then the clock cycle time must be long enough to accommodate the slowest instruction.

Pipelined Processor

Basic Idea of Pipelining

For a $k$-stage pipeline,

\[f=\left(\frac{t_{\text{comb}}}{k}+t_{\text{reg}}\right)^{-1}. \]

Latency:

\[t_{\text{comb}}+ k \times t_{\text{reg}}. \]

Pipeline depth is limited by timing overhead of registers.

For $N$ tasks of $T$ time per task, $k$ stage pipeline:

\[\frac{k-1}{k}T+\frac{N}{k}T. \]

In reality, combinational logic can not be equally divided, then pipeline throughput is limited by the slowest stage.

5-Stage Pipeline

5 stages, one clock cycle per stage

IF (Instruction Fetch): fetch instruction from memory
ID (Instruction Decode): decode instruction & read register
EX (Execute): execute operation or calculate address
MEM (Memory): access memory
WB (Writeback): write result back to register

Pipelined Control Signals:

Used 1 cycle later in EX: ALUOp, ALUSrc
Used 2 cycles later in MEM: MemWrite, MemRead, Branch, Jump
Used 3 cycles later in WB: MemToReg, RegWrite

Stalls and Hazards

Unfortunately, if two instructions are dependent, we encounter pipeline stalls. Stall is a pipeline “bubble”, not moving forward in that cycle.

Effective CPI = base CPI + stall cycles per instruction.

How to Stall:

Software stalls: Compiler inserts independent instructions or nop instructions;
Hardware stalls: Instructions before it continue to move forward as usual; prevent updates of PC and after registers.

Pipeline stalls are caused by hazards:

Structural hazard: A required resource is busy and cannot be used in this cycle
Data hazard: Must wait previous dependent instructions to produce/consume data
Control hazard: Next PC depends on previous instruction

Structural hazard

Register File Access: A prior WB stage and a future ID stage both access registers

Solution: different read/write ports; half-cycle register access

Stage Bypassing: bypass unused stages, but cause structural hazard

Memory Access: Assume instruction & data memories are not separate, IF and MEM stages both access memory

Solution: replicate resources, i.e., use separate memories (caches), or use multiple ports

Multi-Cycle Instructions: instructions that are multi-cycle but not fully pipelined

Solution: make all units fully pipelined; replicate units; just let it stall

Data Hazard

Data Dependencies: Read after Write, Write after Write, Write after Read.

All dependencies must be respected, but pipelining may violate this promise.

In our 5-stage pipeline, only RAW hazards through registers exist.

Solution: Stall pipeline to delay reading instructions until data are available.

A better way is to use Data Forwarding (bypassing): Adding additional hardware paths, the calculation results are "forwarded" directly from where they are generated to the subsequent instructions that need them, thus avoiding pipeline stalls.

How many remaining cycles to stall after forwarding is corresponding to the stage distance between producer and consumer. In our 5-stage pipeline, only load-use hazard needs to stall 1 cycle.

Forwarding Datapath:

Destinations: all stages that really consume values: EX, MEM
Sources: stages after all stages that produce new values: MEM, WB
We let forwarding sources from the pipeline registers

Forwarding Control:

Whole picture:

Compilers can rearrange code to avoid load-use stalls, a.k.a., to fill the load delay slot.

Control Hazard

Cannot fetch the next instruction if I don’t know its address (PC).

Possible next PC in RISC-V

Most instructions: PC + 4 (Calculated in IF stage)
Jumps: PC + offset
Branches: PC+ 4 or PC + offset, after comparing registers

To resolve branch/jump target early, we can moved the branch/jump resolving point to ID stage.

Now we only need to 1-cycle stall for next PC. Furthermore, we can simply guess that next PC is PC + 4, and if the guess is wrong, flush pipeline to nullify mispredicted instructions. Restart wastes 1 cycle + some extra penalty cycles.

Additional forwarding paths of register values to ID stage: 1-cycle stall for previous ALU operation, 2-cycle stall for previous load.

Exceptions and Interrupts

So far two ways to change control flow within user programs

Branch and jump
Call and return

Insufficient for changes in system states

Exception: internal "unexcepted" events in the program itself
Interrupts: external events (e.g., I/O) that require processor handling

User programs cannot anticipate or always prepare for these events; require helps from the OS and the hardware.

OS, i.e., the privileged kernel, handles exceptions/interrupts

Processor stops the user program, saves current PC and the cause, and transfers
control flow to the kernel
Exceptions/interrupts are processed by exception handler inside the kernel, to fix problems or handle events
Return to user program to continue (the current instruction or the next instruction); or abort if problems cannot be fixed

Handling Exceptions in RISC-V:

Save PC (SEPC)
Save the reason (SCAUSE)
Jump to the handler at a fixed address in kernel, e.g., 0x80000000
The handler reads the cause register to decide what to do:
- If fixable, take corrective actions and use SEPC to return to user code
- Otherwise, terminate program and report error using SEPC and SCAUSE

Alternate Mechanism: Vectored Interrupts (used in x86): separate handlers for different causes

Requirements of precise exceptions:

All previous instructions had completed
None of the offending instruction and the following instructions were started

All in all, when exception/interrupt occurs in the pipeline:

Drain older instructions down the pipeline to complete
Nullify the current and younger instructions in earlier stages
Fetch from handler instruction address

Advanced Techniques

Branch Prediction

We need more powerful branch prediction! Two steps:

Predict a branch is taken or not-taken;
Predict the target address if taken.

Remember, with either result, after resolving the branch, if mispredicted, flush and restart.

To predict taken/not-taken, besides exploiting hints from programmers/compilers, we can do hardware-based dynamic prediction, i.e., based on recent history.

A hardware branch history table (BHT) records each branch's history. It only use $m$ bits of the PC to index. Each branch maintains a saturate counter that summarizes the history, which is used to predict the next direction.

Counter +1 for an actually taken branch (until max),–1 for not-taken (until min)
1-bit states: 0: predict not-taken, 1: predict taken
2-bit states: 00: predict not-taken, 01: not-taken?, 10: taken?, 11: taken

Essentially a tradeoff between robustness and adaptivity.

To predict target address, we can use a branch target buffer (BTB) that caches the target addresses of recently taken branches. Note that here we also store the rest of the address as a tag to differentiate aliasing cases.

Superscalar and Out-of-Order

Let $T_1$ be the time to execute with one compute unit, and $T_\infty$ be the time to execute with infinite compute units. Instruction-Level Parallelism:$$\text{ILP} =\frac{T_1}{T_\infty}$$ measures inter-(in)dependency among instructions.

Scalar pipelines are limited by $\text{IPC}_\text{max} = 1$, while superscalar pipelines can issue at most $\text{IPC}_\text{max} = N$ for $N$-way superscalar. To implement superscalar, we need out-of-order (OoO) execution:

Stages:

Fetch: wide and speculative
Decode
Dispatch/rename/allocation: must be in-order
Issue/schedule: Instruction window (IW) to store the data between dispatch and execute, and scheduler decides which instruction to issue based on its priority.
Execute
Reorder: Re-Order Buffer (ROB): a FIFO for instruction tracking
Commit/retire/write-back: must be in-order. Instructions that finish execution but have not commit are considered speculative, and hold their updates to system states. Check the head of ROB in each cycle.

Limitations of Modern Processors

Limitations:

Limited ILP in programs
Pipelining overheads
Frontend bottleneck
Memory inefficiency
Implementation complexity

Solutions:

Parallelism: multi-core
Better design: ASIC (application-specific integrated circuits)

Still, we have data access challenge: when processors get improved, memory performance and energy start to dominate

Memory

Memory Hierarchy: the faster, smaller device at level $k$ store a subset of data for the larger, slower device at level $k+1$.

Caches

Caches are implemented as processor components, which hold recently referenced data by exploiting locality. Managed by hardware, and invisible to software.

In cache, data is transferred in unit of blocks, a.k.a., cachelines.

Cache hit: data found in the cache; serve with short latency
Cache miss: data not found in the cache; need to fetch the block from memory, and may replace a block in the cache

AMAT (average memory access time) = hit latency + miss rate $\times$ miss penalty

Cache Organization

Fully Associative: a block can be put anywhere in the cache.

Pros: have the maximum utilization to cache data, i.e., low miss rate
Cons: slow to look up a block(must look at every location), i.e., high hit latency

Direct-Mapped: each block has only one determined location, according to its memory address.

Pros: fast to lookup, i.e., low hit latency
Cons: more conflicts between blocks, i.e., high miss rate

Set-Associative ($N$-way): each block can go to one of $N$ entries in cache.

Cache access steps:

Decompose address bits into tag, index, offset
Use index bits to locate a set
For each entry in the set
- Check if valid bit is 1
- Compare if tag matches
If found one such entry, hit, return data
- Useoffset to select bytes in block
If none found, miss, fetch data from memory, return data, and replace

The 3Cs of Cache Misses

Compulsory/cold: the first time the item is referenced
Capacity: not enough room in the cache to hold all items
Conflict: item is replaced because of a location conflict with others

Design configurations:

Capacity: capacity misses
- Thrashing: scan data for multiple times, but cache capacity is insufficient
- Solution: optimize code to fully work on each subset before the next
Associativity: conflict misses
- Conflicts could happen quite often in real programs
- Solution: use more complex hashfunctions to determine set index
Block Size: compulsory misses

For the address fields

Write Policies

Write misses are less performance critical than read misses.

Write Hits:

Write-through: update both cache and memory
Write-back: write data only to cache

Write Misses:

Write-allocate: load the block into cache, then write
- Fetch-on-miss or not: do we fetch the rest of the block from memory?
No-write-allocate: bypass cache, write directly to memory
- Write-around vs. write-invalidate: do we keep the data in cache if there is already an entry (e.g., due to reads)?

For write-through, we use write buffer, a FIFO buffer between cache and memory. Can absorb small bursts, as long as the long-term rate of writing to the buffer does not exceed the maximum rate of writing to memory.

Typical Choices for Cache Write Policies:

Write-back + write-allocate + fetch-on-miss

Replacement Policies

On a miss, determine whether to evict an existing entry to make room for the newly requested block, and if so, which entry to evict (called victim).

Whether to evict

Normal: newly requested block replaces an existing old block
Bypass: newly requested block does not enter cache

Which to evict

For direct-mapped, only one choice
For set/fully associative, use a replacement policy

Replacement policy virtually keeps a rank (a priority list) of cache blocks. To define a new policy, we need to define insertion policy (when a block is first inserted into cache) and promotion policy (when an access hit on the block in cache). Priority from MRU to LRU.

Recency-Based Policies:

LRU (Least Recently Used): insert/promote to MRU
LIP (LRU insertion policy): insert to LRU, promote to MRU
BIP (Bimodal insertion policy): combine LRU and LIP, insert with small probability at MRU, others at LRU
DIP (Dynamic Insertion Policy): dynamically select between LRU and BIP

How to implement DIP? A simple way is to use shadow/ghost tag arrays: maintain two tag arrays for LRU and BIP, respectively, to track their miss rates. Use the one with lower miss rate for actual cache replacement.

Another way is to use set dueling: dedicate some sets to different policies, and select the best performing one for the rest sets.

Frequency-Based Policies:

LFU (Least Frequently Used): choose the one used the least
- LFU adapts poorly to pattern changes. Let's combine recency and frequency.
FBR (frequency-based replacement): Do not increment counters for recently referenced (i.e., high-locality) blocks
LRFU (least recently/frequently used): Accumulate a weighted value $F(x)$ for each past reference, $x$ is the distance from that reference to current time

Other Policies:

Random: choose one at random; not necessarily a bad policy
Belady's optimal: choose the one used again furthest in the future

Multi-Level Caches

L1 caches attached to CPU cores. Split D-cache & I-cache (D & I). Focus on hit time.
L2 unified cache per core
L3 cache is shared across all cores, a.k.a., last-level cache (LLC). Focus on miss rate.

Inclusion

Inclusive caches: data in this level are always a subset of data in parent

Advantages: Only check the last level to know if a block is not cached in the entire hierarchy; Writeback is easy, eviction from children always hits in parent
Disadvantages: Data are replicated in both cache levels, lower utilization; Eviction in parent must also evict the block in children, a.k.a., recall

Exclusive caches: data in this level are NOT ALLOWED in parent

Non-inclusive caches: data in this level may be in parent

Save cache space, potentially higher hit rates: better performance
Significantly complicate management

Performance

AMAT for Multi-Level Caches
= L1 hit latency + L1 miss rate × AMAT ofL 2
= L1 hit latency + L1 miss rate × (L2 hit latency + L2 miss rate × AMAT of L3)
= ...

Local miss rate = # misses in this level / # accesses to this level

Global miss rate = # misses in this level / # accesses from processor.

So AMAT = L1 hit latency + L1 global miss rate × L2 hit latency + L2 global miss rate × L3 hit latency + ...

Overall, multi-level caches achieve good balance between miss rate and hit time, i.e., a smoothly, gradually changing spectrum of capacity vs. access speed, to capture different degrees of locality and dataset sizes.

Non-Blocking Caches

Cache needs to support multiple accesses

Allow for hits while serving a miss (hit-under-miss)
Allow for hits to pending misses (hit-to-miss) to avoid redundant memory fetches
Allow for more than one outstanding miss (miss-under-miss)

Miss Status Handling Register (MSHR) keeps track of each outstanding miss to one cache block, as well as pending loads & stores that refer to that block

Multiple misses to different blocks are tracked in different MSHRs
Multiple loads/stores to the same block are in one MSHR

Fields of an MSHR: valid bit, block address and multiple pending load/store entries.

Operations:

On a cache miss
- Search MSHRs for pending access to this cache block
- If found, just allocate a new load/store entry in that MSHR (hit-to-miss)
- Otherwise allocate a new MSHR and its first load/store entry (miss-under-miss)
- If no MSHR or load/store entry free, stall
By offloading states to MSHR, cache can process other requests
When one word/sub-block of a cache block becomes available
- Check which loads/stores are waiting for it, and forward data to/from core
- Clear the load/store entry
When last word of a cache block becomes available
- All load/store entries of this MSHR can be served
- Clear MSHR

Data Prefetching

We want to predict and fetch data into cache before processors request it, so that we can reduce cold misses by exploiting spatial locality. Must have non-blocking caches!

Software prefetch instructions, e.g., GCC __builtin_prefetch() are micro-architecture-dependent, while hardware prefetchers are automatic.

Metrics:

Accuracy = (prefetched & accessed) / (all prefetched)
Coverage = (prefetched & accessed) / (all accessed)

Timeliness: you issue prefetch requests early enough, but not too early

Resource contention: prefetching consumes bandwidth and capacity

Types of Hardware Prefetchers:

Stream Prefetching: prefetch N blocks ahead of the current access, only when detecting stream access patterns
Strided Prefetching: detect strided access patterns and prefetch accordingly
- PC-based stride detection: maintain a table indexed by PC, each entry records the last address and the stride.
Temporal Correlation: learn the transition probabilities between cache blocks, and prefetch according to the most probable next blocks.
Spatial Correlation: learn the spatial patterns of accesses, and prefetch the correlated blocks together.

Main Memory

DRAM Organization

Main memory is physically separate from the processor chip (remember caches are on-chip), connected through memory channels.

DRAM is the prominent technology for main memory today. Key design goal of DRAM: use high bandwidth to tolerate long latency.

Organization of DRAM-based main memory:

Multiple channels per CPU socket: fully independent access with separate physical links (channel bus)
Multiple ranks per channel: independent access inside chips, but share I/O (channel bus)
Multiple chips per rank: fully synchronous access across all chips in a rank
Multiple banks per chip: independent access inside banks, but share I/O (chip pins and internal bus)

Access granularity

Each access needs multiple bytes; usually goes to a single <channel, rank, bank>
The same bank in all chips of a rank contributes a subset of data
The bytes are transferred on the bus sequentially in multiple cycles

Channel:

Each 64-byte access needs 8 burst transfers in 4 cycles on 64-bit channel bus. DDR: double-data-rate, i.e., transfer on both clock edges, e.g., 400 MHz == 800 MT/s.
Channel-level parallelism: higher bandwidth and capacity. NUMA: non-uniform memory access, i.e., different channels have different access latencies.

Rank:

Each access goes to one rank, selected using "Chip Select" signal. 64-bit data across all chips in a rank
Ranks are physically organized in DIMMs
Rank-level parallelism: Larger capacity
Multiple narrow-I/O chips are put together for a wide rank interface

Ranks vs. Banks vs. Chips

Banks and ranks are similar. They divide DRAM into multiple independent parts, allowing for overlapped accesses. The difference is just across vs. within chips。
Chips are to make I/O width wider. Purely a physical organization, without affecting logical hierarchy.

Inside a DRAM bank, data are stored in rows and columns. Each row corresponds to a wordline, and each column corresponds to $w$ bitlines, where $w$ is the I/O width of the DRAM chip.

Access sequence:

Decode row address & drive wordlines
Selected cell bits drive bitlines
Entire row read into row buffer and amplify
Decode column address & select columns
Precharge bitlines for next access

5 basic commands

ACTIVATE (open a row)
READ (read columns in a burst)
WRITE (write columns in a burst)
PRECHARGE (close the row)
REFRESH

Overfetch: Accessing one cell must first activate the entire row to row buffer

Benefit: reduced area overheads of row/column logic w.r.t. dense cell arrays
Downside: higher latency and energy; but can amortize if there is good locality

Row hits save latency and energy, while row conflicts incur extra latency and energy overheads.

DRAM I/O is clocked faster than DRAM arrays, while internal access from arrays is wider than I/O width.

Overview of Parallelism:

Multiple channels per CPU socket: Support fully concurrent access, separate address/command/data
Multiple ranks per channel: Support overlapped access, but share address/command/data
Multiple chips per rank: Support lock-step access, share address/command, separate data
Multiple banks per chip: Support overlapped access, but share address/command/data

IO 64B = 8 bursts × 8 chips × 8 bits (array width)

DRAM Management

Memory Controller (MC) manages DRAM access from (inside) processors. Functionality:

Translate commands: loads/stores → ACTIVATE/READ/WRITE/PRECHARGE/…
Enforce access timing constraints
Decide address mapping: address → (channel, bank, row, col, …)
Schedule access requests and manage row buffers
Manage DRAM refresh
Manage power modes, e.g., entering low-power

Latency of DRAM access:

CPU → memory controller transfer time
Controller latency
- Convert to basic commands, queuing & scheduling delay
DRAM bank latency
- Hit an openrow: tCAS, i.e., RD/WR
  - tCAS = time of column access strobe
- Access an close row and row buffer empty: tRCD + tCAS, i.e., ACT + RD/WR
  - tRCD = row to column command delay
- Access a conflict row from row buffer: tRP + tRCD + tCAS, i.e., PRE + ACT + RD/WR
  - tRP = time to precharge DRAM array
DRAM data transfer time = channel width * burst length / bandwidth
Memory controller → CPU transfer time

Physical address are mapped to <channel x, rank y, bank z, row r, column c>

For sequential accesses, we want to maximize row hits and bank-level parallelism. So cacheline interleaving is preferred. For strided accesses, row interleaving is better.

Row Buffer policies:

Open-page policy: Expect next access to hit, leave row open after access (do not PRECHARGE)
- Pros: if next access is a row hit, no need to ACTIVATE
- Cons: if next access is not a row hit, extra latency to PRECHARGE
Closed-page policy: Expect next access to conflict, immediately PRECHARGE
- Pros: if next access is a row conflict, save PRECHARGE latency
- Cons: if next access is a row hit, waste PRECHARGE/ACTIVATE latency and energy
Adaptive policy: predict whether next access will hit or not

Scheduling Policies:

FCFS (first come, first served)
FR-FCFS (first ready, first come, first served)
- Row-hit first, then oldest first
- Goal: maximize row buffer hits

DRAM cell capacitor charge leaks over time, need to touch each cell periodically to restore the charge. So Refresh = ACTIVATE + PRECHARGE each row every N (64ms typically) milliseconds.

Between Cache and Main Memory

Cache Coherence

Copies of shared data in private caches can become stale & incoherent! We need cache coherence protocols to ensure that all caches have a consistent view of memory.

Basic rule of coherence: single-writer, multiple-readers (SWMR)

All caches can have copies of data while read-shared
On a write, invalidate all other copies, and leave a single writer

A 3-State Coherence Protocol (MSI):

M(odified)
- One cache (single writer) has valid and latest copy, and can write
- Memory copy is stale
S(hard)
- One or more caches (multiple readers) and memory have valid copy
I(nvalid)
- Not present

State transitions:

Self read/write triggers up transition, while ther read/write triggers down transition.

Downgrading from M (to S or I) causes writeback.

Implementation

In each cache entry, extend the valid bit to represent 3 states: M, S, or I

Each private cache needs to be told the actions of others, how?

Choice 1: snooping
- All caches broadcast their misses through the interconnect
- Each cache “snoops” all events from all cores
Choice 2: directory-based
- Use a directory to track the state of each block, e.g., who are the sharers
- The directory tells the sharers, i.e., those who have the block and need to respond, when necessary
- Benefits: reduce messages (to only relevant sharers)

In both cases, different blocks take independent coherence actions.

False Sharing: different memory locations, mapped to same cache block. This is the artifactual effect due to the granularity of coherence tracking. Can be reduced by software techniques, e.g., padding.

On-Chip Networks

Banked Caches

Large (shared) caches are usually implemented with banking. Each bank is like an individual cache, but is smaller and placed closer to each core. Note that each cache block can only reside in only one of the banks at any time. Benefits: lower hit latency, more parallelism

Mapping policy: block address → bank index

Static: block address decides which bank the block should be stored in. E.g., bank ID is taken from lower bits in block address
Dynamic: a block is preferred to be placed in the bank closer to the core with the most accesses to it, and may also be migrated between banks

Banking and set-associative are similar ideas at different levels!

The idea of On-Chip Networks

With many cores and their private/shared caches, we need a network to transport data blocks and coherence messages between them

On-chip network, a.k.a., network-on-chip (NoC)

Topology: how to connect the nodes?
Routing: which path should a message take?
Flowcontrol: how to control actual message transport?
Router microarchitecture: how to build the router

Transport levels:

Message: high-level data. Software or system level, arbitrary size
Packet: network-level object. Variable size, but bounded
Flit: switch-level object. Fixed size, unit of flow control
Phit: link-level object. Fixed size, usually the same size as flits, unit of data transferred on link per cycle

Zero-load latency = header latency $T_h$ + serialization latency $T_s$

Header latency $T_h=H\times (t_r+t_l)$, where $H$ is the number of hops, $t_r$ is the router delay per hop, and $t_l$ is the link delay per hop
Serialization latency $T_s= \frac{L}{B}$, where $L$ is the packet size, and $B$ is the link bandwidth

Real latency = zero-load latency + queuing delay. The latter can largely dominate!

Topology

Common topology examples:

Basic definitions:

Routing distance: number of hops on a route from source to destination
Diameter: maximum routing distance
A network is partitioned by a set of links if their removal disconnects the graph
Bisection bandwidth: bandwidth crossing a minimal cut that partitions the network in two equal-sized halves

Routing

Routing chooses path that packets should follow to get from src to dst.

Properties

Deterministic: always select the same path every time
Minimal: only select a shortest path
Oblivious: route is decided without considering current network state such as traffic congestion
Deterministic routing is oblivious; oblivious routing may not be deterministic (Q: Example?)
Adaptive (the opposite of oblivious): route is influenced by current network state, e.g., choosing a less congested route
Source: entire route is determined at the source
Incremental (the opposite of source): the route is incrementally determined at each hop

Dimension-Order Routing for Mesh/Torus: Resolve each dimension in a fixed order. M inimal, oblivious and deterministic (break tie for torus).

Adaptive Routing for Mesh:

Minimal adaptive: at each hop, choose the output with the lowest load along a minimal path
Fully adaptive: choose the output with the lowest load at each hop, take any path necessary. May encounter livelock!

Flow Control

Flow control allocates resources for packets to traverse along the route. Contention between packets causes resource conflicts, so we need flow control to manage resource allocation.

Bufferless protocols:

Dropping: simply drop the packet if it loses resource arbitration
Misrouting: intentionally route the losing packet away
Circuit-switching: before transmitting, use a request to set up a path and reserve all links. Then send data through the reserved path. Tail flit releases resource.

Buffered protocols allow packets to temporarily wait inside intermediate routers if encountering contention

Packet Granularity: Allocate buffer and link in unit of packets

Store-and-forward: entire packet is stored in router before forwarding
Cut-through: start forwarding as soon as header arrives
Flit Granularity: Allocate buffer and link in unit of flits
Wormhole: like cut-through, but with buffer space allocated to flits
- head-of-line blocking when waiting in the same buffer
virtual-channel: Provision multiple virtual channels per physical port.

If downstream has nobuffer space, backpressure informs the upstream
router to stop sending the next packet/flit. How to implement backpressure?

Credit-based flow control: each upstream router keeps a counter of available buffer slots in downstream router. Decrement on sending, increment on receiving credit.

Deadlock and Avoidance

Resource dependency: a resource $R_i$ is dependent on a resource $R_j$ if it is
possible for $R_i$ to be held by an agent $A$ and for $A$ to wait for $R_j$, denoted
as $R_i \succ R_j$.

Reason of deadlock: cyclic resource dependency. We want to avoid deadlock by eliminating all cycles.

Dimension-order routing:

XY routing / YX routing: first resolve X / Y dimension, then Y / X dimension
Six-turn model: At least one turn in each cycle must be disallowed to avoid cycles

alt text

Another way is resource ordering, i.e., impose a partial order on resources, and then
enforce allocating resource in non-descending order.

Example: when going from a router to the next, require VC ID to be non-descending.

Virtualization

Take physical resources (e.g., memory, processor) and transform it into a more general, flexible, and easy-to-use virtual form.

An OS supports multiple users and multiple programs on one hardware through the abstraction of processes. Schedule processes to hardware cores at different time and provide a private virtual address space to each process.

Page Table

Each process has its own virtual address space, determined by ISA. The hardware system has a single physical address space.

Page table translates virtual addresses to physical addresses, in unit of pages.

Each process has its own page table, managed by OS, used by hardware
Page table is stored in main memory, its base address is kept in a special register

One page table entry (PTE) per virtual page, includes: valid bit, physical page number (a.k.a., frame number), metadata (e.g., R/W/X permissions)

alt text

For a process, usually only a subset of its virtual pages are allocated, and only a subset of allocated virtual pages are mapped. Allocated but unmapped pages are swapped out to external storage, e.g., disks or SSDs.

DRAM is like a software-managed cache of disks/SSDs. Page table, managed by OS, always knows the data location

Allocated and mapped: page hit in DRAM
Allocated but not mapped (swapped-out): page miss, need to fetch from disk/SSD
Unallocated: invalid access

Demand paging: if a page fault occurs, the page is paged in on demand, by the page fault handler in OS kernel.

Translation Process:

Hardware looks up page table using virtual address
If it is a valid page in memory
- Check access permissions (R, W, X) against access type
- If allowed, translate to physical address and access memory
- Otherwise, generate a permission fault, i.e., an exception, handled by OS
Otherwise, page is not currently in memory
- Generate a page fault, i.e., an exception, handled by OS
- If it is due to program errors, e.g., unallocated: terminate process
- If page is on external storage: refill & retry, i.e., demand paging

Translation Alias:

Synonym: a process may use different virtual addresses to point to the same physical address
Page sharing: different processes can (read-only) share a page by setting virtual addresses to point to the same physical address
Homonym: different processes can use the same virtual address, but actually translated to different physical addresses

alt text

Multi-Level Page Tables

Page table is sparse, only a small fraction of pages are actually allocated.

We can use a hierarchical page table structure, i.e., a (sparse) radix tree. Only top level must be resident in memory, remaining levels can be in memory or on disk, or unallocated if corresponding ranges of the virtual address space are not used.

Advantage: save significant page table size
Disadvantage: multiple page faults (slow)

alt text

Translation Look-aside Buffer

TLB: a hardware cache just for translation entries, i.e., PTEs.

Each TLB entry stores a page table entry (PTE) as its data,and the metadata of TLB entry itself.

alt text

If we miss in TLB, access PTE in memory. This is called page table walk.

If page is in memory, copy PTE into TLB and retry
If page is not in memory, raise a page fault first, then fill in TLB

alt text

When multiple processes share a processor, the OS must flush the TLB entries at context switch. Alternatively, add a process ID (PID) in each TLB entry.

The capacity of TLB is usually small. We can use multi-level TLBs, or support multiple page sizes to improve TLB reach.

Combined with Caches:

alt text

How to parallelize TLB and cache access?

Virtual Caches

Virtually indexed, virtually tagged:

Issue 1: homonym (same virtual address, different physical addresses). Can be resolved by flushing cache on context switch or adding process ID to tags
Issue 2: synonym (different virtual addresses, same physical address). Two copies co-exist in cache, which should be the same (physical) data. But writes to one copy will not be reflected in the other copy. This is a coherence issue in a single cache, rather than across multiple private caches, which is hard to resolve.

Instead, we choose to virtually indexed, physically tagged:

alt text

We want cache lookup index only uses bits from page offset, that is, Cache size / associativity == set size $\le$ page size.

With this requirement, VIPT caches are functionally equivalent to PIPT caches.

Virtual Machine

How about virtualize an entire physical machine?

Virtual machine techniques honor existing hardware interface to create virtual machines.

Key requirements

Fidelity: equivalence of interface as real physical machine
Performance (efficiency): minimal overheads
Safety (resource control): isolation among VMs; complete control of resources

Terminology

Host: the underlying physical hardware system
Guest: each VM, i.e., a virtual instance of the host
Virtual machine monitor (VMM), a.k.a., hypervisor: the layer of software that supports virtualization

alt text

VMM Types:

Type 0 hypervisor: hardware-based solutions; hypervisor in firmware
Type 1 hypervisor: OS-like software; "the datacenter OS"
Type 2 hypervisor: simply a process on host OS

Advantages:

Easy development and testing
Server consolidation: improve utilization
Live migration: for elastic scale-down and load balance
Security: strong isolation between VMs even on the same machine

VM vs. hypervisor is similar to process vs. OS. Consider: Time multiplexing, e.g., for processor cores; Resource partitioning, e.g., for physical memory, disks; Mediating hardware interface, e.g., for networking, keyboard/mouse

Compatibility Options:

Paravirtualization
- Whether running within a hypervisor or directly on host is transparent to apps but not guest OS, i.e., guest OS needs to be modified
- Advantage: simpler to implement; smaller performance overheads
- Disadvantage: cannot be used for arbitrary OS'es, e.g., close-source Windows
Full virtualization
- Transparent to both guest OS and apps
- Advantage: work without any changes
- Disadvantage: performance overheads must be paid to address issues

Privilege Mode

VMM runs in kernel mode with high privilege. Physical user mode is divided to virtual user mode and virtual kernel mode. Guest apps run in virtual user mode and guest OS run in the virtual kernel mode.

Trap-and-Emulate

When a guest app in physical user mode wants to execute a privilege instruction, it traps into physical kernel mode, i.e., VMM.
VMM executes (i.e., emulate) the actions on behalf of guest OS.
VMM returns to guest OS and updates guest OS states
Guest OS returns to guest app

Trap-and-emulate requires all sensitive instructions are privileged. However, this is not true for some architectures, e.g., x86.

Solution 1: use paravirtualization
Solution 2: binary translation: VMM examine severy instruction, dynamically rewrites binary to translate sensitive but not privileged instructions to emulation code
Solution 3: hardware support: New operating modes, e.g., VT root mode for VMM and non-root mode for guest OS. New instructions to transition between modes.

Address Translation

Guest virtual address (gVA) → guest physical address (gPA) == Host virtual address (hVA) → host physical address (hPA)

Nested page tables, a.k.a., extended page tables in Intel

Use page tables in guest OS to translate gVA → gPA
Use page table in VMM to translate gPA → hPA

TLB entries directly cache gVA → hPA. Tagged TLB with VM IDs, so no need to flush TLB on VM transitions.

Page table walk becomes nested 2D! Support 2D page table walk in hardware.

alt text

Shadow Page Tables: map gVA → hPA directly

VMM creates and manages
One shadow pagetable per guest app
used by hardware

VMM needs to keep shadow pagetable consistent with page table in guest OS

Advantages: fast page table walk, require little hardware support
Disadvantages: more traps for page table updates, storage overheads

alt text

posted @ 2025-09-23 19:58 xcyle 阅读(22) 评论(0) 收藏举报

刷新页面返回顶部

xcyle

Computer Architecture

System Evaluation Metrics

Cost Metrics

Performance Metrics

Power and Energy Metrics

Scalability

Benchmark

ISA

RISC-V Instructions

Procedure Calls and RISC-V Calling Convention

RISC-V Encoding

RISC-V Vector Extension

Intel x86 SSE and AVX Extensions

RISC-V Vector Extension

Processor

Single-Cycle Processor

Pipelined Processor

Basic Idea of Pipelining

5-Stage Pipeline

Stalls and Hazards

Structural hazard

Data Hazard

Control Hazard

Exceptions and Interrupts

Advanced Techniques

Branch Prediction

Superscalar and Out-of-Order

Limitations of Modern Processors

Memory

Caches

Cache Organization

Write Policies

Replacement Policies

Multi-Level Caches

Inclusion

Performance

Non-Blocking Caches

Data Prefetching

Main Memory

DRAM Organization

DRAM Management

Between Cache and Main Memory

Cache Coherence

Implementation

On-Chip Networks

Banked Caches

The idea of On-Chip Networks

Topology

Routing

Flow Control

Deadlock and Avoidance

Virtualization

Page Table

Multi-Level Page Tables

Translation Look-aside Buffer

Virtual Caches

Virtual Machine

Privilege Mode

Address Translation

公告