Computer Architecture: A Quantitative Approach

Technical Books

On Hold

My notes & review of Computer Architecture: A Quantitative Approach by John L. Hennessy, David A. Patterson, Christos Kozyrakis

Author

Tyler Hillery

Published

May 21, 2026

Notes

Chapter 1: Fundamentals of Quantitative Design and Analysis

1.1. Introduction

Question❓

The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led to a higher rate of performance improvement—roughly 35% growth per year.

What is the official definition of “microprocessor”? Need to clear up my mental model of transistors, integrated circuits and microprocessors

Answer: After some research here some notes to clear things up

Transistors in my mind are basically a “switch” they can be on or off. They are switched by applying a voltage.
Multiple transistors can be combined to create logic gates such as AND, OR, NOT, NAND, NOR, XOR, NXOR
logic gates can be combined to form high level components such as adders, flip-flops
These components can be combined into units such as ALUs, registers which together form a CPU (seems microprocessor and CPU tend to be used interchangeably?)
These systems are implemented on a single piece of silicon as an integrated circuit (IC)
A wafer is a large circular slice of silicon used to manufacture many chips at once
A die is a single rectangular piece cut from the wafer, and typically contains one IC (1:1 mapping)
Modern designs may use chiplets, where multiple smaller dies (each an IC or part of one) are combined into a single package
A microprocessor is an integrated circuit, but not all integrated circuits are microprocessors. Examples of other ICs: Microcontrollers, Memory ICs (DDR), Amplifiers, Voltage Regulators, Application-Specific Integrated Circuits (ASICs).
The term “packaged” means taking the die(s) and putting them together often in an enclosures that can then be placed into a socket (unless there are soldered directly onto the boards).
Chiplets are smaller dies that are integrated into a single package instead of one larger die.

Question❓

First, the virtual elimination of assembly language programming reduced the need for object-code compatibility.

What do they call this “object-code” I thought the term for this was “machine-code”

Question❓

These changes made it possible to successfully develop a new set of architectures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques: the exploitation of instruction-level parallelism (ILP; initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations).

What is the difference between RISC and RISC-V? What were the architectures before RISC called?

Question❓

In 2003, the limits of power due to the end of Dennard scaling and the available instruction-level parallelism slowed uniprocessor performance to 23% per year until 2011, or doubling every 3.5 years.These

What is “Dennard” scaling?

Answer: This is answered in a few paragraphs later.

In 1974, Robert Dennard observed that power density was constant for a given area of silicon even as you increased the number of transistors because of smaller dimensions of each transistor. Remarkably, transistors could go faster but use less power. Dennard scaling ended around 2004 because current and voltage couldn’t keep dropping and still maintain the dependability of integrated circuits.

Question❓

From 2011 to 2015, the annual improvement was less than 12% (doubling every 6 years) in part due to the limits of parallelism of Amdahl’s Law.

What are the limits of parallelism of Amdahl’s law?

Answer: This is answered in a few paragraphs later.

Amdahl’s Law (Section 1.10) prescribes practical limits to the number of useful cores per chip. If 10% of the task is serial, then the maximum performance benefit from parallelism is 10, no matter how many cores you put on the chip.

Moore’s law was defined in 1965 when Gordon Moore predicted the number of transistors per chip would double every year, which he amended in 1975 to every 2 years. This lasted for 50 years. The main reasons for why it ended were because of
- transistors no longer getting much better of the slowing of Moore’s Law and the end of Dennard scaling
- the unchanging power budgets of microprocessors
- the replacement of the single power-hungry processors with several energy-efficient processors
- the limits to multiprocessing given Amdahl’s Law

1.2. Classes of Computers

Real-time performance requirement means a segment of application has absolute maximum execution time.
Soft real-time is when the average time for a particular task is constrained as well as the number instances when some maximum is exceeded.
Two kinds of parallelism in applications:
- Data-Level Parallelism (DLP) arises because there many data items that can be operated on at the same time.
- Task-Level Parallelism (TLP) arises because tasks of work are created that can operate independently and largely in parallel.
Computer hardware can exploit these two kinds of application parallelism in four ways:
- Instruction Level Parallelism (ILP) exploits DLP at modest levels with compiler help using ideas like pipelining and speculative execution
- Vector architectures, graphic processing units (GPUs), and multimedia instruction sets exploit DLP by applying single instruction to a collection of data in parallel
- Thread Level Parallelism (TRP) exploits either DLP or task-level parallelism in a tightly coupled that allows for interaction between parallel threads
- Request Level Parallelism (RLP) exploits parallelism among largely decoupled tasks specified by the programmer or the operating system
Single instruction stream, single data stream (SISD): Standard sequential computer, but still can exploit ILP using superscalar and speculative execution.
Single instruction stream, multiple data streams (SIMD): The same instruction is executed by multiple processors using data different data streams. SIMD computers exploit DLP by applying the same operations to multiple items of data in parallel.
Multiple instructions steams, multiple data streams (MIMD): Each processors fetches its own instructions and operates on its own data, and it targets TLP. MIMD is more flexible than SIMD but it is more expensive. MIMD can also exploit DLP, although the overhead is likely to be higher than SIMD.

1.3. Defining Computer Architecture

Instruction set architecture (ISA) refers to the actual programming visible instruction set. Examples include 80x86, ARMv6, and RISC-V.

Aside

The most popular RISC processors come from ARM (Advanced RISC Machine), which were in 30 billion chips shipped in 2023, or more than 100 times as many chips that shipped with 80x86 processors (often abbreviated x86).

Wow, I had no idea ARM was that more popular than x86.

Question❓

Developed 30 years after the first RISC instruction sets, RISC-V inherits its ancestors’ good ideas—a large set of registers, easy-to-pipeline instructions, and a lean set of operations—while avoiding their omissions or mistakes.

I’d be curious to know, why are the above ideas: large set of registers, easy-to-pipeline instructions and lean set of operations considered good? What does it mean to pipeline instructions?

x86 is considered to be a register-memory ISA
ARMv8 and RISC-V and all ISAs since 1985 are load-store

1.4. Trends in Technology

Integrated circuit processes are characterized by feature size, which is the min size of transistor or wire in either x or y dimension.
Transistor counter by sq mm of silicon is determined by the surface area of a transistor, density of transistors increases quadratically with linear decrease in feature size.
The shrink in vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of transistors.
While transistors generally improve in performance with decreased feature size, wires in an IC do not. In particular the signal delay for a wire increases in proportion to the product of its resistance and capacitance.

1.5. Trends in Power and Energy in ICs

Sustained power consumption is widely called thermal design power (TDP). TDP is neither the peak power nor the actual average.

Aside

The third factor designers and users should consider is energy and energy efficiency. Recall that power is simply energy per unit time: 1 W = 1 joule per second. Which metric is right for comparing processors: energy or power? In general, energy is always a better metric because it is tied to a specific task and the time required for that task. In particular, the energy to complete a workload is equal to the average power times the execution time for the workload.

This reminded me of a great video, Power is not energy: why the difference matters.

Aside

namely, the cost of a mask set. Each step in the integrated circuit process requires a separate mask. Therefore, for modern high-density fabrication processes with up to 10 metal layers, mask costs are about $10 million for 7 nm and $20 million for 5 nm.

Holy shit I had no idea how expensive the R&D is for these chips. The figure they end up showing shows the design cost for the 5nm was $542.2M.

1.10. Quantitative Principles of Computer Design

The basic idea behind pipelining is to overlap instruction execution to reduce the total time to complete an instruction sequence.
A program spends 90% of its execution time in only 10% of the code.
Temporal locality states recently accessed items are likely to be accessed soon.
Spatial locality says that items whose addresses are near one another tend to be referenced closed together in time.
Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

1.12. Fallacies and Pitfalls

Dennard’s scaling was that voltage and current should be proportionally linear dimensions of a transistor. Stated differently, as CMOS tech scaled, circuits could use more transistors and operate at higher frequencies while keeping power consumption constant. Dennard scaling ended 30 years later because threshold voltage and leakage current set a nonscaling baseline for power consumption for each transistor.

Aside

The Sun Microsystems Division of Oracle experienced this pitfall in 2000 with an L2 cache that included parity, but not error correction, in its Sun E3000 to Sun E10000 systems. The SRAMs they used to build the caches had intermittent faults, which parity detected. If the data in the cache were not modified, the processor would simply reread the data from the cache. Because the designers did not protect the cache with ECC (error-correcting code), the operating system had no choice but to report an error to dirty data and crash the program. Field engineers found no problems on inspection in more than 90% of the cases. To reduce the frequency of such errors, Sun modified the Solaris operating system to “scrub” the cache by having a process that proactively wrote dirty data to memory. Because the processor chips did not have enough pins to add ECC, the only hardware option for dirty data was to duplicate the external cache, using the copy without the parity error to correct the error. The pitfall is in detecting faults without providing a mechanism to correct them. These engineers are unlikely to design another computer without ECC on external caches.

This error is commonly brought up on Oxide and Friends, the best episode for reference is probably: A Requiem for SPARC with Tom Lyon

Chapter 2: Memory Hierarchy Design

2.1. Introduction

Multiple words are called a block or line
Each cache block includes a tag to indicate which memory address it corresponds to
set associative where a set is a group of blocks in a cache. The set is chosen by the address of the data $(Block Address) MOD (Number of sets in cache)$
n-way set associative is where there are n blocks in a set.
Direct-mapped cache has just one block per set.
Fully associative cache has just one set.
Both write strategies, write-through and write-back, can use a write buffer to allow the cache to proceed as soon as the data are placed in the buffer rather than wait for the full latency to write the data into memory.
If the write buffer contains other modified blocks, the addresses can be checked to see if the address of the new data matches the address of a valid write buffer entry. Is so, the new data are combined with that entry. This optimization is called write merging.
Categorizations of cache misses:
- Compulsory: The very first access to a block cannot be in the cache
- Capacity: Cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because blocks being discarded and later retrieved.
- Conflict:

2.2. Memory Technology and Optimizations

Static Random Access Memory (SRAM): used for cache
Dynamic Random Access Memory (DRAM): use for main memory
Flash: Used for nonvolatile storage

Appendix C. Pipelining: Basic and Intermediate Concepts

C.1. Introduction

Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles.
1. Instruction fetch cycle (IF)
2. Instruction decode/register fetch cycle (ID)
3. Execution/effective address cycle (EX):
4. Memory access (MEM):
5. Write-back cycle (WB):

Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution.
	Clock number
Instruction number	1	2	3	4	5	6	7	8	9
Instruction i	IF	ID	EX	MEM	WB
Instruction i+1		IF	ID	EX	MEM	WB
Instruction i+2			IF	ID	EX	MEM	WB
Instruction i+3				IF	ID	EX	MEM	WB
Instruction i+4					IF	ID	EX	MEM	WB

2. My thoughts

My biggest takeaway is there is a fine balance in managing memory hierarchies between L1 all the way to DRAM & Flash. With many different inputs to configure like associativity, block size, write through vs write back, write merging, cache eviction policy. Each input you configure comes with various tradeoffs, example large cache size might improve cache hit ratio but requires more static power. The memory wall is a big deal. CPU performance has greatly outpaced the performance of DRAM and I’ll be paying close attention to the high memory bandwidth (HMB) where multiple DRAMs are placed in stacks embedded within the same package as the processor.

I will admit, I still am a bit fuzzy with how DRAM works. The concept of “banks” being access with rows/columns using the row access strobe (RAS) and column access strobe (CAS) didn’t click for me. I’m curious to see what is the actual interface for the CPU to interact with DRAM? It’s clear to me the OS creates virtual memory addresses and handles keeping track of how those map to physical addresses with the page table and translation look aside buffer (TLB) which is cache of page table entries.

3. Instruction-Level Parallelism and Its Exploitation

3.11. Multithreading

A software thread is like a process in that it has state and a current PC, but threads typically share the address space of a single process, allowing a thread to easily access data of other threads within the same process. Multithreading is a hardware technique whereby multiple threads within a process or multiple processes share a processor pipeline without requiring an intervening process switch by the operating system. The ability to switch between