Computer Architecture: A Quantitative Approach

Technical Books
In Progress
My notes & review of Computer Architecture: A Quantitative Approach by John L. Hennessy, David A. Patterson, Christos Kozyrakis
Author

Tyler Hillery

Published

May 7, 2026


Notes

Chapter 1: Fundamentals of Quantitative Design and Analysis

1.1. Introduction

ImportantQuestion❓

The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led to a higher rate of performance improvement—roughly 35% growth per year.

What is the official definition of “microprocessor”? Need to clear up my mental model of transistors, integrated circuits and microprocessors


Answer: After some research here some notes to clear things up

  • Transistors in my mind are basically a “switch” they can be on or off. They are switched by applying a voltage.
  • Multiple transistors can be combined to create logic gates such as AND, OR, NOT, NAND, NOR, XOR, NXOR
  • logic gates can be combined to form high level components such as adders, flip-flops
  • These components can be combined into units such as ALUs, registers which together form a CPU (seems microprocessor and CPU tend to be used interchangeably?)
  • These systems are implemented on a single piece of silicon as an integrated circuit (IC)
  • A wafer is a large circular slice of silicon used to manufacture many chips at once
  • A die is a single rectangular piece cut from the wafer, and typically contains one IC (1:1 mapping)
  • Modern designs may use chiplets, where multiple smaller dies (each an IC or part of one) are combined into a single package
  • A microprocessor is an integrated circuit, but not all integrated circuits are microprocessors. Examples of other ICs: Microcontrollers, Memory ICs (DDR), Amplifiers, Voltage Regulators, Application-Specific Integrated Circuits (ASICs).
  • The term “packaged” means taking the die(s) and putting them together often in an enclosures that can then be placed into a socket (unless there are soldered directly onto the boards).
  • Chiplets are smaller dies that are integrated into a single package instead of one larger die.
ImportantQuestion❓

First, the virtual elimination of assembly language programming reduced the need for object-code compatibility.

What do they call this “object-code” I thought the term for this was “machine-code”

ImportantQuestion❓

These changes made it possible to successfully develop a new set of architectures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques: the exploitation of instruction-level parallelism (ILP; initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations).

What is the difference between RISC and RISC-V? What were the architectures before RISC called?

ImportantQuestion❓

In 2003, the limits of power due to the end of Dennard scaling and the available instruction-level parallelism slowed uniprocessor performance to 23% per year until 2011, or doubling every 3.5 years.These

What is “Dennard” scaling?


Answer: This is answered in a few paragraphs later.

In 1974, Robert Dennard observed that power density was constant for a given area of silicon even as you increased the number of transistors because of smaller dimensions of each transistor. Remarkably, transistors could go faster but use less power. Dennard scaling ended around 2004 because current and voltage couldn’t keep dropping and still maintain the dependability of integrated circuits.

ImportantQuestion❓

From 2011 to 2015, the annual improvement was less than 12% (doubling every 6 years) in part due to the limits of parallelism of Amdahl’s Law.

What are the limits of parallelism of Amdahl’s law?


Answer: This is answered in a few paragraphs later.

Amdahl’s Law (Section 1.10) prescribes practical limits to the number of useful cores per chip. If 10% of the task is serial, then the maximum performance benefit from parallelism is 10, no matter how many cores you put on the chip.

  • Moore’s law was defined in 1965 when Gordon Moore predicted the number of transistors per chip would double every year, which he amended in 1975 to every 2 years. This lasted for 50 years. The main reasons for why it ended were because of
    • transistors no longer getting much better of the slowing of Moore’s Law and the end of Dennard scaling
    • the unchanging power budgets of microprocessors
    • the replacement of the single power-hungry processors with several energy-efficient processors
    • the limits to multiprocessing given Amdahl’s Law

1.2. Classes of Computers

  • Real-time performance requirement means a segment of application has absolute maximum execution time.

  • Soft real-time is when the average time for a particular task is constrained as well as the number instances when some maximum is exceeded.

  • Two kinds of parallelism in applications:

    • Data-Level Parallelism (DLP) arises because there many data items that can be operated on at the same time.
    • Task-Level Parallelism (TLP) arises because tasks of work are created that can operate independently and largely in parallel.
  • Computer hardware can exploit these two kinds of application parallelism in four ways:

    • Instruction Level Parallelism (ILP) exploits DLP at modest levels with compiler help using ideas like pipelining and speculative execution
    • Vector architectures, graphic processing units (GPUs), and multimedia instruction sets exploit DLP by applying single instruction to a collection of data in parallel
    • Thread Level Parallelism (TRP) exploits either DLP or task-level parallelism in a tightly coupled that allows for interaction between parallel threads
    • Request Level Parallelism (RLP) exploits parallelism among largely decoupled tasks specified by the programmer or the operating system
  • Single instruction stream, single data stream (SISD): Standard sequential computer, but still can exploit ILP using superscalar and speculative execution.

  • Single instruction stream, multiple data streams (SIMD): The same instruction is executed by multiple processors using data different data streams. SIMD computers exploit DLP by applying the same operations to multiple items of data in parallel.

  • Multiple instructions steams, multiple data streams (MIMD): Each processors fetches its own instructions and operates on its own data, and it targets TLP. MIMD is more flexible than SIMD but it is more expensive. MIMD can also exploit DLP, although the overhead is likely to be higher than SIMD.

1.3. Defining Computer Architecture

  • Instruction set architecture (ISA) refers to the actual programming visible instruction set. Examples include 80x86, ARMv6, and RISC-V.
NoteAside

The most popular RISC processors come from ARM (Advanced RISC Machine), which were in 30 billion chips shipped in 2023, or more than 100 times as many chips that shipped with 80x86 processors (often abbreviated x86).

Wow, I had no idea ARM was that more popular than x86.

ImportantQuestion❓

Developed 30 years after the first RISC instruction sets, RISC-V inherits its ancestors’ good ideas—a large set of registers, easy-to-pipeline instructions, and a lean set of operations—while avoiding their omissions or mistakes.

I’d be curious to know, why are the above ideas: large set of registers, easy-to-pipeline instructions and lean set of operations considered good? What does it mean to pipeline instructions?

  • x86 is considered to be a register-memory ISA
  • ARMv8 and RISC-V and all ISAs since 1985 are load-store

1.10. Quantitative Principles of Computer Design

  • The basic idea behind pipelining is to overlap instruction execution to reduce the total time to complete an instruction sequence.
  • A program spends 90% of its execution time in only 10% of the code.
  • Temporal locality states recently accessed items are likely to be accessed soon.
  • Spatial locality says that items whose addresses are near one another tend to be referenced closed together in time.
  • Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

1.12. Fallacies and Pitfalls

  • Dennard’s scaling was that voltage and current should be proportionally linear dimensions of a transistor. Stated differently, as CMOS tech scaled, circuits could use more transistors and operate at higher frequencies while keeping power consumption constant. Dennard scaling ended 30 years later because threshold voltage and leakage current set a nonscaling baseline for power consumption for each transistor.
NoteAside

The Sun Microsystems Division of Oracle experienced this pitfall in 2000 with an L2 cache that included parity, but not error correction, in its Sun E3000 to Sun E10000 systems. The SRAMs they used to build the caches had intermittent faults, which parity detected. If the data in the cache were not modified, the processor would simply reread the data from the cache. Because the designers did not protect the cache with ECC (error-correcting code), the operating system had no choice but to report an error to dirty data and crash the program. Field engineers found no problems on inspection in more than 90% of the cases. To reduce the frequency of such errors, Sun modified the Solaris operating system to “scrub” the cache by having a process that proactively wrote dirty data to memory. Because the processor chips did not have enough pins to add ECC, the only hardware option for dirty data was to duplicate the external cache, using the copy without the parity error to correct the error. The pitfall is in detecting faults without providing a mechanism to correct them. These engineers are unlikely to design another computer without ECC on external caches.

This error is commonly brought up on Oxide and Friends, the best episode for reference is probably: A Requiem for SPARC with Tom Lyon

Chapter 2: Memory Hierarchy Design

2.1. Introduction

  • Multiple words are called a block or line
  • Each cache block includes a tag to indicate which memory address it corresponds to
  • set associative where a set is a group of blocks in a cache. The set is chosen by the address of the data $(Block Address) MOD (Number of sets in cache) $
  • n-way set associative is where there are n blocks in a set.
  • Direct-mapped cache has just one block per set.
  • Fully associative cache has just one set.
  • Both write strategies, write-through and write-back, can use a write buffer to allow the cache to proceed as soon as the data are placed in the buffer rather than wait for the full latency to write the data into memory.
  • If the write buffer contains other modified blocks, the addresses can be checked to see if the address of the new data matches the address of a valid write buffer entry. Is so, the new data are combined with that entry. This optimization is called write merging.
  • Categorizations of cache misses:
    • Compulsory: The very first access to a block cannot be in the cache
    • Capacity: Cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because blocks being discarded and later retrieved.
    • Conflict:

2.2. Memory Technology and Optimizations

  • Static Random Access Memory (SRAM): used for cache
  • Dynamic Random Access Memory (DRAM): use for main memory
  • Flash: Used for nonvolatile storage

Appendix C. Pipelining: Basic and Intermediate Concepts

C.1. Introduction

  • Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
  • Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles.
    1. Instruction fetch cycle (IF)
    2. Instruction decode/register fetch cycle (ID)
    3. Execution/effective address cycle (EX):
    4. Memory access (MEM):
    5. Write-back cycle (WB):
Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution.
Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB

2. My thoughts

My biggest takeaway is there is a fine balance in managing memory hierarchies between L1 all the way to DRAM & Flash. With many different inputs to configure like associativity, block size, write through vs write back, write merging, cache eviction policy. Each input you configure comes with various tradeoffs, example large cache size might improve cache hit ratio but requires more static power. The memory wall is a big deal. CPU performance has greatly outpaced the performance of DRAM and I’ll be paying close attention to the high memory bandwidth (HMB) where multiple DRAMs are placed in stacks embedded within the same package as the processor.

I will admit, I still am a bit fuzzy with how DRAM works. The concept of “banks” being access with rows/columns using the row access strobe (RAS) and column access strobe (CAS) didn’t click for me. I’m curious to see what is the actual interface for the CPU to interact with DRAM? It’s clear to me the OS creates virtual memory addresses and handles keeping track of how those map to physical addresses with the page table and translation look aside buffer (TLB) which is cache of page table entries.

3. Instruction-Level Parallelism and Its Exploitation

3.1. Introduction