Computer Architecture: A Quantitative Approach
Notes
Chapter 1: Fundamentals of Quantitative Design and Analysis
1.1. Introduction
- Moore’s law was defined in 1965 when Gordon Moore predicted the number of transistors per chip would double every year, which he amended in 1975 to every 2 years. This lasted for 50 years. The main reasons for why it ended were because of
- transistors no longer getting much better of the slowing of Moore’s Law and the end of Dennard scaling
- the unchanging power budgets of microprocessors
- the replacement of the single power-hungry processors with several energy-efficient processors
- the limits to multiprocessing given Amdahl’s Law
1.2. Classes of Computers
Real-time performance requirement means a segment of application has absolute maximum execution time.
Soft real-time is when the average time for a particular task is constrained as well as the number instances when some maximum is exceeded.
Two kinds of parallelism in applications:
- Data-Level Parallelism (DLP) arises because there many data items that can be operated on at the same time.
- Task-Level Parallelism (TLP) arises because tasks of work are created that can operate independently and largely in parallel.
Computer hardware can exploit these two kinds of application parallelism in four ways:
- Instruction Level Parallelism (ILP) exploits DLP at modest levels with compiler help using ideas like pipelining and speculative execution
- Vector architectures, graphic processing units (GPUs), and multimedia instruction sets exploit DLP by applying single instruction to a collection of data in parallel
- Thread Level Parallelism (TRP) exploits either DLP or task-level parallelism in a tightly coupled that allows for interaction between parallel threads
- Request Level Parallelism (RLP) exploits parallelism among largely decoupled tasks specified by the programmer or the operating system
Single instruction stream, single data stream (SISD): Standard sequential computer, but still can exploit ILP using superscalar and speculative execution.
Single instruction stream, multiple data streams (SIMD): The same instruction is executed by multiple processors using data different data streams. SIMD computers exploit DLP by applying the same operations to multiple items of data in parallel.
Multiple instructions steams, multiple data streams (MIMD): Each processors fetches its own instructions and operates on its own data, and it targets TLP. MIMD is more flexible than SIMD but it is more expensive. MIMD can also exploit DLP, although the overhead is likely to be higher than SIMD.
1.3. Defining Computer Architecture
- Instruction set architecture (ISA) refers to the actual programming visible instruction set. Examples include 80x86, ARMv6, and RISC-V.
- x86 is considered to be a register-memory ISA
- ARMv8 and RISC-V and all ISAs since 1985 are load-store
1.4. Trends in Technology
- Integrated circuit processes are characterized by feature size, which is the min size of transistor or wire in either x or y dimension.
- Transistor counter by sq mm of silicon is determined by the surface area of a transistor, density of transistors increases quadratically with linear decrease in feature size.
- The shrink in vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of transistors.
- While transistors generally improve in performance with decreased feature size, wires in an IC do not. In particular the signal delay for a wire increases in proportion to the product of its resistance and capacitance.
1.5. Trends in Power and Energy in ICs
- Sustained power consumption is widely called thermal design power (TDP). TDP is neither the peak power nor the actual average.
1.10. Quantitative Principles of Computer Design
- The basic idea behind pipelining is to overlap instruction execution to reduce the total time to complete an instruction sequence.
- A program spends 90% of its execution time in only 10% of the code.
- Temporal locality states recently accessed items are likely to be accessed soon.
- Spatial locality says that items whose addresses are near one another tend to be referenced closed together in time.
- Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
1.12. Fallacies and Pitfalls
- Dennard’s scaling was that voltage and current should be proportionally linear dimensions of a transistor. Stated differently, as CMOS tech scaled, circuits could use more transistors and operate at higher frequencies while keeping power consumption constant. Dennard scaling ended 30 years later because threshold voltage and leakage current set a nonscaling baseline for power consumption for each transistor.
Chapter 2: Memory Hierarchy Design
2.1. Introduction
- Multiple words are called a block or line
- Each cache block includes a tag to indicate which memory address it corresponds to
- set associative where a set is a group of blocks in a cache. The set is chosen by the address of the data $(Block Address) MOD (Number of sets in cache) $
- n-way set associative is where there are n blocks in a set.
- Direct-mapped cache has just one block per set.
- Fully associative cache has just one set.
- Both write strategies, write-through and write-back, can use a write buffer to allow the cache to proceed as soon as the data are placed in the buffer rather than wait for the full latency to write the data into memory.
- If the write buffer contains other modified blocks, the addresses can be checked to see if the address of the new data matches the address of a valid write buffer entry. Is so, the new data are combined with that entry. This optimization is called write merging.
- Categorizations of cache misses:
- Compulsory: The very first access to a block cannot be in the cache
- Capacity: Cache cannot contain all the blocks needed during execution of a program, capacity misses will occur because blocks being discarded and later retrieved.
- Conflict:
2.2. Memory Technology and Optimizations
- Static Random Access Memory (SRAM): used for cache
- Dynamic Random Access Memory (DRAM): use for main memory
- Flash: Used for nonvolatile storage
Appendix C. Pipelining: Basic and Intermediate Concepts
C.1. Introduction
- Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
- Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles.
- Instruction fetch cycle (IF)
- Instruction decode/register fetch cycle (ID)
- Execution/effective address cycle (EX):
- Memory access (MEM):
- Write-back cycle (WB):
| Clock number | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Instruction number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| Instruction i | IF | ID | EX | MEM | WB | ||||
| Instruction i+1 | IF | ID | EX | MEM | WB | ||||
| Instruction i+2 | IF | ID | EX | MEM | WB | ||||
| Instruction i+3 | IF | ID | EX | MEM | WB | ||||
| Instruction i+4 | IF | ID | EX | MEM | WB | ||||
2. My thoughts
My biggest takeaway is there is a fine balance in managing memory hierarchies between L1 all the way to DRAM & Flash. With many different inputs to configure like associativity, block size, write through vs write back, write merging, cache eviction policy. Each input you configure comes with various tradeoffs, example large cache size might improve cache hit ratio but requires more static power. The memory wall is a big deal. CPU performance has greatly outpaced the performance of DRAM and I’ll be paying close attention to the high memory bandwidth (HMB) where multiple DRAMs are placed in stacks embedded within the same package as the processor.
I will admit, I still am a bit fuzzy with how DRAM works. The concept of “banks” being access with rows/columns using the row access strobe (RAS) and column access strobe (CAS) didn’t click for me. I’m curious to see what is the actual interface for the CPU to interact with DRAM? It’s clear to me the OS creates virtual memory addresses and handles keeping track of how those map to physical addresses with the page table and translation look aside buffer (TLB) which is cache of page table entries.