Chapter 1
Computer Memory System
Chapter 3
Topics
3.1) Computer Memory System
3.2) Internal Memory
3.3) External Memory
3.4) Cache Memory System
3.1) Computer Memory System
Generally, computer memory system can be divided into 2 categories:
Internal memory
External/Secondary Memory
*
Decrease cost per bit (shorter access timegreater cost per bit)
Increase capacity (Internal memory is much smaller than external memory)
Increase access time (It is always faster to access data from internal memory and greater capacity longer access time)
Decrease frequency of access by the CPU
Memory Hierarchy
SSK3207
In the memory hierarchy as a whole, we saw that there are trade-offs among speed, density, and cost. These trade-offs also exist when we consider the organization of memory cells and functional logic on a chip. For semiconductor memories, one of the key design issues is the number of bits of data that may be read/written at a time. At one extreme is an organization in which the physical arrangement of cells in the array is the same as the logical arrangement (as perceived by the processor) of words in memory
SSK3207
*
3.2) Internal Memory
There are 3 types of internal
memory:
a) Registers (in the CPU)
b) Cache
c) Main Memory (RAM & ROM)
Information stored in internal memory can be accessed with greater speed by the CPU compared to external memory.
Internal memory is contained on computer chips and uses electronic circuits to store information
Internal memories are volatile, only with exception to ROM
*
Main Memory
Central Storage of the computer.
Used to store instructions and data that are being executed (run)
They are built based on Semiconductor Integrated Circuits
Basically, they are divided into RAM and ROM
*
Unique address for each location
n address bits can address 2n locations
Access one word at a time
Basic operations are read and write
RAM
Acronym for Random Access Memory
Information in RAM can be read and modified (write)
Used to store programs and user data that is being executed (run)
When an instruction set (or program) is typed into a computer, the program is stored in a part of this main memory called the RAM.
When the computer is switched off, the program will disappear from the RAM (Volatile).
A kind of volatile memory, i.e. the content in RAM will be wiped out if electricity supply is interrupted.
*
Types of RAM
Static RAM (SRAM)
Made from flip-flop – a logic circuit which retains the information stored in it as long as there is enough power to run the device
Disadvantages : expensive, small storage capacity in a chip
Advantages : easier to use and faster access time (read and write cycle)
Dynamic RAM (DRAM)
Made from capacitors
Disadvantages : slower access time, harder to use because capacitors need to be charge (‘refresh’) for a certain amount of time
Advantages: cheaper, bigger storage capacity in a chip
A few improvements has been done on DRAM to improve its performance (access time) with different method and names, e.g., EDRAM, CDRAM, SDRAM, RDRAM etc
*
RAM
(2k x n)
read
write
From control bus
k address line (from bus address)
n input line (from data bus)
2k word
n bit
n output line (to data bus)
Following is a block diagram for a unit (chip) RAM that has 2k word and the length of each word is n bit
*
Note:
memory is measured in base 2 units
capacity in internal memory is typically expressed in terms of bytes or words
RAM
(224 x 16)
read
write
From control bus
? address line (from bus address)
? bit
? output line (to data bus)
For example, a 224 x 16 RAM memory contains 224 = 16M words, each 16 bits long.
*
Total capacity =
224 x 16 = 224 x 24 = 228 bits
= 225 bytes
Example — RAM
We can use these cells to make a 4 x 1 RAM.
Since there are four words, ADRS is two lines.
Each word is only one bit, so DATA and OUT are one bit each.
*
Chip Organization
Example of a DRAM organization
SSK3207
*
Figure shows a typical organization of a 16-Mbit DRAM. In this case, 4 bits are read or written at a time.
Logically, the memory array is organized as four square arrays of 2048 by 2048 elements.
Various physical arrangements are possible. In any case, the elements of the array are connected by both horizontal (row) and vertical (column) lines.
Each horizontal line connects to the Select terminal of each cell in its row; each vertical line connects to the Data-In/Sense terminal of each cell in its column.
Row decoder decided which line will get activated. Can only activate one line at a time
Address lines supply the address of the word to be selected. A total of log2 W lines are needed. In our example, 11 address lines are needed to select one of 2048 rows.
These 11 lines are fed into a row decoder, which has 11 lines of input and 2048 lines for output.
The logic of the decoder activates a single one of the 2048 outputs depending on the bit pattern on the 11 input lines (211 = 2048).
An additional 11 address lines select one of 2048 columns of 4 bits per column. Four data lines are used for the input and output of 4 bits to and from a data buffer.
On input (write), the bit driver of each bit line is activated for a 1 or 0 according to the value of the corresponding data line. On output (read), the value of each bit line is passed through a sense amplifier and presented to the data lines.
The row line selects which row of cells is used for reading or writing
*
Module Organization
Example:
Memory module of 256K words of 8 bit each
SSK3207
*
Figure shows how a memory module consisting of 256K 8-bit words could be organized. For 256K words, an 18-bit address is needed and is supplied to the module from some external source (e.g., the address lines of a bus to which the module is attached). The address is presented to 8 256K * 1@bit chips, each of which provides the input/output of one bit.
This organization works as long as the size of memory equals the number of bits per chip. In the case in which larger memory is required, an array of chips is needed.
11 address lines
8 data lines
8 bits
2 control signals
How many address lines are needed to address each machine location in a 2048 X 4 memory chip?
It means that a memory of 2048 words, where each word is 4 bits. So to address 2048 (or 2K, where K means 2^10 or 1024), you need 11 bits, so 11 address lines. To express in very easy terms, without any bus-multiplexing, the number of bits required to address a memory is the number of lines (address or data) required to access that memory
4 bits, 4 data lines in and out
222 words, each 4 bits wide
22 address lines are required note that there are only 11??? the chip is designed to save number of pins, the address is multiplexed on the eleven pins (211 x 211)
SSK3207
A typical DRAM pin configuration is shown in Figure 5.4b, for a 16-Mbit chip organized as 4M * 4. There are several differences from a ROM chip. Because a RAM can be updated, the data pins are input/output. The write enable (WE) and output enable (OE) pins indicate whether this is a write or read operation.
Because the DRAM is accessed by row and column, and the address is multiplexed, only 11 address pins are needed to specify the 4M row/column combinations (211 * 211 = 222 = 4M). The functions of the row address select (RAS) and column address select (CAS) pins were discussed previously. Finally, the no connect (NC) pin is provided so that there are an even number of pins
SSK3207
*
ROM
Acronym for Read Only Memory
Information in ROM can only be read (cannot be written or modified)– programmed on the ROM chips by the manufacturer
Used to store permanent programs, e.g.:
bootstrap loader, i.e. a program used to start the operation of a computer when the computer is turned on
Check to see that the cable to the printer is connected
Interpret each key on the Keyboard to the control unit
There are various kinds of ROM, among them are PROM, EPROM and EEPROM. The difference is the ability to change data in them using specific methods and equipments
*
The following is a block diagram of a ROM with the size of 2k x n (2k word and the length of each word is n bit)
Note:
Why don’t ROM chips have read-write pins?
*
n data line (from data bus)
k address line (from address bus)
2k word
n bit
8 bits, 8 data lines to be read out
220 words, each 8 bits wide
SSK3207
Figure 5.4a shows an example EPROM package, which is an 8-Mbit chip organized as 1M * 8. In this case, the organization is treated as a one-word-perchip package. The package includes 32 pins, which is one of the standard chip package sizes. The pins support the following signal line:
The address of the word being accessed. For 1M words, a total of 20 (220 = 1M) pins are needed (A0–A19).
The data to be read out, consisting of 8 lines (D0–D7).
The power supply to the chip (Vcc).
A ground pin (Vss).
A chip enable (CE) pin. Because there may be more than one memory chip, each of which is connected to the same address bus, the CE pin is used to indicate whether or not the address is valid for this chip.
The CE pin is activated by logic connected to the higher-order bits of the address bus (i.e., address bits above A19). The use of this signal is illustrated presently.
A program voltage (Vpp) that is supplied during programming (write operations).
SSK3207
*
3.3) External Memory
Memories that are separated from the main computer components
Also known as secondary memory
Instances of external memory are hard disks, diskettes, CD-ROM, USB (universal serial bus) drive etc
*
SSK3207
SSK3207
*
External Memory (Cont.)
All external memory are non-volatile — data stored in internal memory is lost (except ROM) when the computer is turned off but data stored in external memory is retained.
Used to store information (program and data) that has not or has already been processed by the computer system.
Consists of secondary storage devices such as discs and tapes where information contained in them are accessed by the CPU through the I/O unit (via bus)..
*
Disk
The 2 most popular types:
Magnetic Disk
Flash Memory
Magnetic Disk
There are 2 kinds, i.e. hard disk and floppy disk.
A round platter made from metal or plastic and shrouded with magnetic coating
*
*
Disk Data Layout
SSK3207
*
SSK3207
Consists of tracks and sectors
– Track: 500 – 2000 tracks/surface
– Sector: 10 – 100 sector/track (fixed or varied)
Data are moved into/taken out from the disk in block size (sector/track)
The amount of data in every track is same => data density varies
There are 2 types of read/write heads:
1. Fixed head: One head per track
2. Movable head: One head for all tracks
Disks are inserted into the disk drive.
There are 2 types of disk:
Movable disk: can be inserted and taken out from the disk drive, such as the floppy disk.
Fixed disk: cannot be taken out from the disk drive, for instance hard disk.
*
Disk Data Layout (cont.)
Flash Memory
2 types of flash memory : static and removable
A type of nonvolatile memory that can be erased electronically and rewritten, similar to EEPROM.
Most computers use flash memory to hold their startup instructions because it allows the computer easily to update its contents.
Flash memory chips also store data and programs on many mobile computers and devices such as smart phones, portable media players, PDAs, printers, digital cameras, automotive devices, pagers and digital voice recorders.
Flash Memory (Cont)
Removable flash memory includes memory cards, USB flash drive, and PC Cards/ExpressCard Module
3.4) Cache Memory System
SSK3207
Cache Memory is a special very high-speed memory. It is used to speed up and synchronizing with high-speed CPU. Cache memory is costlier than main memory or disk memory but economical than CPU registers. Cache memory is an extremely fast memory type that acts as a buffer between RAM and the CPU. It holds frequently requested data and instructions so that they are immediately available to the CPU when needed.
Cache memory is used to reduce the average time to access data from the Main memory. The cache is a smaller and faster memory which stores copies of the data from frequently used main memory locations. There are various different independent caches in a CPU, which store instructions and data.
SSK3207
*
Hierarchy List
Registers
L1 Cache
L2 Cache
Main memory
Disk cache
Disk
Optical
Tape
Memory Hierarchy – Diagram
SSK3207
*
SSK3207
So you want fast?
It is possible to build a computer which uses only static RAM (see later)
This would be very fast
This would need no cache
How can you cache cache?
This would cost a very large amount
SSK3207
*
SSK3207
Why do we need cache?
Memory is slow compared to the time required to execute instructions
Memory becomes more expensive as it gets faster
Compromise
Place a small amount of very high-speed memory to temporarily hold a portion of the memory being accessed.
Cache
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or module
SSK3207
*
SSK3207
Cache and Main Memory
Cache/Main Memory Structure
Example:
MM = 16 Mbytes
Memory block = 4 bytes
Memory = 4M blocks of 4 bytes each
Cache can hold 64 Kbytes
Cache is 16K lines of 4 bytes each
CPU
CACHE: divided into a few slots
MM: divided into a few blocks or lines
(block size = slot size)
Figure shows connection of Cache
and Main Memory
*
CPU
Cache
Main Memory
Word Transfer
Block Transfer
Cache Addressing
Where does cache sit?
Between processor and virtual memory management unit
Between MMU and main memory
Logical cache (virtual cache) stores data using virtual addresses
Processor accesses cache directly, not thorough physical cache
Cache access faster, before MMU address translation
Virtual addresses use same address space for different applications
Must flush cache on each context switch
Physical cache stores data using main memory physical addresses
Cache Memory (cont.)
A small and fast memory that situated between the CPU and the MM (conceptually, not physically).
Reduces the difference of time of the memory cycle and the CPU processing time.
The memory cycle time is the time used to fetch an instruction or data from the main memory until it is usable.
With the technology available today, the processing time of the CPU is faster than the memory cycle time by far.
Hence, a memory, which was called Cache, was introduced and it was situated between the CPU and the MM.
*
Typical Cache Organization
Cache connects to the processor via data, control, and address lines
data and address lines attached to buffers attached to system bus to reach main memory
SSK3207
*
SSK3207
The method used for cache manufacturing rendered its memory cycle time far faster than that of the main memory but its cost was very expensive.
We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.
What is the ideal size for the cache??????
How does cache work?
*
When the CPU needs certain data or instruction, it will refer to the cache memory first.
If it is in cache, then data access will be fast.
Otherwise, it will have to be fetched from the MM and the data block that contains the required data or instruction will be moved into the Cache Memory with the hope that the next instruction/data required by the CPU is contained in the block.
*
How Cache Works?
Cache operation – overview
CPU requests contents of memory location
Check cache for this data
If present, get from cache (fast)
If not present, read required block from main memory to cache
Then deliver from cache to CPU
Cache includes tags to identify which block of main memory is in each cache slot
SSK3207
*
SSK3207
Cache Read Operation – Flowchart
SSK3207
Figure 4.5 illustrates the read operation. The processor generates the read address (RA) of a word to be read. If the word is contained in the cache, it is delivered to the processor. Otherwise, the block containing that word is loaded into the cache, and the word is delivered to the processor. Figure 4.5 shows these last two operations occurring in parallel and reflects the organization shown in Figure 4.6, which is typical of contemporary cache organizations. In this organization, the cache connects to the processor via data, control, and address lines. The data and address lines also attach to data and address buffers, which attach to a system bus from which main memory is reached. When a cache hit occurs, the data and address buffers are disabled and communication is only between processor and cache, with no system bus traffic. When a cache miss occurs, the desired address is loaded onto the system bus and the data are returned through the data buffer to both the cache and the processor. In other organizations, the cache is physically interposed between the processor and the main memory for all data, address, and control lines. In this latter case, for a cache miss, the desired word is first read into the cache and then transferred from cache to processor
SSK3207
*
Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines/slots. Three techniques can be used: direct, associative and set-associative.
Once the cache has been filled, when a new block is brought into the cache, one of the existing blocks must be replaced. A replacement algorithm is needed. What types of algorithm???????
*
Cache Replacement Algorithm
LRU – least recently used
FIFO – first in first out
LFU – least frequently used
Cache Design
Addressing
Size
Mapping Function
Replacement Algorithm
Write Policy
Block Size
Number of Caches
SSK3207
*
Although there are a large number of cache implementations, there are a few basic design elements that serve to classify and differentiate cache architectures
Address- Almost all nonembedded processors, and many embedded processors, support virtual memory. In essence, virtual memory is a facility that allows programs to address memory from a logical point of view, without regard to the amount of main memory physically available. When virtual memory is used, the address fields of machine instructions contain virtual addresses. For reads
SSK3207
Size does matter
Cost
More cache is expensive
Speed
More cache is faster (up to a point)
Checking cache for data takes time
SSK3207
*
SSK3207
Comparison of Cache Sizes
Processor
Type
Year of Introduction
L1 cache
L2 cache
L3 cache
IBM 360/85
Mainframe
1968
16 to 32 KB
—
—
PDP-11/70
Minicomputer
1975
1 KB
—
—
VAX 11/780
Minicomputer
1978
16 KB
—
—
IBM 3033
Mainframe
1978
64 KB
—
—
IBM 3090
Mainframe
1985
128 to 256 KB
—
—
Intel 80486
PC
1989
8 KB
—
—
Pentium
PC
1993
8 KB/8 KB
256 to 512 KB
—
PowerPC 601
PC
1993
32 KB
—
—
PowerPC 620
PC
1996
32 KB/32 KB
—
—
PowerPC G4
PC/server
1999
32 KB/32 KB
256 KB to 1 MB
2 MB
IBM S/390 G4
Mainframe
1997
32 KB
256 KB
2 MB
IBM S/390 G6
Mainframe
1999
256 KB
8 MB
—
Pentium 4
PC/server
2000
8 KB/8 KB
256 KB
—
IBM SP
High-end server/ supercomputer
2000
64 KB/32 KB
8 MB
—
CRAY MTAb
Supercomputer
2000
8 KB
2 MB
—
Itanium
PC/server
2001
16 KB/16 KB
96 KB
4 MB
SGI Origin 2001
High-end server
2001
32 KB/32 KB
4 MB
—
Itanium 2
PC/server
2002
32 KB
256 KB
6 MB
IBM POWER5
High-end server
2003
64 KB
1.9 MB
36 MB
CRAY XD-1
Supercomputer
2004
64 KB/64 KB
1MB
—
Hit Ratio (L1 & L2)
For 8 kbytes and 16 kbyte L1
Unified v Split Caches
One cache for data and instructions or; two, one for data and one for instructions
Advantages of unified cache
Higher hit rate
Balances load of instruction and data fetch
Only one cache to design & implement
Advantages of split cache
Eliminates cache contention between instruction fetch/decode unit and execution unit
Important in pipelining
Pentium 4 Cache
80386 – no on chip cache
80486 – 8k using 16-byte lines and four way set associative organization
Pentium (all versions) – two on chip L1 caches
Data & instructions
Pentium III – L3 cache added off chip
Pentium 4
L1 caches (8k bytes, 64 byte lines, four way set associative)
L2 cache (Feeding both L1 caches, 256k, 128 byte lines, 8 way set associative)
L3 cache on chip
SSK3207
*
SSK3207
Intel Cache Evolution
SSK3207 Chapter 3
Problem
Solution
Processor on which feature first appears
External memory slower than the system bus.
Add external cache using faster memory technology.
386
Increased processor speed results in external bus becoming a bottleneck for cache access.
Move external cache on-chip, operating at the same speed as the processor.
486
Internal cache is rather small, due to limited space on chip
Add external L2 cache using faster technology than main memory
486
Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.
Create separate data and instruction caches.
Pentium
Increased processor speed results in external bus becoming a bottleneck for L2 cache access.
Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.
Pentium Pro
Move L2 cache on to the processor chip.
Pentium II
Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.
Add external L3 cache.
Pentium III
Move L3 cache on-chip.
Pentium 4
SSK3207 Chapter 3
Pentium 4 Block Diagram
SSK3207
Figure 4.18 provides a simplified view of the Pentium 4 organization, highlighting the placement of the three caches. The processor core consists of four major components:
Fetch/decode unit: Fetches program instructions in order from the L2 cache, decodes these into a series of micro-operations, and stores the results in the L1 instruction cache.
Out- of- order execution logic: Schedules execution of the micro- operations subject to data dependencies and resource availability; thus, micro-operations may be scheduled for execution in a different order than they were fetched from the instruction stream. As time permits, this unit schedules speculative execution of micro-operations that may be required in the future
Execution units: These units execute micro-operations, fetching the required data from the L1 data cache and temporarily storing results in registers.
Memory subsystem: This unit includes the L2 and L3 caches and the system bus, which is used to access main memory when the L1 and L2 caches have a cache miss and to access the system I/O resources.
SSK3207
*
Pentium 4 Core Processor
Fetch/Decode Unit
Fetches instructions from L2 cache
Decode into micro-ops
Store micro-ops in L1 cache
Out of order execution logic
Schedules micro-ops
Based on data dependence and resources
May speculatively execute
Execution units
Execute micro-ops
Data from L1 cache
Results in registers
Memory subsystem (L2 cache and systems bus)