A Comprehensive Guide to Understanding Cache Memory

January 8, 2024 | udaradesilva | Tutorials

Cache memory is akin to a chef’s handy pantry, dramatically speeding up data access for the CPU. It’s like having the essential ingredients at arm’s reach, eliminating the need to fetch them from a distant store (the main memory). This proximity significantly cuts down data retrieval time, turbocharging overall processing speed and efficiency.

The concept of cache was born in the 1960s with the IBM System/360 Model 85, marking a pivotal shift in computing. Initially, it was a small buffer holding a part of the main memory’s data, bridging the growing speed gap between fast CPUs and slower memory. Evolving over the decades, cache has become a multi-leveled, sophisticated system, perfectly illustrating the relentless quest for faster, more efficient computing. This innovative journey of cache memory is not just a technical advancement but a testament to human ingenuity in the digital era.

Cache memory is a small-sized type of volatile computer memory that provides high-speed data access to a processor and stores frequently used computer programs, applications, and data. There are three types of caches commonly employed in a computer system.

Fig. 1: Multi-level cache hierarchy (source)

Type	L1	L2	L3
Location	Integrated directly into the processor	Located either on the CPU or close to it on the processor chip	Usually on the processor chip, shared by all cores
Size	2KB – 64KB	256KB – 2MB	4MB – 50MB
Speed	Fastest	Intermediate	Slower but faster than main memory
Purpose	Stores instructions and data that are immediately required by the CPU	Acts as an intermediary, storing data that is not as immediately needed but still important for speedy processing	Reduces the gap between the computing power of the processor and the slower RAM

Different types of cache memories in a computer system

1. Cache Mapping

Cache mapping governs how data is organized and accessed in a cache. At the heart of this process are cache lines (also called cache blocks), the basic units of data transfer between the cache and the main memory. Typically ranging from 32 to 64 bytes, cache lines leverage the principle of spatial locality: the likelihood that data close to recently accessed data will soon be accessed. By fetching and storing data in these blocks, rather than individual bytes, cache efficiency is significantly enhanced. There are three primary mapping methods.

Direct-Mapped Cache
- Each main memory block maps to exactly one cache line
- Example: Consider a cache with 128 lines and memory blocks numbered from 0, 1, 2, and so forth. The mapping would be such that block 0 maps to cache line 0, block 128 maps to cache line 0, block 1 to cache line 1, block 129 to cache line 1, and so on
Fully Associative Cache
- Any cache line can store data from any memory block
- Example: If the cache has 128 lines, any of these lines can store data from any memory block. The cache controller must search the entire cache to find a specific block, using tags associated with each cache line.
Set Associative Cache
- The cache is divided into sets, and each set contains several lines. A particular memory block can map to any line within a specific set
- Example: In a 2-way set-associative cache with 128 lines, there would be 64 sets, each containing 2 lines. A memory block is first mapped to a set, and then it can be placed in any line within that set.

Direct-mapped cache has a simpler implementation but results in more cache misses. A fully associative cache minimizes cache misses but retrieving data requires searching the entire cache. Set associative cache is a middle ground between direct-mapped and fully associative cache where there can be more than one possible line to store a block of memory.

Fig. 2: Retrieving cached data from direct mapped cache (source). The first column (i.e., V) is a flag to represent whether the cache line is valid or not. If the Tag is found in the cache and the V flag is true it is cache hit. The number of cache lines determines the bitwidth of the index (i.e., if there are 32 cache lines index is 5 bits wide). The number of bytes in a cache line decides the bit width of the offset.

Fig. 3: Retrieving cached data from set-associative cache (source). Unlike direct mapped cache, each tag has more than one possible storage location (typically 2 or 4). Therefore, after a cache miss, the cache controller needs to determine which line should be replaced. This is determined by the replacement policy discussed in the next section.

In Fig. 2 and Fig. 3, there are additional flags like valid (V) and dirty (D) that indicate if the cache line data is valid, and needs to be written back to memory. These flags play an important role in maintaining cache coherency which ensures the data accessed by each processor in a multi-processor multi-cache computer system has the most recently updated value. Cache coherency is explained in detail in section 4.

2. Replacement Policy

Cache replacement policies are essential for managing the limited space in cache memory. When the cache is full and new data needs to be stored, these policies decide which existing data should be replaced (P.S. In a direct-mapped cache, this is trivial because there is exactly one cache line that could store the missing memory block. However, in an associative cache, there can be more than one location that could store the new memory block and the cache controller needs to determine which line to be replaced). One of the most common policies is the Least Recently Used (LRU) algorithm. Below is a simplified pseudocode representation of the LRU cache replacement policy:

Initialize a cache with fixed size
Initialize a usage record, typically a list or a queue

Function accessData(data):
    if data is in cache:
        move data to the top of the usage record
    else:
        if cache is full:
            remove the least recently used data from cache
            remove the least recently used data from the usage record
        add data to cache
        add data to the top of the usage record

This pseudocode represents the core concept of LRU. In real-world implementations, more complex data structures (like hash tables combined with doubly-linked lists) are often used for efficient tracking and updating of the usage record.

Alternatives to LRU include FIFO: first in first out and LFU: least frequently used. Implementing the FIFO policy is relatively simple but may not be efficient in many cases. LFU maintains a counter to track the number of times each block is accessed and replaces the block with the least count. The difference between LRU and LFU is that LRU can adapt quickly to changing patterns of data access, as it only considers the most recent access history. LFU may take longer to adapt since it accumulates access counts over time. Each policy has its strengths and ideal use cases, making them suitable for different scenarios in cache management.

3. Cache Coherency

Cache coherency, in the world of multi-core and multiprocessor computing, is akin to a symphony orchestra’s need for coordination. Just as each musician must be in tune and in time with the others, each processor’s cache in a computer system must be in sync, sharing a consistent view of the data. Let’s dive into the protocols that orchestrate this harmony

Write-Through Protocol:
- In this protocol, all writes to the cache also write to the main memory.
- Pros: Simplicity and ensures data in the cache and memory are always consistent.
- Cons: Higher latency for write operations due to memory write.
Write-Back Protocol:
- Writes are only done in the cache. The data is written back to the main memory only when it’s replaced in the cache. (In Fig. 3, dirty (D) flag is used for this purpose. If it is set, the value should be written back to memory before replacing it)
- Pros: Faster write operations as they are done in the cache.
- Cons: More complex, as it requires a mechanism to keep track of which cache lines have been modified (often using a dirty bit).
MESI Protocol (Modified, Exclusive, Shared, Invalid):
- A commonly used protocol for maintaining cache coherency in multi-core processors.
- It categorizes the state of each cache line into one of four states:
  - Modified: The data is only in this cache and has been changed. It needs to be written back to the main memory.
  - Exclusive: The data is only in this cache but hasn’t been changed.
  - Shared: The data may be in other caches and is the same as the main memory.
  - Invalid: The data is invalid or has been updated in another cache.
- Pros: Reduces the number of times data is written back to main memory, and improves efficiency.
MOESI Protocol (Modified, Owned, Exclusive, Shared, Invalid):
- An extension of MESI with an additional ‘Owned’ state.
- The ‘Owned’ state indicates that the cache line is shared with other caches but also needs to be written back to the main memory.
- Pros: Can reduce bus traffic compared to MESI.
Directory-Based Coherence:
- Instead of relying on a bus to keep track of cache states (as in MESI), a directory-based system uses a central directory that keeps track of which caches have a copy of the data.
- Pros: Scalable to systems with a large number of processors.
- Cons: Increased complexity and potential performance overhead due to directory maintenance.

4. Conclusion

Understanding the various aspects of cache memory is key to optimizing computing systems. From the small but swift L1 cache to the substantial L3 cache, each layer plays a significant role in data processing efficiency. By mastering cache coherency and implementing effective replacement policies, one can significantly boost a system’s performance. As technology advances, the importance of cache memory in computing remains a constant, offering exciting prospects for future enhancements.

UdaraDe Silva