4-Way Set-Associative Cache
1. Overview
A 4-Way Set-Associative Cache is a cache memory organization where each set contains four cache lines. Compared to a 2-way set-associative cache, it provides even more placement flexibility, further reducing conflict misses and improving cache hit rates for workloads with frequent memory reuse.
Structure
- Cache Division: The cache is divided into multiple sets, each containing four cache lines.
- Address Breakdown: A memory address is divided into three parts:
- Tag: Identifies the specific block in memory.
- Set Index: Determines which set the data might reside in.
- Block Offset: Specifies the exact location within the cache line.
Operation
When a memory request is made:
- The Set Index part of the address is used to select a set.
- All four cache lines within the selected set are checked for a match using the Tag.
- If any line contains the requested data, it is a cache hit; otherwise, it is a miss, and the data is fetched from main memory.
Comparison with 2-Way Set-Associative Cache
| Feature | 2-Way Set-Associative Cache | 4-Way Set-Associative Cache |
|---|---|---|
| Cache Lines per Set | 2 | 4 |
| Placement Flexibility | Moderate (2 choices per set) | Higher (4 choices per set) |
| Conflict Misses | Low | Lower |
| Hardware Complexity | Moderate | Higher |
| Performance | Good for many access patterns | Better for high-conflict access patterns |
Benefits Over 2-Way Set-Associative Cache
- Further Reduced Conflict Misses: More options for placing memory blocks within a set reduces the chance of cache conflicts.
- Improved Cache Hit Rate: Especially beneficial for programs with repeated accesses to multiple memory addresses that map to the same set.
- Better Performance for Complex Workloads: Handles high-frequency accesses more efficiently without dramatically increasing the cache size.
2- Specifications of Our 4-Way Set Associative Cache:
| Parameter | Value (from code) |
|---|---|
| Word Size | 32 bits (per word) |
| Words per Block | 4 words |
| Block Size | 128 bits (16 bytes per block) |
| Number of Blocks | 64 (total cache lines across all sets & ways) |
| Associativity | 4-way (each set holds 4 cache lines) |
| Number of Sets | 16 sets (NUM_BLOCKS / NUM_WAYS) |
| Cache Size | 1 KB (64 blocks × 16 bytes = 1024 bytes) |
| Index Bits | 4 ($clog2(NUM_SETS) = 16 → 4) |
| Block Offset Bits | 2 ($clog2(WORDS_PER_BLOCK) = 4 → 2) |
| Tag Width | 26 bits |
| Valid Bit | 1 per cache line |
| Dirty Bit | 1 per cache line |
| Replacement Policy | PLRU (Pseudo-LRU), maintained as a single bit per set |
| Cache Line Format | {valid (1), dirty (1), tag (26), block (128 bits)} → total 156 bits per cache line |
| Data Storage | 2D array: cache[NUM_SETS][4] (set-indexed, 4 ways per set) |
3- Top-Level Diagram (4-Way vs Direct-Mapped):
Top level diagram almost remains the same, here is the brief overview:
Inputs:
-
req_type: 0 = Read, 1 = Write (same as direct-mapped).
-
req_valid: Indicates CPU request (same).
-
address [31:0]: Physical address from CPU. (Difference: address now splits into Tag=25 bits, Index=5 bits, Offset=2 bits).
-
data_in [31:0]: Data from CPU for write requests(same).
-
data_out_mem [127:0]: 128-bit block from main memory (same).
-
clk, rst: System clock and reset.
Outputs
-
data_out [31:0]: Word returned to CPU.(same)
-
done_cache: Request completed.(same)
-
dirty_block_out [127:0]: Block sent to memory on eviction.(same)
-
hit: Indicates tag match in either way (different from direct-mapped where only 1 tag check existed).
4- Datapath (4-Way Overview):
Similarities:
-
CPU sends requests (req_valid, req_type, address, data_in).
-
Cache Decoder still splits address into {Tag, Index, Block Offset}.
-
Cache Controller FSM still handles Compare → WriteBack → WriteAllocate → RefillDone states.
-
Main memory interface unchanged: 128-bit transfers.
Key Differences:
1- Tag Comparison:
Direct-Mapped → 1 tag check per set.
2-Way → two parallel comparators, one for each way.
4-Way → four parallel comparators, one for each way.
2- Cache Storage:
Direct-Mapped → cache[NUM_SETS] (one line per set).
2-Way → cache[NUM_SETS][2] (two lines per set).
4-Way → cache[NUM_SETS][4] (four lines per set).
3- Replacement Policy:
Direct-Mapped → No replacement needed (fixed slot).
2-Way → Pseudo-LRU (1 bit per set) decides which way to evict.
4-Way → Pseudo-LRU (1 bit per set) decides which way to evict.
4- Hit Signal:
Direct-Mapped → hit = valid && (tag == stored_tag).
2-Way → hit = (hit_way0 || hit_way1).
4-Way → hit = (hit_way0 || hit_way1 || hit_way2 || hit_way3).
5- Cache Controller (FSM Brief):
The FSM remains almost identical:
-
IDLE → wait for req_valid.
-
COMPARE →
If hit → serve read/write.
If miss + clean → fetch from memory.
If miss + dirty → write-back old block.
-
WRITE_BACK → send dirty block to memory.
-
WRITE_ALLOCATE → request new block from memory.
-
REFILL_DONE → write block into cache, update PLRU, return to CPU.
Module-by-Module Explanation:
1- cache_decoder:
#### Inputs:
- clk, address [31:0]
Outputs:
-
tag [25:0]
-
index [3:0]
-
blk_offset [1:0]
Function:
Splits the 32-bit CPU address into:
-
tag (upper bits): sent to all ways (contains 4 lines → way0, way1, way2, way3).comparators.
-
index (middle bits): selects the set (contains 4 bits to represent 16 sets)
-
block offset (lowest bits): selects word within block.
2- Cache controller:
Cache controler module is exactly same as direct mapped cache & 2-way set-associative cache.
3- Main Memory Interface:
main memory interface is also exactly the same as direct mapped cache & 2-way set-associative cache.
4- Cache Memory module:
Inputs:
- clk – system clock
- tag [TAG_WIDTH-1:0] – extracted tag from CPU address
- index [INDEX_WIDTH-1:0] – selects cache set
- blk_offset [OFFSET_WIDTH-1:0] – selects word within block
- req_type – 0 = Read, 1 = Write
- read_en_cache – enables cache read (CPU-side)
- write_en_cache – enables cache write (CPU-side)
- read_en_mem – enables memory read (refill)
- write_en_mem – enables memory write (eviction)
- data_in_mem [BLOCK_SIZE-1:0] – new block from memory
- data_in [WORD_SIZE-1:0] – single word from CPU
- refill – indicates block replacement/refill operation
Outputs
- dirty_block_out [BLOCK_SIZE-1:0] – evicted dirty block (to memory)
- hit – asserted if tag match in any way
- data_out [WORD_SIZE-1:0] – word returned to CPU on read hit
- dirty_bit – indicates dirty block present in current set
Internal Structures
- Cache Line Format –
{valid, dirty, tag, block_data} - cache array –
cache[NUM_SETS][4]→ 4 ways per set - PLRU array –
plru[NUM_SETS]→ 3-bit replacement info per set (tree-based PLRU) - cache_info_t struct (info0, info1, info2, info3) – Holds per-way signals: valid, dirty, tag, block, hit .
Functionality:
i- Hit/Miss Detection
For each set, the four ways are checked in parallel:
-
If the stored tag matches the CPU tag and valid bit = 1, that way asserts a hit.
-
If neither way hits, a miss occurs and replacement must be performed.
##### ii- Replacement Policy – PLRU
This design uses Pseudo-LRU (PLRU) instead of a true LRU to reduce hardware cost.
How PLRU Works in This Module:
- Each set maintains 3 PLRU bits (
plru[index].b1,plru[index].b2,plru[index].b3). - These bits form a binary tree that encodes the least-recently-used (LRU) way.
- Victim selection on miss:
b1decides which subtree to evict from:0→ go left subtree (ways 0–1)1→ go right subtree (ways 2–3)
- If left subtree chosen:
b2=0→ victim = way-0b2=1→ victim = way-1
- If right subtree chosen:
b3=0→ victim = way-2b3=1→ victim = way-3
- Update on access (hit or refill):
- When a way is accessed, the tree bits are updated so that this way becomes most recently used (MRU).
- The tree is flipped to point toward another way as the next replacement candidate.
This approximates true LRU with only 3 bits per set, instead of maintaining full access history.
Example Flow:
- CPU hits in way-0 → PLRU tree updates (
b1=1, b2=1) so that way-0 is marked MRU, and another way (way-1 or from the right subtree) becomes the next victim. - Next miss in that set → Victim is chosen by traversing the PLRU tree. For example, if
b1=1, b3=0, then way-2 will be evicted. - If CPU then hits in way-2 → PLRU tree updates (
b1=0, b3=1) so that way-2 is MRU, and another way (e.g., way-0, way-1, or way-3) becomes the next victim.
Thus, Tree-PLRU approximates true LRU with only 3 bits per set, instead of storing full history.
iii- Miss Handling
When a miss occurs:
- If any way is invalid → the block is refilled into the first invalid way
- If all 4 ways are valid → the PLRU victim way is chosen
- If the victim is clean → it is directly overwritten with the new block
- If the victim is dirty → the dirty block is written back to memory first (dirty_block_out), then the new block is refilled into that way
iv- Read and Write Operations
- On a Read Hit
if (hit) begin
data_out = cache[set][hit_way].block[blk_offset];
// update PLRU for this set
plru[set] = update_plru(plru[set], hit_way);
end
- On a Write Hit
if (hit) begin
cache[set][hit_way].block[blk_offset] = data_in;
cache[set][hit_way].dirty = 1'b1;
// update PLRU for this set
plru[set] = update_plru(plru[set], hit_way);
end
- On a Miss
if (miss) begin
if (exists_invalid_way(set)) begin
victim_way = get_invalid_way(set);
end else begin
victim_way = get_plru_victim(plru[set]);
end
if (cache[set][victim_way].dirty) begin
// write back dirty block to memory
dirty_block_out = cache[set][victim_way].block;
write_back_to_mem(tag, set, cache[set][victim_way]);
end
// refill from memory
cache[set][victim_way].block = read_block_from_mem(address);
cache[set][victim_way].valid = 1'b1;
cache[set][victim_way].dirty = is_write ? 1'b1 : 1'b0;
// update PLRU
plru[set] = update_plru(plru[set], victim_way);
end
Why PLRU is Efficient Here
- Low cost → Just one bit per set vs. full history bits in true LRU
- Good approximation → Ensures alternate ways are reused fairly
- Hardware friendly → Implemented as simple flip logic in always blocks
✅ This makes PLRU the ideal replacement policy for small, low-complexity 2-way associative caches like this one.
Testbench Documentation for cache_memory
Purpose
The testbench (cache_tb) is designed to verify and validate the functionality of the cache_memory module.
It systematically simulates read and write operations across multiple ways of a set-associative cache, verifying:
- Cache hit/miss detection
- Dirty bit handling (clean vs. dirty evictions)
- PLRU replacement policy operation during block eviction
- Correct refill from memory on misses
🔌 DUT Interface
Inputs
| Signal | Description |
|---|---|
clk |
Clock signal (10ns period) |
tag |
Tag field of the memory address |
index |
Index selecting a cache set |
blk_offset |
Word offset within the cache block |
req_type |
Operation type: 0 = Read, 1 = Write |
read_en_cache |
Read enable signal |
write_en_cache |
Write enable signal |
refill |
Asserted to load a block from memory on miss |
data_in_mem |
Full block data input from memory during refill |
data_in |
Word-level data input for write operations |
Outputs
| Signal | Description |
|---|---|
data_out |
Word read from cache |
hit |
High (1) if access is a hit, low (0) if miss |
dirty_bit |
Dirty status of the selected cache block |
dirty_block_out |
Block data to be written back to memory when evicted dirty |
done_cache |
Operation done flag (optional) |
✅ Test Cases
1. Read Hit
- Setup: Tag + index matches valid block in cache
- Action: Assert
read_en_cache = 1 - Expected:
hit = 1- Correct
data_outreturned dirty_bitunchanged- PLRU updated
2. Write Hit
- Setup: Matching tag + index entry exists
- Action: Assert
write_en_cache = 1with validdata_in - Expected:
hit = 1- Word updated in cache
dirty_bit = 1- PLRU updated
3. Read Miss (Clean Block in Set)
- Setup: Tag mismatch, chosen way is clean
- Action: Trigger
read_en_cache = 1, then assertrefill - Expected:
hit = 0- Block replaced with
data_in_mem dirty_block_outnot used- PLRU updated
4. Read Miss (Dirty Block Eviction)
- Setup: All ways valid, victim is dirty
- Action: Assert
read_en_cache = 1 - Expected:
hit = 0- Evicted block on
dirty_block_out - Refetched data replaces it
- PLRU updated
5. Write Miss (Clean Victim Way)
- Setup: Victim way is clean
- Action: Perform write after refill
- Expected:
- Initial
hit = 0 - Block refilled + word written
dirty_bit = 1- PLRU updated
6. Write Miss (Dirty Victim Way)
- Setup: Victim way is dirty
- Action: Write request triggers eviction
- Expected:
- Initial
hit = 0 dirty_block_outcarries evicted block- New block refilled + updated with
data_in dirty_bit = 1- PLRU updated
7. Compulsory Miss (Invalid Entry)
- Setup: Index not yet allocated
- Action: First read to that set
- Expected:
hit = 0- Line filled with
data_in_mem dirty_bit = 0- PLRU initialized
8. PLRU Replacement Behavior
- Setup: Fill all ways, then access subset
- Action: Trigger miss requiring eviction
- Expected:
- PLRU selects least-recently used way
- After replacement, PLRU state updated
🌟 Features of the Testbench
- Preloaded Cache Content: Controlled initialization for targeted tests
- Cycle-by-Cycle Verification: Uses
@(posedge clk)for stepwise operations - Readable Logs: Shows hit/miss, dirty bit, PLRU victim, and cache state
- Waveform-Friendly: Clear signal transitions for debugging