Blog – Udara De Silva

A Comprehensive Guide to Understanding Cache Memory

January 8, 2024 | udaradesilva | Tutorials

Cache memory is akin to a chef’s handy pantry, dramatically speeding up data access for the CPU. It’s like having the essential ingredients at arm’s reach, eliminating the need to fetch them from a distant store (the main memory). This proximity significantly cuts down data retrieval time, turbocharging overall processing speed and efficiency.

The concept of cache was born in the 1960s with the IBM System/360 Model 85, marking a pivotal shift in computing. Initially, it was a small buffer holding a part of the main memory’s data, bridging the growing speed gap between fast CPUs and slower memory. Evolving over the decades, cache has become a multi-leveled, sophisticated system, perfectly illustrating the relentless quest for faster, more efficient computing. This innovative journey of cache memory is not just a technical advancement but a testament to human ingenuity in the digital era.

Cache memory is a small-sized type of volatile computer memory that provides high-speed data access to a processor and stores frequently used computer programs, applications, and data. There are three types of caches commonly employed in a computer system.

Fig. 1: Multi-level cache hierarchy (source)

Type	L1	L2	L3
Location	Integrated directly into the processor	Located either on the CPU or close to it on the processor chip	Usually on the processor chip, shared by all cores
Size	2KB – 64KB	256KB – 2MB	4MB – 50MB
Speed	Fastest	Intermediate	Slower but faster than main memory
Purpose	Stores instructions and data that are immediately required by the CPU	Acts as an intermediary, storing data that is not as immediately needed but still important for speedy processing	Reduces the gap between the computing power of the processor and the slower RAM

Different types of cache memories in a computer system

1. Cache Mapping

Cache mapping governs how data is organized and accessed in a cache. At the heart of this process are cache lines (also called cache blocks), the basic units of data transfer between the cache and the main memory. Typically ranging from 32 to 64 bytes, cache lines leverage the principle of spatial locality: the likelihood that data close to recently accessed data will soon be accessed. By fetching and storing data in these blocks, rather than individual bytes, cache efficiency is significantly enhanced. There are three primary mapping methods.

Direct-Mapped Cache
- Each main memory block maps to exactly one cache line
- Example: Consider a cache with 128 lines and memory blocks numbered from 0, 1, 2, and so forth. The mapping would be such that block 0 maps to cache line 0, block 128 maps to cache line 0, block 1 to cache line 1, block 129 to cache line 1, and so on
Fully Associative Cache
- Any cache line can store data from any memory block
- Example: If the cache has 128 lines, any of these lines can store data from any memory block. The cache controller must search the entire cache to find a specific block, using tags associated with each cache line.
Set Associative Cache
- The cache is divided into sets, and each set contains several lines. A particular memory block can map to any line within a specific set
- Example: In a 2-way set-associative cache with 128 lines, there would be 64 sets, each containing 2 lines. A memory block is first mapped to a set, and then it can be placed in any line within that set.

Direct-mapped cache has a simpler implementation but results in more cache misses. A fully associative cache minimizes cache misses but retrieving data requires searching the entire cache. Set associative cache is a middle ground between direct-mapped and fully associative cache where there can be more than one possible line to store a block of memory.

Fig. 2: Retrieving cached data from direct mapped cache (source). The first column (i.e., V) is a flag to represent whether the cache line is valid or not. If the Tag is found in the cache and the V flag is true it is cache hit. The number of cache lines determines the bitwidth of the index (i.e., if there are 32 cache lines index is 5 bits wide). The number of bytes in a cache line decides the bit width of the offset.

Fig. 3: Retrieving cached data from set-associative cache (source). Unlike direct mapped cache, each tag has more than one possible storage location (typically 2 or 4). Therefore, after a cache miss, the cache controller needs to determine which line should be replaced. This is determined by the replacement policy discussed in the next section.

In Fig. 2 and Fig. 3, there are additional flags like valid (V) and dirty (D) that indicate if the cache line data is valid, and needs to be written back to memory. These flags play an important role in maintaining cache coherency which ensures the data accessed by each processor in a multi-processor multi-cache computer system has the most recently updated value. Cache coherency is explained in detail in section 4.

2. Replacement Policy

Cache replacement policies are essential for managing the limited space in cache memory. When the cache is full and new data needs to be stored, these policies decide which existing data should be replaced (P.S. In a direct-mapped cache, this is trivial because there is exactly one cache line that could store the missing memory block. However, in an associative cache, there can be more than one location that could store the new memory block and the cache controller needs to determine which line to be replaced). One of the most common policies is the Least Recently Used (LRU) algorithm. Below is a simplified pseudocode representation of the LRU cache replacement policy:

Initialize a cache with fixed size
Initialize a usage record, typically a list or a queue

Function accessData(data):
    if data is in cache:
        move data to the top of the usage record
    else:
        if cache is full:
            remove the least recently used data from cache
            remove the least recently used data from the usage record
        add data to cache
        add data to the top of the usage record

This pseudocode represents the core concept of LRU. In real-world implementations, more complex data structures (like hash tables combined with doubly-linked lists) are often used for efficient tracking and updating of the usage record.

Alternatives to LRU include FIFO: first in first out and LFU: least frequently used. Implementing the FIFO policy is relatively simple but may not be efficient in many cases. LFU maintains a counter to track the number of times each block is accessed and replaces the block with the least count. The difference between LRU and LFU is that LRU can adapt quickly to changing patterns of data access, as it only considers the most recent access history. LFU may take longer to adapt since it accumulates access counts over time. Each policy has its strengths and ideal use cases, making them suitable for different scenarios in cache management.

3. Cache Coherency

Cache coherency, in the world of multi-core and multiprocessor computing, is akin to a symphony orchestra’s need for coordination. Just as each musician must be in tune and in time with the others, each processor’s cache in a computer system must be in sync, sharing a consistent view of the data. Let’s dive into the protocols that orchestrate this harmony

Write-Through Protocol:
- In this protocol, all writes to the cache also write to the main memory.
- Pros: Simplicity and ensures data in the cache and memory are always consistent.
- Cons: Higher latency for write operations due to memory write.
Write-Back Protocol:
- Writes are only done in the cache. The data is written back to the main memory only when it’s replaced in the cache. (In Fig. 3, dirty (D) flag is used for this purpose. If it is set, the value should be written back to memory before replacing it)
- Pros: Faster write operations as they are done in the cache.
- Cons: More complex, as it requires a mechanism to keep track of which cache lines have been modified (often using a dirty bit).
MESI Protocol (Modified, Exclusive, Shared, Invalid):
- A commonly used protocol for maintaining cache coherency in multi-core processors.
- It categorizes the state of each cache line into one of four states:
  - Modified: The data is only in this cache and has been changed. It needs to be written back to the main memory.
  - Exclusive: The data is only in this cache but hasn’t been changed.
  - Shared: The data may be in other caches and is the same as the main memory.
  - Invalid: The data is invalid or has been updated in another cache.
- Pros: Reduces the number of times data is written back to main memory, and improves efficiency.
MOESI Protocol (Modified, Owned, Exclusive, Shared, Invalid):
- An extension of MESI with an additional ‘Owned’ state.
- The ‘Owned’ state indicates that the cache line is shared with other caches but also needs to be written back to the main memory.
- Pros: Can reduce bus traffic compared to MESI.
Directory-Based Coherence:
- Instead of relying on a bus to keep track of cache states (as in MESI), a directory-based system uses a central directory that keeps track of which caches have a copy of the data.
- Pros: Scalable to systems with a large number of processors.
- Cons: Increased complexity and potential performance overhead due to directory maintenance.

4. Conclusion

Understanding the various aspects of cache memory is key to optimizing computing systems. From the small but swift L1 cache to the substantial L3 cache, each layer plays a significant role in data processing efficiency. By mastering cache coherency and implementing effective replacement policies, one can significantly boost a system’s performance. As technology advances, the importance of cache memory in computing remains a constant, offering exciting prospects for future enhancements.

Digital Multiplier Architectures

September 30, 2023 | udaradesilva | Tutorials

In the realm of digital computing, the multiplier is one of the core functional units, performing a crucial task that underpins many complex computations. From image processing, and scientific computing, to neural network training, multipliers play a significant role, acting as the cornerstone of these applications. Today, we will delve into the intricate details of digital multiplier architectures, how they work, and how they are being refined to maximize performance.

Digital multipliers come in several architectures, each with its unique traits and application suitability. Let’s dig deeper into some of the most common ones:

Array Multipliers: Recognized for their simplicity, array multipliers use an array of AND and full adder gates for generating and accumulating partial products. These multipliers’ straightforward design makes them easy to implement, although they may not offer the fastest computation time. The architecture of the array multiplier as explained in”Computer Organization” by Hamacher et. al is shown in Fig. 1.

Figure 1: Array multiplier architecture [2]

Wallace Tree Multipliers: Going beyond simplicity, Wallace Tree multipliers minimize the number of sequential adding stages, thereby speeding up the computation process. The generation of partial products, followed by their reduction into two binary numbers using carry-save adders, is the secret to their speed.

Figure 2: Wallace Tree Multiplier (source: https://vlsiverify.com/verilog/verilog-codes/wallace-tree-multiplier )

Booth’s Multipliers: Known for their efficiency, Booth’s multipliers reduce the number of partial products by scrutinizing the multiplier operand bit by bit. Equipped with a powerful algorithm, Booth’s multipliers handle both positive and negative numbers efficiently, making them particularly useful for signed number multiplication.

Figure 3: Flow chart of the booth multiplier algorithm (source: https://github.com/shubhi704/Booth-Algorithm/tree/main)

References

https://electronics.stackexchange.com/questions/593893/difficulty-in-understanding-the-analysis-of-worst-case-signal-propagation-delay
Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Computer Organization, 2004, ISBN 0-07-112214-4
https://vlsiverify.com/verilog/verilog-codes/wallace-tree-multiplier
https://github.com/shubhi704/Booth-Algorithm/tree/main

Getting Started With Xilinx Petalinux

January 23, 2022 | udaradesilva | Tutorials

Xilinx Petalinux is a toolset that allows you to customize, build and deploy an embedded Linux application in a supported Xilinx FPGA like Zynp Ultrascale+ MPSoC. The toolset is available only for Linux, therefore if you are a Windows user consider installing Hyper-V or Virtual Box and creating a Ubuntu virtual machine.

Petalinux toolset contains the following tools to build custom embedded Linux images:

Application, device driver and library generators and templates
Bootable system image builder
Debug agents
GCC tools
QEMU system simulator

Step 1: Build hardware description file

Linux communicates with the hardware using the technique called memory-mapped I\O. This requires a description of the addresses of each hardware IP (i.e. start address, address range). These details are passed on to Petalinux tools by means of a .xsa file (earlier known as .hdf file). The following tutorial describes in detail how to generate a .xsa file for a hardware design that contains only the processor system (PS).

https://xilinx.github.io/Embedded-Design-Tutorials/docs/2021.2/build/html/docs/Introduction/ZynqMPSoC-EDT/3-system-configuration.html

The above tutorial contains only the PS. I added an additional processor reset block and keep the AXI HPM0 FPD and AXI HPM1 FPD (The abbreviation stands for advanced extensible interface high-performance master<number> full power) busses activated as shown below. These AXI masters are used to communicate with your custom hardware IPs.

Fig 1: Hardware design with PS and reset block

In addition to the XSA file, you also need a board support package (BSP) to build a Linux image. The BSP is a collection of drivers customized to the hardware you are using. You can download the BSP from the below link if you are using a Xilinx board. Alternatively, you can generate your own as mentioned here.

https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-design-tools.html

Make sure the BSP you are downloading is the same version as your Petalinux tools version.

Step 2: Create Petalinux project

The Petalinux tools are accessible from the command line. Source the <Xilinx_Installation_Directory>/petalinux/settings.sh

The steps to build the Petalinux project and compile a simple hello world C application is provided below tutorial:

https://xilinx.github.io/Embedded-Design-Tutorials/docs/2021.2/build/html/docs/Introduction/ZynqMPSoC-EDT/6-build-linux-sw-for-ps.html

I also listed the commands that are required to create the Petalinux image below:

petalinux-create -t project -s <path to .bsp file> -n <project_name>

cd <project_name>

petalinux-config --get-hw-description=<path to .xsa file>

petalinux-build

cd images/linux

petalinux-package --boot --fsbl zynqmp_fsbl.elf --fpga <path to bitfile> --u-boot u-boot.elf

These steps will create the following files which need to be copied to the boot partition of the SD card. Once it is copied you can use that SD card to run Linux on your FPGA board

BOOT.BIN
u-boot.bin
boot.scr

Connect to the serial port of the FPGA board using the following settings to verify that Linux is running properly:

Fig 2: Serial port settings to connect to FPGA [1]

Step 3: Create a Linux application

Open Vitis IDE and create a new platform project. This would require you to provide the previously generated .xsa file. Select the OS as Linux and the application processor you intend to use (i.e. psu_coretexa53 in ZCU104)

Fig 3: Create new platform project in Vitis

Build the platform project using the hammer icon
Create an application project : File->New->Application Project
Select the platform we just created
Follow the instructions for the application project:
1. Select the platform we just created
2. Give an application name : hello_linux
3. Use the default domain : linux
4. Keep SYSROOT, rootfs, and kernel image empty
5. Select Linux Hello World template. Click Finish
6. Click hammer icon to build the application project

Step 4: Running the program on FPGA

Fig 4: Steps to run the C application on FPGA

Debugging programs can be done using the debug mode. Use the Run As -> Run configurations to enable Single Application Debug. This allows stepping through the program lines to debug the application.

( To upload the application program to the FPGA you need to have a TFTP server installed in your machine. Instructions to install a TFTP server can be found here )

Conclusion

This short tutorial describes how to build an embedded Linux image using Petalinux tools and run a simple C application on the FPGA

Matlab: SoC Builder Xilinx RFSoC ZCU111 Example

September 7, 2021 | udaradesilva | Tutorials

This guide is written for Matlab R2021a and Vivado 2020.1

Matlab SoC Builder is an add-on that allows creating system on chip (SoC) design for a target device. An SoC design includes both hardware and software design which is generated with the help of HDL coder and Embedded coder toolboxes. Following toolboxes needs to be installed to start using Matlab SoC Builder.

Figure 1: Toolboxes related to Soc Builder

Setting Up

Install the Matlab add-ons shown in Figure 1. Then open the “manage add-on” and click on the configure option of SoC Blockset Support Package for Xilinx Device. Follow the instructions to burn a bootable image to an SD card. Boot the RFSoC board with the SD card and test the connection. If the setup is successful the connection test will pass.

Step 1: Create a RFSoC project in Simulink

Step 2: Modify the example design [OPTIONAL]

Create Project in step 1 will generate an example design similar toFigure 3. The original example design uses the following configuration for the RF Data converter block

RF Interface : ADC & DAC 1×1 interface
Digital Interface : IQ
Samples per clock cycle: 2

The default settings are changed to the settings shown in Figure 4. The settings are chosen to create a hardware loopback between filtered DAC output and ADC input using the XM500 balun board. The default samples per clock cycle settings are changed from 2 to 4 to further reduce the stream clock frequency.

After modifying the RF Data Converter IP, rewiring and model changes are required to build the system. The main changes are removing complex to real/imaginary converters and add bit concatenations to [0 15], [16 31], [32 47] and [48 63]. The source is a 500 kHz LUT-based sinusoidal signal generator. The original example uses two sources to generate the required 2 samples. In the modified design two additional sources are added with the correct phase offset to generate 4 samples in the same cycle.

NOTE: After applying the new settings stream data width gets changed to 64. However, if I close the window (After OK) and open it again the stream data width shows as 32. This seems like only an issue in displaying because the correct values are used for simulation and model compilation.

Open the Model Explorer and change the adc1Data type appropriately. In the above configuration it should change from uint32 to uint64.

Figure 5: Model explore which containing model parameters

Step 3: Configure and Build

Figure 6: Steps to configure and build the model

The configure and build steps will ask you to review the task map and address map which you can auto map. This step will create a Vivado project and will run synthesis, implementation, and bit file generation. After the bit file is generated it can be loaded using the Load existing binaries option in Fig. 6. In Ubuntu, loading the binary has to be done after starting Matlab as the sudo user (This might be fixed in the future).

Simulating S-parameters from touchstone file (ADS)

August 19, 2021 | udaradesilva | Tutorials

RF vendors often provide S-parameter files (Touchstone format) for their components. These files can be used in ADS simulations to decide which component to use in your RF design. This tutorial explains how to import the touchstone file from the vendor website and generate the S-parameter plots.

Step 1: Download the S-parameter file

We will simulate the splitter from here which is MMIC power splitter operating in 1.7-3 GHz range.

Step 2: Import file to ADS

Open a new schematic in ADS and add N-Port S-Parameter file block from Data items. Use the browse option to select the downloaded touchstone file in the block properties.

Step 3: Add S-parameter simulation

Select the Simulation-S_parameters blocks and add SP block and 3 termG blocks. Connect the three “termG” blocks to the 3 ports of the splitter using wires. Change the start, stop and step frequencies to the operating frequency range of the device and then run the simulation.

Step 4: Add the measurments

The simulation results need to be shown in a rectangular plot which can be done as follows.

Summary

This simple example shows how to generate S-parameter plots from vendor-provided touchstones files in ADS. We used a 3-port power splitter in this example but the same method applies to N-port devices.

Launching My Personal Website

June 27, 2020 | udaradesilva | Events

Hello there! If you are reading this post, chances are you are on my Facebook or LinkedIn profiles and know me well already. In that case, you might not find anything new on this website, but I still appreciate that you decided to have a peek. Please drop a comment if you have any suggestions to improve the site.

I decided to launch a personal website on May 23rd, 2020, when Facebook decided to block my profile out of the blue. I guess it was a mistake because they returned my profile the next day. But it got me thinking, how simple it is for Facebook to control what I have to say. So I started working on the website right away. Like my other projects, I only worked on this website on weekends, so it took me a month to get it ready.

I also wanted to give a special thanks to Kasun Sameera from my Synopsys gang, who took the only headshot of me, which looks good enough to put on the website. These photos are from 2017 in Manali, India. Also, a special shout-out to my pretty wife Hasantha, for pushing me to complete the site.

What's next? A lot of changes... The site needs more changes in both functionality and graphics. Also I'm planning to add tech articles on RF Systems, Signal Processing, and Electromagnetism. I already write lessons on chiphackers.com, which is my website focused on Digital Design and Computer Architecture. So tech articles on those fields will not be posted here.

That's about it. Thank you again for visiting. 🙂

Your time is limited, so don't waste it living someone else's life. Don't be trapped by dogma - which is living with the results of other people's thinking. Don't let the noise of others' opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition.
Steve Jobs

UdaraDe Silva