FIELD
PROGRAMMABLE GATE
ARRAYS (FPGAS)
Dr Nick Brown
n. .ac.uk
Lots of abstraction in CPUs and GPUs
• Modern CPUs and GPUs
execute very differently from
the programmer’s view of
things
• This is because of lots of
developments at the electronics
level, but trying to maintain the
traditional view of programming Programmer’s view
Reservation station
How it actually works
• For performance and power
efficiency, a lot to be said about
unifying this – so that the electronics
directly represents the application
So how can we do this?
• Develop a bespoke Application Specific Integrated
Circuit (ASIC)
• Chip is entirely specialised to the application and will likely be
very fast
• Chip is fixed, so can not be changed once made
• Very expensive (> $1 million) to tape out design at the
fab – but once done per chip cost is very low
• Rely on (very) large volumes for it to be do-able, which is not
realistic for our workloads
• Although there are some ASICs for AI workloads
• Field Programmable Gate Arrays (FPGAs)
• Chip is manufactured to be reconfigurable at the
electronics level
• Therefore entirely flexible and electronics can be
reprogrammed time and time
• The per chip cost is higher than an ASIC, but as it is
not fixed this can then be reused without needing to
tape anything out at the fab and the large cost.
• However reconfigurability comes at a cost as it will run
slower than an ASIC
Comparison between approaches
ASIC (full
custom)
Semi
Custom
Gate
Array
FPGA
Chip logic Factory Set Factory Set Fixed Programmable
Interconnectivity Factory Set Factory Set Factory Set Programmable
Area Compact Less compact Moderate Large
Performance Highest High Moderate Lowest
In field flexibility None None None High
• Cost and in-field
flexibility mean that
hardware has become
much softer.
Where did FPGAs come from?
• Back in the early 1980s Ross Freeman and Bernard
V. Vonderschmitt invented the first commercially
viable FPGA
• Which was the founding of Xilinx, which is one of the two
major FPGA companies in the world
• They based their work on the basis of a Look
Up Table (LUT)
• Accepts n binary inputs, one binary output and stores a
mask of 2n bits
• This mask determines the output value for each
permutation of inputs
Associate a Flip Flop (memory that stores a single
bit) with each LUT. This can store the LUT output
and then fed back round as an input. By associating
some memory with the logic this enables a wider
number of circuits to be represented by a single LUT
So what is an FPGA? • A number of LUTs are
combined into a Configurable
Logic Block (CLB)
• Eight on modern FPGAs
• Block RAM (BRAM) is on-chip
very fast (approx. 40 TB/s) of
memory, similar to L1 cache
and accessible in approx. 1
cycle
• But expensive, so limited amounts in
MB per FPGA
• Digital Signal Processing
(DSP) blocks which are ASIC
style components that perform
addition and multiplication
• Very useful for floating point
operations
• Very large number of I/O
connections
• To the host via PCIe, other FPGA
cards, on-board DRAM, or High
Bandwidth Memory (HBM2)
Block RAM (BRAM)
• BRAM typically provides 18 or 36 Kilobits of storage per block and
is (very) fast on-chip random-access memory
• Typically a few hundred to a thousand blocks per FPGA
• Operations complete in a single clock cycle (approx. 40 TB/s bandwidth)
• A maximum of two ports (two concurrent reads/writes)
Single ported block, one access per
clock cycle
Dual ported block, two accesses per
clock cycle
A First In First Out queue, used for
moving data around and buffering it
• Lots of uses on the FPGA
• As read/write cache storage for temporary
data
• Read only storage for constant data
• Buffered queues for communication between
different parts of the FPGA
Digital Signal Processing (DSP) slices
• ASIC style arithmetic logic unit (ALU) embedded into the FPGA and is
composed of a chain of three hardware functions
• Add/subtract, multiplier, and add/subtract/accumulate – which are all very
common in many workloads including digital signal processing
• Between a few hundred to ten thousand of these in an FPGA
• Tooling automatically maps applicable operations to the
DSP slices, can implement these operations directly in
the LUTs too, but takes up a significant number of them
• Very common for floating point arithmetic, where DSP
slices are leveraged extensively
A wide variety of FPGAs
• 1.08 million LUTs
• 4.5MB of on-chip BRAM
• Plus 30MB of slower on-
chip memory
• 9024 DSP slices
• 8GB HBM2
• 32GB DDR4 DRAM
• 3840 LUTs
• 27KB of on-chip BRAM
• 8 DSP slices
How do we get performance?
• FPGAs run much slower than CPUs or GPUs
• High end FPGAs are between 250Mhz and 500Mhz
• Rents Rule
• Ratio between number of external connections to a logic block
and the number of gates in that block
• FPGA ratio is an order of magnitude greater
than CPUs or GPUs
• Meaning much higher bandwidth between the
compute and data
• Massive on-chip concurrency
• A huge amount of independent (configurable)
logic that you can make to behave exactly as
you want and running at the same time
Relies upon dataflow programming
• There are no pre-set instructions, you create specific blocks
of logic which contain application logic
• All stages running concurrently, streaming data to the next
stage each clock cycle
• Each of these might contain nested dataflow stages, e.g. for floating
point arithmetic
• In reality might have hundreds of stages
• Will ultimately get mapped to FPGA resources
Stage 1
Stage 2a
Stage 2b
Stage 3a
Stage 3b
Stage 4
I/O
DRAM
DRAM
Internal
memory
(BRAM)
Internal
memory
(BRAM)
Power efficiency
• FPGAs are generally accepted as highly power efficient
• The far lower clock rate compared with CPUs or GPUs
• No need for things like branch prediction, precise exceptions, re-
order buffers
• Increasing the clock rate tends to
imply an increase in the voltage
• Pdyn = CV
2f
• C is switched load capacitance
• V is the voltage
• f is the clock frequency
• So increasing the voltage is bad for
dynamic power!
• Increasing the chip area in use (on the left
the number of cores, for us the amount of
FPGA in use) is far more linear
Power efficiency
• Below illustrates comparing the NekBone Proxyapp on a Xeon
Platinum Cascade Lake CPU and Alveo U280 FPGA
• Two and a half times less power draw and over ten times more power efficient
Description Performance
(GFLOPS)
Power usage
(Watts)
Power efficiency
(GFLOPS/Watt)
1 CPU core 5.38 65.16 0.08
24 CPU cores 65.74 176.65 0.37
1 FPGA kernel 74.29 45.61 1.63
2 FPGA kernels 146.94 52.47 2.80
4 FPGA kernels 289.02 71.98 4.02
• However FPGAs are not particularly good at idle power
• FPGA logic is sitting there good to go and CPUs/GPUs tend to have more advanced
low power states than FPGAs which are not really designed for idleness
• In the above, the CPU idles at around 18 Watts,
whereas the FPGA idles at 30 Watts unconfigured and
idles at 58 Watts configured with 4 kernels
How do we program FPGAs?
• Ultimately, it’s configuring the LUTs
to represent different gates and
connect all of these together
• So configuration is at the gate level
• However thankfully we as humans don’t
need to go down to that level!
• The basic language of FPGAs is a Hardware
Description Language (HDL) such as VHDL or
Verilog
• Basically the assembly code of FPGAs
• Can translate higher level languages such as C or C++
into this HDL
Simple multiplexer VHDL example
ENTITY my_module is
port (clock, I0, I1, I2, I3, A, B:IN std_logic;
Q : out std_logic);
End my_module
ARCHITECTURE my_module_behaviour of my_module is
type my_state is ( Idle, CheckA, CheckB, AssgnQ );
signal sel : Integer range 0 to 3;
signal state_reg : my_state := Idle;
begin
process (clock)
begin
if rising_edge(clock) then
if (state_reg = Idle) then
sel <= 0; state_reg <= CheckA; elsif (state_reg = CheckA) then if (A = ‘1’) then sel <= sel + 1; end if; state_reg <= CheckB; elsif (state_reg = CheckB) then if (B = ‘1’) then sel <= sel + 2; end if; state_reg <= AssgnQ; elsif (state_reg = AssgnQ) then case sel is when 0 => Q<=I0; when 1 => Q<=I1; when 2 => Q<=I2; when 3 => Q<=I3; end case; state_reg <= Idle; end if; end if; end process; end my_module_behaviour; I0 I1 I2 I3 Q A B D a t a i n p u t s Select inputs Output M U X • Just intended to give a flavour of VHDL • VHDL/Verilog being the classic ways of programming FPGAs but now synthesising C/C++ to this level is very popular and successful. • Finite state machines are very common at the HDL level Converting it all to the gate level • Need to convert our design down to the component level on the FPGA for it to execute • Effectively means we need to figure out which components should be used and how to connect them together optimally • Happily, there is CAD tooling so we don’t need to do this manually! Converting it all to the gate level • Synthesis and mapping: • Creates a netlist design from your code (e.g. LUTs, BRAM, DSPs) • Place: • The design is cleaned, and optimal placing of components on the FPGA determined • Route: • Routes between the various logic blocks are created • Timing/Validation: • Design is validated such as timing analysis performed • Generation of the bitstream: • The binary design file used to physically (re)configure the FPGA. Often iterative, can be a challenge to meet timing Compile times are a major challenge • Results in the logic distributed and connected on the chip • Determining the optimal placement and routing can be extremely time consuming • Compile times have always been a challenge • Even for simple designs a couple of hours is common • For complex designs a few days is not un-common • This is alien to many software developers, who often expect binaries in a matter of seconds or minutes • Makes it far less possible to do fast refinement of code and give it a go to see if things work Emulation to the rescue for development • Support the execution of FPGA designs on a simulated FPGA • Avoids the need for much of the time- consuming compilation • Software emulation • Run the code on the CPU in software, fast to compile and run but only picks up a fraction of errors and no idea about real-world performance • Cycle accurate hardware emulation • Run the HDL on a cycle FPGA simulator, longer to compile (a few minutes), picks up 99.9% of errors and gives an estimate of the performance on the FPGA although this can can be somewhat inaccurate and no substitute for running on hardware (Some) Traditional uses of FPGAs High frequency trading Radio astronomy Test and measurement equipment Military and munitions Space Telecommunications Audio synthesisers Why for HPC and AI/ML workloads ? • Power efficiency is an obvious benefit • Can have reduced/arbitrary precision arithmetic • Such as any-precision fixed point which results in significant performance advantages at reduced power • High performance implementations of binary neural networks • Many HPC codes are (somewhat) memory bound • FPGAs rely on the streaming of data • By reorganising algorithms as dataflow, then we can keep the compute fed with continually streaming data Memory bound codes? Description Performance (GFLOPS) Power usage (Watts) Power efficiency (GFLOPS/Watt) 1 CPU core 5.38 65.16 0.08 24 CPU cores 65.74 176.65 0.37 1 FPGA kernel 74.29 45.61 1.63 2 FPGA kernels 146.94 52.47 2.80 4 FPGA kernels 289.02 71.98 4.02 On 24 CPU cores: 65.74 GFLOPs ➢ Only 12.2 times faster than one CPU core Predictability • Because your code is being run directly by the hardware, then FPGAs are far more predictable • Once you finish your design the tooling also reports: • Latency of every CLB • Latencies between CLBs • Number of cycles required for each high-level operation and how much pipelining there is • Count number of floating-point operations running per cycle in your algorithm, multiply by the number of cycles, and this is expected theoretical performance • With the previous example our theoretical performance (4 kernels) was 321 GFLOPS and we were achieving 289 GFLOPS • On the CPU/GPU this is complicated by the number of floating-point units available and other micro-architecture considerations Change of mindset needed by developers • No instructions; unless you create some yourself • Really care about keeping the logic running concurrently and fed with data • Must rewrite your algorithms dataflow style to achieve this • Your algorithm must be capable of this, and by doing so solving a problem with CPU/GPU performance • Immaturity of tooling (e.g. detailed profilers) that many would expect, although things are improving • Hardware has traditionally been a closed eco- system, is opening up but slowly • Requires software licences for much of tooling • Lack of (open source) libraries Existing FPGA based HPC machines Noctua, a Cray CS500 with 32 Intel Stratix 10 FPGAs Cygnus, with 64 Intel Stratix 10 FPGAs and 320 Tesla GPUs Amazon’s AWS F1 instances with Xilinx UltraScale Plus FPGAs FPGA testbed hosted by EPCC for the ExCALIBUR exascale programme Future hardware developments • Xilinx have released their Versal architecture and available in 2022 • Contains their AI engines, which are optimised for linear algebra operations and will potentially provide a significant performance uplift • Xilinx bought out by AMD for $35 billion with the acquisition on-going • Intel are releasing their Agilex next- generation FPGA architecture and some new model Stratix-10 FPGAs • Some of these have interesting configurations, such as the provision of non-volatile memory with the FPGA Conclusions • Enables us to execute our code directly at the electronics level • Not a one stop solution, but instead suited to solve different code-level issues than CPUs/GPUs and a mix of technologies is ultimately likely to be successful in the exa-scale era • Data flows through the pre-defined hardware instructions vs the instructions/data fetching of CPUs/GPGPUs • Massive on-chip concurrency combined with (very) high bandwidth connections • FPGAs use less energy than CPUs or GPGPUs and already very important for specific applications (e.g. embedded, IoT) • Yet to be proven for HPC/AI workloads, but exciting potential here