CS计算机代考程序代写 c/c++ chain flex c++ AWS concurrency cache AI assembly algorithm FIELD

FIELD

PROGRAMMABLE GATE

ARRAYS (FPGAS)

Dr Nick Brown

n. .ac.uk

Lots of abstraction in CPUs and GPUs

• Modern CPUs and GPUs

execute very differently from

the programmer’s view of

things

• This is because of lots of

developments at the electronics

level, but trying to maintain the

traditional view of programming Programmer’s view

Reservation station

How it actually works

• For performance and power
efficiency, a lot to be said about
unifying this – so that the electronics
directly represents the application

So how can we do this?
• Develop a bespoke Application Specific Integrated

Circuit (ASIC)
• Chip is entirely specialised to the application and will likely be

very fast

• Chip is fixed, so can not be changed once made

• Very expensive (> $1 million) to tape out design at the
fab – but once done per chip cost is very low
• Rely on (very) large volumes for it to be do-able, which is not

realistic for our workloads

• Although there are some ASICs for AI workloads

• Field Programmable Gate Arrays (FPGAs)
• Chip is manufactured to be reconfigurable at the

electronics level
• Therefore entirely flexible and electronics can be

reprogrammed time and time

• The per chip cost is higher than an ASIC, but as it is
not fixed this can then be reused without needing to
tape anything out at the fab and the large cost.

• However reconfigurability comes at a cost as it will run
slower than an ASIC

Comparison between approaches
ASIC (full

custom)

Semi

Custom

Gate

Array

FPGA

Chip logic Factory Set Factory Set Fixed Programmable

Interconnectivity Factory Set Factory Set Factory Set Programmable

Area Compact Less compact Moderate Large

Performance Highest High Moderate Lowest

In field flexibility None None None High

• Cost and in-field

flexibility mean that

hardware has become

much softer.

Where did FPGAs come from?
• Back in the early 1980s Ross Freeman and Bernard

V. Vonderschmitt invented the first commercially

viable FPGA

• Which was the founding of Xilinx, which is one of the two

major FPGA companies in the world

• They based their work on the basis of a Look

Up Table (LUT)
• Accepts n binary inputs, one binary output and stores a

mask of 2n bits

• This mask determines the output value for each

permutation of inputs

Associate a Flip Flop (memory that stores a single
bit) with each LUT. This can store the LUT output
and then fed back round as an input. By associating
some memory with the logic this enables a wider
number of circuits to be represented by a single LUT

So what is an FPGA? • A number of LUTs are
combined into a Configurable

Logic Block (CLB)
• Eight on modern FPGAs

• Block RAM (BRAM) is on-chip

very fast (approx. 40 TB/s) of

memory, similar to L1 cache

and accessible in approx. 1

cycle
• But expensive, so limited amounts in

MB per FPGA

• Digital Signal Processing

(DSP) blocks which are ASIC

style components that perform

addition and multiplication
• Very useful for floating point

operations

• Very large number of I/O

connections
• To the host via PCIe, other FPGA

cards, on-board DRAM, or High

Bandwidth Memory (HBM2)

Block RAM (BRAM)
• BRAM typically provides 18 or 36 Kilobits of storage per block and

is (very) fast on-chip random-access memory
• Typically a few hundred to a thousand blocks per FPGA

• Operations complete in a single clock cycle (approx. 40 TB/s bandwidth)

• A maximum of two ports (two concurrent reads/writes)

Single ported block, one access per

clock cycle
Dual ported block, two accesses per

clock cycle
A First In First Out queue, used for

moving data around and buffering it

• Lots of uses on the FPGA
• As read/write cache storage for temporary

data

• Read only storage for constant data

• Buffered queues for communication between
different parts of the FPGA

Digital Signal Processing (DSP) slices
• ASIC style arithmetic logic unit (ALU) embedded into the FPGA and is

composed of a chain of three hardware functions
• Add/subtract, multiplier, and add/subtract/accumulate – which are all very

common in many workloads including digital signal processing

• Between a few hundred to ten thousand of these in an FPGA

• Tooling automatically maps applicable operations to the
DSP slices, can implement these operations directly in
the LUTs too, but takes up a significant number of them
• Very common for floating point arithmetic, where DSP

slices are leveraged extensively

A wide variety of FPGAs

• 1.08 million LUTs

• 4.5MB of on-chip BRAM
• Plus 30MB of slower on-

chip memory

• 9024 DSP slices

• 8GB HBM2

• 32GB DDR4 DRAM

• 3840 LUTs

• 27KB of on-chip BRAM

• 8 DSP slices

How do we get performance?

• FPGAs run much slower than CPUs or GPUs

• High end FPGAs are between 250Mhz and 500Mhz

• Rents Rule

• Ratio between number of external connections to a logic block

and the number of gates in that block

• FPGA ratio is an order of magnitude greater
than CPUs or GPUs

• Meaning much higher bandwidth between the
compute and data

• Massive on-chip concurrency
• A huge amount of independent (configurable)

logic that you can make to behave exactly as
you want and running at the same time

Relies upon dataflow programming
• There are no pre-set instructions, you create specific blocks

of logic which contain application logic

• All stages running concurrently, streaming data to the next
stage each clock cycle
• Each of these might contain nested dataflow stages, e.g. for floating

point arithmetic

• In reality might have hundreds of stages

• Will ultimately get mapped to FPGA resources

Stage 1

Stage 2a

Stage 2b

Stage 3a

Stage 3b

Stage 4

I/O

DRAM

DRAM

Internal

memory

(BRAM)

Internal

memory

(BRAM)

Power efficiency
• FPGAs are generally accepted as highly power efficient

• The far lower clock rate compared with CPUs or GPUs

• No need for things like branch prediction, precise exceptions, re-

order buffers

• Increasing the clock rate tends to
imply an increase in the voltage
• Pdyn = CV

2f

• C is switched load capacitance

• V is the voltage

• f is the clock frequency

• So increasing the voltage is bad for
dynamic power!

• Increasing the chip area in use (on the left
the number of cores, for us the amount of
FPGA in use) is far more linear

Power efficiency
• Below illustrates comparing the NekBone Proxyapp on a Xeon

Platinum Cascade Lake CPU and Alveo U280 FPGA

• Two and a half times less power draw and over ten times more power efficient

Description Performance

(GFLOPS)

Power usage

(Watts)

Power efficiency

(GFLOPS/Watt)

1 CPU core 5.38 65.16 0.08

24 CPU cores 65.74 176.65 0.37

1 FPGA kernel 74.29 45.61 1.63

2 FPGA kernels 146.94 52.47 2.80

4 FPGA kernels 289.02 71.98 4.02

• However FPGAs are not particularly good at idle power

• FPGA logic is sitting there good to go and CPUs/GPUs tend to have more advanced

low power states than FPGAs which are not really designed for idleness

• In the above, the CPU idles at around 18 Watts,

whereas the FPGA idles at 30 Watts unconfigured and

idles at 58 Watts configured with 4 kernels

How do we program FPGAs?

• Ultimately, it’s configuring the LUTs

to represent different gates and

connect all of these together

• So configuration is at the gate level

• However thankfully we as humans don’t

need to go down to that level!

• The basic language of FPGAs is a Hardware

Description Language (HDL) such as VHDL or

Verilog

• Basically the assembly code of FPGAs

• Can translate higher level languages such as C or C++

into this HDL

Simple multiplexer VHDL example
ENTITY my_module is

port (clock, I0, I1, I2, I3, A, B:IN std_logic;

Q : out std_logic);

End my_module

ARCHITECTURE my_module_behaviour of my_module is

type my_state is ( Idle, CheckA, CheckB, AssgnQ );

signal sel : Integer range 0 to 3;

signal state_reg : my_state := Idle;

begin

process (clock)

begin

if rising_edge(clock) then

if (state_reg = Idle) then

sel <= 0; state_reg <= CheckA; elsif (state_reg = CheckA) then if (A = ‘1’) then sel <= sel + 1; end if; state_reg <= CheckB; elsif (state_reg = CheckB) then if (B = ‘1’) then sel <= sel + 2; end if; state_reg <= AssgnQ; elsif (state_reg = AssgnQ) then case sel is when 0 => Q<=I0; when 1 => Q<=I1; when 2 => Q<=I2; when 3 => Q<=I3; end case; state_reg <= Idle; end if; end if; end process; end my_module_behaviour; I0 I1 I2 I3 Q A B D a t a i n p u t s Select inputs Output M U X • Just intended to give a flavour of VHDL • VHDL/Verilog being the classic ways of programming FPGAs but now synthesising C/C++ to this level is very popular and successful. • Finite state machines are very common at the HDL level Converting it all to the gate level • Need to convert our design down to the component level on the FPGA for it to execute • Effectively means we need to figure out which components should be used and how to connect them together optimally • Happily, there is CAD tooling so we don’t need to do this manually! Converting it all to the gate level • Synthesis and mapping: • Creates a netlist design from your code (e.g. LUTs, BRAM, DSPs) • Place: • The design is cleaned, and optimal placing of components on the FPGA determined • Route: • Routes between the various logic blocks are created • Timing/Validation: • Design is validated such as timing analysis performed • Generation of the bitstream: • The binary design file used to physically (re)configure the FPGA. Often iterative, can be a challenge to meet timing Compile times are a major challenge • Results in the logic distributed and connected on the chip • Determining the optimal placement and routing can be extremely time consuming • Compile times have always been a challenge • Even for simple designs a couple of hours is common • For complex designs a few days is not un-common • This is alien to many software developers, who often expect binaries in a matter of seconds or minutes • Makes it far less possible to do fast refinement of code and give it a go to see if things work Emulation to the rescue for development • Support the execution of FPGA designs on a simulated FPGA • Avoids the need for much of the time- consuming compilation • Software emulation • Run the code on the CPU in software, fast to compile and run but only picks up a fraction of errors and no idea about real-world performance • Cycle accurate hardware emulation • Run the HDL on a cycle FPGA simulator, longer to compile (a few minutes), picks up 99.9% of errors and gives an estimate of the performance on the FPGA although this can can be somewhat inaccurate and no substitute for running on hardware (Some) Traditional uses of FPGAs High frequency trading Radio astronomy Test and measurement equipment Military and munitions Space Telecommunications Audio synthesisers Why for HPC and AI/ML workloads ? • Power efficiency is an obvious benefit • Can have reduced/arbitrary precision arithmetic • Such as any-precision fixed point which results in significant performance advantages at reduced power • High performance implementations of binary neural networks • Many HPC codes are (somewhat) memory bound • FPGAs rely on the streaming of data • By reorganising algorithms as dataflow, then we can keep the compute fed with continually streaming data Memory bound codes? Description Performance (GFLOPS) Power usage (Watts) Power efficiency (GFLOPS/Watt) 1 CPU core 5.38 65.16 0.08 24 CPU cores 65.74 176.65 0.37 1 FPGA kernel 74.29 45.61 1.63 2 FPGA kernels 146.94 52.47 2.80 4 FPGA kernels 289.02 71.98 4.02 On 24 CPU cores: 65.74 GFLOPs ➢ Only 12.2 times faster than one CPU core Predictability • Because your code is being run directly by the hardware, then FPGAs are far more predictable • Once you finish your design the tooling also reports: • Latency of every CLB • Latencies between CLBs • Number of cycles required for each high-level operation and how much pipelining there is • Count number of floating-point operations running per cycle in your algorithm, multiply by the number of cycles, and this is expected theoretical performance • With the previous example our theoretical performance (4 kernels) was 321 GFLOPS and we were achieving 289 GFLOPS • On the CPU/GPU this is complicated by the number of floating-point units available and other micro-architecture considerations Change of mindset needed by developers • No instructions; unless you create some yourself • Really care about keeping the logic running concurrently and fed with data • Must rewrite your algorithms dataflow style to achieve this • Your algorithm must be capable of this, and by doing so solving a problem with CPU/GPU performance • Immaturity of tooling (e.g. detailed profilers) that many would expect, although things are improving • Hardware has traditionally been a closed eco- system, is opening up but slowly • Requires software licences for much of tooling • Lack of (open source) libraries Existing FPGA based HPC machines Noctua, a Cray CS500 with 32 Intel Stratix 10 FPGAs Cygnus, with 64 Intel Stratix 10 FPGAs and 320 Tesla GPUs Amazon’s AWS F1 instances with Xilinx UltraScale Plus FPGAs FPGA testbed hosted by EPCC for the ExCALIBUR exascale programme Future hardware developments • Xilinx have released their Versal architecture and available in 2022 • Contains their AI engines, which are optimised for linear algebra operations and will potentially provide a significant performance uplift • Xilinx bought out by AMD for $35 billion with the acquisition on-going • Intel are releasing their Agilex next- generation FPGA architecture and some new model Stratix-10 FPGAs • Some of these have interesting configurations, such as the provision of non-volatile memory with the FPGA Conclusions • Enables us to execute our code directly at the electronics level • Not a one stop solution, but instead suited to solve different code-level issues than CPUs/GPUs and a mix of technologies is ultimately likely to be successful in the exa-scale era • Data flows through the pre-defined hardware instructions vs the instructions/data fetching of CPUs/GPGPUs • Massive on-chip concurrency combined with (very) high bandwidth connections • FPGAs use less energy than CPUs or GPGPUs and already very important for specific applications (e.g. embedded, IoT) • Yet to be proven for HPC/AI workloads, but exciting potential here