fpt Licensing

top_left_banner SimCon logo

FPGAs and Parallel Programming

FPGAs and Multi-core Machines

FPGAs are Field programmable Gate Arrays. They have a far higher density of programmable gates than more conventional computing systems, but a lower clock speed. They offer enormous potential for fine-grain parallel computing, where the results of individual computations are almost immediately available to other computing elements. This is quite unlike the coarse-grain parallel computing used in modern supercomputers, where there are significant delays in transferring data between different computing elements.

Multi-core PC chips are now widely available. New generations of parallel chips have been developed with as many as 64 parallel cores in a closely coupled system. These devices are far less parallel than FPGAs but are faster and may offer similar scope for fine-grain parallelism.

Parallel Algorithms

Algorithms for fine-grain parallel computing were developed by a member of SimCon in 1983 for the Applied Dynamics International AD10 simulation computer. This machine was by far the fastest real-time simulation system available in the mid 1980s, and was used in the design of Space Shuttle, comunications satellites, nuclear powerplant, helicopters and a range of military hardware. The algorithms were implemented as a stand-alone optimiser and instruction scheduler. They are applicable to Fortran or C programming for FPGAs and multicore systems. Research was carried out in collaboration with Liverpool Hope University and The University of Cape Town to apply the algorithms to these devices.

Hydra - The First Implementation

"Hydra", the first implementation of the parallel algorithmes for an FPGA is a pipeline of processes made up of:

WinFPT: WinFPT replaces separate variable lives with different variables (For a description of variables lives, please see the description of this issue in the research on physical units and dimensions. Essentially, variables which are assigned more than once are replaced by separate variables, each with only a single assignment). WinFPT expands sub-programs in-line and unrolls DO loops. It cuts long expressons into "triples" - expressions of the form a = b + c, a = b * c etc.;
WinFPT API: Code written in the WinFPT API outputs the triples in a format appropriate for the parallel optimiser;
Hydra Optimiser: The Hydra optimiser, derived from the original system for the AD10, schedules the instructions in an optimised parallel sequence;
Hydra Image Builder: The image builder constructs the load image for the FPGA.

The FPGA program consists of an interpreter which reads the load image and executes groups of operations in parallel. The FPGA program is data driven. It is a single generic code which will execute any program. Compilation of a Fortran code through the pipeline does not require any further coding within the FPGA.

The Hydra pipeline is described in detail in the paper "APPRASE: Automatic parallelisation of Fortran to run on an FPGA" presented at the SCSC 2008 conference in Edinburgh.

Fynbos - The Working System

"Fynbos", is a successful working system. The FPGA code was developed by Jane Wyngaard at The University of Cape Town. It is described in Wyngaard J, Inggs M, Collins J and Farrimond B, 2013, "Towards a many-core architecture for HPC", 23rd International Conference on Field Programmable Logic and Applications (FPL). The software pipeline is largely as described above, except that the image builder is now integrated into the optimiser code. Experiments were carried out in the detailed architecture of the FPGA engine. In particular, division is expensive in FPGA program space, and the number of divisions in a program tends to be small. Therefore division was implemented in only a sub-set of the system cores.

The Next Step - Fixed-point Programming

FPGAs are most efficiently programmed in fixed point arithmetic. Real numbers are represented by scaled fractions. Each real value, x, is assigned a maximum absolute value, scale(x). x is represented internally by an integer sfx which is calculated as sfx = 2**(n-1) * x / scale(x) where n is the number of bits used in the scaled fraction. Usually, n would be 32 bits, but in an FPGA this is not necessarily the case.

The AD10 was also a fixed-point machine. A system was developed to maintain the coefficients and data structures required to support the scaled-fraction arithmetic. For example, where:

a = b + c

and a, b and c have different scales, a set of coefficients must be generated to correct the values so that the effective calculation would be:

A = ( B + C * ( scale(c) / scale(b) ) ) * ( scale(b) / scale(a) )

where A, B and C are the scaled fractions representing a, b and c. This system will be applied in the FPGA programs.