HLS Coding Style: Arrays and Data Types

 

Data Types for Efficient Hardware

C-based native data types are all on 8-bit boundaries (8, 16, 32, 64 bits). RTL buses
(corresponding to hardware) support arbitrary data lengths.

Arrays

If you specify a very large array, it might cause C simulation to run out of memory and fail,
as shown in the following example:

1 #include "ap_cint.h"
2 int i, acc;
3 // Use an arbitrary precision type
4 int32 la0[10000000], la1[10000000];
5 for (i=0 ; i < 10000000; i++) {
6 acc = acc + la0[i] + la1[i];
7 }

A solution is to use dynamic memory allocation for simulation but a fixed sized array for
synthesis, as shown in the next example. This means that the memory required for this is
allocated on the heap, managed by the OS, and which can use local disk space to grow.

 1  #include "ap_cint.h"
 2 int i, acc;
 3 #ifdef __SYNTHESIS__
 4 // Use an arbitrary precision type & array for synthesis
 5 int32 la0[10000000], la1[10000000];
 6 #else
 7 // Use an arbitrary precision type & dynamic memory for simulation
 8 int32 *la0 = malloc(10000000 * sizeof(int32));
 9 int32 *la1 = malloc(10000000 * sizeof(int32));
10 #endif
11 for (i=0 ; i < 10000000; i++) {
12 acc = acc + la0[i] + la1[i];
13 }

Arrays are typically implemented as a memory (RAM, ROM or FIFO) after synthesis. As
discussed in Arrays on the Interface, arrays on the top-level function interface are
synthesized as RTL ports that access a memory outside. Arrays internal to the design are
synthesized to internal block RAM, LUTRAM, UltraRAM, or registers, depending on the
optimization settings.

Cases in which arrays can create issues in the RTL include:
• Array accesses can often create bottlenecks to performance. When implemented as a
memory, the number of memory ports limits access to the data. Array initialization, if
not performed carefully, can result in undesirably long reset and initialization in the
RTL.
• Some care must be taken to ensure arrays that only require read accesses are
implemented as ROMs in the RTL.

Array Accesses and Performance

The following code example shows a case in which accesses to an array can limit
performance in the final RTL design.

1 #include "array_mem_bottleneck.h"
2 dout_t array_mem_bottleneck(din_t mem[N]) {
3 dout_t sum=0;
4 int i;
5 SUM_LOOP:for(i=2;i<N;++i)
6 sum += mem[i] + mem[i-1] + mem[i-2];
7 return sum;
8 }

Trying to pipeline SUM_LOOP with an initiation interval of 1 results in the following message
(after failing to achieve a throughput of 1, Vivado HLS relaxes the constraint):

INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2',
bottleneck.c:62) on array 'mem' due to limited memory ports.
INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.

A dual-port RAM could be used, but this allows only two accesses per clock cycle. Three
reads are required to calculate the value of sum, and so three accesses per clock cycle are
required to pipeline the loop with an new iteration every clock cycle.

This code can be rewritten as shown in the following code example to allow
the code to be pipelined with a throughput of 1. In the following code example, by
performing pre-reads and manually pipelining the data accesses, there is only one array
read specified in each iteration of the loop. This ensures that only a single-port RAM is
required to achieve the performance.

 1 #include "array_mem_perform.h"
 2 dout_t array_mem_perform(din_t mem[N]) {
 3 din_t tmp0, tmp1, tmp2;
 4 dout_t sum=0;
 5 int i;
 6 tmp0 = mem[0];
 7 tmp1 = mem[1];
 8 SUM_LOOP:for (i = 2; i < N; i++) {
 9 tmp2 = mem[i];
10 sum += tmp2 + tmp1 + tmp0;
11 tmp0 = tmp1;
12 tmp1 = tmp2;
13 }
14 return sum;
15 }

Vivado HLS includes optimization directives for changing how arrays are implemented and
accessed. It is typically the case that directives can be used, and changes to the code are not
required. Arrays can be partitioned into blocks or into their individual elements. In some
cases, Vivado HLS partitions arrays into individual elements. This is controllable using the
configuration settings for auto-partitioning.

FIFO Accesses

Accesses to a FIFO must be in sequential order starting from location zero.

Arrays on the Interface

Vivado HLS synthesizes arrays into memory elements by default. When you use an array as
an argument to the top-level function, Vivado HLS assumes the following:
•Memory is off-chip
Vivado HLS synthesizes interface ports to access the memory.
•Memory is standard block RAM with a latency of 1
The data is ready 1 clock cycle after the address is supplied.

To configure how Vivado HLS creates these ports:
• Specify the interface as a RAM or FIFO interface using the INTERFACE directive.
• Specify the RAM as a single or dual-port RAM using the RESOURCE directive.
• Specify the RAM latency using the RESOURCE directive.
• Use array optimization directives (Array_Partition, Array_Map, or
Array_Reshape) to reconfigure the structure of the array and therefore, the number
of I/O ports.

Array Interfaces

The resource directive can explicitly specify which type of RAM is used, and therefore which
RAM ports are created (single-port or dual-port). If no resource is specified, Vivado HLS
uses:
• A single-port RAM by default.
• A dual-port RAM if it reduces the initiation interval or reduces latency.

1 #include "array_RAM.h"
2 void array_RAM (dout_t d_o[4], din_t d_i[4], didx_t idx[4]) {
3 int i;
4 For_Loop: for (i=0;i<4;i++) {
5 d_o[i] = d_i[idx[i]];
6 }
7 }

A single-port RAM interface is used because the for-loop ensures that only one element
can be read and written in each clock cycle. There is no advantage in using a dual-port RAM
interface.


If the for-loop is unrolled, Vivado HLS uses a dual-port. Doing so allows multiple elements
to be read at the same time and improves the initiation interval. The type of RAM interface
can be explicitly set by applying the resource directive.

 

FIFO Interfaces

Vivado HLS allows array arguments to be implemented as FIFO ports in the RTL. If a FIFO
ports is to be used, be sure that the accesses to and from the array are sequential. Vivado
HLS determines whether the accesses are sequential.

 1 #include "array_FIFO.h"
 2 void array_FIFO (dout_t d_o[4], din_t d_i[4], didx_t idx[4]) {
 3 int i;
 4 #pragma HLS INTERFACE ap_fifo port=d_i
 5 #pragma HLS INTERFACE ap_fifo port=d_o
 6 // Breaks FIFO interface d_o[3] = d_i[2];
 7 For_Loop: for (i=0;i<4;i++) {
 8 d_o[i] = d_i[idx[i]];
 9 }
10 }

In this case, the behavior of variable idx determines whether or not a FIFO interface can be
successfully created.
• If idx is incremented sequentially, a FIFO interface can be created.
• If random values are used for idx, a FIFO interface fails when implemented in RTL.
Because this interface might not work, Vivado HLS issues a message during synthesis and
creates a FIFO interface.

@W [XFORM-124] Array 'd_i': may have improper streaming access(es).
  • Note: FIFO ports cannot be synthesized for arrays that are read from and written to. Separate input and output arrays must be created.

The following general rules apply to arrays that are implemented with a Streaming interface:

• The array must be written and read in only one loop or function. This can be
transformed into a point-to-point connection that matches the characteristics of FIFO
links.
• The array reads must be in the same order as the array write. Because random access is
not supported for FIFO channels, the array must be used in the program following first
in, first out semantics.
• The index used to read and write from the FIFO must be analyzable at compile time.
Array addressing based on run time computations cannot be analyzed for FIFO
semantics and prevent the tool from converting an array into a FIFO.

Array Initialization

  • RECOMMENDED: As discussed in Type Qualifiers, although not a requirement, Xilinx recommends specifying arrays that are to be implemented as memories with the static qualifier. This not only ensures that Vivado HLS implements the array with a memory in the RTL, it also allows the initialization behavior of static types to be used.

In the following code, an array is initialized with a set of values. Each time the function is
executed, array coeff is assigned these values. After synthesis, each time the design
executes the RAM that implements coeff is loaded with these values. For a single-port
RAM this would take 8 clock cycles. For an array of 1024, it would of course, take 1024 clock
cycles, during which time no operations depending on coeff could occur.

int coeff[8] = {-2, 8, -4, 10, 14, 10, -4, 8, -2};

The following code uses the static qualifier to define array coeff. The array is initialized
with the specified values at start of execution. Each time the function is executed, array
coeff remembers its values from the previous execution. A static array behaves in C code
as a memory does in RTL.

static int coeff[8] = {-2, 8, -4, 10, 14, 10, -4, 8, -2};

In addition, if the variable has the static qualifier, Vivado HLS initializes the variable in the
RTL design and in the FPGA bitstream. This removes the need for multiple clock cycles to
initialize the memory and ensures that initializing large memories is not an operational
overhead.

Implementing ROMs

Vivado HLS does not require that an array be specified with the static qualifier to
synthesize a memory or the const qualifier to infer that the memory should be a ROM.
Vivado HLS analyzes the design and attempts to create the most optimal hardware.


Xilinx highly recommends using the static qualifier for arrays that are intended to be
memories. As noted in Array Initialization, a static type behaves in an almost identical
manner as a memory in RTL.


The const qualifier is also recommended when arrays are only read, because Vivado HLS
cannot always infer that a ROM should be used by analysis of the design. The general rule
for the automatic inference of a ROM is that a local, static (non-global) array is written to
before being read. The following practices in the code can help infer a ROM:
• Initialize the array as early as possible in the function that uses it.
• Group writes together.
• Do not interleave array(ROM) initialization writes with non-initialization code.
• Do not store different values to the same array element (group all writes together in
the code).
• Element value computation must not depend on any non-constant (at compile-time)
design variables, other than the initialization loop counter variable.


If complex assignments are used to initialize a ROM (for example, functions from the
math.h library), placing the array initialization into a separate function allows a ROM to be
inferred. In the following example, array sin_table[256] is inferred as a memory and
implemented as a ROM after RTL synthesis.

 1 #include "array_ROM_math_init.h"
 2 #include <math.h>
 3 void init_sin_table(din1_t sin_table[256])
 4 {
 5 int i;
 6 for (i = 0; i < 256; i++) {
 7 dint_t real_val = sin(M_PI * (dint_t)(i - 128) / 256.0);
 8 sin_table[i] = (din1_t)(32768.0 * real_val);
 9 }
10 }
11 dout_t array_ROM_math_init(din1_t inval, din2_t idx)
12 {
13 short sin_table[256];
14 init_sin_table(sin_table);
15 return (int)inval * (int)sin_table[idx];
16 }
  • TIP: Because the result of the sin() function results in constant values, no core is required in the RTL design to implement the sin() function.

 

Data Types

Standard Types

 

Composite Data Types

Vivado HLS supports composite data types for synthesis:
• struct
• enum
• union

 

Structs

 When structs are used as arguments to the top-level function, the ports created by
synthesis are a direct reflection of the struct members. Scalar members are implemented as
standard scalar ports and arrays are implemented, by default, as memory ports.

 In this design example, struct data_t is defined in the header file shown in the
following code example. This struct has two data members:
• An unsigned vector A of type short (16-bit).
• An array B of four unsigned char types (8-bit).

1 typedef struct {
2 unsigned short A;
3 unsigned char B[4];
4 } data_t;
5 data_t struct_port(data_t i_val, data_t *i_pt, data_t *o_pt);

In the following code example, the struct is used as both a pass-by-value argument (from
i_val to the return of o_val) and as a pointer (*i_pt to *o_pt).

 1 #include "struct_port.h"
 2 data_t struct_port(
 3 data_t i_val,
 4 data_t *i_pt,
 5 data_t *o_pt
 6 ) {
 7 data_t
 8 int i;
 9 o_val;
10 // Transfer pass-by-value structs
11 o_val.A = i_val.A+2;
12 for (i=0;i<4;i++) {
13 o_val.B[i] = i_val.B[i]+2;
14 }
15 // Transfer pointer structs
16 o_pt->A = i_pt->A+3;
17 for (i=0;i<4;i++) {
18 o_pt->B[i] = i_pt->B[i]+3;
19 }
20 return o_val;
21 }

All function arguments and the function return are synthesized into ports as follows:
• Struct element A results in a 16-bit port.
• Struct element B results in a RAM port, accessing 4 elements.

 

Global Variables
Global variables can be freely used in the code and are fully synthesizable. By default, global
variables are not exposed as ports on the RTL interface.

The following code example shows the default synthesis behavior of global variables. It uses
three global variables.
• Values are read from array Ain.
• Array Aint is used to transform and pass values from Ain to Aout.


The outputs are written to array Aout.

 1 din_t Ain[N];
 2 din_t Aint[N];
 3 dout_t Aout[N/2];
 4 void types_global(din1_t idx) {
 5 int i,lidx;
 6 // Move elements in the input array
 7 for (i=0; i<N; ++i) {
 8 lidx=i;
 9 if(lidx+idx>N-1)
10 lidx=i-N;
11 Aint[lidx] = Ain[lidx+idx] + Ain[lidx];
12 }
13 // Sum to half the elements
14 for (i=0; i<(N/2); i++) {
15 Aout[i] = (Aint[i] + Aint[i+1])/2;
16 }
17 }

By default, after synthesis, the only port on the RTL design is port idx. Global variables are
not exposed as RTL ports by default. In the default case:
• Array Ain is an internal RAM that is read from.
• Array Aout is an internal RAM that is written to.

 

Pointers

 

 

 

 

Reference:

1. Xilinx UG902

posted @ 2019-05-30 16:00  wordchao  阅读(3010)  评论(0编辑  收藏  举报