Bambu: A Free Framework for the High-Level Synthesis of Complex Applications
Bambu is a free framework aimed at assisting the designer during the high-level synthesis of complex applications, supporting most of the C constructs (e.g., function calls and sharing of the modules, pointer arithmetic and dynamic resolution of memory accesses, accesses to array and structs, parameter passing either by reference or copy, …). Bambu is developed for Linux stystems, it is written in C++, and it can be freely downloaded under GPL license.
Bambu receives as input a behavioral description of the specification, written in C language, and generates the HDL description of the corresponding RTL implementation as output, which is compatible with commercial RTL synthesis tools, along with a test-bench for the simulation and validation of the behavior. Bambu is designed in a extremely modular way, implementing the different tasks of the HLS process, and specific algorithms, in distinct C++ classes which work on different IRs depending on the synthesis stage.
The whole HLS flow is quite similar to a software compilation flow: it starts from a high level specification and produces low level code after a sequence of analysis and optimization steps.
As well as software compilation flow has, three different phases can be identified in the High Level Synthesis flow: front-end, middle-end and back-end. In the front-end the input code is parsed and translated in an intermediate representation which will be used in the following parts of the flow. In the middle-end target independent analyses and optimizations are performed.
Bambu interfaces the GNU Compiler Collection (GCC) (version 4.5, 4.6, 4.7, 4.8, 4.9 and 5 are currently supported) by means of GCC plugins to extract its internal representation in Static Single Assignment form of the initial C code. In particular, the extracted IR is the GIMPLE IR exploited by GCC to perform the target and language-independent optimizations. Starting from the dumping of this representation in ASCII files, Bambu loads the intermediate representation.
The Gimple IR intermediate representation is extracted after that GCC has performed the target independent optimizations.
Note however that not all the software code optimizations are profitable when the target is a hardware accelerator. For example, the effects of transformations like function inlining and loop unrolling can impact much more on resource utilization than the same transformation done when a processor is considered.
Starting from the intermediate representation extracted from GCC, Bambu performs further analyses and builds additional internal representations, such as Call Graph, Control Flow Graphs, Data Flow Graphs and Program Dependence Graphs.
Next it applies a set of analyses and transformations independently from the target device.
Some of these steps are the same applied in a software compilation flow (e.g., data flow analysis, loop recognition, dead code elimination, constant propagation, etc.).
One relevant specific optimizations performed by Bambu during this phase is the optimization of multiplications and divisions by a constant.
These operations are typically transformed into operations that use only shifts and adds to improve area and timing.
Another analysis performed at this stage is the Bitwidth Analysis that aims to reduce the number of bits required by datapath operators.
This is a very important optimization because it impacts all non-functional requirements (e.g. performance, area, power) of a design, without affecting its behavior.
Differently from general purpose processor compilers, which are designed to target a processor with a fixed-sized datapath (usually 32 or 64 bits), a hardware compiler can exploit specialization by generating custom-size operators (i.e. functional units) and registers. As a direct consequence, we can select the minimal number of bits required for an operation and/or storage of the specific algorithm, which in turns leads to minimal space used for registers, smaller functional units that translate into less area, less power, and shorter critical paths.
However, this analysis cannot be usually completely automated since it often requires specific knowledge of the algorithm and the input datasets.
Bambu implements the methodology describes in budiu-tr00.pdf integrated with the Value Range information computed by the GCC compiler.
In this phase the actual High-Level Synthesis of the specification is performed.
Even if the same HDL language can be used to describe architectures implemented for different families of devices, the HLS flow is not target independent but takes into account information about the target device. Moreover, FPGAs do not have a fixed operating frequency, but this can be decided by the designer or forced by devices (e.g., sensors or actuators) connected to it.
The synthesis process acts on each function separately. The resulting architecture is modular, reflecting the structure of the call graph.
The modules implementing the single functions include two different parts: the control logic and the data-path.
The control logic is modeled as a Finite State Machine which handles the routing of the data within the data-path and the execution of the single operations.
The generated data-path is a custom mux-based architecture optimized on the dimension of the data types to reduce the number of flip-flops and bit-level multiplexers.
It implements all the operations that have to be executed and stores their input and output.
The back-end phase generates the actual hardware architecture by performing the following steps:
Functions Allocation defines the hierarchy of the modules implementing the functions of the specification built.
Bambu is currently able to use and integrate functions described at low level in Verilog or in VHDL with functions described at high-level in C.
Memories Allocation defines the memories used to store aggregate variables (arrays and structures), global variables, and how the dynamic memory allocation is implemented.
Bambu adopts a novel architecture for memory accesses: it builds a hierarchical data-path directly connected to a dual-port BRAM whenever a local aggregated or a global scalar/aggregate data type is used by the code specified and whenever the accesses can be determined at compile time.
In this case, multiple memory accesses can be performed in parallel.
Otherwise, the memories are interconnected so that it is also possible to support dynamic resolution of the addresses.
Indeed, the same memory infrastructure can be natively connected to external components (e.g. a local scratch-pad memory or cache) or directly to the bus to access off-chip memory.
Resource allocation associates operations in the specification to Functional Units (FUs) in the resource library. During the middle-end phase the specification is inspected, and operations characteristics identified. Such characteristics include the kind of operation (e.g. addition, multiplication, …), and input/output value types (e.g. integer, float, …).
Floating point operations are supported through the High Level Synthesis of a soft-float library containing basic soft float operations or through FloPoCo, a generator of arithmetic Floating-Point Cores. The allocation step maps them on the set of available FUs: their characterization includes information, such as latency, area, and number of pipeline stages. Usually more operation/FU matchings are feasible: in this case the selection of a proper FU is driven by design constraints. In addition to FUs, also memory resources are allocated. Local data in fact, may be bound to local memories.
The library of functional units used by Bambu is quite rich and in some cases it includes several implementations for the same single operation.
Moreover, the library contains functional units that are expressed as templates in a standard hardware description language (i.e. Verilog or VHDL). These templates can be retargeted and customized on the basis of the characteristics of the target technology. In this case, the underlying logic synthesis tool can determine which is the best architecture to implement each function. For example, multipliers can be mapped either on dedicated DSP blocks or implemented with LUTs. To perform aggressive optimizations, each component of the library is annotated with information useful during the entire HLS process, such as resource occupation and latency for executing the operations. Bambu adopts a pre-characterization approach. That is, the performance estimation considers a generic template of the functional unit, which can be parametric with respect to the bitwidths and pipeline stages. Latency and resource occupation are then obtained by synthesizing each configuration and storing the results in the library.
Scheduling of operations is performed by default through a LIST-based algorithm, which is constrained by resource availability. In its basic formulation, the LIST algorithm associates to each operation a priority, according to particular metrics. For example, priority may reflect operations mobility with respect to the critical path. Operations belonging to the critical path have zero-mobility: delaying their execution usually results in an increase of the overall circuit latency. Critical path and mobilities can be obtained analyzing As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules. The LIST approach proceeds iteratively associating to each control step, operations to be executed. Ready operations (e.g. whose dependencies have been satisfied in previous iterations of the algorithm) are scheduled in the current control step considering resource availability: if multiple ready operations compete for a resource, than the one having higher priority is scheduled. Alternatively, a Speculative scheduling algorithm based on System of Difference Constraints (see Code Transformations Based on Speculative SDC Scheduling paper) is available: this algorithm build an integer linear programming formulation of the scheduling problem, allowing code motions and speculations of operations into different basic blocks. The solution produced by the ILP solver is then implemented by applying the code motions and the speculations suggested by the ILP solution, then the rest of the High Level Synthesis flow can be implemented. After the scheduling task it is possible to build State Transition Graph (STG) accordingly: the STG is adopted for further analysis and to build the final Finite State Machine implementation for the controller.
Operations that execute concurrently, according to the computed schedule, are not allowed to share the same FU instance, thus avoiding resource conflicts. In Bambu, binding is performed through a clique covering algorithm on a weighted compatibility graph. The compatibility graph is built by analyzing the schedule: operations scheduled on different control steps are compatible. Weights express how much is profitable for two operations to share the same hardware resource. They are computed taking into account area/delay trade-offs as a result of sharing; for example, FUs that demand a large area will be more likely shared. Weights computation also considers the cost of interconnections for introducing steering logic, both in terms of area and frequency. Bambu offers several algorithms also for solving the covering problem on generic compatibility/conflict graphs.
Register binding associates storage values to registers, and requires a preliminary analysis step, the Liveness Analysis (LA). LA analyzes the scheduled function, and identifies the life intervals of each variable, i.e. the sequence of control steps in which a temporary needs to be stored. Storage values with non overlapping life intervals may share the same register. In default settings, the Bambu flow computes liveness information through a non-iterative SSA liveness analysis algorithm (see Non-Iterative SSA liveness analysis paper). Register assignment is then reduced to the problem of coloring a conflict graph. Nodes of the graph are storage values, edges represent the conflict relation. Algorithms for a weighted clique covering compatibility graph solving the register binding problem are also available.
Interconnections are bound according to the previous steps: if a resource is shared, then the algorithm introduces steering logic on its inputs. It also identifies the relation between control signals and different operations: such signals are then set by the controller.
During the synthesis process, the final architecture is represented through a hyper-graph, which also highlights the interconnection between modules.
The netlist generation step translates such representation in a Verilog or VHDL description. The process access the resource library, which embeds the Verilog or the VHDL implementation of each allocated module.
Generation of Synthesis and Simulation Scripts
Bambu provides the automatic generation of synthesis and simulation scripts which can be customized by means of XML configuration files. This feature allows the automatic characterization of the resource library, providing technology-aware details during the High-Level Synthesis.
The tools for RTL-synthesis currently supported are:
- Xilinx ISE,
- Xilinx VIVADO
- Altera Quartus
- Lattice Diamond
while the supported simulators are:
- Mentor Modelsim,
- Xilinx ISIM
- Xilinx XSIM
- Verilog Icarus
The distribution includes several examples under directory example. Here is the list of directories currently included:
This example shows how to add a non-supported device to the Bambu synthesis flow.
The file xc7z045-2ffg900-VVD.xml has copied from the framework distribution etc/devices/Xilinx_devices/xc7z020-1clg484-VVD.xml and then renamed in xc7z045-2ffg900-VVD.xml.
After copying the file few changes have been made. All of them relates to the new device characteristics: model, package and speed grade.
Here it follows the changed part of the xml file:
Note that the field
refers to the synthesis script stored in etc/devices/Xilinx_devices/Zynq-VVD.xml.
So, the bambu.sh will first simulate and then synthesize the C based description using the above specified Zynq device.
Note that, this example shows another nice feature of the HLS framework. The file module.c contains the C specification of the factorial function in its recursive form.
Bambu is not actually able to synthesize recursive functions but GCC is able to automatically translate it in its non-recursive form once -O2 option is passed. To understand what exactly
has been synthesized please check the a.c in the sim or synth directory created by bambu.sh.
The new device considred in this example is very similar to one of the already supported. In case the device is not very similar to one of the already characterized devices, the user should
check and accordingly add the characterization scripts. Example of characterization scripts based on eucalyptus tool are available in etc/devices.
Note that, eucalyptus is automatically built once a RTL synthesis back-end is configured.
This directory includes a simple example of High Level synthesis and generation of RTL simulation&synthesis scripts.
The results of the HLS synthesis could be inspected by looking into testbench/hls_summary_0.xml.
The result of the scheduling could be graphically viewed exploiting a viewer of dot files (e.g., xdot or dotty).
In particular, Bambu generates several dot files by passing the option –print-dot.
The scheduling of the arf function is stored in file HLS_output/dot/arf/HLS_scheduling.dot while the FSM of the arf function annotated with the C statements is stored in file HLS_output/dot/arf/HLS_STGraph.dot.
In this directory the impact of resource sharing on multipliers for the arf benchmark is considered. Two sets of scripts are provided: constrained and non-constrained based synthesis scripts.
The devices considered are the ones supported by Bambu.
In all the synthesis performed, the WB4 interface has been used to avoid issues with the high number of IO pins required by the arf function when synthesized alone.
Basically, adding a constraints on the number of used multipliers used requires to pass to Bambu a xml file structured in this way:
<?xml version="1.0"?> <constraints> <HLS_constraints> <tech_constraints fu_name="mult_expr_FU" fu_library="STD_FU" n="1"/> </HLS_constraints> </constraints>
This directory collects several scripts to test the multi-bus feature of bambu.
The file test_icrc.xml shows how to write xml testcases for array based function parameters.
This directory show an example on how it is possible to write a C-based testbench to test a given kernel.
The kernel function is defined through the option –top-rtldesign-name.
This design flow requires to add two attributes to the kernel function:
__attribute__ ((noinline)) __attribute__ ((used))
and to insert this two timing functions:
These two functions will start and stop a timer used by Bambu to compute the total number of cycles spent in the kernel function.
The target device is a Zynq xc7z020,-1,clg484 and the back-end flow is based on yosys open source RTL synthesis tool (http://www.clifford.at/yosys/).
This example starts from the reference C description of Keccak crypto function distributed through this website http://keccak.noekeon.org/.
Keccak has been selected by NIST to become the new SHA-3 standard (see http://www.nist.gov/hash-competition and http://ehash.iaik.tugraz.at/wiki/The_SHA-3_Zoo).
Further details can be found at: http://ehash.iaik.tugraz.at/wiki/Keccak.
Together with the C implementation optimized for processors, there exist several implementations for FPGA and ASIC.
So, as a referenced it has been selected one of the Low-Area Implementations developed by the authors of the Keccak algorithm (i.e., Guido Bertoni-STMicroelectronics, Joan Daemen-STMicroelectronics, Michaël Peeters-NXP Semiconductors and Gilles Van Assche-STMicroelectronics).
The results reported at this link http://ehash.iaik.tugraz.at/wiki/SHA-3_Hardware_Implementations are:
Altera Cyclone III 1559LEs 47.8Mbit/s 181 MHz Xilinx Virtex 5 444slices 70.1Mbit/s 265 MHz
Starting from the C description delivered as reference, it has been built an equivalent C function (equivalent to the VHDL reference design).
After two days of hacking and design space exploration, here are 5 different alternatives using different FPGAs:
Altera Cyclone II 5460LEs 66.9Mbit/s 107MHz (directory keccak_CycloneII_10) Altera Cyclone II 8681LEs 150.8Mbit/s 262MHz (directory keccak_CycloneII_4hl) Lattice ECP3 3789slices 80.2Mbit/s 128MHz (directory keccak_ECP3_10_09) Lattice ECP3 3831slices 80.2Mbit/s 128MHz (directory keccak_ECP3_9) Xilinx Virtex 5 7015slices 152.69Mbit/s 252MHz (directory keccak_V5_4hl)
These results have been obtained with PandA framework 0.9.3.
Along with this example another one comes showing how it is possible to build an Autotools project for the high-level synthesis with bambu: directory crypto_designs/multi-keccak.
This directory includes an example program which computes the FFT of a short pulse in a sample of length 128.
Scripts, updated results and code related with this paper:
Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi: Inter-procedural resource sharing in High Level Synthesis through function proxies. FPL 2015: 1-8.
This directory contains the CHStone v1.11 benchmarks taken from http://www.ertl.jp/chstone/ and all the scripts used and results obtained with bambu.
In this directory it is shown how to write a test.xml file when multi-dimensional arrays are used as function parameters.
The example uses the option –memory-allocation-policy=EXT_PIPELINED_BRAM. This option is used to declare that the parameters are allocated on a block ram memory (e.g., pipelining access is possible).
This example is very similar to the mm example.
There are mainly two differences:
– the two dimensions of the arrays are passed as parameter;
– the matrix elements are floats.
This directory contains scripts and results obtained on the libm functions supported by bambu.
Vga Adapter on Altera DE1 Cyclone II (EP2C20F484C7N).
The main aim of the project is to develop an application written in C which drives a VGA-compatible screen connected to a DE1 Altera FPGA.
The design includes some Verilog IPs which control the VGA port and shows how Bambu can manage existing IPs described by using hardware description languages.
This simple example show how to integrate C code with low level interfaces written in Verilog.
The design improves the VGA example by adapting such design to the more capable NEXYS4 prototyping board.
In this directory an example on how Bambu can use IO libc primitives (open, read, write and close) is shown.
This directory contains a simple example describing how to integrate and verify existing IPs with functions written in C that receives structs passed by pointers.
This simple example shows how to integrate small snippet of Verilog in the HLS flow by making Bambu use Verilog as third assembler dialect.
Currently only single output asm instructions are supported. In case outputs are included to pass the simulation the Intel and the ATT asm should be included. For asm having only inputs, such asm string could be safely left empty.
A detailed reference on how asm statements are considered by GCC could be found at this link:https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html.
This directory includes an example showing how to integrate Python for design verification.
This directory include an example of simple GPIO controller developed to show how to integrate Verilog IPs with plain C.
This directory includes the Pong game ported to Nexys4 prototyping board. Pong was the first game developed by Atari Inc. and was designed and built by Allan Alcorn. Further information can be found at https://en.wikipedia.org/wiki/Pong.
The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low level IP cores written in Verilog.
The original SDL code can be found at http://www.aaroncox.net/tutorials/arcade/PaddleBattle.html.
The artificial intelligence used to control the computer paddle is based on a random function described at http://burtleburtle.net/bob/rand/smallprng.html
This directory includes the breakout game ported to Nexys4 prototyping board. The game was designed by Nolan Bushnell, Steve Wozniak, and Steve Bristow. History of Breakout game can be found at this link: https://en.wikipedia.org/wiki/Breakout_%28video_game%29.
The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low level IP cores written in Verilog.
The original SDL code can be found at http://www.aaroncox.net/tutorials/arcade/BRICKBreaker.html.
This directory contains the scripts, the results and code of the MachSuite benchmarks set which is described in this paper:
Brandon Reagen, Robert Adolf, Sophia Yakun Shao, Gu-Yeon Wei, and David Brooks.
“MachSuite: Benchmarks for Accelerator Design and Customized Architectures.”
2014 IEEE International Symposium on Workload Characterization.
This directory includes the scripts, the updated results and the code related with this paper:
R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of FPGA High-Level Synthesis Tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. PP, iss. 99, pp. 1-1, 2016.
This directory includes scripts and code testing single and double precision basic operations: division, subtraction, addition and multiplication.
In the following the current Bambu options are reported:
******************************************************************************** ____ _ | __ ) __ _ _ __ ___ | |_ _ _ | _ \ / _` | '_ ` _ \| '_ \| | | | | |_) | (_| | | | | | | |_) | |_| | |____/ \__,_|_| |_| |_|_.__/ \__,_| ******************************************************************************** High-Level Synthesis Tool Politecnico di Milano - DEIB System Architectures Group ******************************************************************************** Copyright (c) 2004-2016 Politecnico di Milano Version: PandA 0.9.4 Usage: bambu [Options] <source_file> [<constraints_file>] [<technology_file>] Options: General options: --help, -h Display this usage information. --version, -V Display the version of the program. Output options: --verbosity, -v <level> Set the output verbosity level Possible values for <level>: 0 - NONE 1 - MINIMUM 2 - VERBOSE 3 - PEDANTIC 4 - VERY PEDANTIC (default = 1) --no-clean Do not remove temporary files. --benchmark-name=<name> Set the name of the current benchmark for data collection. Mainly useful for data collection from extensive regression tests. --configuration-name=<name> Set the name of the current tool configuration for data collection. Mainly useful for data collection from extensive regression tests. --benchmark-fake-parameters Set the parameters string for data collection. The parameters in the string are not actually used, but they are used for data collection in extensive regression tests. --output-temporary-directory=<path> Set the directory where temporary files are saved. Default is 'panda-temp' --print-dot Dump to file several different graphs used in the IR of the tool. The graphs are saved in .dot files, in graphviz format --pretty-print=<file> C-based pretty print of the internal IR. --writer,-w<language> Output RTL language: V - Verilog (default) H - VHDL --no-mixed-design Avoid mixed design. --generate-tb=<file> Generate testbench for the input values defined in the specified XML file. --top-fname=<fun_name> Define the top function to be synthesized. --top-rtldesign-name=<top_name> Define the top module name for the RTL backend. --file-input-data=<file_list> A comma-separated list of input files used by the C specification. --C-no-parse=<file> Specify a comma-separated list of C files used only during the co-simulation phase. GCC options: --compiler=<gcc_version> Specify which compiler is used. Possible values for <processor>: I386_GCC45 I386_GCC46 I386_GCC47 I386_GCC48 I386_GCC49 I386_GCC5 -O<level> Enable a specific optimization level. Possible values are the usual optimization flags accepted by compilers, plus some others: -O0,-O1,-O2,-O3,-Os,-O4,-O5. -f<option> Enable or disable a GCC optimization option. All the -f or -fno options are supported. In particular, -ftree-vectorize option triggers the high-level synthesis of vectorized operations. -I<path> Specify a path where headers are searched for. -W<warning> Specify a warning option passed to GCC. All the -W options available in GCC are supported. -E Enable preprocessing mode of GCC. --std=<standard> Assume that the input sources are for <standard> (default=gnu89). All the --std options available in GCC are supported. -D<name> Predefine name as a macro, with definition 1. -D
Tokenize and process as if it appeared as a #define directive. -U<name> Remove existing definition for macro <name>. --param <name>=<value> Set the amount <value> for the GCC parameter <name> that could be used for some optimizations. -l<library> Search the library named <library> when linking. -L<dir> Add directory <dir> to the list of directories to be searched for -l. --use-raw Specify that input file is already a GIMPLE file and not a source file. -m<machine-option> Specify machine dependent options (currently not used). --Include-sysdir Return the system include directory used by the wrapped GCC compiler. --gcc-config Return the GCC configuration. --extra-gcc-options Specify custom extra options to the compiler. Target: --target-file=file, -b<file> Specify an XML description of the target device. --generate-interface=<type> Wrap the top level module with an external interface. Possible values for <type> and related interfaces: minimal - (minimal interface - default) WB4 - (WishBone 4 interface) High Level Synthesis: --parametric-list-based[=<type>] Perform priority list-based scheduling. This is the default scheduling algorithm in bambu. The optional <type> argument can be used to set options for list-based scheduling as follows: 0 - Dynamic mobility (default) 1 - Static mobility 2 - Priority-fixed mobility --post-rescheduling Perform post rescheduling to better distribute resources. --speculative-sdc-scheduling Perform scheduling by using speculative sdc. --fixed-scheduling=<file> Provide scheduling as an XML file. --no-chaining Disable chaining optimization. Binding: --register-allocation=<type> Set the algorithm used for register allocation. Possible values for the <type> argument are the following: WEIGHTED_COLORING - use weighted coloring algorithm (default) COLORING - use simple coloring algorithm CHORDAL_COLORING - use chordal coloring algorithm BIPARTITE_MATCHING - use bipartite matching algorithm TTT_CLIQUE_COVERING - use a weighted clique covering algorithm UNIQUE_BINDING - unique binding algorithm --module-binding=<type> Set the algorithm used for module binding. Possible values for the <type> argument are one the following: WEIGHTED_TS - solve the weighted clique covering problem by exploiting the Tseng&Siewiorek heuristics (default) WEIGHTED_COLORING - solve the weighted clique covering problem performing a coloring on the conflict graph COLORING - solve the unweighted clique covering problem performing a coloring on the conflict graph TTT_FAST - use Tomita, A. Tanaka, H. Takahashi maxima weighted cliques heuristic to solve the clique covering problem TTT_FAST2 - use Tomita, A. Tanaka, H. Takahashi maximal weighted cliques heuristic to incrementally solve the clique covering problem TTT_FULL - use Tomita, A. Tanaka, H. Takahashi maximal weighted cliques algorithm to solve the clique covering problem TTT_FULL2 - use Tomita, A. Tanaka, H. Takahashi maximal weighted cliques algorithm to incrementally solve the clique covering problem TS - solve the unweighted clique covering problem by exploiting the Tseng&Siewiorek heuristic BIPARTITE_MATCHING - solve the weighted clique covering problem exploiting the bipartite matching approach UNIQUE - use a 1-to-1 binding algorithm Memory allocation: --memory-allocation=<type> Set the algorithm used for memory allocation. Possible values for the type argument are the following: DOMINATOR - all local variables, static variables and strings are allocated on BRAMs (default) XML_SPECIFICATION - import the memory allocation from an XML specification --xml-memory-allocation=<xml_file_name> Specify the file where the XML configuration has been defined. --memory-allocation-policy=<type> Set the policy for memory allocation. Possible values for the <type> argument are the following: ALL_BRAM - all objects that need to be stored in memory are allocated on BRAMs (default) LSS - all local variables, static variables and strings are allocated on BRAMs GSS - all global variables, static variables and strings are allocated on BRAMs NO_BRAM - all objects that need to be stored in memory are allocated on an external memory EXT_PIPELINED_BRAM - all objects that need to be stored in memory are allocated on an external pipelined memory --base-address=address Define the starting address for objects allocated externally to the top module. --initial-internal-address=address Define the starting address for the objects allocated internally to the top module. --channels-type=<type> Set the type of memory connections. Possible values for <type> are: MEM_ACC_11 - the accesses to the memory have a single direct connection or a single indirect connection (default) MEM_ACC_N1 - the accesses to the memory have n parallel direct connections or a single indirect connection MEM_ACC_NN - the accesses to the memory have n parallel direct connections or n parallel indirect connections --channels-number=<n> Define the number of parallel direct or indirect accesses. --memory-ctrl-type=type Define which type of memory controller is used. Possible values for the <type> argument are the following: D00 - no extra delay (default) D10 - 1 clock cycle extra-delay for LOAD, 0 for STORE D11 - 1 clock cycle extra-delay for LOAD, 1 for STORE D21 - 2 clock cycle extra-delay for LOAD, 1 for STORE --sparse-memory[=on/off] Control how the memory allocation happens. on - allocate the data in addresses which reduce the decoding logic (default) off - allocate the data in a contiguous addresses. --do-not-use-asynchronous-memories Do not add asynchronous memories to the possible set of memories used by bambu during the memory allocation step. --distram-threshold=value Define the threshold in bitsize used to infer DISTRIBUTED/ASYNCHRONOUS RAMs (default 256). --serialize-memory-accesses Serialize the memory accesses using the GCC virtual use-def chains without taking into account any alias analysis information. --unaligned-access Use only memories supporting unaligned accesses. --aligned-access Assume that all accesses are aligned and so only memories supporting aligned accesses are used. --do-not-chain-memories When enabled LOADs and STOREs will not be chained with other operations. --bram-high-latency Assume a 'high latency bram'-'faster clock frequency' block RAM memory based architecture: LOAD(II=1,L=3) STORE(1). --mem-delay-read=value Define the external memory latency when LOAD are performed (default 2). --mem-delay-write=value Define the external memory latency when LOAD are performed (default 1). --do-not-expose-globals All global variables are considered local to the compilation units. --data-bus-bitsize=<bitsize> Set the bitsize of the external data bus. --addr-bus-bitsize=<bitsize> Set the bitsize of the external address bus. Evaluation of HLS results: --simulate Simulate the RTL implementation. --simulator=<type> Specify the simulator used in generated simulation scripts: MODELSIM - Mentor Modelsim XSIM - Xilinx XSim ISIM - Xilinx iSim ICARUS - Verilog Icarus simulator VERILATOR - Verilator simulator --max-sim-cycles=<cycles> Specify the maximum number of cycles a HDL simulation may run. (default 20000000). --accept-nonzero-return Do not assume that application main must return 0. --generate-vcd Enable .vcd output file generation for waveform visualization (requires testbench generation). --evaluation[=type] Perform evaluation of the generated solution. The value of 'type' selects the objectives to be evaluated If nothing is specified all the following are evaluated The 'type' argument can be a string containing any of the following strings, separated with commas, without spaces: AREA - Area usage AREAxTIME - Area x Latency product TIME - Latency for the average computation TOTAL_TIME - Latency for the whole computation CYCLES - n. of cycles for the average computation TOTAL_CYCLES - n. of cycles for the whole computation BRAMS - number of BRAMs CLOCK_SLACK - Slack between actual and required clock period DSPS - number of DSPs FREQUENCY - Maximum target frequency PERIOD - Actual clock period REGISTERS - number of registers Checks and debugging: --assert-debug Enable assertion debugging performed by Modelsim. RTL synthesis: Note: for a more complete evaluation you should use the option --evaluation --clock-period=value Specify the period of the clock signal (default = 10ns). --backend-script-extensions=file Specify a file that will be included in the backend specific synthesis scripts. --backend-sdc-extensions=file Specify a file that will be included in the Synopsys Design Constraints file (SDC). --device-name=value Specify the name of the device. Three different cases are foreseen: - Xilinx: a comma separated string specifying device, speed grade and package (e.g.,: "xc7z020,-1,clg484,VVD") - Altera: a string defining the device string (e.g. EP2C70F896C6) - Lattice: a string defining the device string (e.g. LFE335EA8FN484C) --power-optimization Enable Xilinx power based optimization (default no). --no-iob Disconnect primary ports from the IOB (the default is to connect primary input and outpur ports to IOBs). --soft-float Enable use of soft-based implementation of floating-point operations. This is the default for bambu. --flopoco Enable use of flopoco-based implementation of floating-point operations --max-ulp Define the maximal ULP (Unit in the last place, i.e., is the spacing between floating-point numbers) accepted. --hls-div Perform the high-level synthesis of integer division and modulo operations starting from a C library based implementation. --skip-pipe-parameter=<value> Used during the allocation of pipelined units. <value> specifies how many pipelined units, compliant with the clock period, will be skipped. (default=0). --reset-type=value Specify the type of reset: no - use registers without reset (default) async - use registers with asynchronous reset sync - use registers with synchronous reset --reset-level=value Specify if the reset is active high or low: low - use registers with active low reset (default) high - use registers with active high reset --registered-inputs=value Specify if inputs are registered or not: auto - inputs are registered only for proxy functions (default) yes - all inputs are registered no - none of the inputs is registered --cprf=value Clock Period Resource Fraction (default = 1.0). --DSP-allocation-coefficient=value During the allocation step the timing of the DSP-based modules is multiplied by value (default = 1.0). --DSP-margin-combinational=value Timing of combinational DSP-based modules is multiplied by value. (default = 1.0). --DSP-margin-pipelined=value Timing of pipelined DSP-based modules is multiplied by value. (default = 1.0). --mux-margins=n Scheduling reserves a margin corresponding to the delay of n 32 bit multiplexers. --timing-model=value Specify the timing model used by HLS: EC - estimate timing overhead of glue logics and connections between resources (default) SIMPLE - just consider the resource delay --experimental-setup=<setup> Specify the experimental setup. This is a shorthand to set multiple options with a single command. Available values for <setup> are the follwing: BAMBU-AREA - this setup implies: -Os -D'printf(fmt, ...)=' --memory-allocation-policy=ALL_BRAM --DSP-allocation-coefficient=1.75 --distram-threshold=256 BAMBU-AREA-MP - this setup implies: -Os -D'printf(fmt, ...)=' --channels-type=MEM_ACC_NN --memory-allocation-policy=ALL_BRAM --DSP-allocation-coefficient=1.75 --distram-threshold=256 BAMBU-BALANCED - this setup implies: -O2 -D'printf(fmt, ...)=' --channels-type=MEM_ACC_11 --memory-allocation-policy=ALL_BRAM -fgcse-after-reload -fipa-cp-clone -ftree-partial-pre -funswitch-loops -finline-functions -fno-ivopts --param max-inline-insns-auto=25 -fno-tree-loop-ivcanon --distram-threshold=256 BAMBU-BALANCED-MP - (default) this setup implies: -O2 -D'printf(fmt, ...)=' --channels-type=MEM_ACC_NN --memory-allocation-policy=ALL_BRAM -fgcse-after-reload -fipa-cp-clone -ftree-partial-pre -funswitch-loops -finline-functions -fno-ivopts --param max-inline-insns-auto=25 -fno-tree-loop-ivcanon --distram-threshold=256 BAMBU-PERFORMANCE - this setup implies: -O3 -D'printf(fmt, ...)=' --memory-allocation-policy=ALL_BRAM --distram-threshold=512 BAMBU-PERFORMANCE-MP - this setup implies: -O3 -D'printf(fmt, ...)=' --channels-type=MEM_ACC_NN --memory-allocation-policy=ALL_BRAM --distram-threshold=512 BAMBU - this setup implies: -O0 --channels-type=MEM_ACC_11 --memory-allocation-policy=LSS --distram-threshold=256 BAMBU092 - this setup implies: -O3 -D'printf(fmt, ...)=' --timing-model=SIMPLE --DSP-margin-combinational=1.3 --cprf=0.9 -skip-pipe-parameter=1 --channels-type=MEM_ACC_11 --memory-allocation-policy=LSS --distram-threshold=256 VVD - this setup implies: -O3 -D'printf(fmt, ...)=' --channels-type=MEM_ACC_NN --memory-allocation-policy=ALL_BRAM --distram-threshold=256 --DSP-allocation-coefficient=1.75 --do-not-expose-globals --cprf=0.875 Other options: --time, -t <time> Set maximum execution time (in seconds) for ILP solvers. (infinite). Debug options: --discrepancy Performs automated discrepancy analysis between the execution of the original source code and the generated HDL (currently supports only Verilog). If a mismatch is detected reports useful information the user. Uninitialized variables in C are legal, but if they are used before initialization in HDL it is possible to obtain X values in simulation. This is not necessarily wrong, so these errors are not reported by default to avoid reporting false positives. If you can guarantee that in your C code there are no uninitialized variables and you want the X values in HDL to be reported use the option --discrepancy-force-uninitialized --discrepancy-force-uninitialized Reports errors due to uninitialized values in HDL. See the option --discrepancy for details --discrepancy-no-load-pointers Assume that the data loaded from memories in HDL are never used to represent addresses, unless they are explicitly assigned to pointer variables. The discrepancy analysis is able to compare pointers in software execution and addresses in hardware. By default all the values loaded from memory are treated as if they could contain addresses, even if they are integer variables. This is due to the fact that C code doing this tricks is valid and actually used in embedded systems, but it can lead to imprecise bug reports, because only pointers pointing to actual data are checked by the discrepancy analysis. If you can guarantee that your code always manipulates addresses using pointers and never using plain int, then you can use this option to get more precise bug reports.