Bambu: A Free Framework for the High-Level Synthesis of Complex Applications

Bambu is a free framework aimed at assisting the designer during the high-level synthesis of complex applications, supporting most of the C constructs (e.g., function calls and sharing of the modules, pointer arithmetic and dynamic resolution of memory accesses, accesses to array and structs, parameter passing either by reference or copy, …). Bambu is developed for Linux systems, it is written in C++, and it can be freely downloaded under GPL license.

Bambu receives as input a behavioral description of the specification, written in C language, and generates the HDL description of the corresponding RTL implementation as output, which is compatible with commercial RTL synthesis tools, along with a test-bench for the simulation and validation of the behavior. Bambu is designed in a extremely modular way, implementing the different tasks of the HLS process, and specific algorithms, in distinct C++ classes which work on different IRs depending on the synthesis stage.

The whole HLS flow is quite similar to a software compilation flow: it starts from a high level specification and produces low level code after a sequence of analysis and optimization steps.
As well as software compilation flow has, three different phases can be identified in the High Level Synthesis flow: front-end, middle-end and back-end. In the front-end the input code is parsed and translated in an intermediate representation which will be used in the following parts of the flow. In the middle-end target independent analyses and optimizations are performed.

Bambu front-end

Bambu interfaces the GNU Compiler Collection (GCC) (version 4.5, 4.6, 4.7, 4.8, 4.9, 5, 6 and 7 are currently supported) by means of GCC plugins to extract its internal representation in Static Single Assignment form of the initial C code. In particular, the extracted IR is the GIMPLE IR exploited by GCC to perform the target and language-independent optimizations. Starting from the dumping of this representation in ASCII files, Bambu loads the intermediate representation.
The Gimple IR intermediate representation is extracted after that GCC has performed the target independent optimizations.
Note however that not all the software code optimizations are profitable when the target is a hardware accelerator. For example, the effects of transformations like function inlining and loop unrolling can impact much more on resource utilization than the same transformation done when a processor is considered.

Bambu middle-end

Starting from the intermediate representation extracted from GCC, Bambu performs further analyses and builds additional internal representations, such as Call Graph, Control Flow Graphs, Data Flow Graphs and Program Dependence Graphs.
Next it applies a set of analyses and transformations independently from the target device.
Some of these steps are the same applied in a software compilation flow (e.g., data flow analysis, loop recognition, dead code elimination, constant propagation, etc.).

One relevant specific optimizations performed by Bambu during this phase is the optimization of multiplications and divisions by a constant.
These operations are typically transformed into operations that use only shifts and adds to improve area and timing.

Another analysis performed at this stage is the Bitwidth Analysis that aims to reduce the number of bits required by datapath operators.
This is a very important optimization because it impacts all non-functional requirements (e.g. performance, area, power) of a design, without affecting its behavior.
Differently from general purpose processor compilers, which are designed to target a processor with a fixed-sized datapath (usually 32 or 64 bits), a hardware compiler can exploit specialization by generating custom-size operators (i.e. functional units) and registers. As a direct consequence, we can select the minimal number of bits required for an operation and/or storage of the specific algorithm, which in turns leads to minimal space used for registers, smaller functional units that translate into less area, less power, and shorter critical paths.
However, this analysis cannot be usually completely automated since it often requires specific knowledge of the algorithm and the input datasets.
Bambu implements the methodology describes in budiu-tr00.pdf integrated with the Value Range information computed by the GCC compiler.

Bambu back-end

In this phase the actual High-Level Synthesis of the specification is performed.
Even if the same HDL language can be used to describe architectures implemented for different families of devices, the HLS flow is not target independent but takes into account information about the target device. Moreover, FPGAs do not have a fixed operating frequency, but this can be decided by the designer or forced by devices (e.g., sensors or actuators) connected to it.
The synthesis process acts on each function separately. The resulting architecture is modular, reflecting the structure of the call graph.

The modules implementing the single functions include two different parts: the control logic and the data-path.
The control logic is modeled as a Finite State Machine which handles the routing of the data within the data-path and the execution of the single operations.
The generated data-path is a custom mux-based architecture optimized on the dimension of the data types to reduce the number of flip-flops and bit-level multiplexers.
It implements all the operations that have to be executed and stores their input and output.

The back-end phase generates the actual hardware architecture by performing the following steps:

Functions Allocation

Functions Allocation defines the hierarchy of the modules implementing the functions of the specification built.
Bambu is currently able to use and integrate functions described at low level in Verilog or in VHDL with functions described at high-level in C.

Memories Allocation

Memories Allocation defines the memories used to store aggregate variables (arrays and structures), global variables, and how the dynamic memory allocation is implemented.
Bambu adopts a novel architecture for memory accesses: it builds a hierarchical data-path directly connected to a dual-port BRAM whenever a local aggregated or a global scalar/aggregate data type is used by the code specified and whenever the accesses can be determined at compile time.
In this case, multiple memory accesses can be performed in parallel.
Otherwise, the memories are interconnected so that it is also possible to support dynamic resolution of the addresses.
Indeed, the same memory infrastructure can be natively connected to external components (e.g. a local scratch-pad memory or cache) or directly to the bus to access off-chip memory.

Resource Allocation

Resource allocation associates operations in the specification to Functional Units (FUs) in the resource library. During the middle-end phase the specification is inspected, and operations characteristics identified. Such characteristics include the kind of operation (e.g. addition, multiplication, …), and input/output value types (e.g. integer, float, …).
Floating point operations are supported through the High Level Synthesis of a soft-float library containing basic soft float operations or through FloPoCo, a generator of arithmetic Floating-Point Cores. The allocation step maps them on the set of available FUs: their characterization includes information, such as latency, area, and number of pipeline stages. Usually more operation/FU matchings are feasible: in this case the selection of a proper FU is driven by design constraints. In addition to FUs, also memory resources are allocated. Local data in fact, may be bound to local memories.

The library of functional units used by Bambu is quite rich and in some cases it includes several implementations for the same single operation.
Moreover, the library contains functional units that are expressed as templates in a standard hardware description language (i.e. Verilog or VHDL). These templates can be retargeted and customized on the basis of the characteristics of the target technology. In this case, the underlying logic synthesis tool can determine which is the best architecture to implement each function. For example, multipliers can be mapped either on dedicated DSP blocks or implemented with LUTs. To perform aggressive optimizations, each component of the library is annotated with information useful during the entire HLS process, such as resource occupation and latency for executing the operations. Bambu adopts a pre-characterization approach. That is, the performance estimation considers a generic template of the functional unit, which can be parametric with respect to the bitwidths and pipeline stages. Latency and resource occupation are then obtained by synthesizing each configuration and storing the results in the library.


Scheduling of operations is performed by default through a LIST-based algorithm, which is constrained by resource availability. In its basic formulation, the LIST algorithm associates to each operation a priority, according to particular metrics. For example, priority may reflect operations mobility with respect to the critical path. Operations belonging to the critical path have zero-mobility: delaying their execution usually results in an increase of the overall circuit latency. Critical path and mobilities can be obtained analyzing As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules. The LIST approach proceeds iteratively associating to each control step, operations to be executed. Ready operations (e.g. whose dependencies have been satisfied in previous iterations of the algorithm) are scheduled in the current control step considering resource availability: if multiple ready operations compete for a resource, than the one having higher priority is scheduled. Alternatively, a Speculative scheduling algorithm based on System of Difference Constraints (see Code Transformations Based on Speculative SDC Scheduling paper) is available: this algorithm build an integer linear programming formulation of the scheduling problem, allowing code motions and speculations of operations into different basic blocks. The solution produced by the ILP solver is then implemented by applying the code motions and the speculations suggested by the ILP solution, then the rest of the High Level Synthesis flow can be implemented. After the scheduling task it is possible to build State Transition Graph (STG) accordingly: the STG is adopted for further analysis and to build the final Finite State Machine implementation for the controller.

Module Binding

Operations that execute concurrently, according to the computed schedule, are not allowed to share the same FU instance, thus avoiding resource conflicts. In Bambu, binding is performed through a clique covering algorithm on a weighted compatibility graph. The compatibility graph is built by analyzing the schedule: operations scheduled on different control steps are compatible. Weights express how much is profitable for two operations to share the same hardware resource. They are computed taking into account area/delay trade-offs as a result of sharing; for example, FUs that demand a large area will be more likely shared. Weights computation also considers the cost of interconnections for introducing steering logic, both in terms of area and frequency. Bambu offers several algorithms also for solving the covering problem on generic compatibility/conflict graphs.

Register Binding

Register binding associates storage values to registers, and requires a preliminary analysis step, the Liveness Analysis (LA). LA analyzes the scheduled function, and identifies the life intervals of each variable, i.e. the sequence of control steps in which a temporary needs to be stored. Storage values with non overlapping life intervals may share the same register. In default settings, the Bambu flow computes liveness information through a non-iterative SSA liveness analysis algorithm (see Non-Iterative SSA liveness analysis paper). Register assignment is then reduced to the problem of coloring a conflict graph. Nodes of the graph are storage values, edges represent the conflict relation. Algorithms for a weighted clique covering compatibility graph solving the register binding problem are also available.

Interconnection Binding

Interconnections are bound according to the previous steps: if a resource is shared, then the algorithm introduces steering logic on its inputs. It also identifies the relation between control signals and different operations: such signals are then set by the controller.

Netlist Generation

During the synthesis process, the final architecture is represented through a hyper-graph, which also highlights the interconnection between modules.
The netlist generation step translates such representation in a Verilog or VHDL description. The process access the resource library, which embeds the Verilog or the VHDL implementation of each allocated module.

Generation of Synthesis and Simulation Scripts

Bambu provides the automatic generation of synthesis and simulation scripts which can be customized by means of XML configuration files. This feature allows the automatic characterization of the resource library, providing technology-aware details during the High-Level Synthesis.

The tools for RTL-synthesis currently supported are:

  • Xilinx ISE,
  • Xilinx VIVADO
  • Altera Quartus
  • Lattice Diamond

while the supported simulators are:

  • Mentor Modelsim,
  • Xilinx ISIM
  • Xilinx XSIM
  • Verilator
  • Verilog Icarus

Bambu examples

The distribution includes several examples under directory example. Here is the list of directories currently included:

    • add_device_simple

This example shows how to add a non-supported device to the Bambu synthesis flow.
The file xc7z045-2ffg900-VVD.xml has copied from the framework distribution etc/devices/Xilinx_devices/xc7z020-1clg484-VVD.xml and then renamed in xc7z045-2ffg900-VVD.xml.
After copying the file few changes have been made. All of them relates to the new device characteristics: model, package and speed grade.
Here it follows the changed part of the xml file:
<model value="xc7z045"/>
<package value="ffg900"/>
<speed_grade value="-2"/>

Note that the field
<family value="Zynq-VVD"/>
refers to the synthesis script stored in etc/devices/Xilinx_devices/Zynq-VVD.xml.
So, the will first simulate and then synthesize the C based description using the above specified Zynq device.

Note that, this example shows another nice feature of the HLS framework. The file module.c contains the C specification of the factorial function in its recursive form.
Bambu is not actually able to synthesize recursive functions but GCC is able to automatically translate it in its non-recursive form once -O2 option is passed. To understand what exactly
has been synthesized please check the a.c in the sim or synth directory created by
The new device considered in this example is very similar to one of the already supported. In case the device is not very similar to one of the already characterized devices, the user should
check and accordingly add the characterization scripts. Example of characterization scripts based on eucalyptus tool are available in etc/devices.
Note that, eucalyptus is automatically built once a RTL synthesis back-end is configured.

    • arf

This directory includes a simple example of High Level synthesis and generation of RTL simulation&synthesis scripts.
The results of the HLS synthesis could be inspected by looking into testbench/hls_summary_0.xml.
The result of the scheduling could be graphically viewed exploiting a viewer of dot files (e.g., xdot or dotty).
In particular, Bambu generates several dot files by passing the option –print-dot.
The scheduling of the arf function is stored in file HLS_output/dot/arf/ while the FSM of the arf function annotated with the C statements is stored in file HLS_output/dot/arf/

    • arf_res_sharing

In this directory, the impact of resource sharing on multipliers for the arf benchmark is considered. Two sets of scripts are provided: constrained and non-constrained based synthesis scripts.
The devices considered are the ones supported by Bambu.
In all the synthesis performed, the WB4 interface has been used to avoid issues with the high number of IO pins required by the arf function when synthesized alone.
Basically, adding a constraint on the number of used multipliers used requires to pass to Bambu a xml file structured in this way:

<?xml version="1.0"?>
      <tech_constraints fu_name="mult_expr_FU" fu_library="STD_FU" n="1"/>
    • crc

This directory collects several scripts to test the multi-bus feature of bambu.
The file test_icrc.xml shows how to write xml testcases for array-based function parameters.

    • crc_yosys

This directory shows an example of how it is possible to write a C-based testbench to test a given kernel.
The kernel function is defined through the option –top-rtldesign-name.

This design flow requires to add two attributes to the kernel function:

  __attribute__ ((noinline)) __attribute__ ((used))  

and to insert this two timing functions:


These two functions will start and stop a timer used by Bambu to compute the total number of cycles spent in the kernel function.
The target device is a Zynq xc7z020,-1,clg484 and the back-end flow is based on yosys open source RTL synthesis tool (

    • crypto_designs

This example starts from the reference C description of Keccak crypto function distributed through this website
Keccak has been selected by NIST to become the new SHA-3 standard (see and
Further details can be found at:
Together with the C implementation optimized for processors, there exist several implementations for FPGA and ASIC.
So, as a referenced it has been selected one of the Low-Area Implementations developed by the authors of the Keccak algorithm (i.e., Guido Bertoni-STMicroelectronics, Joan Daemen-STMicroelectronics, Michaël Peeters-NXP Semiconductors and Gilles Van Assche-STMicroelectronics).

The results reported at this link are:

Altera Cyclone III 1559LEs 47.8Mbit/s 181 MHz

Xilinx Virtex 5 444slices 70.1Mbit/s 265 MHz

Starting from the C description delivered as a reference, it has been built an equivalent C function (equivalent to the VHDL reference design).
After two days of hacking and design space exploration, here are 5 different alternatives using different FPGAs:

Altera Cyclone II 5460LEs 66.9Mbit/s 107MHz (directory keccak_CycloneII_10)

Altera Cyclone II 8681LEs 150.8Mbit/s 262MHz (directory keccak_CycloneII_4hl)

Lattice ECP3 3789slices 80.2Mbit/s 128MHz (directory keccak_ECP3_10_09)

Lattice ECP3 3831slices 80.2Mbit/s 128MHz (directory keccak_ECP3_9)

Xilinx Virtex 5 7015slices 152.69Mbit/s 252MHz (directory keccak_V5_4hl)

These results have been obtained with PandA framework 0.9.3.

Along with this example, another one comes showing how it is possible to build an Autotools project for the high-level synthesis with bambu: directory crypto_designs/multi-keccak.

    • fft_example

This directory includes an example program which computes the FFT of a short pulse in a sample of length 128.

    • function_pointers

Scripts, updated results, and code related to this paper:

Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi: Inter-procedural resource sharing in High Level Synthesis through function proxies. FPL 2015: 1-8.

    • CHStone

This directory contains the CHStone v1.11 benchmarks taken from and all the scripts used and results obtained with bambu.

    • mm

In this directory, it is shown how to write a test.xml file when multi-dimensional arrays are used as function parameters.
The example uses the option –memory-allocation-policy=EXT_PIPELINED_BRAM. This option is used to declare that the parameters are allocated on a block ram memory (e.g., pipelining access is possible).

    • mm_float

This example is very similar to the mm example.
There are mainly two differences:
– the two dimensions of the arrays are passed as a parameter;
– the matrix elements are floats.

    • libm

This directory contains scripts and results obtained on the libm functions supported by bambu.

    • VGA

Vga Adapter on Altera DE1 Cyclone II (EP2C20F484C7N).
The main aim of the project is to develop an application written in C which drives a VGA-compatible screen connected to a DE1 Altera FPGA.
The design includes some Verilog IPs which control the VGA port and shows how Bambu can manage existing IPs described by using hardware description languages.

    • VGA_Nexys4

This simple example shows how to integrate C code with low-level interfaces written in Verilog.
The design improves the VGA example by adapting such design to the more capable NEXYS4 prototyping board.

    • file_simulate

In this directory, an example of how Bambu can use IO libc primitives (open, read, write and close) is shown.

    • IP_integration

This directory contains a simple example describing how to integrate and verify existing IPs with functions written in C that receives structs passed by pointers.

    • simple_asm

This simple example shows how to integrate small snippet of Verilog in the HLS flow by making Bambu use Verilog as third assembler dialect.
Currently, only single output asm instructions are supported. In case outputs are included to pass the simulation the Intel and the ATT asm should be included. For asm having only inputs, such asm string could be safely left empty.
A detailed reference on how asm statements are considered by GCC could be found at this link:

    • python-bindings

This directory includes an example showing how to integrate Python for design verification.

    • led_example

This directory includes an example of a simple GPIO controller developed to show how to integrate Verilog IPs with plain C.

    • pong

This directory includes the Pong game ported to Nexys4 prototyping board. Pong was the first game developed by Atari Inc. and was designed and built by Allan Alcorn. Further information can be found at
The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low-level IP cores written in Verilog.
The original SDL code can be found at
The artificial intelligence used to control the computer paddle is based on a random function described at

    • breakout

This directory includes the breakout game ported to Nexys4 prototyping board. The game was designed by Nolan Bushnell, Steve Wozniak, and Steve Bristow. History of Breakout game can be found at this link:
The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low-level IP cores written in Verilog.
The original SDL code can be found at

    • MachSuite

This directory contains the scripts, the results and code of the MachSuite benchmarks set which is described in this paper:

Brandon Reagen, Robert Adolf, Sophia Yakun Shao, Gu-Yeon Wei, and David Brooks.
“MachSuite: Benchmarks for Accelerator Design and Customized Architectures.”
2014 IEEE International Symposium on Workload Characterization.

    • hls_study

This directory includes the scripts, the updated results and the code related to this paper:

R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of FPGA High-Level Synthesis Tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. PP, iss. 99, pp. 1-1, 2016.

    • softfloat

This directory includes scripts and code testing single and double precision basic operations: division, subtraction, addition and multiplication.

Bambu options

In the following the current Bambu options are reported:

                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

                         High-Level Synthesis Tool

                         Politecnico di Milano - DEIB
                          System Architectures Group
                Copyright (c) 2004-2017 Politecnico di Milano
    Version: PandA 0.9.5 - Revision 69c929b7691994bcf63d1505c5279cbb424f3682

       bambu [Options] <source_file> [<constraints_file>] [<technology_file>]


  General options:

    --help, -h
        Display this usage information.

    --version, -V
        Display the version of the program.

        Read command line options from a XML file.

        Dump the parsed command line options into a XML file.

  Output options:

    --verbosity, -v <level>
        Set the output verbosity level
        Possible values for <level>:
            0 - NONE
            1 - MINIMUM
            2 - VERBOSE
            3 - PEDANTIC
            4 - VERY PEDANTIC
        (default = 1)

    --debug, -d <level>
        Set the verbosity level of debugging information
        Possible values for <level>:
            0 - NONE
            1 - MINIMUM
            2 - VERBOSE
            3 - PEDANTIC
            4 - VERY PEDANTIC
        (default = 1).

        Set maximum debug level for classes in <classes_list>

        Set a maximum number of cfg transformations for each function.

        Do not remove temporary files.

        Set the name of the current benchmark for data collection.
        Mainly useful for data collection from extensive regression tests.

        Set the name of the current tool configuration for data collection.
        Mainly useful for data collection from extensive regression tests.

        Set the parameters string for data collection. The parameters in the
        string are not actually used, but they are used for data collection in
        extensive regression tests.

        Set the directory where temporary files are saved.
        Default is 'panda-temp'

        Dump to file several different graphs used in the IR of the tool.
        The graphs are saved in .dot files, in graphviz format

        Convert all runtime warnings to errors.

        C-based pretty print of the internal IR.

        Output RTL language:
            V - Verilog (default)
            H - VHDL

        Avoid mixed design.

        Generate testbench for the input values defined in the specified XML

        Define the top function to be synthesized.

        Define the top module name for the RTL backend.

        A comma-separated list of input files used by the C specification.

        Specify a comma-separated list of C files used only during the
        co-simulation phase.

  GCC options:

        Specify which compiler is used.
        Possible values for <processor>:

        Enable a specific optimization level. Possible values are the usual
        optimization flags accepted by compilers, plus some others:

        Enable or disable a GCC optimization option. All the -f or -fno options
        are supported. In particular, -ftree-vectorize option triggers the
        high-level synthesis of vectorized operations.

        Specify a path where headers are searched for.

        Specify a warning option passed to GCC. All the -W options available in
        GCC are supported.

        Enable preprocessing mode of GCC.

        Assume that the input sources are for <standard>. All
        the --std options available in GCC are supported.

        Predefine name as a macro, with definition 1.

        Tokenize <definition> and process as if it appeared as a #define directive.

        Remove existing definition for macro <name>.

    --param <name>=<value>
        Set the amount <value> for the GCC parameter <name> that could be used for
        some optimizations.

        Search the library named <library> when linking.

        Add directory <dir> to the list of directories to be searched for -l.

        Specify that input file is already a raw file and not a source file.

        Specify machine dependend options (currently not used).

        Read GCC options from a XML file.

        Dump the parsed GCC compiler options into a XML file.

        Return the system include directory used by the wrapped GCC compiler.

        Return the GCC configuration.

        Replace sizeof with the computed valued for the considered target

        Specify custom extra options to the compiler.


    --target-file=file, -b<file>
        Specify an XML description of the target device.

        Wrap the top level module with an external interface.
        Possible values for <type> and related interfaces:
            minimal  -  (minimal interface - default)
            WB4      -  (WishBone 4 interface)


        Perform priority list-based scheduling. This is the default scheduling algorithm
        in bambu. The optional <type> argument can be used to set options for
        list-based scheduling as follows:
            0 - Dynamic mobility (default)
            1 - Static mobility
            2 - Priority-fixed mobility

        Perform post rescheduling to better distribute resources.

        Perform scheduling by using speculative sdc.

        Provide scheduling as an XML file.

        Disable chaining optimization.


        Set the algorithm used for register allocation. Possible values for the
        <type> argument are the following:
            WEIGHTED_COLORING   - use weighted coloring algorithm (default)
            COLORING            - use simple coloring algorithm
            CHORDAL_COLORING    - use chordal coloring algorithm
            BIPARTITE_MATCHING  - use bipartite matching algorithm
            TTT_CLIQUE_COVERING - use a weighted clique covering algorithm
            UNIQUE_BINDING      - unique binding algorithm

        Set the algorithm used for module binding. Possible values for the
        <type> argument are one the following:
            WEIGHTED_TS        - solve the weighted clique covering problem by
                                 exploiting the Tseng&Siewiorek heuristics
            WEIGHTED_COLORING  - solve the weighted clique covering problem
                                 performing a coloring on the conflict graph
            COLORING           - solve the unweighted clique covering problem
                                 performing a coloring on the conflict graph
            TTT_FAST           - use Tomita, A. Tanaka, H. Takahashi maxima
                                 weighted cliques heuristic to solve the clique
                                 covering problem
            TTT_FAST2          - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques heuristic to incrementally
                                 solve the clique covering problem
            TTT_FULL           - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques algorithm to solve the clique
                                 covering problem
            TTT_FULL2          - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques algorithm to incrementally
                                 solve the clique covering problem
            TS                 - solve the unweighted clique covering problem
                                 by exploiting the Tseng&Siewiorek heuristic
            BIPARTITE_MATCHING - solve the weighted clique covering problem
                                 exploiting the bipartite matching approach
            UNIQUE             - use a 1-to-1 binding algorithm

  Memory allocation:

        Set the algorithm used for memory allocation. Possible values for the
        type argument are the following:
            DOMINATOR          - all local variables, static variables and
                                 strings are allocated on BRAMs (default)
            XML_SPECIFICATION  - import the memory allocation from an XML

        Specify the file where the XML configuration has been defined.

        Set the policy for memory allocation. Possible values for the <type>
        argument are the following:
            ALL_BRAM           - all objects that need to be stored in memory
                                 are allocated on BRAMs (default)
            LSS                - all local variables, static variables and
                                 strings are allocated on BRAMs
            GSS                - all global variables, static variables and
                                 strings are allocated on BRAMs
            NO_BRAM            - all objects that need to be stored in memory
                                 are allocated on an external memory
            EXT_PIPELINED_BRAM - all objects that need to be stored in memory
                                 are allocated on an external pipelined memory

        Define the starting address for objects allocated externally to the top

        Define the starting address for the objects allocated internally to the
        top module.

        Set the type of memory connections.
        Possible values for <type> are:
            MEM_ACC_11 - the accesses to the memory have a single direct
                         connection or a single indirect connection (default)
            MEM_ACC_N1 - the accesses to the memory have n parallel direct
                         connections or a single indirect connection
            MEM_ACC_NN - the accesses to the memory have n parallel direct
                         connections or n parallel indirect connections

        Define the number of parallel direct or indirect accesses.

        Define which type of memory controller is used. Possible values for the
        <type> argument are the following:
            D00 - no extra delay (default)
            D10 - 1 clock cycle extra-delay for LOAD, 0 for STORE
            D11 - 1 clock cycle extra-delay for LOAD, 1 for STORE
            D21 - 2 clock cycle extra-delay for LOAD, 1 for STORE

        Control how the memory allocation happens.
            on - allocate the data in addresses which reduce the decoding logic (default)
           off - allocate the data in a contiguous addresses.

        Do not add asynchronous memories to the possible set of memories used
        by bambu during the memory allocation step.

        Define the threshold in bitsize used to infer DISTRIBUTED/ASYNCHRONOUS RAMs (default 256).

        Serialize the memory accesses using the GCC virtual use-def chains
        without taking into account any alias analysis information.

        Use only memories supporting unaligned accesses.

        Assume that all accesses are aligned and so only memories supporting aligned

        accesses are used.

        When enabled LOADs and STOREs will not be chained with other

        Assume that read-only memories can be duplicated in case timing requires.

        Assume a 'high latency bram'-'faster clock frequency' block RAM memory
        based architectures:
        3 => LOAD(II=1,L=3) STORE(1).
        4 => LOAD(II=1,L=4) STORE(II=1,L=2).

        Define the external memory latency when LOAD are performed (default 2).

        Define the external memory latency when LOAD are performed (default 1).

        All global variables are considered local to the compilation units.

        Set the bitsize of the external data bus.

        Set the bitsize of the external address bus.

  Evaluation of HLS results:

        Simulate the RTL implementation.

        Simulate the RTL implementation and then open Mentor Visualizer.

        Specify the simulator used in generated simulation scripts:
            MODELSIM - Mentor Modelsim
            XSIM - Xilinx XSim
            ISIM - Xilinx iSim
            ICARUS - Verilog Icarus simulator
            VERILATOR - Verilator simulator

        Specify the maximum number of cycles a HDL simulation may run.
        (default 20000000).

        Do not assume that application main must return 0.

        Enable .vcd output file generation for waveform visualization (requires
        testbench generation).

        Perform evaluation of the results.
        The value of 'type' selects the objectives to be evaluated
        If nothing is specified all the following are evaluated
        The 'type' argument can be a string containing any of the following
        strings, separated with commas, without spaces:
            AREA            - Area usage
            AREAxTIME       - Area x Latency product
            TIME            - Latency for the average computation
            TOTAL_TIME      - Latency for the whole computation
            CYCLES          - n. of cycles for the average computation
            TOTAL_CYCLES    - n. of cycles for the whole computation
            BRAMS           - number of BRAMs
            CLOCK_SLACK     - Slack between actual and required clock period
            DSPS            - number of DSPs
            FREQUENCY       - Maximum target frequency
            PERIOD          - Actual clock period
            REGISTERS       - number of registers

  RTL synthesis:

    Note: for a more complete evaluation you should use the option --evaluation

        Specify the period of the clock signal (default = 10ns).

        Specify a file that will be included in the backend specific synthesis

        Specify a file that will be included in the Synopsys Design Constraints
        file (SDC).

        Specify the name of the device. Three different cases are foreseen:
            - Xilinx:  a comma separated string specifying device, speed grade
                       and package (e.g.,: "xc7z020,-1,clg484,VVD")
            - Altera:  a string defining the device string (e.g. EP2C70F896C6)
            - Lattice: a string defining the device string (e.g.

        Enable Xilinx power based optimization (default no).

        Disconnect primary ports from the IOB (the default is to connect
        primary input and outpur ports to IOBs).

        Enable the soft-based implementation of floating-point operations.
        Bambu uses as default a faithfully rounded version of softfloat with rounding mode equal to round to nearest even.

        This is the default for bambu.

        Enable the flopoco-based implementation of floating-point operations.

        Enable the soft-based implementation of floating-point operations with subnormals support.

        Enable the use of standard libm.
        Without this option, Bambu uses as default a faithfully rounded version of libm with rounding mode equal to round to nearest even.

        Enable the use of soft_fp GCC library instead of bambu customized version of John R. Hauser softfloat library.

        Define the maximal ULP (Unit in the last place, i.e., is the spacing
        between floating-point numbers) accepted.

        Perform the high-level synthesis of integer division and modulo
        operations starting from a C library based implementation or a HDL component:
             none  - use a HDL based pipelined restoring division
             nr1   - use a C-based non-restoring division with unrolling factor equal to 1 (default)
             nr2   - use a C-based non-restoring division with unrolling factor equal to 2
             NR    - use a C-based Newton-Raphson division
             as    - use a C-based align divisor shift dividend method

        Used during the allocation of pipelined units. <value> specifies how
        many pipelined units, compliant with the clock period, will be skipped.

        Specify the type of reset:
             no    - use registers without reset (default)
             async - use registers with asynchronous reset
             sync  - use registers with synchronous reset

        Specify if the reset is active high or low:
             low   - use registers with active low reset (default)
             high  - use registers with active high reset

        Used to remove the INIT value from registers (useful for ASIC designs)

        Specify if inputs are registered or not:
             auto  - inputs are registered only for proxy functions (default)
             yes   - all inputs are registered
             no    - none of the inputs is registered

             auto    - it depends on the target technology. VVD prefers one encoding while the other are fine with the standard binary encoding. (default)
             one-hot - one hot encoding
             binary  - binary encoding

        Clock Period Resource Fraction (default = 1.0).

        During the allocation step the timing of the DSP-based modules is
        multiplied by value (default = 1.0).

        Timing of combinational DSP-based modules is multiplied by value.
        (default = 1.0).

        Timing of pipelined DSP-based modules is multiplied by value.
        (default = 1.0).

        Scheduling reserves a margin corresponding to the delay of n 32 bit

        Specify the timing model used by HLS:
             EC     - estimate timing overhead of glue logics and connections
                      between resources (default)
             SIMPLE - just consider the resource delay

        Specify the experimental setup. This is a shorthand to set multiple
        options with a single command.
        Available values for <setup> are the follwing:
             BAMBU-AREA           - this setup implies:
                                    -Os  -D'printf(fmt, ...)='
             BAMBU-AREA-MP        - this setup implies:
                                    -Os  -D'printf(fmt, ...)='
             BAMBU-BALANCED       - this setup implies:
                                    -O2  -D'printf(fmt, ...)='
                                    -fgcse-after-reload  -fipa-cp-clone
                                    -ftree-partial-pre  -funswitch-loops
                                    -finline-functions  -fdisable-tree-bswap
                                    --param max-inline-insns-auto=25
             BAMBU-BALANCED-MP    - (default) this setup implies:
                                    -O2  -D'printf(fmt, ...)='
                                    -fgcse-after-reload  -fipa-cp-clone
                                    -ftree-partial-pre  -funswitch-loops
                                    -finline-functions  -fdisable-tree-bswap
                                    --param max-inline-insns-auto=25
             BAMBU-PERFORMANCE    - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
             BAMBU-PERFORMANCE-MP - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
             BAMBU                - this setup implies:
                                    -O0 --channels-type=MEM_ACC_11
             BAMBU092             - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --cprf=0.9  -skip-pipe-parameter=1
             VVD                  - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --do-not-expose-globals --cprf=0.875

  Other options:

    --time, -t <time>
        Set maximum execution time (in seconds) for ILP solvers. (infinite).

        Enables interprocedural bitvalue analysis.

  Debug options:

           Performs automated discrepancy analysis between the execution
           of the original source code and the generated HDL (currently
           supports only Verilog). If a mismatch is detected reports
           useful information the user.
           Uninitialized variables in C are legal, but if they are used
           before initialization in HDL it is possible to obtain X values
           in simulation. This is not necessarily wrong, so these errors
           are not reported by default to avoid reporting false positives.
           If you can guarantee that in your C code there are no
           uninitialized variables and you want the X values in HDL to be
           reported use the option --discrepancy-force-uninitialized

           Reports errors due to uninitialized values in HDL.
           See the option --discrepancy for details

           Assume that the data loaded from memories in HDL are never used
           to represent addresses, unless they are explicitly assigned to
           pointer variables.
           The discrepancy analysis is able to compare pointers in software
           execution and addresses in hardware. By default all the values
           loaded from memory are treated as if they could contain addresses,
           even if they are integer variables. This is due to the fact that
           C code doing this tricks is valid and actually used in embedded
           systems, but it can lead to imprecise bug reports, because only
           pointers pointing to actual data are checked by the discrepancy
           If you can guarantee that your code always manipulates addresses
           using pointers and never using plain int, then you can use this
           option to get more precise bug reports.

           Restricts the discrepancy analysis only to the functions whose
           name is in the list passed as argument.

           Do not trigger hard errors on pointer variables.

        Enable assertion debugging performed by Modelsim.

A framework for Hardware-Software Co-Design of Embedded Systems