Bambu

Bambu: A Free Framework for the High-Level Synthesis of Complex Applications

Bambu is a free framework aimed at assisting the designer during the high-level synthesis of complex applications, supporting most of the C constructs (e.g., function calls and sharing of the modules, pointer arithmetic and dynamic resolution of memory accesses, accesses to array and structs, parameter passing either by reference or copy, …). Bambu is developed for Linux stystems, it is written in C++, and it can be freely downloaded under GPL license.

Bambu receives as input a behavioral description of the specification, written in C language, and generates the HDL description of the corresponding RTL implementation as output, which is compatible with commercial RTL synthesis tools, along with a test-bench for the simulation and validation of the behavior. Bambu is designed in a extremely modular way, implementing the different tasks of the HLS process, and specific algorithms, in distinct C++ classes which work on different IRs depending on the synthesis stage.

The whole HLS flow is quite similar to a software compilation flow: it starts from a high level specification and produces low level code after a sequence of analysis and optimization steps.
As well as software compilation flow has, three different phases can be identified in the High Level Synthesis flow: front-end, middle-end and back-end. In the front-end the input code is parsed and translated in an intermediate representation which will be used in the following parts of the flow. In the middle-end target independent analyses and optimizations are performed.

Bambu front-end

Bambu interfaces the GNU Compiler Collection (GCC) (version 4.5, 4.6, 4.7, 4.8, 4.9 and 5 are currently supported) by means of GCC plugins to extract its internal representation in Static Single Assignment form of the initial C code. In particular, the extracted IR is the GIMPLE IR exploited by GCC to perform the target and language-independent optimizations. Starting from the dumping of this representation in ASCII files, Bambu loads the intermediate representation.
The Gimple IR intermediate representation is extracted after that GCC has performed the target independent optimizations.
Note however that not all the software code optimizations are profitable when the target is a hardware accelerator. For example, the effects of transformations like function inlining and loop unrolling can impact much more on resource utilization than the same transformation done when a processor is considered.

Bambu middle-end

Starting from the intermediate representation extracted from GCC, Bambu performs further analyses and builds additional internal representations, such as Call Graph, Control Flow Graphs, Data Flow Graphs and Program Dependence Graphs.
Next it applies a set of analyses and transformations independently from the target device.
Some of these steps are the same applied in a software compilation flow (e.g., data flow analysis, loop recognition, dead code elimination, constant propagation, etc.).

One relevant specific optimizations performed by Bambu during this phase is the optimization of multiplications and divisions by a constant.
These operations are typically transformed into operations that use only shifts and adds to improve area and timing.

Another analysis performed at this stage is the Bitwidth Analysis that aims to reduce the number of bits required by datapath operators.
This is a very important optimization because it impacts all non-functional requirements (e.g. performance, area, power) of a design, without affecting its behavior.
Differently from general purpose processor compilers, which are designed to target a processor with a fixed-sized datapath (usually 32 or 64 bits), a hardware compiler can exploit specialization by generating custom-size operators (i.e. functional units) and registers. As a direct consequence, we can select the minimal number of bits required for an operation and/or storage of the specific algorithm, which in turns leads to minimal space used for registers, smaller functional units that translate into less area, less power, and shorter critical paths.
However, this analysis cannot be usually completely automated since it often requires specific knowledge of the algorithm and the input datasets.
Bambu implements the methodology describes in budiu-tr00.pdf integrated with the Value Range information computed by the GCC compiler.

Bambu back-end

In this phase the actual High-Level Synthesis of the specification is performed.
Even if the same HDL language can be used to describe architectures implemented for different families of devices, the HLS flow is not target independent but takes into account information about the target device. Moreover, FPGAs do not have a fixed operating frequency, but this can be decided by the designer or forced by devices (e.g., sensors or actuators) connected to it.
The synthesis process acts on each function separately. The resulting architecture is modular, reflecting the structure of the call graph.

The modules implementing the single functions include two different parts: the control logic and the data-path.
The control logic is modeled as a Finite State Machine which handles the routing of the data within the data-path and the execution of the single operations.
The generated data-path is a custom mux-based architecture optimized on the dimension of the data types to reduce the number of flip-flops and bit-level multiplexers.
It implements all the operations that have to be executed and stores their input and output.

The back-end phase generates the actual hardware architecture by performing the following steps:

Functions Allocation

Functions Allocation defines the hierarchy of the modules implementing the functions of the specification built.
Bambu is currently able to use and integrate functions described at low level in Verilog or in VHDL with functions described at high-level in C.

Memories Allocation

Memories Allocation defines the memories used to store aggregate variables (arrays and structures), global variables, and how the dynamic memory allocation is implemented.
Bambu adopts a novel architecture for memory accesses: it builds a hierarchical data-path directly connected to a dual-port BRAM whenever a local aggregated or a global scalar/aggregate data type is used by the code specified and whenever the accesses can be determined at compile time.
In this case, multiple memory accesses can be performed in parallel.
Otherwise, the memories are interconnected so that it is also possible to support dynamic resolution of the addresses.
Indeed, the same memory infrastructure can be natively connected to external components (e.g. a local scratch-pad memory or cache) or directly to the bus to access off-chip memory.

Resource Allocation

Resource allocation associates operations in the specification to Functional Units (FUs) in the resource library. During the middle-end phase the specification is inspected, and operations characteristics identified. Such characteristics include the kind of operation (e.g. addition, multiplication, …), and input/output value types (e.g. integer, float, …).
Floating point operations are supported through the High Level Synthesis of a soft-float library containing basic soft float operations or through FloPoCo, a generator of arithmetic Floating-Point Cores. The allocation step maps them on the set of available FUs: their characterization includes information, such as latency, area, and number of pipeline stages. Usually more operation/FU matchings are feasible: in this case the selection of a proper FU is driven by design constraints. In addition to FUs, also memory resources are allocated. Local data in fact, may be bound to local memories.

The library of functional units used by Bambu is quite rich and in some cases it includes several implementations for the same single operation.
Moreover, the library contains functional units that are expressed as templates in a standard hardware description language (i.e. Verilog or VHDL). These templates can be retargeted and customized on the basis of the characteristics of the target technology. In this case, the underlying logic synthesis tool can determine which is the best architecture to implement each function. For example, multipliers can be mapped either on dedicated DSP blocks or implemented with LUTs. To perform aggressive optimizations, each component of the library is annotated with information useful during the entire HLS process, such as resource occupation and latency for executing the operations. Bambu adopts a pre-characterization approach. That is, the performance estimation considers a generic template of the functional unit, which can be parametric with respect to the bitwidths and pipeline stages. Latency and resource occupation are then obtained by synthesizing each configuration and storing the results in the library.

Scheduling

Scheduling of operations is performed by default through a LIST-based algorithm, which is constrained by resource availability. In its basic formulation, the LIST algorithm associates to each operation a priority, according to particular metrics. For example, priority may reflect operations mobility with respect to the critical path. Operations belonging to the critical path have zero-mobility: delaying their execution usually results in an increase of the overall circuit latency. Critical path and mobilities can be obtained analyzing As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules. The LIST approach proceeds iteratively associating to each control step, operations to be executed. Ready operations (e.g. whose dependencies have been satisfied in previous iterations of the algorithm) are scheduled in the current control step considering resource availability: if multiple ready operations compete for a resource, than the one having higher priority is scheduled. Alternatively, a Speculative scheduling algorithm based on System of Difference Constraints (see Code Transformations Based on Speculative SDC Scheduling paper) is available: this algorithm build an integer linear programming formulation of the scheduling problem, allowing code motions and speculations of operations into different basic blocks. The solution produced by the ILP solver is then implemented by applying the code motions and the speculations suggested by the ILP solution, then the rest of the High Level Synthesis flow can be implemented. After the scheduling task it is possible to build State Transition Graph (STG) accordingly: the STG is adopted for further analysis and to build the final Finite State Machine implementation for the controller.

Module Binding

Operations that execute concurrently, according to the computed schedule, are not allowed to share the same FU instance, thus avoiding resource conflicts. In Bambu, binding is performed through a clique covering algorithm on a weighted compatibility graph. The compatibility graph is built by analyzing the schedule: operations scheduled on different control steps are compatible. Weights express how much is profitable for two operations to share the same hardware resource. They are computed taking into account area/delay trade-offs as a result of sharing; for example, FUs that demand a large area will be more likely shared. Weights computation also considers the cost of interconnections for introducing steering logic, both in terms of area and frequency. Bambu offers several algorithms also for solving the covering problem on generic compatibility/conflict graphs.

Register Binding

Register binding associates storage values to registers, and requires a preliminary analysis step, the Liveness Analysis (LA). LA analyzes the scheduled function, and identifies the life intervals of each variable, i.e. the sequence of control steps in which a temporary needs to be stored. Storage values with non overlapping life intervals may share the same register. In default settings, the Bambu flow computes liveness information through a non-iterative SSA liveness analysis algorithm (see Non-Iterative SSA liveness analysis paper). Register assignment is then reduced to the problem of coloring a conflict graph. Nodes of the graph are storage values, edges represent the conflict relation. Algorithms for a weighted clique covering compatibility graph solving the register binding problem are also available.

Interconnection Binding

Interconnections are bound according to the previous steps: if a resource is shared, then the algorithm introduces steering logic on its inputs. It also identifies the relation between control signals and different operations: such signals are then set by the controller.

Netlist Generation

During the synthesis process, the final architecture is represented through a hyper-graph, which also highlights the interconnection between modules.
The netlist generation step translates such representation in a Verilog or VHDL description. The process access the resource library, which embeds the Verilog or the VHDL implementation of each allocated module.

Generation of Synthesis and Simulation Scripts

Bambu provides the automatic generation of synthesis and simulation scripts which can be customized by means of XML configuration files. This feature allows the automatic characterization of the resource library, providing technology-aware details during the High-Level Synthesis.

The tools for RTL-synthesis currently supported are:

  • Xilinx ISE,
  • Xilinx VIVADO
  • Altera Quartus
  • Lattice Diamond

while the supported simulators are:

  • Mentor Modelsim,
  • Xilinx ISIM
  • Xilinx XSIM
  • Verilator
  • Verilog Icarus

Bambu examples

The distribution includes several examples under directory example. Here is the list of directories currently included:

  • add_device_simple
  • This example shows how to add a non-supported device to the Bambu synthesis flow.
    The file xc7z045-2ffg900-VVD.xml has copied from the framework distribution etc/devices/Xilinx_devices/xc7z020-1clg484-VVD.xml and then renamed in xc7z045-2ffg900-VVD.xml.
    After copying the file few changes have been made. All of them relates to the new device characteristics: model, package and speed grade.
    Here it follows the changed part of the xml file:
    <model value="xc7z045"/>
    <package value="ffg900"/>
    <speed_grade value="-2"/>

    Note that the field
    <family value="Zynq-VVD"/>
    refers to the synthesis script stored in etc/devices/Xilinx_devices/Zynq-VVD.xml.
    So, the bambu.sh will first simulate and then synthesize the C based description using the above specified Zynq device.

    Note that, this example shows another nice feature of the HLS framework. The file module.c contains the C specification of the factorial function in its recursive form.
    Bambu is not actually able to synthesize recursive functions but GCC is able to automatically translate it in its non-recursive form once -O2 option is passed. To understand what exactly
    has been synthesized please check the a.c in the sim or synth directory created by bambu.sh.
    The new device considred in this example is very similar to one of the already supported. In case the device is not very similar to one of the already characterized devices, the user should
    check and accordingly add the characterization scripts. Example of characterization scripts based on eucalyptus tool are available in etc/devices.
    Note that, eucalyptus is automatically built once a RTL synthesis back-end is configured.

  • arf
  • This directory includes a simple example of High Level synthesis and generation of RTL simulation&synthesis scripts.
    The results of the HLS synthesis could be inspected by looking into testbench/hls_summary_0.xml.
    The result of the scheduling could be graphically viewed exploiting a viewer of dot files (e.g., xdot or dotty).
    In particular, Bambu generates several dot files by passing the option –print-dot.
    The scheduling of the arf function is stored in file HLS_output/dot/arf/HLS_scheduling.dot while the FSM of the arf function annotated with the C statements is stored in file HLS_output/dot/arf/HLS_STGraph.dot.

  • arf_res_sharing
  • In this directory the impact of resource sharing on multipliers for the arf benchmark is considered. Two sets of scripts are provided: constrained and non-constrained based synthesis scripts.
    The devices considered are the ones supported by Bambu.
    In all the synthesis performed, the WB4 interface has been used to avoid issues with the high number of IO pins required by the arf function when synthesized alone.
    Basically, adding a constraints on the number of used multipliers used requires to pass to Bambu a xml file structured in this way:

    <?xml version="1.0"?>
    <constraints>
       <HLS_constraints>
          <tech_constraints fu_name="mult_expr_FU" fu_library="STD_FU" n="1"/>
       </HLS_constraints>
    </constraints>
    
  • crc
  • This directory collects several scripts to test the multi-bus feature of bambu.
    The file test_icrc.xml shows how to write xml testcases for array based function parameters.

  • crc_yosys
  • This directory show an example on how it is possible to write a C-based testbench to test a given kernel.
    The kernel function is defined through the option –top-rtldesign-name.

    This design flow requires to add two attributes to the kernel function:

      __attribute__ ((noinline)) __attribute__ ((used))  
    

    and to insert this two timing functions:

            __builtin_bambu_time_start();
            __builtin_bambu_time_stop();
    

    These two functions will start and stop a timer used by Bambu to compute the total number of cycles spent in the kernel function.
    The target device is a Zynq xc7z020,-1,clg484 and the back-end flow is based on yosys open source RTL synthesis tool (http://www.clifford.at/yosys/).

  • crypto_designs
  • This example starts from the reference C description of Keccak crypto function distributed through this website http://keccak.noekeon.org/.
    Keccak has been selected by NIST to become the new SHA-3 standard (see http://www.nist.gov/hash-competition and http://ehash.iaik.tugraz.at/wiki/The_SHA-3_Zoo).
    Further details can be found at: http://ehash.iaik.tugraz.at/wiki/Keccak.
    Together with the C implementation optimized for processors, there exist several implementations for FPGA and ASIC.
    So, as a referenced it has been selected one of the Low-Area Implementations developed by the authors of the Keccak algorithm (i.e., Guido Bertoni-STMicroelectronics, Joan Daemen-STMicroelectronics, Michaël Peeters-NXP Semiconductors and Gilles Van Assche-STMicroelectronics).

    The results reported at this link http://ehash.iaik.tugraz.at/wiki/SHA-3_Hardware_Implementations are:

    Altera Cyclone III 1559LEs 47.8Mbit/s 181 MHz
    
    Xilinx Virtex 5 444slices 70.1Mbit/s 265 MHz
    

    Starting from the C description delivered as reference, it has been built an equivalent C function (equivalent to the VHDL reference design).
    After two days of hacking and design space exploration, here are 5 different alternatives using different FPGAs:

    Altera Cyclone II 5460LEs 66.9Mbit/s 107MHz (directory keccak_CycloneII_10)
    
    Altera Cyclone II 8681LEs 150.8Mbit/s 262MHz (directory keccak_CycloneII_4hl)
    
    Lattice ECP3 3789slices 80.2Mbit/s 128MHz (directory keccak_ECP3_10_09)
    
    Lattice ECP3 3831slices 80.2Mbit/s 128MHz (directory keccak_ECP3_9)
    
    Xilinx Virtex 5 7015slices 152.69Mbit/s 252MHz (directory keccak_V5_4hl)
    

    These results have been obtained with PandA framework 0.9.3.

    Along with this example another one comes showing how it is possible to build an Autotools project for the high-level synthesis with bambu: directory crypto_designs/multi-keccak.

  • fft_example
  • This directory includes an example program which computes the FFT of a short pulse in a sample of length 128.

  • function_pointers
  • Scripts, updated results and code related with this paper:

    Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi: Inter-procedural resource sharing in High Level Synthesis through function proxies. FPL 2015: 1-8.

  • CHStone
  • This directory contains the CHStone v1.11 benchmarks taken from http://www.ertl.jp/chstone/ and all the scripts used and results obtained with bambu.

  • mm
  • In this directory it is shown how to write a test.xml file when multi-dimensional arrays are used as function parameters.
    The example uses the option –memory-allocation-policy=EXT_PIPELINED_BRAM. This option is used to declare that the parameters are allocated on a block ram memory (e.g., pipelining access is possible).

  • mm_float
  • This example is very similar to the mm example.
    There are mainly two differences:
    – the two dimensions of the arrays are passed as parameter;
    – the matrix elements are floats.

  • libm
  • This directory contains scripts and results obtained on the libm functions supported by bambu.

  • VGA
  • Vga Adapter on Altera DE1 Cyclone II (EP2C20F484C7N).
    The main aim of the project is to develop an application written in C which drives a VGA-compatible screen connected to a DE1 Altera FPGA.
    The design includes some Verilog IPs which control the VGA port and shows how Bambu can manage existing IPs described by using hardware description languages.

  • VGA_Nexys4
  • This simple example show how to integrate C code with low level interfaces written in Verilog.
    The design improves the VGA example by adapting such design to the more capable NEXYS4 prototyping board.

  • file_simulate
  • In this directory an example on how Bambu can use IO libc primitives (open, read, write and close) is shown.

  • IP_integration
  • This directory contains a simple example describing how to integrate and verify existing IPs with functions written in C that receives structs passed by pointers.

  • simple_asm
  • This simple example shows how to integrate small snippet of Verilog in the HLS flow by making Bambu use Verilog as third assembler dialect.
    Currently only single output asm instructions are supported. In case outputs are included to pass the simulation the Intel and the ATT asm should be included. For asm having only inputs, such asm string could be safely left empty.
    A detailed reference on how asm statements are considered by GCC could be found at this link:https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html.

  • python-bindings
  • This directory includes an example showing how to integrate Python for design verification.

  • led_example
  • This directory include an example of simple GPIO controller developed to show how to integrate Verilog IPs with plain C.

  • pong
  • This directory includes the Pong game ported to Nexys4 prototyping board. Pong was the first game developed by Atari Inc. and was designed and built by Allan Alcorn. Further information can be found at https://en.wikipedia.org/wiki/Pong.
    The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low level IP cores written in Verilog.
    The original SDL code can be found at http://www.aaroncox.net/tutorials/arcade/PaddleBattle.html.
    The artificial intelligence used to control the computer paddle is based on a random function described at http://burtleburtle.net/bob/rand/smallprng.html

  • breakout
  • This directory includes the breakout game ported to Nexys4 prototyping board. The game was designed by Nolan Bushnell, Steve Wozniak, and Steve Bristow. History of Breakout game can be found at this link: https://en.wikipedia.org/wiki/Breakout_%28video_game%29.
    The code has been ported by Fabrizio Ferrandi by adapting a SDL based tutorial to the PandA methodology for the integration of low level IP cores written in Verilog.
    The original SDL code can be found at http://www.aaroncox.net/tutorials/arcade/BRICKBreaker.html.

  • MachSuite
  • This directory contains the scripts, the results and code of the MachSuite benchmarks set which is described in this paper:

    Brandon Reagen, Robert Adolf, Sophia Yakun Shao, Gu-Yeon Wei, and David Brooks.
    “MachSuite: Benchmarks for Accelerator Design and Customized Architectures.”
    2014 IEEE International Symposium on Workload Characterization.

  • hls_study
  • This directory includes the scripts, the updated results and the code related with this paper:

    R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of FPGA High-Level Synthesis Tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. PP, iss. 99, pp. 1-1, 2016.

  • softfloat
  • This directory includes scripts and code testing single and double precision basic operations: division, subtraction, addition and multiplication.

Bambu options

In the following the current Bambu options are reported:

********************************************************************************
                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

********************************************************************************
                         High-Level Synthesis Tool

                         Politecnico di Milano - DEIB
                          System Architectures Group
********************************************************************************
                Copyright (c) 2004-2016 Politecnico di Milano
Version: PandA 0.9.4

Usage:
       bambu [Options] <source_file> [<constraints_file>] [<technology_file>]

Options:

  General options:

    --help, -h
        Display this usage information.

    --version, -V
        Display the version of the program.


  Output options:

    --verbosity, -v <level>
        Set the output verbosity level
        Possible values for <level>:
            0 - NONE
            1 - MINIMUM
            2 - VERBOSE
            3 - PEDANTIC
            4 - VERY PEDANTIC
        (default = 1)

    --no-clean
        Do not remove temporary files.

    --benchmark-name=<name>
        Set the name of the current benchmark for data collection.
        Mainly useful for data collection from extensive regression tests.

    --configuration-name=<name>
        Set the name of the current tool configuration for data collection.
        Mainly useful for data collection from extensive regression tests.

    --benchmark-fake-parameters
        Set the parameters string for data collection. The parameters in the
        string are not actually used, but they are used for data collection in
        extensive regression tests.

    --output-temporary-directory=<path>
        Set the directory where temporary files are saved.
        Default is 'panda-temp'

    --print-dot
        Dump to file several different graphs used in the IR of the tool.
        The graphs are saved in .dot files, in graphviz format

    --pretty-print=<file>
        C-based pretty print of the internal IR.

    --writer,-w<language>
        Output RTL language:
            V - Verilog (default)
            H - VHDL

    --no-mixed-design
        Avoid mixed design.

    --generate-tb=<file>
        Generate testbench for the input values defined in the specified XML
        file.

    --top-fname=<fun_name>
        Define the top function to be synthesized.

    --top-rtldesign-name=<top_name>
        Define the top module name for the RTL backend.

    --file-input-data=<file_list>
        A comma-separated list of input files used by the C specification.

    --C-no-parse=<file>
        Specify a comma-separated list of C files used only during the
        co-simulation phase.


  GCC options:

    --compiler=<gcc_version>
        Specify which compiler is used.
        Possible values for <processor>:
            I386_GCC45
            I386_GCC46
            I386_GCC47
            I386_GCC48
            I386_GCC49
            I386_GCC5

    -O<level>
        Enable a specific optimization level. Possible values are the usual
        optimization flags accepted by compilers, plus some others:
        -O0,-O1,-O2,-O3,-Os,-O4,-O5.

    -f<option>
        Enable or disable a GCC optimization option. All the -f or -fno options
        are supported. In particular, -ftree-vectorize option triggers the
        high-level synthesis of vectorized operations.

    -I<path>
        Specify a path where headers are searched for.

    -W<warning>
        Specify a warning option passed to GCC. All the -W options available in
        GCC are supported.

    -E
        Enable preprocessing mode of GCC.

    --std=<standard>
        Assume that the input sources are for <standard> (default=gnu89). All
        the --std options available in GCC are supported.

    -D<name>
        Predefine name as a macro, with definition 1.

    -D
        Tokenize  and process as if it appeared as a #define directive.

    -U<name>
        Remove existing definition for macro <name>.

    --param <name>=<value>
        Set the amount <value> for the GCC parameter <name> that could be used for
        some optimizations.

    -l<library>
        Search the library named <library> when linking.

    -L<dir>
        Add directory <dir> to the list of directories to be searched for -l.

    --use-raw
        Specify that input file is already a GIMPLE file and not a source file.

    -m<machine-option>
        Specify machine dependent options (currently not used).

    --Include-sysdir
        Return the system include directory used by the wrapped GCC compiler.

    --gcc-config
        Return the GCC configuration.

    --extra-gcc-options
        Specify custom extra options to the compiler.


  Target:

    --target-file=file, -b<file>
        Specify an XML description of the target device.

    --generate-interface=<type>
        Wrap the top level module with an external interface.
        Possible values for <type> and related interfaces:
            minimal  -  (minimal interface - default)
            WB4      -  (WishBone 4 interface)


  High Level Synthesis:

    --parametric-list-based[=<type>]
        Perform priority list-based scheduling. This is the default scheduling algorithm
        in bambu. The optional <type> argument can be used to set options for
        list-based scheduling as follows:
            0 - Dynamic mobility (default)
            1 - Static mobility
            2 - Priority-fixed mobility

    --post-rescheduling
        Perform post rescheduling to better distribute resources.

    --speculative-sdc-scheduling
        Perform scheduling by using speculative sdc.

    --fixed-scheduling=<file>
        Provide scheduling as an XML file.

    --no-chaining
        Disable chaining optimization.


  Binding:

    --register-allocation=<type>
        Set the algorithm used for register allocation. Possible values for the
        <type> argument are the following:
            WEIGHTED_COLORING   - use weighted coloring algorithm (default)
            COLORING            - use simple coloring algorithm
            CHORDAL_COLORING    - use chordal coloring algorithm
            BIPARTITE_MATCHING  - use bipartite matching algorithm
            TTT_CLIQUE_COVERING - use a weighted clique covering algorithm
            UNIQUE_BINDING      - unique binding algorithm

    --module-binding=<type>
        Set the algorithm used for module binding. Possible values for the
        <type> argument are one the following:
            WEIGHTED_TS        - solve the weighted clique covering problem by
                                 exploiting the Tseng&Siewiorek heuristics
                                 (default)
            WEIGHTED_COLORING  - solve the weighted clique covering problem
                                 performing a coloring on the conflict graph
            COLORING           - solve the unweighted clique covering problem
                                 performing a coloring on the conflict graph
            TTT_FAST           - use Tomita, A. Tanaka, H. Takahashi maxima
                                 weighted cliques heuristic to solve the clique
                                 covering problem
            TTT_FAST2          - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques heuristic to incrementally
                                 solve the clique covering problem
            TTT_FULL           - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques algorithm to solve the clique
                                 covering problem
            TTT_FULL2          - use Tomita, A. Tanaka, H. Takahashi maximal
                                 weighted cliques algorithm to incrementally
                                 solve the clique covering problem
            TS                 - solve the unweighted clique covering problem
                                 by exploiting the Tseng&Siewiorek heuristic
            BIPARTITE_MATCHING - solve the weighted clique covering problem
                                 exploiting the bipartite matching approach
            UNIQUE             - use a 1-to-1 binding algorithm


  Memory allocation:

    --memory-allocation=<type>
        Set the algorithm used for memory allocation. Possible values for the
        type argument are the following:
            DOMINATOR          - all local variables, static variables and
                                 strings are allocated on BRAMs (default)
            XML_SPECIFICATION  - import the memory allocation from an XML
                                 specification

    --xml-memory-allocation=<xml_file_name>
        Specify the file where the XML configuration has been defined.

    --memory-allocation-policy=<type>
        Set the policy for memory allocation. Possible values for the <type>
        argument are the following:
            ALL_BRAM           - all objects that need to be stored in memory
                                 are allocated on BRAMs (default)
            LSS                - all local variables, static variables and
                                 strings are allocated on BRAMs
            GSS                - all global variables, static variables and
                                 strings are allocated on BRAMs
            NO_BRAM            - all objects that need to be stored in memory
                                 are allocated on an external memory
            EXT_PIPELINED_BRAM - all objects that need to be stored in memory
                                 are allocated on an external pipelined memory

   --base-address=address
        Define the starting address for objects allocated externally to the top
        module.

   --initial-internal-address=address
        Define the starting address for the objects allocated internally to the
        top module.

   --channels-type=<type>
        Set the type of memory connections.
        Possible values for <type> are:
            MEM_ACC_11 - the accesses to the memory have a single direct
                         connection or a single indirect connection (default)
            MEM_ACC_N1 - the accesses to the memory have n parallel direct
                         connections or a single indirect connection
            MEM_ACC_NN - the accesses to the memory have n parallel direct
                         connections or n parallel indirect connections

   --channels-number=<n>
        Define the number of parallel direct or indirect accesses.

   --memory-ctrl-type=type
        Define which type of memory controller is used. Possible values for the
        <type> argument are the following:
            D00 - no extra delay (default)
            D10 - 1 clock cycle extra-delay for LOAD, 0 for STORE
            D11 - 1 clock cycle extra-delay for LOAD, 1 for STORE
            D21 - 2 clock cycle extra-delay for LOAD, 1 for STORE

    --sparse-memory[=on/off]
        Control how the memory allocation happens.
            on - allocate the data in addresses which reduce the decoding logic (default)
            off - allocate the data in a contiguous addresses.

    --do-not-use-asynchronous-memories
        Do not add asynchronous memories to the possible set of memories used
        by bambu during the memory allocation step.

    --distram-threshold=value
        Define the threshold in bitsize used to infer DISTRIBUTED/ASYNCHRONOUS RAMs (default 256).

    --serialize-memory-accesses
        Serialize the memory accesses using the GCC virtual use-def chains
        without taking into account any alias analysis information.

    --unaligned-access
        Use only memories supporting unaligned accesses.

    --aligned-access
        Assume that all accesses are aligned and so only memories supporting aligned
        accesses are used.

    --do-not-chain-memories
        When enabled LOADs and STOREs will not be chained with other
        operations.

    --bram-high-latency
        Assume a 'high latency bram'-'faster clock frequency' block RAM memory
        based architecture: LOAD(II=1,L=3) STORE(1).

    --mem-delay-read=value
        Define the external memory latency when LOAD are performed (default 2).

    --mem-delay-write=value
        Define the external memory latency when LOAD are performed (default 1).

    --do-not-expose-globals
        All global variables are considered local to the compilation units.

    --data-bus-bitsize=<bitsize>
        Set the bitsize of the external data bus.

    --addr-bus-bitsize=<bitsize>
        Set the bitsize of the external address bus.


  Evaluation of HLS results:

    --simulate
        Simulate the RTL implementation.

    --simulator=<type>
        Specify the simulator used in generated simulation scripts:
            MODELSIM - Mentor Modelsim
            XSIM - Xilinx XSim
            ISIM - Xilinx iSim
            ICARUS - Verilog Icarus simulator
            VERILATOR - Verilator simulator

    --max-sim-cycles=<cycles>
        Specify the maximum number of cycles a HDL simulation may run.
        (default 20000000).

    --accept-nonzero-return
        Do not assume that application main must return 0.

    --generate-vcd
        Enable .vcd output file generation for waveform visualization (requires
        testbench generation).

    --evaluation[=type]
        Perform evaluation of the generated solution.
        The value of 'type' selects the objectives to be evaluated
        If nothing is specified all the following are evaluated
        The 'type' argument can be a string containing any of the following
        strings, separated with commas, without spaces:
            AREA            - Area usage
            AREAxTIME       - Area x Latency product
            TIME            - Latency for the average computation
            TOTAL_TIME      - Latency for the whole computation
            CYCLES          - n. of cycles for the average computation
            TOTAL_CYCLES    - n. of cycles for the whole computation
            BRAMS           - number of BRAMs
            CLOCK_SLACK     - Slack between actual and required clock period
            DSPS            - number of DSPs
            FREQUENCY       - Maximum target frequency
            PERIOD          - Actual clock period
            REGISTERS       - number of registers


  Checks and debugging:

    --assert-debug
        Enable assertion debugging performed by Modelsim.


  RTL synthesis:

    Note: for a more complete evaluation you should use the option --evaluation

    --clock-period=value
        Specify the period of the clock signal (default = 10ns).

    --backend-script-extensions=file
        Specify a file that will be included in the backend specific synthesis
        scripts.

    --backend-sdc-extensions=file
        Specify a file that will be included in the Synopsys Design Constraints
        file (SDC).

    --device-name=value
        Specify the name of the device. Three different cases are foreseen:
            - Xilinx:  a comma separated string specifying device, speed grade
                       and package (e.g.,: "xc7z020,-1,clg484,VVD")
            - Altera:  a string defining the device string (e.g. EP2C70F896C6)
            - Lattice: a string defining the device string (e.g.
                       LFE335EA8FN484C)

    --power-optimization
        Enable Xilinx power based optimization (default no).

    --no-iob
        Disconnect primary ports from the IOB (the default is to connect
        primary input and outpur ports to IOBs).

    --soft-float
        Enable use of soft-based implementation of floating-point operations.
        This is the default for bambu.

    --flopoco
        Enable use of flopoco-based implementation of floating-point operations

    --max-ulp
        Define the maximal ULP (Unit in the last place, i.e., is the spacing
        between floating-point numbers) accepted.

    --hls-div
        Perform the high-level synthesis of integer division and modulo
        operations starting from a C library based implementation.

    --skip-pipe-parameter=<value>
        Used during the allocation of pipelined units. <value> specifies how
        many pipelined units, compliant with the clock period, will be skipped.
        (default=0).

    --reset-type=value
        Specify the type of reset:
             no    - use registers without reset (default)
             async - use registers with asynchronous reset
             sync  - use registers with synchronous reset

    --reset-level=value
        Specify if the reset is active high or low:
             low   - use registers with active low reset (default)
             high  - use registers with active high reset

    --registered-inputs=value
        Specify if inputs are registered or not:
             auto  - inputs are registered only for proxy functions (default)
             yes   - all inputs are registered
             no    - none of the inputs is registered

    --cprf=value
        Clock Period Resource Fraction (default = 1.0).

    --DSP-allocation-coefficient=value
        During the allocation step the timing of the DSP-based modules is
        multiplied by value (default = 1.0).

    --DSP-margin-combinational=value
        Timing of combinational DSP-based modules is multiplied by value.
        (default = 1.0).

    --DSP-margin-pipelined=value
        Timing of pipelined DSP-based modules is multiplied by value.
        (default = 1.0).

    --mux-margins=n
        Scheduling reserves a margin corresponding to the delay of n 32 bit
        multiplexers.

    --timing-model=value
        Specify the timing model used by HLS:
             EC     - estimate timing overhead of glue logics and connections
                      between resources (default)
             SIMPLE - just consider the resource delay 

    --experimental-setup=<setup>
        Specify the experimental setup. This is a shorthand to set multiple
        options with a single command.
        Available values for <setup> are the follwing:
             BAMBU-AREA           - this setup implies:
                                    -Os  -D'printf(fmt, ...)='
                                    --memory-allocation-policy=ALL_BRAM
                                    --DSP-allocation-coefficient=1.75
                                    --distram-threshold=256
             BAMBU-AREA-MP        - this setup implies:
                                    -Os  -D'printf(fmt, ...)='
                                    --channels-type=MEM_ACC_NN
                                    --memory-allocation-policy=ALL_BRAM
                                    --DSP-allocation-coefficient=1.75
                                    --distram-threshold=256
             BAMBU-BALANCED       - this setup implies:
                                    -O2  -D'printf(fmt, ...)='
                                    --channels-type=MEM_ACC_11
                                    --memory-allocation-policy=ALL_BRAM
                                    -fgcse-after-reload  -fipa-cp-clone
                                    -ftree-partial-pre  -funswitch-loops
                                    -finline-functions  -fno-ivopts
                                    --param max-inline-insns-auto=25
                                    -fno-tree-loop-ivcanon
                                    --distram-threshold=256
             BAMBU-BALANCED-MP    - (default) this setup implies:
                                    -O2  -D'printf(fmt, ...)='
                                    --channels-type=MEM_ACC_NN
                                    --memory-allocation-policy=ALL_BRAM
                                    -fgcse-after-reload  -fipa-cp-clone
                                    -ftree-partial-pre  -funswitch-loops
                                    -finline-functions  -fno-ivopts
                                    --param max-inline-insns-auto=25
                                    -fno-tree-loop-ivcanon
                                    --distram-threshold=256
             BAMBU-PERFORMANCE    - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --memory-allocation-policy=ALL_BRAM
                                    --distram-threshold=512
             BAMBU-PERFORMANCE-MP - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --channels-type=MEM_ACC_NN
                                    --memory-allocation-policy=ALL_BRAM
                                    --distram-threshold=512
             BAMBU                - this setup implies:
                                    -O0 --channels-type=MEM_ACC_11
                                    --memory-allocation-policy=LSS
                                    --distram-threshold=256
             BAMBU092             - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --timing-model=SIMPLE
                                    --DSP-margin-combinational=1.3
                                    --cprf=0.9  -skip-pipe-parameter=1
                                    --channels-type=MEM_ACC_11
                                    --memory-allocation-policy=LSS
                                    --distram-threshold=256
             VVD                  - this setup implies:
                                    -O3  -D'printf(fmt, ...)='
                                    --channels-type=MEM_ACC_NN
                                    --memory-allocation-policy=ALL_BRAM
                                    --distram-threshold=256
                                    --DSP-allocation-coefficient=1.75
                                    --do-not-expose-globals --cprf=0.875


  Other options:

    --time, -t <time>
        Set maximum execution time (in seconds) for ILP solvers. (infinite).


  Debug options:

    --discrepancy
           Performs automated discrepancy analysis between the execution
           of the original source code and the generated HDL (currently
           supports only Verilog). If a mismatch is detected reports
           useful information the user.
           Uninitialized variables in C are legal, but if they are used
           before initialization in HDL it is possible to obtain X values
           in simulation. This is not necessarily wrong, so these errors
           are not reported by default to avoid reporting false positives.
           If you can guarantee that in your C code there are no
           uninitialized variables and you want the X values in HDL to be
           reported use the option --discrepancy-force-uninitialized

    --discrepancy-force-uninitialized
           Reports errors due to uninitialized values in HDL.
           See the option --discrepancy for details

    --discrepancy-no-load-pointers
           Assume that the data loaded from memories in HDL are never used
           to represent addresses, unless they are explicitly assigned to
           pointer variables.
           The discrepancy analysis is able to compare pointers in software
           execution and addresses in hardware. By default all the values
           loaded from memory are treated as if they could contain addresses,
           even if they are integer variables. This is due to the fact that
           C code doing this tricks is valid and actually used in embedded
           systems, but it can lead to imprecise bug reports, because only
           pointers pointing to actual data are checked by the discrepancy
           analysis.
           If you can guarantee that your code always manipulates addresses
           using pointers and never using plain int, then you can use this
           option to get more precise bug reports.

A framework for Hardware-Software Co-Design of Embedded Systems