AppImage binary for TCAD 2020 “Tensor Optimization for High-Level Synthesis Design Flows” paper:
AppImage is a format for distributing portable software on Linux without needing superuser permissions to install the application. Once downloaded just add the execution rights with this command:
chmod +x bambu-x86_64.AppImage
For further details see https://appimage.org/.
Current Development Version on github:
VirtualBox virtual machine with a pre-configured version of Bambu running on Ubuntu-desktop 16.04 LTS 32bits – 2.4GB
VirtualBox virtual machine with a pre-configured version of Bambu running on Ubuntu-desktop 18.04 LTS 64bits- 2.26GB
On both the Virtual machines we have this Username: ubuntu and Password: password.
Latest stable release:
January 11th, 2020, PandA release 0.9.6- panda-0.9.6.tar.gz.
New features introduced:
- Added support to BRAVE FPGAs (link). Both NG-Medium and NG-Large are supported.
- Added support to TASTE model-based design flow (https://taste.tools/). A new option has been added to activate the customized HLS flow: –experimental-setup=BAMBU-TASTE.
- Added support for Hardware Discrepancy Analysis. Reference paper: Pietro Fezzardi, Marco Lattuada, Fabrizio Ferrandi, Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis. ACM Trans. Embedded Comput. Syst. 16(5): 149:1-149:19 (2017).
- Added support for OpenMP for. Reference papers: M. Minutoli, V. G. Castellana, A. Tumeo, M. Lattuada, and F. Ferrandi, “Efficient Synthesis of Graph Methods: A Dynamically Scheduled Architecture,” in Proceedings of the 35th International Conference on Computer-Aided Design, New York, NY, USA, 2016, p. 128:1–128:8. V. G. Castellana, M. Minutoli, A. Morari, A. Tumeo, M. Lattuada, and F. Ferrandi, “High level synthesis of RDF queries for graph analytics,” in Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2015, pp. 323-330.
- Customized Dynamic HLS flow updated. Reference paper: M. Lattuada and F. Ferrandi, “A Design Flow Engine for the Support of Customized Dynamic High Level Synthesis Flows”, ACM Trans. Reconfigurable Technol. Syst. 12, 4, Article 19 (October 2019), 26 pages.
- Added initial support to C++/fortran high-level synthesis. Now, Ubuntu distributions require the installation of g++-multilib package.
- Added support to CLANG/LLVM compiler version 4, 5, 6 and 7. On multiple files, LTO is exploited.
- Added support to high-level synthesis of spec in LLVM format (i.e., file with .ll extension).
- Added to CLANG/LLVM based analysis an interprocedural value range analysis based on the following paper: Fernando Magno Quintao Pereira, Raphael Ernani Rodrigues, and Victor Hugo Sperle Campos. “A fast and low-overhead technique to secure programs against integer overflows”, In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). Washington, DC, USA, 1-11.
The implementation started from code available at this link:https://code.google.com/archive/p/range-analysis/.
The original code went through a deep revision and it has been changed to be compatible with a recent version of LLVM. Some extensions have been also added:
– Added anti range support.
– Redesigned many Range operations to take into account wrapping and to improve the reductions performed.
– Integrated the LLVM lazy value range analysis.
– Added support to range value propagation of load from constant arrays.
– Added support to range value propagation of load and store from generic arrays.
- Added to CLANG/LLVM based analysis an Andersen based pointer analysis. The specific version used is described in:
“The Ant and the Grasshopper: Fast and Accurate Pointer Analysis for
Millions of Lines of Code”, by Ben Hardekopf & Calvin Lin, in PLDI 2007.
The BDD library used is the Buddy BDD package By Jørn Lind-Nielsen. Users working on a Debian/Ubuntu distribution can install the libbdd-dev package.
- Added support to GCC compiler version 8.
- Added support to mingw-w64. A Windows-7 64bit binary distribution is now available. This minimal msys2 binary distribution includes bambu, GCC8 with plugin and multilib enabled and a customized version of clang7.
- Added support to Mac OSX exploiting Mac Ports project (https://www.macports.org/)
- Improved the vagrant based virtual machine generation scripts. Such scripts now create an Ubuntu 16.04 32bit VirtualBox image and an Ubuntu 18.04 64bit VirtualBox image.
- Added Vagrant scripts for Ubuntu precise, trusty,xenial and bionic, Fedora 29, CentOs7, MacOSX MacPorts.
- Improved timing constraints and timing reports.
- Added –registered-inputs=top option.
- Improved the support for discrepancy when Verilator is used as simulator and a better check of basic floating-point operations.
- Added an example using C++14 constexpr declaration and a simple GCD example written in C++.
- The building system is now based on a single configure.ac.
- Regressions are now exploiting a Jenkins based infrastructure.
- Some performance, style, and c++ improvements have been done following the suggestions coming from cppcheck, clang static analyzer, and Codacy.
- A single-precision floating-point faithfully rounded powf function has been added. It follows the method published in:
Florent De Dinechin, Pedro Echeverria, Marisa Lopez-Vallejo, Bogdan Pasca. Floating-Point Exponentiation Units for Reconfigurable Computing. ACM Transactions on Reconfigurable Technology and Systems (TRETS), ACM, 2013, 6 (1), pp.4:1–4:15.
The powf function currently does not supports subnormals.
- Ported some parts of the code to C++11 standard by applying clang-tidy modernize-deprecated-headers, modernize-pass-by-value, modernize-use-auto,
modernize-use-bool-literals, modernize-use-equals-default, modernize-use-equals-delete, modernize-loop-convert and modernize-use-override.
- Added a SRT4 implementation. Added –hls-fpdiv to select which floating-point division will be used. Current options: SRT4 for Sweeney, Robertson, Tocher floating-point division with radix 4 and G for Goldschmidt floating point division.
- Added support to ac_types and ac_math library from Mentor Graphics. Concerning the original library, the bambu/PandA library does support both LLVM/Clang and GCC compiler and it requires c++14 standard for the compilation. Initial support to ap_* objects used by VIVADO HLS has been added as well.
- Added support to top design interfaces: none (ap_none), acknowledge (ap_ack), valid (ap_vld), ovalid (ap_ovld), handshake (ap_hs), fifo (ap_fifo) and array (ap_memory). These interfaces are activated by pragmas added to the source code and when the option –generate-interface=INFER is passed to bambu. Examples of pragma use can be found in panda_regressions/hls/bambu_specific_test4.
- Added some examples taken from https://github.com/Xilinx/HLx_Examples
- Added option –clock-name, –reset-name, –start-name and –done-name to specify the top component controlling signal names.
- Renamed top component and removed the suffix _minimal_interface. Now, the top component has the same name of the top function.
- Improved and fixed the synthesis of empty functions.
- Added option –VHDL-library=libraryname to specify the library in which the synthesized function has to be compiled.
- Extended the VHDL support to other bambu library components.
- If it is supported, the compiler used to compile the PandA framework uses the c++17 standard (-std=c++17), otherwise, it uses c++11 standard.
- Integrated Mockturtle library from EPFL (https://github.com/lsils/mockturtle) to simplify LUT-based expressions.
- Integrated Abseil C++ libraries (https://abseil.io/)
- Added examples of parallel_queries from ICCAD15 and ICCAD16 papers.
- Added FPT tutorial material.
- Added PNNL19 tutorial material.
- Added PACT19 tutorial material.
- Improved makefile parallelization.
- Improved bambu determinism: two runs with the same specs produce the same HDL output.
- Fixed COND_EXPR_RESTRUCTURING step
- Fixed omp tests
- Fixed libm tests
- Fixed make check and cpp examples
- Fixed c++ struct initialization
- Fixed floating point division
- Fixed make dist command under Github repository
August 14th, 2017 Release 0.9.5- panda-0.9.5.tar.gz.
New features introduced:
- Added support to GCC 6 and GCC 7 (GCC 4.9 is still the preferred GCC compiler).
- Added support to bitfields.
- Added support for pointers and memory operations to the Discrepancy Analysis. Reference paper: Pietro Fezzardi and Fabrizio Ferrandi, “Automated bug detection for pointers and memory accesses in High-Level Synthesis compilers”, in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016.
- Added new option: –discrepancy-only=comma,separated,list,of,function,names
Restricts the discrepancy analysis only to the functions whose name is in the list passed as argument.
- Added new option: –discrepancy-permissive-ptrs\n”
Do not trigger hard errors on pointer variables.
- added preliminary support to TASTE integration. Reference paper: M. Lattuada, F. Ferrandi, and M. Perrotin, “Computer Assisted Design and Integration of FPGA Accelerators in Aerospace Systems,” in Proceedings of the IEEE Aerospace Conference, 2016, pp. 1-11.
- Added support to OpenMP SIMD. Reference paper: M. Lattuada and F. Ferrandi, “Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops,” Journal of Systems Architecture, vol. 75, pp. 1-14, 2017.
- Default golden reference is now input C code without any modification.
- The options –synthesize and –objectives have been removed. Now the same values passed with –objectives can
be directly passed through the option –evaluation.
- Improved the precision and the effectiveness of the Bit Value analysis and optimizations.
- Improved detection of irreducible loops.
- Improved CSE.
- Added a frontend transformations that merge some operations into FPGA LUTs.
- Now frontend explicitly introduces function calls to softfloat functions.
- Added support to block RAM with latency = 3 (–high-latency=4).
- Added bambu option –fsm-encoding=[auto,one-hot,binary].
- Added a new option: –disable-reg-init-value
Used to remove the INIT value from registers (useful for ASIC designs)
- Improved mapping of multiplications on DSPs.
- Added a GCC plugin to apply the whole program optimization starting from the topfname function instead of main function (currently only GCC 4.9 is supported).
- Added further integer division algorithms:
- non-restoring division with unrolling factor equal to 1 (–hls-div=nr1) which becomes the default division algorithm.
- non-restoring division with unrolling factor equal to 2 (–hls-div=nr2)
- align divisor shift dividend method (–hls-div=as)
- Added a specialization of the integer division working with 64bits dividend and 32bits divisor.
- Single precision floating point faithfully rounded expf and logf functions implemented following the HOTBM method published by
- Jeremie Detrey and Florent de Dinechin, “Parameterized floating-point logarithm and exponential functions for FPGAs”, Microprocessors and Microsystems, vol.31,n.8, 2007, pp.537-545.
The code has been exhaustively tested and it supports subnormals.
- Single precision floating point faithfully rounded sin, cos, sincos and tan functions implemented following the HOTBM method published by
- Jeremie Detrey and Florent de Dinechin, “Floating-point Trigonometric Functions for FPGAs” FPL 2007.
The code has been exhaustively tested and it supports subnormals.
- Single precision floating point faithfully rounded sqrt function implemented following the method published by
- Florent de Dinechin, Mioara Joldes, Bogdan Pasca, Guillaume Revy: Multiplicative Square Root Algorithms for FPGAs. FPL 2010: 574-577
The code has been exhaustively tested and it supports subnormals.
- Implemented the port swapping algorithm as described in the following paper:
- Hao Cong, Song Chen and T. Yoshimura, “Port assignment for interconnect reduction in high-level synthesis,” Proceedings of Technical Program of 2012 VLSI Design, Automation and Test, Hsinchu, 2012, pp. 1-4.
- Improved support to structs passed by copy.
- Improved ROM identification.
- Added a new option: –rom-duplication
Assume that read-only memories can be duplicated in case timing requires.
- Improved memory initialization.
- Added some transformations that lowered some memcpy and memset call to simple instructions.
- Improved softfloat functions for basic single and double precisions operations: sum, sub, mul and division.
Now addition and subtraction operations correctly manage operand equal to +0 and -0.
- Added three options to control which softfloat and libm libraries are used: –softfloat-subnormal, –libm-std-rounding and –soft-fp.
- Fixed builtin isnanf.
- Added double precision implementation of libm round function.
- Added __builtin_lrint, __builtin_llrint, __builtin_nearbyint to libm library.
- Fixed and improved tgamma and tgammaf function.
- Added support to parallel compilation of bambu libraries.
- Added support to the automatic configuration of newer releases of Quartus for IntelFPGAs.
- Improved verilator detection.
- Improved libicu detection.
- Improved boost filesystem macro.
- Fixed problems due to -m32 under arch linux.
- Fixed compilation problems with glpk and ubuntu 14.04.
- Fixed a problem with long double. They now have the same size of double.
- Added support to Mentor Visualizer.
- Improved components characterization and timing models.
- Extended support to VHDL.
- Now VHDL modelsim simulation uses 2008 standard.
- Extended set of synthesis scripts and synthesis results.
- Improved area reporting for Virtex4 devices.
- Improved characterization of asynchronous RAMs.
- Fixed extraction of slack delay from ISE trce and Lattice reports.
- Fixed yosys backend wr.r.t the newer Vivado releases.
- Added SLICES to the set of data collected by characterization.
- Extended set of regression tests.
April 20th, 2016 – Release 0.9.4 – panda-0.9.4.tar.gz.
New features introduced:
- added support to GCC 5 (GCC 4.9 is still the default and the preferred GCC version)
- improved support to complex builtin data types
- added an initial support to “Extended Asm – Assembler Instructions with C Expression Operands”. In particular, the asm instruction could be used to inline VERILOG/VHDL code in a C source description (done by extending the multiple assembler dialects feature in asm templates: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html)
- timing for combinational accelerator is now correctly estimated by the backend synthesis scripts (accelerator that takes a single cycle to complete).
- added option –serialize-memory-accesses to remove any memory access parallelization. It is mainly useful for debugging purposes.
- added option –distram-threshold to explicitly control the DISTRIBUTED/ASYNCHRONOUS RAMs inferencing.
- refactored of simulation/evaluation options
- added support to STRATIX V and STRATIX IV
- added support to Virtex4
- added support to C files for cosimulation
- added support to out of context synthesis on Altera boards
- now Lattice ECP3 is fully supported. In particular, the byte enabling feature required by some of the memories instantiated by bambu is implemented by exploiting Lattice PMI (Parameterizable Module Inferencing) library.
- improved and extended the integration of existing IPs written in Verilog/VHDL.
- added an example showing how asm could be inlined in the C source code: simple_asm
- added an example, named file_simulate, showing how open, read, write and close could be used to verify a complex design with datasets coming from a file
- added an example showing how python could be used to verify the correctness of the HLS process: python-bindings
- added MachSuite (“MachSuite: Benchmarks for Accelerator Design and Customized Architectures.” – 2014 IEEE International Symposium on Workload Characterization.) to examples
- added benchmarks of “A Survey and Evaluation of FPGA High-Level Synthesis Tools” – IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems to examples
- added two examples showing how external IPs could be integrated in the HLS flow: IP_integration and led_example
- improved the VGA example for DE1 and ported to Nexys4 Xilinx prototyping board
- added two examples showing how it is possible to run two arcade games such as pong and breakout on a Nexys 4 prototyping board without using any processor. The two examples smoothly connect a low level controller for the VGA port plus some GPIO controllers with plain C code describing the game behavior.
- added a tutorial describing how to use bambu in designing a simple example: led_example
- refactoring of scripts for technology libraries characterization
- improved regression scripts: now panda regression consists of about 250K tests
- assertions check now have to be explicitly disabled also in release
- done port is now registered whenever it is possible
- added support to the synthesis of function pointers and to the inter-procedural resource sharing of functions (reference paper is: Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi: Inter-procedural resource sharing in High Level Synthesis through function proxies. FPL 2015: 1-8)
- added support to speculative SDC code scheduling (controlled through option –speculative-sdc-scheduling | reference paper is: Code Transformations Based on Speculative SDC Scheduling. ICCAD 2015: 71-77)
- added two new experimental setups (BAMBU-BALANCED-MP and BAMBU-BALANCED) oriented to trade-off between area and performances; BAMBU_BALANCED-MP is the new default experimental setup
- added a discrepancy analysis to verify the correctness of the generated code (controlled through option –discrepancy | reference paper is: Pietro Fezzardi, Michele Castellana, Fabrizio Ferrandi: Trace-based automated logical debugging for high-level synthesis generated circuits. ICCD 2015: 251-258)
- added common subexpression elimination step
- reset can now be active high or active low (controlled through option –reset-level)
- added support to file IO libc functions: open, read, write and close.
- added support to assert function.
- added support to libc functions: stpcpy stpncpy strcasecmp strcasestr strcat strchr
strchrnul strcmp strcpy strcspn strdup strlen strncasecmp
strncat strncmp strncpy strndup strnlen strpbrk strrchr
strsep strspn strstr strtok bzero bcopy mempcpy
memchr memrchr rawmemchr index rindex
- improved double precision soft-float library
- added support to single and double precision complex division operations: __divsc3 __divdc3
- added preliminary support to irreducible loops
- changed the PandA hardware description license from GPL to LGPL
March 24th, 2015 – Release 0.9.3 – panda-0.9.3.tar.gz.
PandA now requires GCC 4.6 or greater to be compiled.
New features introduced:
- general improvement of performances of generated circuits
- added full support to GCC 4.9 family which is now the default
- improved retrieving of GCC alias analysis information
- added first version of VHDL backend
- added support to CycloneV
- added support to Artix7
- extended support to Virtex7 boards family
- added option –top-rtldesign-name that controls which is the function to be synthesized by the RTL backed
- it is now possible to write the testbench in C instead of using the xml file
- added a first experimental backend to yosys (yosys link )
- added examples/crc_yosys which tests yosys backend and C based testbenches
- improved Verilog testbench generation: it is now fully compliant with cycle based simulators (e.g., VERILATOR)
- added option –backend-script-extensions to pass further constraints to the RTL synthesis (e.g., pin assignment)
- added examples/VGA showing how to integrate existing HDL based IPs in a real FPGA design
- added scripts and results for CHStone synthesis of Lattice based designs
- improved support of complex numbers
- single precision soft-float functions redesigned: now –soft-float is the default and –flopoco becomes optional
- single precision floating point division implemented exploiting Goldshmidt algorithm
- improved synthesis of libm functions
- improved libm regression test
- improved architectural timing model
- improved graphviz representation of FSMs: timing information has been added
- added option –post-rescheduling to further improve the resource usage
- parameter registering is now performed and it can be controlled by using option –registered-inputs
- added a full implementation of Bit Value analysis and coupled with Value Range analysis performed by GCC
- added option –experimental-setup to control bambu defaults:
- BAMBU-PERFORMANCE-MP – multi-port performance oriented setup
- BAMBU-PERFORMANCE – single port performance oriented setup
- BAMBU-AREA-MP – multi-port area oriented setup
- BAMBU-AREA – single-port area oriented setup
- BAMBU – no specific optimizations enabled
- improved code speculation
- improved memory localization
- added option –do-not-expose-globals making possible localization of globals, as it is similarly done by some commercial tools
- added support of high latency memories and of distributed memories: zero, one and two delays memories are supported
- added option –aligned-access to drive the memory allocation towards more simple block RAM models: it can be used under some restricted assumptions (e.g., no vectorization and no structs used)
- ported the GCC algorithm which rewrites a division by a constant in adds and shifts
- added option –hls-div that maps integer divisions and modulus on a C based implementation of the Newton-Raphson algorithm
- improved technology libraries management:
- technology libraries and contraints are now managed in a independent way
- multiple technology libraries can be provided to the tool at the same time
- improved and parallelized PandA test regression infrastructure
- added support to Centos7, fedora 21, Ubuntu 14.04 and Ubuntu 14.10 distributions
- complete refactoring of output messages
- fixed problem related to Bison 2.7
- fixed reinstallation of PandA in a different folder
- fixed installation problems on systems where boost and gcc are not installed in default locations
- removed some implicit conversions from generated verilog circuits
February 12th, 2014 – Release 0.9.2 – panda-0.9.2.tar.gz.
New features introduced:
- added an initial support to GCC 4.9.0,
- stable support to GCC versions: v4.5, v4.6, v4.7 (default) and v4.8,
- added an experimental support to Verilator simulator,
- new dataflow dependency analysis for LOADs and STOREs; we now use GCC alias analysis to see if a LOAD and STORE pair or a STORE and STORE pair may conflict,
- added a frontend step that optimizes PHI nodes,
- added a frontend step that performs conditionally if conversions,
- added a frontend step that performs simple code motions,
- added a frontend step that converts if chains in a single multi-if control construct,
- added a frontend step that simplifies short circuits based control constructs,
- added a proxy-based approach to the LOADs/STOREs of statically resolved pointers,
- improved EBR inference for Lattice based devices,
- now, memory models are different for Lattice, Altera, Virtex5 and Virtex6-7 based devices,
- updated FloPoCo to a more recent version,
- now, register allocation maps storage values on registers without write enable when possible,
- added support to CentOS/Scientific Linux distributions,
- added support to ArchLinux distribution,
- added support to Ubuntu 13.10 distribution,
- now, testbenches accept a user defined error for float based computations; the error is specified in ULPs units; a Unit in the Last Place is the spacing between floating-point numbers,
- improved architectural timing model,
- added a very simple symbolic estimator of number of cycles taken by a function, it mainly covers function without loops and without unbounded operations,
- general refactoring of automatic HLS testbench generation,
- added support to libm function lceil and lceilf,
- added skip-pipe-parameter option to bambu; it is is used to select a faster pipelined unit (xilinx devices have the default equal to 1 while lattice and altera devices have the default equal to 0),
- improved memory allocation when byte-enabled write memories are not needed,
- added support to variable lenght arrays,
- added support to memalign builtin,
- added EXT_PIPELINED_BRAM memory allocation policy, bambu with this option assumes that a block ram memory is allocated outside the core (LOAD with II=1 and latency=2 and STORE with latency=1),
- added 2D matrix multiplication examples for integers and single precision floating point numbers,
- added some synthesis scripts checking bambu quality of results w.r.t. 72 single precision libm functions (e.g., sqrtf, sinf, expf, tanhf, etc.),
- added spider tool to automatically generate latex tables from bambu synthesis results,
- moved all the dot generated files into directory HLS_output/dot/. Such files (scheduling, FSM, CFG, DFG, etc) are now generated when –print-dot option is passed,
- VIVADO is now the default backend flow for Zynq based devices.
- fixed all the Bison related compilation problems,
- fixed some problems with testbench generation of 2D arrays,
- fixed configuration scripts for manually installed Boost libraries; now, we need at least Boost 1.48.0,
- fixed some problems with C pretty-printing of the internal IR,
- fixed some ISE/Vivado synthesis problems when System Verilog based model are used,
- fixed some problems with –soft-float based synthesis,
- fixed RTL-backend configuration scripts looking for tools (e.g., ISE, Vivado, Quartus and Diamond) already installed,
- fixed some problems with real-to-int and int-to-real conversions, added some explicit tests to the panda regressions.
September 17th, 2013 – Release 0.9.1 – panda-0.9.1.tar.gz.
New features introduced:
- complete support of CHSTone benchmarks synthesis and verification (link),
- better support of multi-ported memories,
- local memory exploitation,
- read-only-memory exploitation,
- support of multi-bus for parallel memory accesses,
- support of unaligned memory accesses,
- better support of pipelined resources,
- improved module binding algorithms (e.g., slack-based module binding),
- support of resource constraints through user xml file,
- support of libc primitives: memcpy, memcmp, memset and memmove,
- better support of printf primitive for RTL debugging purposes,
- support of dynamic memory allocation,
- synthesis of libm builtin primitives such as sin, cos, acosh, etc,
- better integration with FloPoCo library (FloPoCo link),
- soft-float based HW synthesis,
- support of Vivado Xilinx backend,
- support of Diamond Lattice backend,
- support of XSIM Xilinx simulator,
- synthesis and testbench generation of WISHBONE B4 Compliant Accelerators (see WB4 specs for details on the WISHBONE specification),
- synthesis of AXI4LITE Compliant Accelerators (experimental),
- inclusion of GCC regression tests to test HLS synthesis (tested HLS synthesis and RTL simulation),
- inclusion of libm regression tests to test HLS synthesis of libm (tested HLS synthesis and RTL simulation),
- support of multiple versions of GCC compiler: v4.5, v4.6 and v4.7.
- support of GCC vectorizing capability (experimental).
March 21st, 2012 – Release 0.9.0 – Build 10722 – panda-0.9.0.tar.gz.
For any information or bug report, please write to firstname.lastname@example.org