Manticore: Hardware-Accelerated RTL Simulation With Static Bulk-Synchronous Parallelism

Abstract:
The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1–1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration.

One possibility is to exploit low-level parallelism as RTL expresses considerable fine-grain concurrency. Unfortunately this fine-grain parallelism contrasts with coarse-grain parallel workloads for which modern multicore systems are built, which leads to simulator designs that can achieve only weak parallel performance scaling.

This work presents Manticore: a co-designed manycore architecture and compiler for RTL simulation that achieves strong parallel performance scaling. Manticore combines a bulk-synchronous parallel (BSP) execution model with static scheduling to eliminate the runtime overheads of synchronization among hundreds of cores. Since the scheduled synchronization occurs without overhead, fine-grain interactions among cores are efficient. Device-wide static scheduling also allows us to simplify the Manticore processors, significantly increasing the parallelism possible on a single chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on top-of-the-line desktop and server computers in 8 out of 9 benchmarks. The ideas underlying Manticore design present a first step towards fast, scale-out RTL simulation.


JavaScript has been disabled in your browser