Amirali Sharifian | Publications

2019

μIR -An intermediate representation for transforming and optimizing the microarchitecture of application accelerators Sharifian, Amirali, Hojabr, Reza, Rahimi, Navid, Liu, Sihao, Guha, Apala, Nowatzki, Tony, and Shriraman, Arrvindh In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019 2019 [Abs] [PDF] [Slides] [Code]
Creating high quality application-specific accelerators requires us to make iterative changes to both algorithm behavior and microarchitecture, and this is a tedious and error-prone process. High-Level Synthesis (HLS) tools [5, 10] generate RTL for application accelerators from annotated software. Unfortunately, the generated RTL is challenging to change and optimize. The primary limitation of HLS is that the functionality and microarchitecture are conflated together in a single language (such as C++). Making changes to the accelerator design may require code restructuring, and microarchitecture optimizations are tied with program correctness. We propose a generalized intermediate representation for describing accelerator microarchitecture, µIR, and an associated pass framework, µopt. µIR represents the accelerator as a concurrent structural graph in which the components roughly correspond to microarchitecture level hardware blocks (e.g., function units, network, memory banks). There are two important benefits i) it decouples microarchitecture optimizations from algorithm/program optimizations. ii) it decouples microarchitecture optimizations from the RTL generation. Computer architects express their ideas as a set of iterative transformations of the µIR graph that successively refine the accelerator architecture. The µIR graph is then translated to Chisel, while maintaining the execution model and cycle-level performance characteristics. In this paper, we study three broad classes of optimizations: Timing (e.g., Pipeline re-timing), Spatial (e.g., Compute tiling), and Higher-order Ops (e.g., Tensor function units) that deliver between 1.5 — 8× improvement in performance; overall 5—20× speedup compared to an ARM A9 1Ghz. We evaluate the quality of the autogenerated accelerators on an Arria 10 FPGA and under ASIC UMC 28nm technology

2018

TAPAS: Generating Parallel Accelerators from Parallel Programs Margerm, Steve, Sharifian, Amirali, Guha, Apala, Shriraman, Arrvindh, and Pokam, Giells In The 51th Annual IEEE/ACM International Symposium on Microarchitecture 2018 [Abs] [PDF] [Slides] [Code]
High-level-synthesis (HLS) tools generate accelerators from software programs to ease the task of building hardware. Unfortunately, current HLS tools have limited support for concurrency, which impacts the speedup achievable with the generated accelerator. Current approaches only target fixed static patterns (e.g., pipeline, data-parallel kernels). This constraints the ability of software programmers to express concurrency. Moreover, the generated accelerator loses a key benefit of parallel hardware, dynamic asynchrony, and the potential to hide long latency and cache misses. We have developed TAPAS, an HLS toolchain for generating parallel accelerators from programs with dynamic parallelism. TAPAS is built on top of Tapir [22], [39], which embeds fork-join parallelism into the compiler’s intermediate-representation. TAPAS leverages the compiler IR to identify parallelism and synthesizes the hardware logic. TAPAS provides first-class architecture support for spawning, coordinating and synchronizing tasks during accelerator execution. We demonstrate TAPAS can generate accelerators for concurrent programs with heterogeneous, nested and recursive parallelism. Our evaluation on Intel-Altera DE1-SoC and Arria-10 boards demonstrates that TAPAS generated accelerators achieve 20× the power efficiency of an Intel Xeon, while maintaining comparable performance. We also show that TAPAS enables lightweight tasks that can be spawned in ’10 cycles and enables accelerators to exploit available fine-grain parallelism. TAPAS is a complete HLS toolchain for synthesizing parallel programs to accelerators and is open-sourced.

2016

CHAINSAW: Von-neumann Accelerators to Leverage Fused Instruction Chains Sharifian, Amirali, Kumar, Snehasish, Guha, Apala, and Shriraman, Arrvindh In The 49th Annual IEEE/ACM International Symposium on Microarchitecture 2016 [Abs] [PDF] [Slides] [Code]
A central tenet behind accelerators is to partition a program execution into regions with different behavior (e.g., SIMD, Irregular, Compute-Intensive) and then use behavior-specialized architectures [1] for each region. It is unclear whether the gains in efficiency arise from recognizing that a simpler microarchitecture is sufficient for the acceleratable code region or the actual microarchitecture, or a combination of both. Many proposals [2], [3] seem to choose dataflow-based accelerators which encounters challenges with fabric utilization and static power when the available instruction parallelism is below the peak operation parallelism available [4]. In this paper, we develop, Chainsaw, a Von-Neumann based accelerator and demonstrate that many of the fundamental overheads (e.g., fetch-decode) can be amortized by adopting the appropriate instruction abstraction. The key insight is the notion of chains, which are compiler fused sequences of instructions. chains adapt to different acceleration behaviors by varying the length of the chains and the types of instructions that are fused into a chain. Chains convey the producer-consumer locality between dependent instructions, which the Chainsaw architecture then captures by temporally scheduling such operations on the same execution unit and uses pipeline registers to forward the values between dependent operations. Chainsaw is a generic multi-lane architecture (4-stage pipeline per lane) and does not require any specialized compound function units; it can be reloaded enabling it to accelerate multiple program paths. We have developed a complete LLVM-based compiler prototype and simulation infrastructure and demonstrated that a 8-lane Chainsaw is within 73% of the performance of an ideal dataflow architecture, while reducing the energy consumption by 45% compared to a 4-way OOO processor.
Peruse and Profit: Estimating the Accelerability of Loops Kumar, Snehasish, Srinivasan, Vijayalakshmi, Sharifian, Amirali, Sumner, Nick, and Shriraman, Arrvindh In Proceedings of the 2016 International Conference on Supercomputing 2016 [Abs] [PDF] [Slides]
There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration. In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads. For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.

2013

An energy-efficient clustering algorithm for large scale wireless sensor networks Soleimani, M., Sharifian, A., and Fanian, A. In 2013 21st Iranian Conference on Electrical Engineering (ICEE) 2013 [Abs]
Wireless sensor networks (WSNs) consist of a large number of sensor nodes with limited energy resources. Collecting and transmitting sensed information in an efficient way is one of the challenges in these networks. The clustering algorithm is a solution to reduce energy consumption. It can be helpful to the scalability and network life time. However, the problem of unbalanced energy dissipation is an important issue in cluster based WSNs. In this paper, a new clustering algorithm, named PDKC, is proposed for wireless sensor networks based on node deployment knowledge. However, in PDKC, sensor node location is modelled by Gaussian probability distribution function instead of using GPSs or any other location-aware devices. In the proposed method, cluster heads are selected based on node deployment information, residual energy, node degree and their distance from the base station. The Simulation results indicate that PDKC algorithm prolongs network lifetime, improves the network coverage and balance energy dissipation in comparison to other works.