GeneSys
One of the major enabling factors in the significant advancement of deep learning (like convolutional and transformer-based neural networks) is the rapid growth of computing power in the 2010s. With the end of Dennard scaling (Dennard et al., 1974) and the advent of dark silicon (Esmaeilzadeh et al., 2011), research and development has shifted towards adopting hardware accelerators for deep learning (Chen et al., 2014; Sharma et al., 2016; Chen et al., 2016; Shao et al., 2019; Genc et al., 2021). Deep neural network (DNN) accelerators have found their way into production datacenters (Jouppi et al., 2017; Anderson et al., 2021; Jouppi et al., 2021), autonomous vehicles (Lin et al., 2018), internet of things (IoT) devices (Reagen et al., 2016; Whatmough et al., 2018), and biomedical devices (Kim et al., 2014; Liu et al., 2021). Besides the challenging task of designing hardware accelerators that abide by various power, performance, and area constraints, there is also ongoing research on how to seamlessly integrate hardware accelerators in the software stack (Ma et al., 2020; Yu et al., 2020; Korolija et al., 2020). Therefore, we must shift the focus from standalone hardware to holistic system design.
The Alternative Computing Technologies (ACT) Lab, led by my advisor Hadi Esmaeilzadeh, is one of the few in academia to have developed a fully fledged programmable accelerator generator, called GeneSys. GeneSys is a full-stack system designed to accelerate deep learning models such as convolutional neural networks (CNNs) and transformers-based language models. It comprises a parameterizable neural processing unit (NPU) generator capable of creating hardware accelerators with various configurations, which has been both taped out and prototyped on AWS F1 FPGAs. GeneSys also features a multi-target compilation stack that supports algorithms beyond deep learning, OpenCL-based Linux drivers, user-friendly Python APIs, and an RTL verification framework with a regression suite including synthetic and state-of-the-art DNN benchmarks like ResNet50, BERT, and GPT2. Additionally, it includes hardware synthesis scripts, a software simulator for profiling, and comprehensive software support. These unique and functional artifacts represent years of research and multiple published papers by the ACT Lab (Mahajan et al., 2016; Sharma et al., 2016; Sharma et al., 2018; Ghodrati et al., 2020; Kinzer et al., 2021; Kim et al., 2022). GeneSys has already been used in multiple published papers (Wang et al., 2024; Ghodrati et al., 2024; Mahapatra et al., 2024) and course projects in CSE 240D at the University of California San Diego.
My primary contribution to GeneSys has been the continued development of the system’s compiler, which was originally written by Sean Kinzer. After taking on the mantle, I streamlined the setup and installation process by packaging the compiler with pip, implemented a new greedy memory allocation strategy based on (Pisarchyk & Lee, 2020) which significantly reduced the memory footprint in DRAM during execution, removed excess code bloat leftover from stitching multiple projects together during the initial development phase, and expanded the neural network layer support of the compiler to accommodate more diverse models. I also had the opportunity to give oral presentations on the compiler through tutorials organized by ACT Lab. The tutorials were presented at the following venues:
- IEEE/ACM International Symposium on Microarchitecture (MICRO) on October 29th, 2023 in Toronto, Canada
- IEEE International Symposium on High-Performance Computer Architecture (HPCA) on March 2nd, 2024 in Edinburgh, Scotland
- ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) on April 28th, 2024 in San Diego, California
- International Symposium on Computer Architecture (ISCA) on June 29th, 2024 in Buenos Aires, Argentina.
For more information on the project as a whole, see the GeneSys website.
References
2024
- Wang, S.-T., Xu, H., Mamandipoor, A., Mahapatra, R., Ahn, B. H., Ghodrati, S., Kailas, K., Alian, M., & Esmaeilzadeh, H. (2024). Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators. 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 1043–1062. https://doi.org/10.1109/HPCA57654.2024.00083
- Ghodrati, S., Kinzer, S., Xu, H., Mahapatra, R., Kim, Y., Ahn, B. H., Wang, D. K., Karthikeyan, L., Yazdanbakhsh, A., Park, J., Kim, N. S., & Esmaeilzadeh, H. (2024). Tandem Processor: Grappling with Emerging Operators in Neural Networks. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 1165–1182. https://doi.org/10.1145/3620665.3640365
- Mahapatra, R., Ghodrati, S., Ahn, B. H., Kinzer, S., Wang, S.-T., Xu, H., Karthikeyan, L., Sharma, H., Yazdanbakhsh, A., Alian, M., & Esmaeilzadeh, H. (2024). In-Storage Domain-Specific Acceleration for Serverless Computing. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 530–548. https://doi.org/10.1145/3620665.3640413
2022
- Kim, J. K., Ahn, B. H., Kinzer, S., Ghodrati, S., Mahapatra, R., Yatham, B., Wang, S.-T., Kim, D., Sarikhani, P., Mahmoudi, B., Mahajan, D., Park, J., & Esmaeilzadeh, H. (2022). Yin-Yang: Programming Abstractions for Cross-Domain Multi-Acceleration. IEEE Micro, 42(5), 89–98. https://doi.org/10.1109/MM.2022.3189416
2021
- Genc, H., Kim, S., Amid, A., Haj-Ali, A., Iyer, V., Prakash, P., Zhao, J., Grubb, D., Liew, H., Mao, H., Ou, A., Schmidt, C., Steffl, S., Wright, J., Stoica, I., Ragan-Kelley, J., Asanovic, K., Nikolic, B., & Shao, Y. S. (2021). Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration. 2021 58th ACM/IEEE Design Automation Conference (DAC), 769–774. https://doi.org/10.1109/DAC18074.2021.9586216
- Anderson, M., Chen, B., Chen, S., Deng, S., Fix, J., Gschwind, M., Kalaiah, A., Kim, C., Lee, J., Liang, J., Liu, H., Lu, Y., Montgomery, J., Moorthy, A., Nadathur, S., Naghshineh, S., Nayak, A., Park, J., Petersen, C., … Rao, V. (2021). First-Generation Inference Accelerator Deployment at Facebook. https://arxiv.org/abs/2107.04140
- Jouppi, N. P., Yoon, D. H., Ashcraft, M., Gottscho, M., Jablin, T. B., Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., Norrie, T., Patil, N., Prasad, S., Young, C., Zhou, Z., & Patterson, D. (2021). Ten Lessons from Three Generations Shaped Google’s TPUv4i. Proceedings of the 48th Annual International Symposium on Computer Architecture, 1–14. https://doi.org/10.1109/ISCA52012.2021.00010
- Liu, J., Zhu, Z., Zhou, Y., Wang, N., Dai, G., Liu, Q., Xiao, J., Xie, Y., Zhong, Z., Liu, H., Chang, L., & Zhou, J. (2021). 4.5 BioAIP: A Reconfigurable Biomedical AI Processor with Adaptive Learning for Versatile Intelligent Health Monitoring. 2021 IEEE International Solid-State Circuits Conference (ISSCC), 62–64. https://doi.org/10.1109/ISSCC42613.2021.9365996
- Kinzer, S., Kim, J. K., Ghodrati, S., Yatham, B., Althoff, A., Mahajan, D., Lerner, S., & Esmaeilzadeh, H. (2021). A Computational Stack for Cross-Domain Acceleration. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 54–70. https://doi.org/10.1109/HPCA51647.2021.00015
2020
- Ma, J., Zuo, G., Loughlin, K., Cheng, X., Liu, Y., Eneyew, A. M., Qi, Z., & Kasikci, B. (2020). A Hypervisor for Shared-Memory FPGA Platforms. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 827–844. https://doi.org/10.1145/3373376.3378482
- Yu, H., Peters, A. M., Akshintala, A., & Rossbach, C. J. (2020). AvA: Accelerated Virtualization of Accelerators. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 807–825. https://doi.org/10.1145/3373376.3378466
- Korolija, D., Roscoe, T., & Alonso, G. (2020). Do OS Abstractions Make Sense on FPGAs? Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, 991–1010. https://www.usenix.org/conference/osdi20/presentation/roscoe
- Ghodrati, S., Ahn, B. H., Kyung Kim, J., Kinzer, S., Yatham, B. R., Alla, N., Sharma, H., Alian, M., Ebrahimi, E., Kim, N. S., Young, C., & Esmaeilzadeh, H. (2020). Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 681–697. https://doi.org/10.1109/MICRO50266.2020.00062
- Pisarchyk, Y., & Lee, J. (2020). Efficient Memory Management for Deep Neural Net Inference. https://arxiv.org/abs/2001.03288
2019
- Shao, Y. S., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S. G., Zhang, Y., Dally, W. J., Emer, J., Gray, C. T., Khailany, B., & Keckler, S. W. (2019). Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 14–27. https://doi.org/10.1145/3352460.3358302
2018
- Lin, S.-C., Zhang, Y., Hsu, C.-H., Skach, M., Haque, M. E., Tang, L., & Mars, J. (2018). The Architectural Implications of Autonomous Driving: Constraints and Acceleration. Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 751–766. https://doi.org/10.1145/3173162.3173191
- Whatmough, P. N., Lee, S. K., Brooks, D., & Wei, G.-Y. (2018). DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications. IEEE Journal of Solid-State Circuits, 53(9), 2722–2731. https://doi.org/10.1109/JSSC.2018.2841824
- Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Kim, J. K., Chandra, V., & Esmaeilzadeh, H. (2018). Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks. Proceedings of the 45th Annual International Symposium on Computer Architecture, 764–775. https://doi.org/10.1109/ISCA.2018.00069
2017
- Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-luc, Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., … Yoon, D. H. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 1–12. https://doi.org/10.1145/3079856.3080246
2016
- Sharma, H., Park, J., Mahajan, D., Amaro, E., Kim, J. K., Shao, C., Mishra, A., & Esmaeilzadeh, H. (2016). From High-Level Deep Neural Models to FPGAs. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 1–12. https://doi.org/10.1109/MICRO.2016.7783720
- Chen, Y.-H., Emer, J., & Sze, V. (2016). Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. Proceedings of the 43rd International Symposium on Computer Architecture, 367–379. https://doi.org/10.1109/ISCA.2016.40
- Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S. K., Hernández-Lobato, J. M., Wei, G.-Y., & Brooks, D. (2016). Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 267–278. https://doi.org/10.1109/ISCA.2016.32
- Mahajan, D., Park, J., Amaro, E., Sharma, H., Yazdanbakhsh, A., Kim, J. K., & Esmaeilzadeh, H. (2016). TABLA: A Unified Template-Based Framework for Accelerating Statistical Machine Learning. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 14–26. https://doi.org/10.1109/HPCA.2016.7446050
2014
- Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., & Temam, O. (2014). DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, 269–284. https://doi.org/10.1145/2541940.2541967
- Kim, C., Chung, M., Cho, Y., Konijnenburg, M., Ryu, S., & Kim, J. (2014). ULP-SRP: Ultra Low-Power Samsung Reconfigurable Processor for Biomedical Applications. ACM Trans. Reconfigurable Technol. Syst., 7(3). https://doi.org/10.1145/2629610
2011
- Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011). Dark Silicon and the End of Multicore Scaling. Proceedings of the 38th Annual International Symposium on Computer Architecture, 365–376. https://doi.org/10.1145/2000064.2000108
1974
- Dennard, R. H., Gaensslen, F. H., Yu, H.-N., Rideout, V. L., Bassous, E., & LeBlanc, A. R. (1974). Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions. IEEE Journal of Solid-State Circuits, 9(5), 256–268. https://doi.org/10.1109/JSSC.1974.1050511