Jiakun Yan's Homepage

Research Interest

Jiakun Yan

I am a member of technical staff at Advanced Micro Devices (AMD) Research. I work on developing GPU communication software and programming models for AI and HPC workloads.

Before that, I completed a Ph.D. in computer science at University of Illinois at Urbana-Champaign advised by Marc Snir, where I worked on designing better communication libraries for highly dynamic/irregular HPC programming systems and applications.

I am the main developer of LCI and also working/worked on MPICH, HPX, Legion, and Charm++.

CV / GitHub

Contact me

Email

jiakun.yan[AT]amd[DOT]com
jiakunyan1998[AT]gmail[DOT]com

My research interest lies in parallel computing and the broader computer system area. Currently, I am interested in designing high-level task-based programming models and low-level communication systems to better utilize modern parallel architectures and improve the performance, scalability, and programmability of modern parallel applications.

--> Education

University of Illinois at Urbana-Champaign, USA, from Aug. 2020 to Dec. 2025
Computer Science Ph.D. student, advised by Marc Snir

Shanghai Jiao Tong University, China, from Sep. 2016 to June 2020
Department of Computer Science & Zhiyuan College, Bachelor of Engineering

Past Experience

GPU Software - Legate Group, NVIDIA Research, USA, from May. 2024 to Aug. 2024
Software Engineer Intern, working with Manolis Papadakis and Hessam Mirsadeghi

Programming Models and Runtime Systems (PMRS) Group, Argonne National Laboratory, USA, from May. 2023 to Aug. 2023
Research Assistant, working with Yanfei Guo

Programming Systems and Applications Research Group, NVIDIA Research, USA, from May. 2022 to Aug. 2022
Research Assistant, working with Michael Bauer and Michael Garland

PASSION Lab, Lawrence Berkeley National Laboratory, USA, from Aug. 2019 to Jan. 2020
Research Assistant, working with Aydın Buluç and Kathy Yelick

Publications

Jiakun Yan, Marc Snir, and Yanfei Guo. "Examining MPI and its Extensions for Asynchronous Multithreaded Communication." Proceedings of the 32nd European MPI Users' Group Meeting (EuroMPI), 2025. [ preprint ]

Jiakun Yan, and Marc Snir. "LCI: a Lightweight Communication Interface for Efficient Asynchronous Multithreaded Communication." The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2025. [ preprint ]

Jiakun Yan, Hartmut Kaiser, and Marc Snir. "Understanding the Communication Needs of Asynchronous Many-Task Systems--A Case Study of HPX+ LCI." arXiv preprint, 2025. [ preprint ]

Daiß, Gregor, Patrick Diehl, Jiakun Yan, John K. Holmen, Rahulkumar Gayatri, Christoph Junghans, Alexander Straub et al. "Asynchronous-Many-Task Systems: Challenges and Opportunities--Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX." arXiv preprint, 2024. [ preprint ]

Jiakun Yan, and Marc Snir. "Contemplating a Lightweight Communication Interface for Asynchronous Many-Task Systems" In Workshop on Asynchronous Many-Task Systems and Applications (WAMTA), 2025. [ preprint ]

Jiakun Yan, Hartmut Kaiser, and Marc Snir. "Design and Analysis of the Network Software Stack of an Asynchronous Many-task System – The LCI parcelport of HPX" In Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W), 2023. [ pdf ]

Benjamin Brock, Yuxin Chen, Jiakun Yan, John Owens, Aydın Buluç, and Katherine Yelick. "RDMA vs. RPC for Implementing Distributed Data Structures" In Workshop on Irregular Applications: Architectures and Algorithms (IA³), 2019. [ pdf ]

Awards

Best Poster Award, WAMTA24, A Lightweight Communication Interface for Asynchronous Many-Task Systems, Feb. 2024, [Zenodo | pdf]

Talks

Advances in Applied Computer Science Invited Speaker Series, LANL, LCI: a Lightweight Communication Interface for Asynchronous Multithreaded Communication, May. 27th, 2025

17th JLESC Workshop, LCI: a Lightweight Communication Interface for efficient asynchronous multithreaded communication, May. 13th, 2025

Charm++ Workshop 2024, Lightweight Communication Interface: High-Performance Communication Support for Asynchronous Many-Task Systems, Apr. 25th, 2024

14th JLESC Workshop, Lightweight Communication Interface: Efficient Message Passing support for irregular, multithreaded communication, Sep. 28th, 2022

Services

Session Chair: WAMTA 2025

Projects

	The Lightweight Communication Interface (LCI) A cool communication runtime for parallel libraries and frameworks Advised by Marc Snir, UIUC, Aug. 2020 - present LCI Homepage The Lightweight Communication Interface (LCI) is a low-level communication runtime. It aims to provide efficient support for applications with asynchronous, multithreaded, irregular communication patterns. It is designed with task-based runtimes as the target clients but should be general enough to apply to other irregular applications such as graph analysis/sparse linear algebra. Its major features include (a) flexible communication primitives and signaling mechanisms (b) better multithread performance (c) explicit user control of communication behaviors and resources. I am one of the major developers of LCI. I developed the Libfabric backend of LCI to enable LCI to run on the Cray/GNI platforms. I am evaluating the multithreaded performance of LCI and exploring ways, such as utilizing multiple hardware contexts, to improve its multi-threaded performance.
	HPX + LCI Integrating LCI into a task-based programming runtime Advised by Marc Snir, UIUC, Aug. 2021 - present HPX Manual -- using the LCI parcelport The High Performance ParalleX (HPX) is a runtime system known for its support for the asynchronous task programming model. Currently, HPX uses MPI as its major communication backend. In this project, we would like to add an LCI parcelport for HPX and compare the new HPX/LCI system with the original HPX/MPI system to investigate: (a) how efficiently the current LCI can interact with HPX. (b) where LCI can further improve to better support the communication requirement of asynchronous task frameworks. The first version of a full-fledged LCI parcelport implementation has been merged to the HPX master branch and will be shipped with HPX release 1.9.0. We evaluated the performance using a real-world application, Octo-Tiger : a star system simulator based on the fast multipole method on adaptive Octrees. The LCI parcelport achieved 40% performance speedup compared to the MPI parcelport on 32 nodes/4096 cores.
	TaskFlow A task-based runtime on distributed-memory system Advised by Josep Torrellas and Marc Snir, UIUC, Jan. 2021 - May. 2021 TaskFlow is a simple but efficient task-based runtime for distributed-memory systems. It adopts the PTG-based task programming model that enables reduced time/memory overhead and fine-grained synchronization. It executes tasks according to an explicit task dependency graph and uses active messages to proactively signal remote tasks. We implement TaskFlow based on Argobots and MPI. We perform a collection of micro-benchmarks and mini-applications to evaluate the performance of its various configurations and compare it with two established PTG-based task systems, TaskTorrent and PaRSEC. The benchmark results show that TaskFlow generally achieves the best performance under various circumstances.
	Asynchronous RPC Library (ARL) A high-throughput RPC system with node-level aggregation and single-node work-stealing Advised by Aydın Buluç and Kathy Yelick, LBNL, Aug. 2019 - Jul. 2020 GitHub Data-driven HPC applications suffer significant overheads for their fine-grained communication pattern. ARL is a high-throughput RPC system that targets at this kind of data-driven applications. It uses Remote Procedure Call (RPC) to provide powerful expressiveness. It achieves high performance through node-level aggregation, work-stealing, and innovate concurrent data structures. It also provides a flexible programming interface for users to program. Node-level aggregation is the primary idea underlying the ARL system, which aggregates RPC requests sharing the same source and target node and sends them together as one large message. Using this methodology, ARL is able to utilize high bandwidth across cores on the same node to achieve low overhead and high throughput. Work-stealing is another important feature of the ARL system. Every core could execute(steal) inbound RPC requests of other cores on the same target node. In this way, ARL could reduce attentiveness-sensitivity and load imbalance problems. I am the main developer of the ARL system. ARL is developed as a C++ header-only library based on the GASNet_EX communication library.
	RDMA vs. RPC for Implementing Distributed Data Structures Advised by Aydın Buluç and Kathy Yelick, LBNL, Aug. 2019 - Sep. 2019 RDMA and RPC are two primary ways for implementing distributed data structures. In this project, we compared the implementation of distributed data structures using RDMA and RPC. We developed an analytical model to predict the performance of RDMA- and RPC- based data structures based on their constituent operations, and then compared it with real-world performance. My primary focus in this project is to design and conduct experiments to investigate the attentiveness problem of RPC, which became one of the motivations for the later ARL system project. This project is accepted by IA³ workshop, Supercomputing 2019.
	Berkeley Container Library in Rust A memory-safe distributed data structure library in Rust CS267 course project, UC Berkeley, Mar. 2019 - May 2019 Advised by Benjamin Brock GitHub The Berkeley Container Library (BCL) is a distributed data structure library based on RDMA written in C++. Rust is a system programming language for both safety and high performance. We re-designed and implemented BCL using Rust to provide several safety guarantees for the distributed data structures, including data race, memory leaking, type check, and explicit type convert. I was one of the main developers of BCL in Rust. I developed the global pointer based on OpenSHMEM backend, which is the base for high-level data structure and has little overhead compared to the raw backend functions, and the global guard, which prevents data race in reference to the mutex struct in Rust. I also contributed some codes to the distributed Array, GuardArray struct and their benchmarks.