Start Here - Learn Parallel Programming

If you are new to Parallel Programming, you can start here. This page is under construction and updates are being made on a regular basis.

Introduction to Parallel Programming!

Before we jump to coding, there are a few things one should know before writing the parallel code.

Why do we need Parallel Programming?
What are the different Parallel Programming models?
How do we evaluate the performance of a parallel Code?

This is a much longer discussion and is reserved for a dedicated page. You can read it here in detail. If you want to directly jump to programming aspects, you can continue reading.

Broad Picture

Our broad objective here is to improve the performance of our code. So, we will explore approaches to improve the performance of serial code as well as parallel code. This page provides a broader picture and provides links to pages that will cover these topics in more detail.

Please note that most of the material on this page is applicable for C/C++ and Fortran Programmers (unless otherwise stated explicitly). If you are a Python Programmer, you can find HPC/Parallel Programming related resources here.

Programming CPUs
- Improving the performance of Serial codes
  - Optimization of the serial code : Optimization of serial code is a crucial step in enhancing the performance of software applications. Serial code, which executes tasks sequentially, can often be improved to run more efficiently without changing its fundamental logic. This process involves various techniques and strategies, which can significantly impact the overall performance of an application. More details on the optimization of serial codes can be found here.
  - Compiler based optimizations : Modern compilers are very smart. They can generate optimized code, eliminating the need for programmers to manually write hand-tuned low-level code. As a programmer, we can assist the compiler in generating efficient code. More details on compiler based optimizations can be found here.
  - Advanced Vector Extensions (AVX) Vectorization: Advanced Vector Extensions (AVX) are a set of instructions designed to improve the performance of applications by enabling Single Instruction, Multiple Data (SIMD) operations. SIMD operations allow a single operation to be performed on multiple data points simultaneously. In today’s processors, we are losing a significant amount of performance (up to 32 times) if our code is not using AVX units. More details on AVX Vectorization can be found here.
    - Using Vectorclass library : Agner Fog’s Vector Class Library (VCL) is a powerful C++ library designed to harness the power of SIMD (Single Instruction, Multiple Data) instructions on modern CPUs. It provides a set of classes and functions that allow developers to perform high-performance mathematical operations using the AVX, AVX2, and AVX-512 instruction sets. More details on using Vectorclass Library can be found here.
    - Using AVX Intrinsics : Using AVX intrinsics allows developers to directly harness the power of AVX units for performance-critical sections of their code. Intrinsics are special functions provided by compilers that map directly to AVX instructions, enabling fine-grained control over vectorized operations without writing low-level assembly code. More details on using AVX Intrinsics can be found here.
    - Writing assembly code using AVX instructions : If none of the methods work, one can directly write low-level optimized AVX code for the performance-critical code. More details on writing assembly code using AVX instructions can be found here.
  - ARM Scalable Vector Extension (SVE) and Vector Length Agnostic (VLA) programming : SVE is an extension of the ARM AArch64 architecture designed to support scalable vector lengths. Unlike fixed-length vector extensions, SVE allows vector lengths to vary from 128 bits up to 2048 bits, depending on the hardware implementation. VLA programming is a technique that allows code to be written without specifying the exact vector length. This makes the code portable across different hardware platforms with varying vector lengths. VLA programming is particularly useful for applications that need to run on a wide range of devices, from embedded systems to high-end servers.
  - Parallel Programming on CPUs : Parallel programming on CPUs refers to the technique of utilizing multiple processor cores to execute several computations simultaneously. This method is crucial for minimizing the overall execution time of applications, particularly those involving large datasets or complex calculations. There are two primary approaches to parallelizing CPU-based codes: Shared Memory Programming and Distributed Memory Programming.
    - Shared Memory Programming : Shared memory programming is a parallel programming model where multiple threads or processes share a common memory space, allowing them to efficiently communicate and exchange data. This approach is commonly used in multi-core and multi-processor systems to improve performance and resource utilization.
      - OpenMP : OpenMP (Open Multi-Processing) is a widely-used API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It provides a set of compiler directives, library routines, and environment variables that allow developers to write parallel code for multi-core and multi-processor systems. More details on OpenMP Programming can be found here.
      - Pthreads : POSIX Threads, commonly known as Pthreads, is a standardized threading library for UNIX-based systems. It provides a set of APIs for creating and managing threads, enabling developers to write parallel programs that can take advantage of multi-core processors. More details on Pthreads Programming can be found here.
    - Distributed Memory Programming : Distributed memory programming is a parallel computing paradigm in which each processing element has its own private memory. In this model, processes communicate by sending messages to one another. This approach is particularly effective for large-scale systems, such as clusters and supercomputers, where processors are spread across multiple nodes.
      - Message Passing Interface (MPI) : The Message Passing Interface (MPI) is a standardized and widely-used communication protocol for parallel computing. It enables processes to exchange messages efficiently across different nodes in a distributed computing environment. By providing a set of library routines usable in languages like C, C++, and Fortran, MPI simplifies the development of scalable and efficient parallel programs. More details on MPI Programming can be found here.
Programming GP-GPU : General-Purpose Computing on Graphics Processing Units (GP-GPU) is a paradigm that leverages the processing power of Graphics Processing Units (GPUs) for tasks beyond graphics rendering. Traditionally, GPUs were designed to accelerate the creation of images for display on screens. However, their highly parallel architecture makes them ideal for a wide range of computationally intensive tasks across various fields.
- CUDA : CUDA is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use NVIDIA GPUs for general-purpose processing (GP-GPU). It enabled researchers from various application domains to accelerate their applications. CUDA provides a set of tools, libraries, and an extension of the C/C++ programming languages. CUDA allows the creation of a large number of threads that can be spawned to utilize the GPU’s hardware resources. More details on CUDA Programming can be found here.
- HIP : The Heterogeneous-Compute Interface for Portability (HIP) is a C++ runtime API and kernel language developed by AMD. It allows developers to write portable code that can run on both AMD and NVIDIA GPUs from a single source. This makes it easier to develop applications that can leverage the computational power of different GPU architectures without needing to rewrite the code for each platform. More details on HIP Programming can be found here.
Parallel Programming on Multiple Architectures : There are some parallel programming languages / APIs that allow us to use different/multiple architectures with the use of single source code. This significantly reduces development efforts.
- OpenACC : OpenACC is a directive-based parallel programming model designed to simplify the process of writing high-performance computing (HPC) applications. OpenACC codes can be run on a wide variety of platforms (CPUs and GPUs different vendors etc). More details on OpenACC Programming can be found here.
- SYCL : SYCL is an open standard developed by the Khronos Group that enables developers to write code for heterogeneous architectures (CPUs, GPUs, FPGAs, etc) using modern C++. More details on SYCL Programming can be found here.
- Kokkos : Kokkos is a C++ performance portability programming model designed to help developers write applications that can run efficiently on various high-performance computing (HPC) platforms. It provides abstractions for parallel execution and data management, making it easier to write code that can leverage different types of hardware, including CPUs, GPUs, and other accelerators. More details on Kokkos Programming can be found here.
- OpenCL : OpenCL (Open Computing Language) is an open standard for parallel programming of heterogeneous systems. It provides a framework for writing programs that execute across various platforms, including CPUs, GPUs, FPGAs, and other processors. Developed by the Khronos Group, OpenCL enables developers to harness the computational power of a wide range of devices to accelerate performance for compute-intensive tasks. More details on OpenCL Programming can be found here.

Tools

Parallel programming is simultaneously executing multiple tasks or computations to improve performance and efficiency. This approach is increasingly important in modern computing, where complex applications and large datasets require significant processing power. Several tools are essential to effectively write codes using parallel programming. Let’s delve into each of these tools in detail –

Editor: An editor is a fundamental tool for writing and managing code. For parallel programming, it’s crucial to choose an editor that supports syntax highlighting, code completion, and integration with other development tools.
Compiler: A compiler translates high-level code into machine code that can be executed by the hardware. Depending on the hardware platform one can choose between all the available compilers on that platform. The code generated by different compilers will have different execution timings for the same code with the same input problem on the given system. Detailed discussion on compilers can be found on this page.
Debugger: Debugging parallel programs can be challenging due to the concurrent execution of multiple tasks. A debugger helps identify and resolve issues in the code, ensuring correct and efficient operation. Detailed discussion on debuggers can be found on this page.
Profiler: Profiling tools help analyze the performance of parallel programs by identifying bottlenecks, measuring execution time, and providing insights into resource usage. Using a profiler enables developers to fine-tune their parallel programs, ensuring optimal performance and resource utilization. Detailed discussion on profilers can be found on this page.

References: