This repository contains a collection of heterogeneous computing benchmarks written with CUDA, HIP, SYCL/DPC++, and OpenMP-4.5 target offloading for studying performance, portability, and productivity.
Z. Jin and J. S. Vetter, "A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model," 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA, 2023, pp. 325-327, doi: 10.1109/ISPASS57527.2023.00041. (https://ieeexplore.ieee.org/document/10158214)
AMD ROCm
Intel DPC++ compiler or Intel oneAPI toolkit
NVIDIA HPC SDK
Certain SYCL benchmarks require oneDPL, oneTBB, Syclomatic, or oneMKL interfaces.
Each benchmark falls into a single category. While such classification is not accurate, the arrangement serves as a starting point for users of the benchmark suite. Please see the Reference for more information about each benchmark.
daphne
allreduce, cmembench, babelstream, ccl, memcpy, memtest, pingpong, randomAccess, shmembench, triad
all-pairs-distance, bsw, ccs, cm, deredundancy, diamond, epistasis, extend2, frna, fsm, ga, logan, minibude, minimap2, nbnxm, nw, pcc, prna, sa, snake
affine, aobench, asmooth, background-subtract, bezier-surface, bilateral, bm3d, boxfilter, cbsfil, car, ced, colorwheel, convolution1D, convolution3D, convolutionDeformable, convolutionSeperable, dct8x8, debayer, depixel, degrid, doh, dpid, egs, face, flame, gabor, gamma-correction, hogbom, mandelbrot, marchCubes, match, medianfilter, morphology, mriQ, ne, opticalFlow, perlin, sobel, tonemapping, recursiveGaussian, resize, sad, seam-carving, spm, srad, ssim, stencil1d, stencil3d, surfel, voxelization, zoom
aes, bitcracker, bitpermute, chacha20, columnarSolver, ecdh, keccaktreehash, merkle, present
atomicAggregate, atomicCAS, atomicCost, atomicIntrinsics, atomicPerf, atomicSystemWide, bitpacking, bscan, bwt, compute-score, contract, dxtc2, filter, fpc, histogram, lzss, minmax, mpc, mtf, quant, rle, sc, scan, scan2, scan3, segment-reduce
ans, crc64, crs, entropy, jenkins-hash, ldpc, md5hash, murmurhash3
aop, black-scholes, binomial, bonds, libor
aidw, coordinates, geodesic, hausdorff, haversine, stsg
cc, floydwarshall, floydwarshall2, gc, hbc, hungarian, mis, sssp, rsmt
aligned-types, asta, blockAccess, blockexchange, collision, concurrentKernels, conversion, dispatch, graphExecution, ert, interleave, intrinsics-cast, kernelLaunch, layout, mallocFree, maxFlops, mixbench, nosync, openmp, overlap, p2p, pad, pitch, popcount, prefetch, reverse, ring, saxpy-ompt, shuffle, simpleMultiDevice, streamCreateCopyDestroy, streamOrderedAllocation, streamPriority, streamUM, tensorAccessor, threadfence, warpexchange, vote, wmma, wordcount, zerocopy
accuracy, adam, adamw, addBiasQKV, addBiasResidualLayerNorm, attention, attentionMultiHead, backprop, bincount, bn, channelShuffle, channelSum, clink, concat, crossEntropy, dense-embedding, dropout, dwconv, dwconv1d, expdist, flip, gd, gelu, ge-spmm, glu, gmm, gru, kalman, kmc, kmeans, knn, layernorm, lda, lif, logprob, lr, lrn, mask, matern, maxpool3d, mcpr, meanshift, mf-sgd, mmcsf, mnist, mrc, multinomial, nlll, nonzero, overlay, p4, page-rank, permute, perplexity, pointwise, pool, qkv, qtclustering, remap, relu, resnet-kernels, rowwiseMoments, rotary, sampling, scel, snicit, softmax, softmax-fused, softmax-online, stddev, streamcluster, swish, unfold, vol2col, wedford, winograd, word2vec
atan2, blas-dot, blas-fp8gemm, blas-gemm, blas-gemmBatched, blas-gemmStridedBatched, blas-gemmEx, blas-gemmEx2, complex, cross, determinant, divergence, dp, eigenvalue, f16max, f16sp, frechet, fresnel, fwt, gaussian, geam, gels, gemv, hellinger, hmm, idivide, interval, jaccard, jacobi, kurtosis, lanczos, langford, lci, lebesgue, leukocyte, lfib4, log2, lud, ludb, michalewicz, matrix-rotate, matrixT, minkowski, mr, mrg32k3a, norm2, nqueen, ntt, phmm, pnpoly, reverse2D, rfs, romberg, rsc, sddmm-batch, secp256k1, simpleSpmv, slu, spd2s, spgeam, spgemm, spmm, spmv, spnnz, sps2d, spsort, sptrsv, thomas, wyllie, zeropoint
mt, permutate, qrg, rng-wallace, sobol, urng
bfs, bsearch, b+tree, grep, keogh, s8n, ss, sss, tsp
extrema, fft, lombscargle, sosfil, zmddft
ace, adv, amgmk, axhelm, bh, bspline-vgh, burger, cooling, ccsd-trpdrv, che, chemv, chi2, clenergy, cmp, cobahh, d2q9_bgk, d3q19_bgk, damage, ddbp, dslash, easyWave, eikonal, fdtd3d, feynman-kac, fhd, fluidSim, gibbs, goulash, gpp, grrt, haccmk, halo-finder, heartwall, heat, heat2d, henry, hexicton, hotspot, hotspot3D, hpl, hwt1d, hypterm, ising, iso2dfd, laplace, laplace3d, lavaMD, lid-driven-cavity, logic-resim, logic-rewrite, loopback, lsqt, lulesh, mcmd, md, mdh, metropolis, miniFE, minimod, minisweep, miniWeather, multimaterial, myocte, nbody, particle-diffusion, particlefilter, particles, pathfinder, pns, projectile, pso, qem, rainflow, rayleighBenardConvection, reaction, rsbench, rtm8, rushlarsen, s3d, su3sheath, simplemoc, slit, sparkler, sph, sw4ck, tensorT, testSNAP, tissue, tpacf, tqs, tridiagonal, tsa, vanGenuchten, vmc, wlcpow, wsm5, xlqc, xsbench
bitonic-sort, hybridsort, is, merge, quicksort, radixsort, segsort, sort, sortKV, split, warpsort
inversek2j, rodrigues
Option 1: Makefile scripts that build and run an individual benchmark
Navigate to a benchmark in CUDA (benchmark-cuda) and type
`make ARCH=sm_70 run` // run on a NIVIDA GPU device with compute capability 7.0
Navigate to a benchmark in HIP (benchmark-hip) and type
`make run`
Navigate to a benchmark in SYCL (benchmark-sycl) and type
`make CUDA=yes CUDA_ARCH=sm_70 run` (targeting an NVIDIA GPU)
`make HIP=yes HIP_ARCH=gfx908 run` (targeting an AMD GPU)
`make run` or `make CC=icpx run` (targeting an Intel GPU)
NOTE: GCC_TOOLCHAIN may be set to select a specific version of GNU toolchain for the SYCL compiler
`make CUDA=yes CUDA_ARCH=sm_70 GCC_TOOLCHAIN=/path/to/x86_64/gcc-9.1.0 run`
Navigate to a benchmark in OpenMP (benchmark-omp) and type
`make -f Makefile.nvc run` (targeting NVIDIA GPUs)
`make -f Makefile.aomp run` (targeting AMD GPUs)
`make run` (targeting Intel GPUs)
Users may need to set appropriate values (e.g., `sm_80`, `sm_90`, `gfx906`, `gfx1030`) for their target offloading devices
`make -f Makefile.nvc SM=cc80 run`
`make -f Makefile.aomp ARCH=gfx906 run`
Option 2: Python scripts that help build, run and gather results from the benchmarks. As well as a basic script to compare results from two different runs.
It works with a `.json` file containing the benchmark names, a regex to
find the timings in the benchmark output and optional arguments that
must be provided to the benchmark binary. The `subset.json` contains
a subset of the benchmarks for cuda, hip and sycl at the moment, so more
work would be required to support the rest of the benchmarks. In
addition if there are failing benchmarks in the `.json` list, an
additional text file can be provided with a list of benchmarks to skip
when running all of them. Benchmarks in the text file can still be run
explicitly.
For example, to run all the SYCL benchmarks and then all the CUDA
benchmarks and compare the two:
```
./autohecbench.py sycl -o sycl.csv
./autohecbench.py cuda -o cuda.csv
./autohecbench-compare.py sycl.csv cuda.csv
```
It can also be used to run a single benchmark:
```
./autohecbench.py backprop-sycl --verbose
```
To run a single benchmark using the Intel oneAPI Toolkit on an Intel GPU:
```
./autohecbench.py backprop-sycl --sycl-type opencl --compiler-name icpx --verbose
```
To run a single benchmark using the Intel oneAPI Toolkit on a CPU:
```
./autohecbench.py backprop-sycl --sycl-type cpu --compiler-name icpx --verbose
```
By default it will run a warmup iteration before running each benchmark,
and it is possible to run the benchmarks multiple times with `-r`:
```
./autohecbench.py backprop-sycl -r 20 -o mandel-sycl.csv
```
And it also has options to pick the CUDA SM version or HIP architecture and a
few other parameters. Type `./autohecbench.py` to see all the options.
For Rodinia benchmarks, please download the dataset at http://lava.cs.virginia.edu/Rodinia/download.htm
For other benchmarks, datasets are either included with the benchmarks or could be downloaded through the links described in the README files.
The programs have not been evaluated on Windows or MacOS
The lastest Intel SYCL compiler (not the Intel oneAPI toolkit) may be needed for building some SYCL programs successfully
Kernel results do not exactly match using these programming languages on a platform for certain programs
Not all programs automate the verification of host and device results
Not all CUDA programs have SYCL, HIP or OpenMP equivalents
Not all programs have OpenMP target offloading implementations
Raw performance of any program may be suboptimal
Some programs may take long to complete on an integrated GPU
Some host programs contain platform-specific intrinsics, so they may cause compile error on a PowerPC platform
When double-precision floating-point operations are not supported on certain Intel GPU devices, software emulation may be enabled. FP64 emulation
Faqir-Rhazoui, Y. and García, C., 2024. SYCL in the edge: performance and energy evaluation for heterogeneous acceleration. The Journal of Supercomputing, pp.1-21.
Dearing, M.T., Tao, Y., Wu, X., Lan, Z. and Taylor, V., 2024. LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes. arXiv preprint arXiv:2407.01638.
Marzen, L., Dutta, A. and Jannesari, A., 2024. Static Generation of Efficient OpenMP Offload Data Mappings. arXiv preprint arXiv:2406.13881.
Ivanov, I.R., Zinenko, O., Domke, J., Endo, T. and Moses, W.S., 2024, March. Retargeting and Respecializing GPU Workloads for Performance Portability. In 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (pp. 119-132). IEEE.
Shilpage, W.R. and Wright, S.A., 2023, May. An investigation into the performance and portability of sycl compiler implementations. In International Conference on High Performance Computing (pp. 605-619). Cham: Springer Nature Switzerland.
Tian, S., Chapman, B. and Doerfert, J., 2023, August. Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution. In Proceedings of the 52nd International Conference on Parallel Processing Workshops (pp. 112-118).
Tian, S., Scogland, T., Chapman, B. and Doerfert, J., 2023, November. OpenMP kernel language extensions for performance portable GPU codes. In Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (pp. 876-883).
Murtovi, A., Georgakoudis, G., Parasyris, K., Liao, C., Laguna, I. and Steffen, B., 2023. Enhancing Performance Through Control-flow Unmerging and Loop Unrolling (No. LLNL-CONF-849354). Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States).
Alpay, A., Soproni, B., Wünsche, H. and Heuveline, V., 2022, May. Exploring the possibility of a hipSYCL-based implementation of oneAPI. In Proceedings of the 10th International Workshop on OpenCL (pp. 1-12).
Thavappiragasam, M. and Kale, V., 2022, November. OpenMP’s Asynchronous Offloading for All-pairs Shortest Path Graph Algorithms on GPUs. In 2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar) (pp. 1-11). IEEE.
Doerfert, J., Jasper, M., Huber, J., Abdelaal, K., Georgakoudis, G., Scogland, T. and Parasyris, K., 2022, October. Breaking the vendor lock: performance portable programming through OpenMP as target independent runtime layer. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (pp. 494-504).
Tian, S., Huber, J., Parasyris, K., Chapman, B. and Doerfert, J., 2022, November. Direct GPU compilation and execution for host applications with OpenMP Parallelism. In 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) (pp. 43-51). IEEE.
Jin, Z. and Vetter, J.S., 2022, December. Understanding performance portability of bioinformatics applications in SYCL on an NVIDIA GPU. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 2190-2195). IEEE.
Jin, Z., 2021. The Rodinia Benchmarks in SYCL (No. ORNL/TM-2021/2338). Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States).
Early results are shown here
Accuracy of prediction (https://pytorch.org/)
Phase-field simulation of dendritic solidification (https://github.com/myousefi2016/Allen-Cahn-CUDA)
Adaptive moment estimation (https://github.com/hpcaitech/ColossalAI)
Adaptive moment estimation with weight decay (https://github.com/lessw2020/QuantFour_AdamW_Cuda)
Add biases to query, key, and value vectors (https://github.com/NVIDIA/FasterTransformer)
Combines the bias, residual of previous block and the computation of layer normalization (https://github.com/NVIDIA/FasterTransformer)
Advection (https://github.com/Nek5000/nekBench/tree/master/adv)
AES encrypt and decrypt (https://github.com/Multi2Sim/m2s-bench-amdsdk-2.5-src)
Affine transformation (https://github.com/Xilinx/SDAccel_Examples/tree/master/vision/affine)
Adaptive inverse distance weighting (Mei, G., Xu, N. & Xu, L. Improving GPU-accelerated adaptive IDW interpolation algorithm using fast kNN search. SpringerPlus 5, 1389 (2016))
Alignment specification for variables of structured types (http://docs.nvidia.com/cuda/cuda-samples/index.html)
The ring allreduce and ring allgather (https://github.com/baidu-research/baidu-allreduce)
All-pairs distance calculation (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2910913/)
The relax kernel in the AMGmk benchmark (https://asc.llnl.gov/CORAL-benchmarks/Micro/amgmk-v1.0.tar.gz)
Asymmetric numeral systems decoding (https://github.com/weissenberger/multians)
A lightweight ambient occlusion renderer (https://code.google.com/archive/p/aobench)
American options pricing (https://github.com/NVIDIA-developer-blog)
Adaptive smoothing (http://www.hcs.harvard.edu/admiralty/)
Array of structure of tiled array for data layout transposition (https://github.com/chai-benchmarks/chai)
Approximate the atan2 math function (https://github.com/cms-patatrack/pixeltrack-standalone)
Atomic aggregate (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)
Atomic add, subtract, min, max, AND, OR, XOR (http://docs.nvidia.com/cuda/cuda-samples/index.html)
64-bit atomic add, min, and max with compare and swap (https://github.com/treecode/Bonsai/blob/master/runtime/profiling/derived_atomic_functions.h)
Evaluate the cost of atomic add operations
Evaluate atomic add operations over global and shared memory (https://stackoverflow.com/questions/22367238/cuda-atomic-operation-performance-in-different-scenarios)
Integer sum reduction with atomics (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/reduction)
System-wide atomics (http://docs.nvidia.com/cuda/cuda-samples/index.html)
Ham, T.J., et al., 2020, February. A^ 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 328-341). IEEE.
Implementation of multi-head attention (https://github.com/IrishCoffee/cudnnMultiHeadAttention)
Helmholtz matrix-vector product (https://github.com/Nek5000/nekBench/tree/master/axhelm)
Measure memory transfer rates for copy, add, mul, triad, dot, and nstream (https://github.com/UoB-HPC/BabelStream)
Background subtraction (Alptekin Temizel et al. Experiences on Image and Video Processing with CUDA and OpenCL, In Applications of GPU Computing Series, GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2011, Pages 547-567)
Backpropagation in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The Bezier surface (https://github.com/chai-benchmarks/chai)
The breadth-first search in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Simulate the gravitational forces in a star cluster using the Barnes-Hut n-body algorithm (https://userweb.cs.txstate.edu/~burtscher/research/ECL-BH/)
Bilateral filter (https://github.com/jstraub/cudaPcl)
Count the number of values that fall into each bin (https://pytorch.org/)
Evaluate fair call price for a given set of European options under binomial model (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Open-source password cracking tool for storage devices (https://github.com/e-ago/bitcracker.git)
Bitonic sorting (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/)
A bit-level operation that aims to reduce the number of bits required to store each value (https://github.com/NVIDIA/nvcomp)
Permute the data using bit-level operations in an array (https://github.com/supranational/sppark)
The Black-Scholes simulation (https://github.com/cavazos-lab/FinanceBench)
A dot product between two real vectors
Scaled matrix-matrix multiplication in a 8-bit floating-point format
General matrix-matrix multiplications (https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2025-0/gemm.html)
Batched general matrix-matrix multiplication (https://github.com/pyrovski/cublasSgemmBatched-example)
Strided batched general matrix-matrix multiplication (https://github.com/pyrovski/cublasSgemmBatched-example)
Extended general matrix-matrix multiplications (https://godweiyang.com/2021/08/24/gemm/, https://github.com/UoB-HPC/abc-pvc-deepdive)
Extended general matrix-matrix multiplications using cuBLASLt, hipBLASLt, and oneDNN
Block access from the CUB's collective primitives (https://github.com/NVIDIA/cub)
Rearrange data partitioned across a thread block (https://github.com/NVIDIA/cub)
Block-matching and 3D filtering method for image denoising (https://github.com/DawyD/bm3d-gpu)
Bayesian network learning (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Fixed-rate bond with flat forward curve (https://github.com/cavazos-lab/FinanceBench)
Box filtering (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Binary scan in a block (Harris, M. and Garland, M., 2012. Optimizing parallel prefix operations for the Fermi architecture. In GPU Computing Gems Jade Edition (pp. 29-38). Morgan Kaufmann.)
Classic and vectorizable binary search algorithms (https://www.sciencedirect.com/science/article/abs/pii/S0743731517302836)
Compute the value, gradient and hessian on random positions in a 3D box (https://github.com/QMCPACK/miniqmc/blob/OMP_offload/src/OpenMP/main.cpp)
GPU accelerated Smith-Waterman for performing batch alignments (https://github.com/mgawan/ADEPT)
2D Burger's equation (https://github.com/soumyasen1809/OpenMP_C_12_steps_to_Navier_Stokes)
Burrows-Wheeler transform (https://github.com/jedbrooke/cuda_bwt)
B+Tree in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Content adaptive resampling (https://github.com/sunwj/CAR)
Cubic b-spline filtering (https://github.com/DannyRuijters/CubicInterpolationCUDA)
Connected components (https://userweb.cs.txstate.edu/~burtscher/research/ECL-CC/)
Collective communications library (https://github.com/NVIDIA/nccl)
Condition-dependent Correlation Subgroups (https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS)
The CCSD tengy kernel, which was converted from Fortran to C by Jeff Hammond, in NWChem (https://github.com/jeffhammond/nwchem-ccsd-trpdrv)
Canny edge detection (https://github.com/chai-benchmarks/chai)
The CFD solver in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
ChaCha20 stream cipher (https://github.com/983/ChaCha20)
Divide the channels in a tensor into groups and rearrange them (https://pytorch.org/)
Per-channel sum of values (https://pytorch.org/)
Phase-field simulation of spinodal decomposition using the Cahn-Hilliard equation (https://github.com/myousefi2016/Cahn-Hilliard-CUDA)
Complex hermitian matrix-vector multiplication (https://repo.or.cz/ppcg.git)
The Chi-square 2-df test. (https://web.njit.edu/~usman/courses/cs677_spring19/)
Direct coulomb summation kernel (http://www.ks.uiuc.edu/Training/Workshop/GPU_Aug2010/resources/clenergy.tar.gz)
Compact LSTM inference kernel (http://github.com/UCLA-VAST/CLINK)
Gene expression connectivity mapping (https://pubmed.ncbi.nlm.nih.gov/24112435/)
The constant memory microbenchmark (https://github.com/ekondis/gpumembench)
Seismic processing using the classic common midpoint (CMP) method (https://github.com/hpg-cepetro/IPDPS-CRS-CMP-code)
Simulation of Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances (https://dl.acm.org/doi/10.1145/3307339.3343460)
Check collision of duplicate values (https://github.com/facebookarchive/fbcuda)
Color encoding of flow vectors (https://vision.middlebury.edu/flow/code/flow-code/colorcode.cpp)
Dimitrov, M. and Esslinger, B., 2021. CUDA Tutorial--Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA. arXiv preprint arXiv:2103.13937.
Complex numbers arithmetics (https://github.com/tpn/cuda-samples/blob/master/v8.0/include/cuComplex.h)
Document filtering (https://www.intel.com/content/www/us/en/programmable/support/support-resources/design-examples/design-software/opencl/compute-score.html)
Concatenation of two tensors (https://github.com/bytedance/lightseq)
Demonstrate the use of streams for concurrent execution of several kernels with dependency on a device (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/concurrentKernels)
Second-order tensor aggregation with an adjacency matrix (https://github.com/HyTruongSon/GraphFlow)
Conversion among common data types (intel/llvm#7195)
1D convolution (Kirk, D.B. and Wen-Mei, W.H., 2016. Programming massively parallel processors: a hands-on approach. Morgan kaufmann)
3D convolution (Kirk, D.B. and Wen-Mei, W.H., 2016. Programming massively parallel processors: a hands-on approach. Morgan kaufmann)
Deformable convolution (https://github.com/lucasjinreal/DCNv2_latest)
Convolution filter of a 2D image with separable kernels (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Primordial hydrogen/helium cooling curve (https://github.com/cholla-hydro/cholla)
Coordinates(latitude and longitude) transformation using the STL transform (https://github.com/rapidsai/cuspatial)
64-bit cyclic-redundancy check (https://trac.alcf.anl.gov/projects/hpcrc64/)
Cross product of two 2D tensors (https://pytorch.org/)
Cross entropy loss in the backward phase (intel/llvm#5969)
Cauchy Reed-Solomon encoding (https://www.comp.hkbu.edu.hk/~chxw/gcrs.html)
A lattice boltzmann scheme with a 2D grid, 9 velocities, and Bhatnagar-Gross-Krook collision step (https://github.com/WSJHawkins/ExploringSycl)
Lattice Boltzmann simulation framework based on C++ parallel algorithms (https://gitlab.com/unigehpfs/stlbm)
The Darmstadt automotive parallel heterogeneous benchmark suite (https://github.com/esa-tu-darmstadt/daphne-benchmark)
The continuum level damage in a peridynamic body (https://github.com/alan-turing-institute/PeriPy)
Discrete Cosine Transform (DCT) and inverse DCT for 8x8 blocks (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Distance-driven backprojection (https://github.com/LAVI-USP/DBT-Reconstruction)
Convert a Bayer mosaic raw image to RGB
Radio astronomy degridding (https://github.com/NVIDIA/SKA-gpu-degrid)
Dense embedding add operations (https://pytorch.org/)
Check connectivity and remove crosses in depixelization of pixel art (https://github.com/yzhwang/depixelization)
Gene sequence de-redundancy is a precise gene sequence de-redundancy software that supports heterogeneous acceleration (https://github.com/JuZhenCS/gene-sequences-de-redundancy)
Calculate the determinant of a matrix using library-based decomposition and strided reduction (https://github.com/OrangeOwlSolutions/Linear-Algebra)
Mask sequences kernel in Diamond (https://github.com/bbuchfink/diamond)
Kernel dispatch rate and latency (https://github.com/ROCm-Developer-Tools/HIP-CPU)
Barrel distortion (https://github.com/Cuda-Chen/barrel-distortion-cuda)
CPU and GPU divergence test (https://github.com/E3SM-Project/divergence_cmdvse)
Determinant of a Hessian matrix (https://github.com/rapidsai/cucim)
Dot product (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Detail-preserving image downscaling (https://github.com/mergian/dpid)
Randomly zero some elements of the input array with a probability using samples from a uniform distribution (https://github.com/pytorch/)
A Lattice QCD Dslash operator proxy application derived from MILC (https://gitlab.com/NERSC/nersc-proxies/milc-dslash)
A lossy texture compression (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Depth-wise convolution (https://pytorch.org/)
Depth-wise 1D convolution (https://github.com/BlinkDL/RWKV-CUDA/tree/main)
Simulation of tsunami generation and propagation in the context of early warning (https://git.gfz-potsdam.de/id2/geoperil/easyWave)
Elliptic curve Diffie-Hellman key exchange (https://github.com/jaw566/ECDH)
Parallel implementation of EGSnrc's photon transport mechanism (https://jonaslippuner.com/research/cuda-egs/)
Calculate the eigenvalues of a tridiagonal symmetric matrix (https://github.com/OpenCL/AMD_APP_samples)
Fast iterative method for Eikonal equations on structured volumes (https://github.com/SCIInstitute/StructuredEikonal)
Compute the entropy for each point of a 2D matrix using a 5x5 window (https://lan-jing.github.io/parallel%20computing/system/entropy/)
Epistasis detection (https://github.com/rafatcampos/bio-epistasis-detection)
Modified microkernel in the empirical roofline tool (https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/)
Compute the Bhattacharya cost function (https://github.com/benvanwerkhoven/kernel_tuner)
Smith-Waterman (SW) extension in Burrow-wheeler aligner for short-read alignment (https://github.com/lh3/bwa)
Find local maxima (https://github.com/rapidsai/cusignal/)
Compute the maximum of half-precision floating-point numbers using bit operations (https://x.momo86.net/en?p=113)
Half-precision scalar product (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Face detection using the Viola-Jones algorithm (https://sites.google.com/site/5kk73gpu2012/assignment/viola-jones-face-detection)
FDTD-3D (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Use of Feynman-Kac algorithm to solve Poisson's equation in a 2D ellipse (https://people.sc.fsu.edu/~jburkardt/c_src/feynman_kac_2d/feynman_kac_2d.html)
A case study: advanced magnetic resonance imaging reconstruction (https://ict.senecacollege.ca/~gpu610/pages/content/cudas.html)
Filtering by a predicate (https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/)
Fast Fourier transform in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Fractal flame (http://gpugems.hwu-server2.crhc.illinois.edu/)
Tensor flip (https://pytorch.org/)
Floyd-Warshall Pathfinding sample (https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/HIP-Examples-Applications/FloydWarshall/)
Fast Floyd-Warshall for all-pairs-shortest paths (https://userweb.cs.txstate.edu/~burtscher/research/ECL-APSP/)
2D Fluid Simulation using the Lattice-Boltzman method (https://github.com/OpenCL/AMD_APP_samples)
Frequent pattern compression ( Base-delta-immediate compression: practical data compression for on-chip caches. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques (pp. 377- 388). ACM.)
Compute the discrete Frechet distance between two curves specified by discrete ordered points in n-dimensional space (https://github.com/mp4096/discrete-frechet-distance)
Fresnel integral
Accelerate the fill step in predicting the lowest free energy structure and a set of suboptimal structures (http://rna.urmc.rochester.edu/Text/Fold-cuda.html)
A GPU-accelerated implementation of a genetic algorithm for finding well-performing finite-state machines for predicting binary sequences (https://userweb.cs.txstate.edu/~burtscher/research/FSM_GA/)
Fast Walsh transformation (http://docs.nvidia.com/cuda/cuda-samples/index.html)
Gene alignment (https://github.com/NUCAR-DEV/Hetero-Mark)
Gabor filter function (https://github.com/fercer/gaborfilter)
Gamma correction (https://github.com/intel/BaseKit-code-samples)
Gaussian elimination in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Graph coloring via shortcutting (https://userweb.cs.txstate.edu/~burtscher/research/ECL-GC/)
Gradient descent (https://github.com/CGudapati/BinaryClassification)
Matrix transpose using the BLAS-extension functions (https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-geam)
Find the least squares solution of a batch of overdetermined systems (https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp)
Apply the Gaussian error linear units function (https://github.com/NVIDIA/FasterTransformer)
Low-precision matrix-vector multiply functions (https://github.com/wangsiping97/FastGEMV)
Geodesic distance (https://www.osti.gov/servlets/purl/1576565)
General-purposed sparse matrix-matrix multiplication on GPUs for graph neural networks (https://github.com/hgyhungry/ge-spmm)
Implementation of a Gibbs-Metropolis sampling algorithm (https://github.com/arendsee/cuda-gibbs-example)
The gated linear unit function (https://pytorch.org/docs/stable/generated/torch.nn.GLU.html)
Expectation maximization with Gaussian mixture models (https://github.com/Corv/CUDA-GMM-MultiGPU)
Simulate the dynamics of a small part of a cardiac myocyte, specifically the fast sodium m-gate (https://github.com/LLNL/goulash)
General Plasmon Pole Self-Energy Simulation the BerkeleyGW software package (https://github.com/NERSC/gpu-for-science-day-july-2019)
A demonstration of graphs creation, instantiation and launch (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/simpleCudaGraphs)
Regular expression matching (https://github.com/bkase/CUDA-grep)
General-relativistic radiative transfer calculations coupled with the calculation of geodesics in the Kerr spacetime (https://github.com/hungyipu/Odyssey)
Forward operations of a gated recurrent unit (https://pytorch.org/)
The HACC microkernel (https://asc.llnl.gov/CORAL-benchmarks/#haccmk)
Parallel halo finder operation (https://gem5.googlesource.com/public/gem5-resources)
Hausdorff distance (https://github.com/arohamirai/Hausdorff-Distance-Match)
Haversine distance (https://github.com/rapidsai/cuspatial)
Hybrid methods for parallel betweenness centrality (https://github.com/Adam27X/hybrid_BC)
Heart Wall in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
The heat equation solver (https://github.com/UoB-HPC/heat_sycl)
Discrete 2D laplacian operation a number of times on a given vector (https://github.com/gpucw/cuda-lapl)
Hellinger distance (https://github.com/rapidsai/raft)
Henry coefficient (https://github.com/CorySimon/HenryCoefficient)
A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion (https://github.com/noma/hexciton_benchmark)
Histogram (http://github.com/NVlabs/cub/tree/master)
Hidden markov model (http://developer.download.nvidia.com/compute/DevZone/OpenCL/Projects/oclHiddenMarkovModel.tar.gz)
The benchmark implements the kernel of the Hogbom Clean deconvolution algorithm (https://github.com/ATNF/askap-benchmarks/)
The transient thermal differential equation solver from HotSpot (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Extension to hotspot with 3D transient thermal simulation (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
High-Performance Linpack benchmark implementation (https://github.com/oneapi-src/Velocity-Bench)
Fast block distributed Implementation of the Hungarian Algorithm (https://github.com/paclopes/HungarianGPU)
1D Haar wavelet transformation (https://github.com/OpenCL/AMD_APP_samples)
Hybridsort in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
A routine from the ExpCNS Compressible Navier-Stokes mini-application (https://github.com/pssrawat/ppopp-artifact)
Fast interger divide (https://github.com/milakov/int_fastdiv)
Interleaved and non-interleaved global memory accesses (Shane Cook. 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs (1st. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.)
Interval arithmetic operators example (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Type casting intrinsic functions (https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__CAST.html)
The inverse kinematics for 2-joint arm (http://axbench.org/)
Integer sort (https://github.com/GMAP/NPB-GPU)
Monte-Carlo simulations of 2D Ising Model (https://github.com/NVIDIA/ising-gpu/)
Isotropic 2-dimensional Finite Difference (https://github.com/intel/HPCKit-code-samples/)
Jaccard index for a sparse matrix (https://github.com/rapidsai/nvgraph/blob/main/cpp/src/jaccard_gpu.cu)
Jacobi relaxation (https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/single_gpu/jacobi.cu)
Bob Jenkins lookup3 hash function (https://android.googlesource.com/platform/external/jenkins-hash/+/75dbeadebd95869dd623a29b720678c5c5c55630/lookup3.c)
Kalman filter (https://github.com/rapidsai/cuml/)
A Keccak tree hash function (http://sites.google.com/site/keccaktreegpu/)
Keogh's lower bound (https://github.com/gravitino/cudadtw)
Kernel launch with small, medium and large kernel arguments (https://github.com/ROCm/hip-tests)
Kernel matrix compute (https://github.com/MKLab-ITI/CUDA)
K-means in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
K-nearest neighbor (https://github.com/OSU-STARLAB/UVM_benchmark/blob/master/non_UVM_benchmarks)
Compute the kurtosis of two variables (https://github.com/d-d-j/ddj_store)
Lanczos tridiagonalization (https://github.com/linhr/15618)
Count planar Langford sequences (https://github.com/boris-dimitrov/z4_planar_langford_multigpu)
A Laplace solver using red-black Gaussian Seidel with SOR solver (https://github.com/kyleniemeyer/laplace_gpu)
Solve Laplace equation on a regular 3D grid (https://github.com/gpgpu-sim/ispass2009-benchmarks)
LavaMD in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Apply layer normalization over a batch of inputs (https://github.com/karpathy/llm.c)
AoS and SoA comparison (https://github.com/OpenCL/AMD_APP_samples)
Landau collisional integral (https://github.com/vskokov/Landau_Collisional_Integral)
Latent Dirichlet allocation (https://github.com/js1010/cusim)
QC-LDPC decoding (https://github.com/robertwgh/cuLDPC)
Estimate the Lebesgue constant (https://people.math.sc.edu/Burkardt/c_src/lebesgue/lebesgue.html)
Leukocyte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Marsa-LFIB4 pseudorandom number generator (https://bitbucket.org/przemstp/gpu-marsa-lfib4/src/master/)
A LIBOR market model Monte Carlo application (https://people.maths.ox.ac.uk/~gilesm/cuda_old.html)
GPU solver for a 2D lid-driven cavity problem (https://github.com/kyleniemeyer/lid-driven-cavity_gpu)
A leaky integrate-and-fire neuron model (https://github.com/e2crawfo/hrr-scaling)
A simple lock-free hash table (https://github.com/nosferalatu/SimpleGPUHashTable)
Approximate the log2 math function (https://adacenter.org/sites/default/files/milspec/Transcendentals.zip)
GPU-based X-Drop alignment (https://github.com/albertozeni/LOGAN)
Gate-level logic re-simulation (https://github.com/ceKyleLee/cad20-tgls)
Parallel logic rewriting (https://github.com/cuhk-eda/CULS)
Convert logits to probabilities (https://github.com/NVIDIA/FasterTransformer)
Lomb-Scargle periodogram (https://github.com/rapidsai/cusignal/)
Lookback option simulation (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Linear regression (https://github.com/ChenyangZhang-cs/iMLBench)
Local response normalization (intel/llvm#8292)
Linear scaling quantum transport (https://github.com/brucefan1983/gpuqt)
LU decomposition in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
LU factorization of a batch of small matrices (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Livermore unstructured Lagrangian explicit shock hydrodynamics (https://github.com/LLNL/LULESH)
Efficient LZSS compression solution for multi-byte data on GPUs (https://github.com/hipdac-lab/ICS23-GPULZ)
Memory allocation and deallocation samples (https://github.com/ROCm-Developer-Tools/HIP/)
Calculations of the Mandelbrot set (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming)
A practical isosurfacing algorithm for large data on many-core architectures (https://github.com/LRLVEC/MarchingCubes)
Masking operators in Pytorch (https://pytorch.org/)
Compute matching scores for two 16K 128D feature points (https://github.com/Celebrandil/CudaSift)
Sum using the Matern kernel (https://tbetcke.github.io/hpc_lecture_notes/rbf_evaluation.html)
In-place matrix rotation
Matrix transposition (https://docs.nvidia.com/cuda/cuda-samples/index.html)
3D Maxpooling (https://github.com/nachiket/papaa-opencl)
Maximum floating-point operations in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Monte Carlo and Molecular Dynamics Simulation Package (https://github.com/khavernathy/mcmd)
Multi-category probit regression (https://github.com/berkeley-scf/gpu-workshop-2016)
Molecular dynamics function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Simple multiple Debye-Huckel kernel in fast molecular electrostatics algorithms on GPUs (http://gpugems.hwu-server2.crhc.illinois.edu/)
MD5 hash function in the SHOC benchmark suite (https://github.com/vetter/shoc/)
Mean shift clustering (https://github.com/w00zie/mean_shift)
Two-dimensional 3x3 median filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Merkle tree construction using rescue prime hash (https://github.com/itzmeanjan/ff-gpu)
A benchmark for memory copy between a host and a device
Selected memory tests (https://github.com/ComputationalRadiationPhysics/cuda_memtest)
Merge two unsorted arrays into a sorted array (https://github.com/ogreen/MergePathGPU)
Simulation of an ensemble of replicas with Metropolis–Hastings computation in the trial run (https://github.com/crinavar/trueke)
Matrix factorization with stochastic gradient descent (https://github.com/cuMF/cumf_sgd)
Evaluate the Michalewicz function (https://www.sfu.ca/~ssurjano/michal.html)
MiniFE Mantevo mini-application (https://github.com/Mantevo/miniFE)
The core computation of the Bristol University Docking Engine (BUDE) (https://github.com/UoB-HPC/miniBUDE)
Hardware acceleration of long read pairwise overlapping in genome sequencing (https://github.com/UCLA-VAST/minimap2-acceleration)
A finite difference solver for seismic modeling (https://github.com/rsrice/gpa-minimod-artifacts)
A deterministic Sn radiation transport miniapp (https://github.com/wdj/minisweep)
A parallel programming training mini-app simulating weather-like flows (https://github.com/mrnorman/miniWeather)
Minkowski distance (https://github.com/rapidsai/raft)
Find the smallest and largest elements (https://github.com/rapidsai/cuspatial)
Maximal independent set (http://www.cs.txstate.edu/~burtscher/research/ECL-MIS/)
A read-only version of mixbench (https://github.com/ekondis/mixbench)
MTTKRP kernel using mixed-mode CSF (https://github.com/isratnisa/MM-CSF)
Chapter 4.2: Converting CUDA CNN to HIP (https://developer.amd.com/wp-content/resources)
Morphological operators: Erosion and Dilation (https://github.com/yszheda/CUDA-Morphology)
The Miller-Rabin primality test (https://github.com/wizykowski/miller-rabin)
Computation of a matrix Q used in a 3D magnetic resonance image reconstruction (https://github.com/abduld/Parboil/blob/master/benchmarks/mri-q/DESCRIPTION)
Margin ranking criterion operation (https://pytorch.org)
Uniform mrg32k3a pseudorandom number generation using Host API (https://github.com/NVIDIA/CUDALibrarySamples/)
Mersenne Twister (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Move-to-front transform (https://github.com/bzip2-cuda/bzip2-cuda)
Multi-material simulations (https://github.com/reguly/multimaterial)
Multinomial sampling (https://pytorch.org)
MurmurHash3 yields a 128-bit hash value (https://github.com/aappleby/smhasher/wiki/MurmurHash3)
Myocte in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Computing non-bonded pair interactions (https://manual.gromacs.org/current/doxygen/html-full/page_nbnxm.xhtml)
Nbody simulation (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL/N-BodyMethods/Nbody)
Normal estimation in 3D (https://github.com/PointCloudLibrary/pcl)
The negative log likelihood 2D loss reduction (https://pytorch.org/)
Work-efficient parallel non-maximum suppression kernels (https://github.com/hertasecurity/gpu-nms)
Nearest neighbor in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Return a tensor containing the indices of all non-zero elements of input (https://pytorch.org/)
Compute the Euclidean norm of a vector (https://docs.nvidia.com/cuda/cublas)
Stream synchronization in Thrust and oneDPL (https://github.com/NVIDIA/thrust/tree/main/examples)
N-Queens (https://github.com/tcarneirop/ChOp)
Number-theoretic transform (https://github.com/vernamlab/cuHE)
Needleman-Wunsch in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Multi-threading over a single device (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Optical flow estimation (https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/HSOpticalFlow)
Overlap data copies with compute kernels (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Overlay grid in the DetectNet (https://github.com/dusty-nv/jetson-inference)
Simple peer-to-peer accesses (https://docs.nvidia.com/cuda/cuda-samples/index.html)
PointPillar post-processing (https://github.com/NVIDIA-AI-IOT/CUDA-PointPillars)
In-place padding (https://github.com/chai-benchmarks/chai)
PageRank (https://github.com/Sable/Ostrich/tree/master/map-reduce/page-rank)
Particle diffusion in the HPCKit code samples (https://www.intel.com/content/www/us/en/developer/articles/code-sample/oneapi-dpcpp-compiler-example-particle-diffusion.html)
Particle Filter in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Particles collision simulation (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
PathFinder in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Compute pairwise Pearson’s correlation coefficient (https://github.com/pcdslab/Fast-GPU-PCC)
Perlin noise generator (https://github.com/silverweed/perlin_cuda)
Parallel implementation of the permutation testing in NIST SP 800-90B (https://github.com/yeah1kim/yeah_GPU_SP800_90B_IID)
A tensor with its dimensions permuted (https://github.com/karpathy/llm.c)
Perplexity search (https://github.com/rapidsai/cuml/)
Pair hidden Markov model (https://github.com/lienliang/Pair_HMM_forward_GPU)
Two MPI ranks pass data back and forth (https://github.com/olcf-tutorials/MPI_ping_pong)
Pitched memory allocation (https://docs.nvidia.com/cuda/cuda-c-programming-guide)
Solve the point-in-polygon problem using the crossing number algorithm (https://github.com/benvanwerkhoven/kernel_tuner)
Petri-net simulation (https://github.com/abduld/Parboil/)
Fused point-wise operations (https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/)
Pooling layer (https://github.com/PaddlePaddle/Paddle)
Implementations of population count (Jin, Z. and Finkel, H., 2020, May. Population Count on Intel® CPU, GPU and FPGA. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 432-439). IEEE.)
Concurrent managed accesses (https://github.com/ROCm-Developer-Tools/HIP/)
Lightweight cryptography (https://github.com/bozhu/PRESENT-C/blob/master/present.h)
Calculate a partition function for a sequence, which can be used to predict base pair probabilities (http://rna.urmc.rochester.edu/Text/partition-cuda.html)
Projectile motion is a program that implements a ballistic equation (https://github.com/intel/BaseKit-code-samples)
A modified implementation of particle swarm optimization using Levy function (https://github.com/wiseodd/cuda-pso, https://github.com/chensohg/GPU_CUDA_PSO)
A quartic equation minimizer (https://github.com/qureshizawar/CUDA-quartic-solver)
Generate query, key and value vectors over a batch of inputs (https://github.com/karpathy/llm.c, https://github.com/mspronesti/llm.sycl)
Niederreiter quasirandom number generator and Moro's Inverse Cumulative Normal Distribution generator (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Quality threshold clustering (https://github.com/vetter/shoc/)
8-bit quantization (https://github.com/bitsandbytes-foundation/bitsandbytes/tree/main)
A parallel radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
A library-based sort-by-key (https://github.com/NVIDIA/cuda-samples)
A fast rainflow cycle counting algorithm (https://github.com/carlos-souto/rainflow-cycle-counting)
Random memory access (https://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/)
A finite-difference scheme for 2D Rayleigh-Benard thermal convection (https://github.com/sanchda/2D_RB_convec/tree/main)
3D Gray-Scott reaction diffusion (https://github.com/ifilot/wavefuse)
2-dimensional Gaussian Blur Filter of RGBA image (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Map unique values to indices (https://pytorch.org/)
Rectified linear unit (https://github.com/tensorflow)
Resize images (https://github.com/opencv/)
ResNet kernels for inference (https://github.com/xuqiantong/CUDA-Winograd)
Reverse an input array of size 256 using shared memory
Reverse an input matrix along rows or columns (https://github.com/rapidsai/cuml)
Reproducible floating sum (https://github.com/facebookarchive/fbcuda)
Non-P2P transfers in a circular manner among GPU devices
Computes a run-length encoding of a sequence (https://github.com/NVIDIA/cub)
Random number generation using the Wallace algorithm (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application)
Rodrigues' rotation (https://github.com/DIDSR/VICTRE_MCGPU)
Romberg's method (https://github.com/SwayambhuNathRay/Parallel-Romberg-Integration)
Compute row-wise moments (https://pytorch.org/)
Rotary position encoding (https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary)
A proxy application for full neutron transport application like OpenMC that support multipole cross section representations (https://github.com/ANL-CESAR/RSBench/)
Random sample consensus based on task partitioning (https://github.com/chai-benchmarks/chai)
Rectilinear Steiner minimum tree (https://userweb.cs.txstate.edu/~burtscher/research/SFP/)
A structured-grid applications in the oil and gas industry (https://github.com/ROCm-Developer-Tools/HIP-Examples/tree/master/rtm8)
An ODE solver using the Rush-Larsen scheme (https://bitbucket.org/finsberg/gotran/src/master)
Chemical rates computation used in the simulation of combustion (https://github.com/vetter/shoc/)
Stacked 8-neighborhood search finds nearest neighbors in each of the eight octants partitioned by ordering of three coordinates (https://github.com/MVIG-SJTU/pointSIFT/tree/master)
Dynamic parallel skew algorithm for suffix array on GPU (https://github.com/gmzang/Parallel-Suffix-Array-on-GPU)
Naive template matching with SAD
Shapley sampling values explanation method (https://github.com/rapidsai/cuml)
Perform the SAXPY operation on host and device (https://github.com/pc2/OMP-Offloading)
Stream compaction (https://github.com/chai-benchmarks/chai)
Scan with bank-conflict-aware optimization (https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)
Scan a large array (https://github.com/OpenCL/AMD_APP_samples)
Scan a large array using vendors' libraries (https://github.com/NVIDIA/cub)
Sigmoid cross-entropy with logits (https://pytorch.org/)
Find the top scores (https://github.com/opencv/)
Batched dense matrix - dense matrix multiplication into sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsesddmm)
Part of BIP39 solver
Seam carving (https://github.com/pauty/CUDA_seam_carving)
Segmented reduction using Thrust and oneDPL (https://github.com/c3sr/tcu_scope)
Fast segmented sort on a GPU (https://github.com/Funatiq/bb_segsort)
Plasma sheath simulation with the particle-in-cell method (https://www.particleincell.com/2016/cuda-pic/)
The shared local memory microbenchmark (https://github.com/ekondis/gpumembench)
Shuffle instructions with subgroup sizes of 8, 16, and 32 (https://github.com/cpc/hipcl/tree/master/samples/4_shfl)
Set intersection with matrix multiply (https://github.com/chribell/set_intersection)
The attentuation of neutron fluxes across an individual geometrical segment (https://github.com/ANL-CESAR/SimpleMOC-kernel)
Execute kernels on multiple devices (https://docs.nvidia.com/cuda/cuda-samples/index.html)
Simple sparse matrix vector multiply (https://github.com/passlab/CUDAMicroBench)
Slit experiment to compute diffraction patterns (https://github.com/bamaratunga/cuda_fft.git)
Sparse LU factorization (https://github.com/sheldonucr/GLU_public)
Genome pre-alignment filtering (https://github.com/CMU-SAFARI/SneakySnake)
Accelerating sparse neural network inference via compression at inference time (https://github.com/IDEA-CUHK/SNICIT)
Sobel filter (https://github.com/OpenCL/AMD_APP_samples)
Sobol quasi-random generator (https://docs.nvidia.com/cuda/cuda-samples/index.html)
The softmax function (https://github.com/pytorch/glow/tree/master/lib/Backends/OpenCL)
The softmax function with scaling and masking (https://github.com/Dao-AILab/flash-attention/tree/main/csrc/fused_softmax)
The online softmax function with scaling (https://github.com/karpathy/llm.c)
Radix sort in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Sort by key using Thrust and oneDPL
Second-order IIR digital filtering (https://github.com/rapidsai/cusignal/)
A miniapp for the CoMet comparative genomics application (https://github.com/wdj/sparkler)
The simple n^2 SPH simulation (https://github.com/olcf/SPH_Simple)
The split operation in radix sort (http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Image registration calculations for the statistical parametric mapping (SPM) system (http://mri.ee.ntust.edu.tw/cuda/)
Conversion of a dense matrix to a sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsedensetosparse)
Library-based out-of-place transpose of a sparse matrix (https://docs.nvidia.com/cuda/cusparse/index.html, https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-1/oneapi-mkl-sparse-omatcopy.html)
Library-based sparse matrix - dense matrix multiplication (https://docs.nvidia.com/cuda/cusparse/index.html)
Library-based sparse matrix - sparse matrix multiplication (https://docs.nvidia.com/cuda/cusparse/index.html)
Library-based sparse matrix - dense vector multiplication (https://docs.nvidia.com/cuda/cusparse/index.html)
Count the number of nonzero elements per row or column and the total number of nonzero elements in a dense matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-t-nnz)
Conversion of a sparse matrix to a dense matrix (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsesparsetodense)
Sort a sparse matrix represented in CSR format (https://docs.nvidia.com/cuda/cusparse/index.html#cusparsexcsrsort)
A thread-Level synchronization-free sparse triangular solver (https://github.com/JiyaSu/CapelliniSpTRSV)
SRAD (version 1) in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
String search (https://github.com/OpenCL/AMD_APP_samples)
Compute structural similarity index measure (https://github.com/VIDILabs/instantvnr)
Stochastic shotgun search for Dirichlet process mixtures of Gaussian graphical models (https://github.com/mukherjeec/DPmixGGM/)
The single-source shortest path (https://github.com/chai-benchmarks/chai)
Standard deviation (https://github.com/rapidsai/raft)
1D stencil (https://www.olcf.ornl.gov/wp-content/uploads/2019/12/02-CUDA-Shared-Memory.pdf)
3D stencil (https://github.com/LLNL/cardioid)
Streamcluster in the Rodinia benchmark suite (http://lava.cs.virginia.edu/Rodinia/download_links.htm)
Evaluate the overhead of create, copy, and destroy using streams (https://github.com/ROCm-Developer-Tools/HIP/)
Demonstrate stream-ordered memory allocation and deallocation (https://github.com/NVIDIA/cuda-samples/)
Demonstrate basic use of stream priorities (https://github.com/NVIDIA/cuda-samples/)
Demonstrate the use of OpenMP and streams with unified memory on a single device (https://github.com/NVIDIA/cuda-samples/)
Spatial-temporal Savitzky-Golay method for reconstructing high-quality NDVI time series (https://github.com/HPSCIL/cuSTSG)
Lattice QCD SU(3) matrix-matrix multiply microbenchmark (https://gitlab.com/NERSC/nersc-proxies/su3_bench)
Surfel rendering (https://github.com/jstraub/cudaPcl)
Compute the singular value decomposition of 3x3 matrices (https://github.com/kuiwuchn/3x3_SVD_CUDA)
SW4 curvilinear kernels are five stencil kernels that account for ~50% of the solution time in SW4 (https://github.com/LLNL/SW4CK)
The Swish activate functions (https://pytorch.org/)
A demo of tensor accessors in Pytorch (https://pytorch.org/)
Tensor transposition (https://github.com/Jokeren/GPA-Benchmark/tree/master/ExaTENSOR)
A proxy for the SNAP force calculation in the LAMMPS molecular dynamics package (https://github.com/FitSNAP/TestSNAP)
Solve tridiagonal systems of equations using the Thomas algorithm (https://pm.bsc.es/gitlab/run-math/cuThomasBatch/tree/master)
Memory fence function (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions)
Accumulate contributions of tissue source strengths and previous solute levels to current tissue solute levels (https://github.com/secomb/GreensTD19_GPU)
Tone mapping (https://github.com/OpenCL/AMD_APP_samples)
The 2-point correlation function (https://users.ncsa.illinois.edu/kindr/projects/hpca/index.html)
Simulation of a task queue system (https://github.com/chai-benchmarks/chai)
Triad in the SHOC benchmark suite(https://github.com/vetter/shoc/)
Matrix solvers for large number of small independent tridiagonal linear systems(http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html)
Trotter-Suzuki approximation (https://bitbucket.org/zzzoom/trottersuzuki/src/master/)
Solving the symmetric traveling salesman problem with iterative hill climbing (https://userweb.cs.txstate.edu/~burtscher/research/TSP_GPU/)
Unfold the view of original tensor as slices (https://pytorch.org/)
Uniform random noise generator (https://github.com/OpenCL/AMD_APP_samples)
Genuchten conversion of soil moisture and pressure (https://github.com/HydroComplexity/Dhara)
Computes expectation values (6D integrals) associated with the helium atom (https://github.com/wadejong/Summer-School-Materials/tree/master/Examples/vmc)
Volume-to-column transform (https://pytorch.org/)
Demonstrate the usage of the vote intrinsics (https://github.com/NVIDIA/cuda-samples/)
Transform a point tensor to a sparse tensor (https://github.com/NVIDIA-AI-IOT/Lidar_AI_Solution)
Rearrange data partitioned across a warp (https://github.com/NVIDIA/cub)
Sort small numbers (https://github.com/facebookarchive/fbcuda)
Compute mean and variance using the Welford algorithm (https://github.com/hpcaitech/ColossalAI)
Winograd convolution (https://github.com/ChenyangZhang-cs/iMLBench)
Compute spring forces in a worm-like chain model with a power function (https://github.com/AnselGitAccount/USERMESO-2.0)
Simple warp matrix multiply-accumulate operations (https://github.com/ROCm/rocWMMA)
Implementation of word2vec with Continuous Bag-of-Words (https://github.com/cudabigdata/word2vec_cuda)
Count the number of words in a text (https://github.com/NVIDIA/thrust/blob/main/examples/)
Parallel weather and research forecast single moment 5-class (https://github.com/gpgpu-sim/ispass2009-benchmarks/tree/master/wp)
List ranking with Wyllie's algorithm (Rehman, M. & Kothapalli, Kishore & Narayanan, P.. (2009). Fast and Scalable List Ranking on the GPU. Proceedings of the International Conference on Supercomputing. 235-243. 10.1145/1542275.1542311.)
Hartree-Fock self-consistent-field (SCF) calculation of H2O (https://github.com/recoli/XLQC)
A proxy application for full neutron transport application like OpenMC (https://github.com/ANL-CESAR/XSBench/)
Kernels may read and write directly to pinned system memory from a user perspective (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simpleZeroCopy)
Find zero-points and scales in quantization (https://pytorch.org/)
3D complex FFT in a 256^3 cube (https://github.com/spiral-software/fftx)
Zoom in and zoom out an image (https://github.com/rapidsai/cucim)
Authored and maintained by Zheming Jin (https://github.com/zjin-lcf)
Abhishek Bagusetty, Andrew Barker, Andrey Alekseenko, Anton Gorshkov, Beau Johnston, Bernhard Esslinger, Bert de Jong, Chengjian Liu, Chris Knight, Daine McNiven, David Oro, David Sanchez, Douglas Franz, Edson Borin, Gabriell Araujo, Georgi Mirazchiyski, Henry Linjamäki, Henry Gabb, Hugh Delaney, Ian Karlin, Istvan Reguly, Jack Kirk, Jason Lau, Jeff Hammond, Jenny Chen, Jeongnim Kim, Jianxin Qiu, Jakub Chlanda, Jiya Su, John Tramm, Ju Zheng, Junchao Zhang, Kali Uday Balleda, Kinman Lei, Luke Drummond, Martin Burtscher, Matthias Noack, Michael Kruse, Michel Migdal, Mike Franusich, Mike Giles, Mikhail Dvorskiy, Mohammed Alser, Muhammad Haseeb, Muaaz Awan, Nevin Liber, Nicholas Miller, Pavel Samolysov, Pedro Valero Lara, Piotr Różański, Rahulkumar Gayatri, Samyak Gangwal, Shaoyi Peng, Steffen Larsen, Rafal Bielski, Robert Harrison, Robin Kobus, Rod Burns, Rodrigo Vimieiro, Romanov Vlad, Tadej Ciglarič, Thomas Applencourt, Tiago Carneiro, Tiago Cogumbreiro, Timmie Smith, Tobias Baumann, Usman Roshan, Wayne Joubert, Ye Luo, Yongbin Gu, Zhe Chen
Codeplay® and Intel® for their contributions to the oneAPI ecosystem
The project uses resources at the Intel® DevCloud, the Chameleon testbed supported by the National Science Foundation, the Argonne Leadership Computing Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357, and the Experimental Computing Laboratory (ExCL) at Oak Ridge National Laboratory supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.