CUDA, SYCL, Codeplay, and oneAPI — Accelerators for Everyone | by TonyM | Jul, 2022

CUDA and SYCL — A useful check stroll by

Picture by Pakata Goh on Unsplash

There may be an ever-growing variety of accelerators on the planet. This raises the query of how numerous ecosystems will evolve to permit programmers to leverage these accelerators. At greater ranges of abstraction, domain-specific layers like Tensorflow and PyTorch present nice abstractions to the underlying {hardware}. Nevertheless, for builders who keep code that talks to the {hardware} with out such an abstraction the problem nonetheless exists. One resolution that’s supported on a number of underlying {hardware} architectures is C++ with SYCL. Right here is the outline of SYCL from the Kronos Group webpage:

SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that permits code for heterogeneous processors to be written utilizing normal ISO C++ with the host and kernel code for an utility contained in the identical supply file.

This sounds fairly good assuming you:

  1. are comfy studying a C++ like language
  2. need the flexibleness of not being tied to your underlying {hardware} vendor
  3. you’re beginning your code from scratch

Since individuals who program at this stage of the stack are already utilizing C++, it’s secure to say that the primary assumption is affordable and in the event you’re nonetheless studying, I’ll assume quantity two as nicely. Nevertheless, the third level may be very typically not the case. For the remainder of this story, we are going to talk about methods to take a CUDA code, migrate it to SYCL after which run it on a number of forms of {hardware}, together with an NVIDIA GPU.

To check how viable that is, we’ll be utilizing a sequence of freely obtainable instruments together with SYCLomatic, oneAPI Base Toolkit, and the Codeplay oneAPI for CUDA compiler. For info on supported CUDA variations for these instruments, please see the Intel DPC++ Compatibility Tool Release Notes and oneAPI for CUDA Getting Started Guide.

For reference, I’m utilizing my private Intel Alder Lake Core i9–12900KF Alienware R13 system with a 3080Ti GPU. My software program stack is Home windows 11 Professional system and I’m growing in Ubuntu 20.04 leveraging the Home windows Subsystem for Linux (WSL).

CUDA is commonly used to program for general-purpose computing on GPUs. The disadvantage is it solely runs on NVIDIA GPUs. To assist emigrate from CUDA to SYCL, we shall be leveraging the Intel DPC++ Compatibility Tool. Notice that Intel open-sourced the expertise behind the DPC++ Compatibility Instrument to additional advance the migration capabilities for producing extra SYCL-based purposes into the SYCLomatic venture.

There may be already a pleasant write-up that’s posted on Stackoverflow and Intel Developer Zone that walks by taking the jacobiCudaGraphs implementation from the cuda-samples GitHub repository and migrating the code, so relatively than retype I’ll simply hyperlink it right here.

Notice that in the event you simply wish to see how Codeplay’s oneAPI for CUDA compiler works for SYCL code, you’ll be able to skip this tutorial and see the ultimate code within the oneAPI-Samples repository on GitHub here.

After getting migrated to SYCL code, you need to be capable to run it on quite a lot of {hardware}. Let’s put it to the check, we could? To make it simpler to do that walkthrough, you’ll be able to try the oneAPI-samples listing which incorporates the migrated code already:

For comparability functions, right here is the output after I comply with the cuda-sample directions to construct and run the code:

Intel supplies a SYCL-based compiler within the oneAPI Base Toolkit, which is out there right here:

As a result of I’m utilizing Ubuntu, I simply adopted the directions to do an APT-based set up. Make certain so as to add the compiler paths to your workspace by operating:

> supply /choose/intel/oneapi/setvars.sh

I then went to the oneAPI-samples/DirectProgramming/DPC++/DenseLinearAlgebra/jacobi_iterative/sycl_dpct_migrated/src listing and compiled the code utilizing the Intel DPC++ SYCL compiler:

> dpcpp -o jacobiSyclCPU important.cpp jacobi.cpp -I ../Frequent/

Notice that the embody of ../Frequent is from the SYCLomatic workflow, which creates some helper recordsdata to allow my migration from CUDA to SYCL. My executable file on this case is jacobiSyclCPU, so let’s give {that a} run:

Wanting on the output there are a pair issues to contemplate:

  1. The SYCL model of the code is compiled with Intel’s DPC++ compiler and the cuda-sample code is compiled with the GNU compiler.
  2. The migrated SYCL code, which is initially parallelized for the GPU, is now operating on the CPU and is slower than the serialized model of the code. This is because of reminiscence buffer setup and synchronization that isn’t required for the serialized model.
  3. The textual content “GPU ***” is wrong as a result of I simply migrated the code and didn’t change the textual content to mirror that I’m concentrating on a CPU on this case.

Now that we’ve seen that we are able to run SYCL code on the CPU, let’s do one thing extra attention-grabbing and take the migrated code and see if we really can run it on my GPU.

Step one was to put in the oneAPI for CUDA compiler from Codeplay. I adopted their directions to put in and construct the compiler, inlined right here in your comfort:

git clone https://github.com/intel/llvm.git -b sycl
cd llvm
python ./buildbot/configure.py --cuda -t launch --cmake-gen “Unix Makefiles”
cd construct
make sycl-toolchain -j `nproc`
make set up

WSL2 Tip

A fast apart, whereas doing this set up I seen it was operating fairly slowly, a bit little bit of asking round and the difficulty was that I used to be operating my compilation in WSL in opposition to a Home windows filesystem as an alternative of the WSL ext4 filesystem. Transferring to the native filesystem made it exponentially quicker, so good tip. For extra particulars try some WSL filesystem benchmarks right here. It’s a bit older however nonetheless appears to be related:

https://vxlabs.com/2019/12/06/wsl2-io-measurements/

Compiling and Working for CUDA

With my compiler put in, I’m able to compile to run on an NVIDIA GPU. As a consequence of how the migrated code was generated, I must have embody recordsdata from each the oneAPI DPC++ compiler and Codeplay’s compiler in my path:

> supply /choose/intel/oneapi/setvars.sh
> export PATH=/dwelling/etmongko/llvm/construct/bin:$PATH
> export LD_LIBRARY_PATH=/dwelling/etmongko/llvm/construct/lib:$LD_LIBRARY_PATH

Now I run the Codeplay compiler to generate my CUDA-enabled binary:

> clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -DSYCL_USE_NATIVE_FP_ATOMICS -o jacobiSyclCuda important.cpp jacobi.cpp -I ../Frequent/

This generated the jacobiSyclCuda binary. Let’s give it a attempt!

Uhhhh not good

Ouch! A Segmentation Fault, not a very good begin. The excellent news is that I discovered the difficulty with a bit debugging. I had each oneAPI and the Codeplay libraries in my path, which brought about some points. To get across the points, I opened a brand new terminal and easily ran:

> export LD_LIBRARY_PATH=/dwelling/etmongko/llvm/construct/lib:$LD_LIBRARY_PATH

to incorporate simply the Codeplay libraries. After this straightforward tweak, the SYCL code ran with none downside on my GPU.

It really works!

Yay! We are able to see that the CPU time is just like the baseline instance and the GPU portion is considerably quicker than utilizing the CPU because the accelerator as in our second instance. Nevertheless, the efficiency isn’t fairly on par with the native CUDA implementation. This isn’t surprising as a result of this was a easy migration from CUDA to SYCL with none optimizations.

C++ with SYCL code could be compiled and run on a number of backends. On this case, we went by CPU and NVIDIA GPU examples, however the fantastic thing about SYCL is it does enable us to leverage different backends as nicely. As a teaser, subsequent week I’ll be testing out developer workflows with a brand-new Intel Arc A370M based mostly laptop computer, so we’ll see how SYCL permits a number of vendor backends.

As a developer, I feel alignment on hardware-agnostic options will ultimately make our lives simpler, we simply must get there. Migrating from CUDA to SYCL isn’t trivial, however there’s lots of neighborhood help round it and the instruments are regularly getting higher.

Wouldn’t it’s good if as new {hardware} got here out you could possibly run your code on it if it occurred to be quicker or extra power-efficient or less expensive? That’s a query for an additional time and dialogue, however it’s a good thought.

Wish to Join?If you wish to see what random tech information I’m studying, you'll be able to follow me on Twitter.Tony is a Software program Architect and Technical Evangelist at Intel. He has labored on a number of software program developer instruments and most just lately led the software program engineering staff that constructed the info middle platform which enabled Habana’s scalable MLPerf resolution.Intel, the Intel emblem and different Intel marks are logos of Intel Company or its subsidiaries. SYCL is a trademark of the Khronos® Group. Different names and types could also be claimed because the property of others.

More Posts