Hacker News Re-Imagined

Rust-CUDA: Fast GPU code fully in Rust

  • 184 points
  • 5 hours ago

  • @amadeusine
  • Created a post
  • • 51 comments

Rust-CUDA: Fast GPU code fully in Rust


@booleancoercion 1 hour

Replying to @amadeusine 🎙

sus

Reply


@dragontamer 4 hours

Replying to @amadeusine 🎙

https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...

* Missing Atomics -- Gamebreaker IMO. Atomics are absolutely essential when you are dealing with 10,000+ threads on a regular basis. You'll inevitably come across a shared data-structure that requires write-access from each thread, and some coordination mechanism is needed for that. Atomics are one important fit.

Ironic, a few days ago, I argued for the use of Fork-join parallelism in most cases (aka: Kernel launch / synchronized kernel exits). Now I find myself arguing the opposite now that we have a topic here with missing atomics. Like... atomics need to be used very, very rarely, but those rare uses are incredibly important.

* Warp Vote / Match / Reduce / Shuffle missing (Very useful tools for highly-optimized code, but you can write slower code that does the same thing through __shared__ memory just fine)

------

Wait, does this support __shared__ memory at all? Raw access to memory is not really amenable to Rust's programming style, but its absolutely necessary for high-performance GPU programming.

If this is missing __shared__ memory concepts, then the entire idea of "efficient GPU code" is dead IMO. GPU threads can only communicate quickly over __shared__ memory within an OpenCL Workgroup / CUDA Block (A Workgroup or Block is roughly a grouping of 1024 "threads" or SIMD-lanes)

All other forms of communication are incredibly slow. Atomics are maybe the next fastest form of communication, but only across __shared__ memory. Relaxed Atomics to global memory are reasonably performant but once you have either Seq-cst or Acquire/Release semantics (aka: the right memory barriers in the right place), things slow down dramatically in GPU-land.

The big issue is that __shared__ memory is only 64kB in size, its locked down to workgroups / blocks. In NVidia GPUs, the __shared__ memory "eats" into your L1 cache as well (In fact: __shared__ memory can be thought of as programmer-managed cache. The caching heuristics just aren't good enough for high-performance GPU programmers. They want to manually manage that high-speed memory for maximum performance).

Reply


@zozbot234 3 hours

Replying to @amadeusine 🎙

How would this compare with Accel (which is also built on the cuda ecosystem)?

Reply


@nextaccountic 2 hours

Replying to @amadeusine 🎙

Could this maybe support AMD's HIP as well?

Reply


@Djrhfbfnsks 55 minutes

Replying to @amadeusine 🎙

How does the compare with writing GPU code in Julia?

Reply


@esjeon 1 hour

Replying to @amadeusine 🎙

The title says fast, but no benches, but the README puts more emphasis on that it’s a more usable solution than LLVM-PTX + Rust.

I mean, what’s the point of “fast”, if LLVM-PTX is clunky with rust in the first place?

Reply


@neatze 40 minutes

Replying to @amadeusine 🎙

> n ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Well at least it does not say extremely safe.

Reply


@m0zg 2 hours

Replying to @amadeusine 🎙

NVIDIA should hire the guy, then hire whoever he says he wants on the team and let them rip. That's what I'd do if I were in their shoes. Viable paths off C/C++ are badly needed, and currently the only real viable path with an ecosystem and community is Rust.

Reply


@sxp 4 hours

Replying to @amadeusine 🎙

> "Extremely fast"

When people make claims like this, it would be good if they put the benchmarks on the first page. E.g, how does it compare with https://github.com/gfx-rs/wgpu which lets you target Vulkan, Metal, DX, GL or WASM+WebGPU with rust?

Reply


@gaze 4 hours

Replying to @amadeusine 🎙

The writing is better than I might have produced as a first year college student, but this needs copy editing. I might suggest that the word “extremely” should be removed not only from the entire repository but all of the user’s repositories. Fast might be removed as well. Nobody is trying to generate slow GPU code. The salient feature is that one can write the same code as one might write in CUDA with the advantages of Rust’s type system—-which is indeed useful! However, there’s no speed to be gained by using this relative to CUDA.

Reply


@russdpale 4 hours

Replying to @amadeusine 🎙

very cool!

Reply


About Us

site design / logo © 2021 Box Piper