It's definitely possible, I am not arguing against that.
I am just saying it's not as flexible/cost-free as you would on a 'normal' von Neumann-style CPU.
I would love to see Rust-based code that obviates the need to write CUDA kernels (including compiling to different architectures). It feels icky to use/introduce things like async/await in the context of a GPU programming model which is very different from a traditional Rust programming model.
You still have to worry about different architectures and the streaming nature at the end of the day.
I am very interested in this topic, so I am curious to learn how the latest GPUs help manage this divergence problem.
My understanding of warp (https://docs.nvidia.com/cuda/cuda-programming-guide/01-intro...) is that you are essentially paying the cost of taking both the branches.
I understand with newer GPUs, you have clever partitioning / pipelining in such a way block A takes branch A vs block B that takes branch B with sync/barrier essentially relying on some smart 'oracle' to schedule these in a way that still fits in the SIMT model.
It still doesn't feel Turing complete to me. Is there an nvidia doc you can refer me to?