Get the latest tech news

Highly efficient matrix transpose in Mojo


In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the architecture using Mojo. The best kernel archive...

After allocating the shared memories we define the upper left coordinate of the tile using x and y and get row and column the current thread is responsible fore. For a more detailed explanation of what swizzling is and how it works please in my previous blogpost on matrix transpose the concept is the same for Mojo. I hope this blogpost showed you how to archive high performance on a common task in GPU computing using Mojo.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of CUDA

CUDA

Photo of Mojo

Mojo

Photo of % improvement

% improvement

Related news:

News photo

AMD ROCm 7.0 To Align HIP C++ "Even More Closely With CUDA"

News photo

Linux 6.16 Will Be Able To Exit User Mode Faster: 2~11% Improvement

News photo

CubeCL: GPU Kernels in Rust for CUDA, ROCm, and WGPU