september 19, 2025
the final optimization techniques that push our matmul kernel past nvidia's cublas, reaching 103% of theoretical peak performance on blackwell gpus.
september 12, 2025
diving into advanced optimization techniques including register allocation, memory coalescing, and pipelining strategies to reach 85% of peak performance.
september 5, 2025
exploring blackwell-specific features like tcgen05 instructions and tensor memory to build faster matrix multiplication kernels.
august 28, 2025
introducing the fundamentals of gpu programming and building a simple matmul kernel in mojo. the starting point of our journey to beat cublas.