technical deep dives

september 19, 2025

part 4: breaking sota: achieving 103% of peak performance on blackwell

the final optimization techniques that push our matmul kernel past nvidia's cublas, reaching 103% of theoretical peak performance on blackwell gpus.

september 12, 2025

part 3: the optimizations behind 85% of sota performance

diving into advanced optimization techniques including register allocation, memory coalescing, and pipelining strategies to reach 85% of peak performance.

september 5, 2025

part 2: using blackwell hardware features to optimize matmul

exploring blackwell-specific features like tcgen05 instructions and tensor memory to build faster matrix multiplication kernels.

august 28, 2025

part 1: matrix multiplication on blackwell: introduction

introducing the fundamentals of gpu programming and building a simple matmul kernel in mojo. the starting point of our journey to beat cublas.