experience
may – august 2025 • los altos, ca
- built matmul kernels on blackwell that beat nvidia's cublas at 103% sota (blog, code)
- wrote 1d conv kernel that beat cudnn by 7x, speeding up inworld ai's tts model by 12%
- implemented hilbert curve scheduling for h100
- built specialized kernels (3d conv, bicubic resize) for midjourney inference
sep – dec 2024 • fremont, ca
- built computer vision prototype for 2-post lift safety in service centers
- trained segmentation model to detect lift arms and compute safe angle thresholds
jan – april 2024 • dearborn, mi
- built automated test suite using playwright and typescript
may – aug 2023 • waterloo, on
- built full-stack analytics dashboard for service data
projects
cuda • github
- built cuda matmul kernel for a100 from scratch, progressively optimizing performance
- implemented memory coalescing, increasing bandwidth from 15gb/s to 110gb/s (6.5× speedup)
- added 1d block tiling to improve register reuse, achieving 8.5 tflops
python, numpy • github
- built neural network from scratch using only numpy
- implemented dense layers, activation, and loss functions
- built custom optimizers (sgd, adam) and mini-batch gradient descent