experience

performance engineering intern

modular

may – august 2025 • los altos, ca

built matmul kernels on blackwell that beat nvidia's cublas at 103% sota (blog, code)
wrote 1d conv kernel that beat cudnn by 7x, speeding up inworld ai's tts model by 12%
implemented hilbert curve scheduling for h100
built specialized kernels (3d conv, bicubic resize) for midjourney inference

software engineering intern

tesla

sep – dec 2024 • fremont, ca

built computer vision prototype for 2-post lift safety in service centers
trained segmentation model to detect lift arms and compute safe angle thresholds

software engineering intern

ford

jan – april 2024 • dearborn, mi

built automated test suite using playwright and typescript

software engineering intern

blackberry

may – aug 2023 • waterloo, on

built full-stack analytics dashboard for service data

projects

matmul rewritten: cuda kernel study

cuda • github

built cuda matmul kernel for a100 from scratch, progressively optimizing performance
implemented memory coalescing, increasing bandwidth from 15gb/s to 110gb/s (6.5× speedup)
added 1d block tiling to improve register reuse, achieving 8.5 tflops

bare metal neural network

python, numpy • github

built neural network from scratch using only numpy
implemented dense layers, activation, and loss functions
built custom optimizers (sgd, adam) and mini-batch gradient descent