GPU Developer— Gridwise
Sept 2025 – Present
At Gridwise I work on low-level GPU compute, building and tuning WebGPU kernels for data-parallel workloads. The role is deeply technical — close to the metal, focused on squeezing real throughput out of GPU hardware.
- Built and optimized GPU kernels for scan, reduction, radix sort, and histogram using WebGPU, following CUDA-style execution patterns with thread groups, shared memory, and barriers.
- Profiled kernels with WebGPU tooling, identifying bandwidth bottlenecks and atomic contention; tuned memory layouts to achieve sustained throughput of 80 GB/s — a 40 GB/s improvement over baseline.
- Unified multiple histogram implementations into a single, easy-to-use GPU API, simplifying integration for other developers without sacrificing performance.
- Performed end-to-end performance testing across workloads, analyzing occupancy, synchronization costs, and memory efficiency, iterating on kernels to maximize throughput.