Pallas:Mosaic GPU# Backend specific documentation for the Mosaic GPU backend. Reference documentation Writing Mosaic GPU kernels with Pallas What is a GPU? Array layouts and memory reference transforms MMA (TensorCore) Using core_map Synchronization structures and primitives Cluster launch control Asynchronous copies Inline Mosaic GPU Compiler parameters Debugging Calling kernels from PyTorch Mosaic GPU Pipelining Pipelining with Mosaic GPU GPU Memory Spaces Example: Matmul Kernel on Hopper GPUs Warp Specialization Example: Matrix Multiplication with Warp Specialization Writing high-performance matrix multiplication kernels for Blackwell 0. Basic kernel 1. Warp specialization 2. Tiled epilogue 3. Collective (2CTA) MMA 4. Persistent kernel 5. Dedicated epilogue warpgroup 6. Grid tiling Final kernel Collective matrix multiplication Algorithm overview: Ring All-Gather Pallas primitives for inter-device communication Implementation with Pallas Integrating the kernel with JAX