A deep technical overview of the new MoE Align & Sort algorithm. By fully enabling concurrent multiple blocks execution with arbitrary expert numbers, and with aggressive usage of shared memory and registers, the MoE Align & Sort significant performance gains on AMD hardware, providing up to a 10x acceleration on MI100 and 7x on MI300X/MI300A.