Brussels / 31 January & 1 February 2026

schedule

What a Decade of SIMD Taught Us: AVX-512, AMX, NEON, SVE, SME, and Beyond


Over the past decade, CPU vector units have evolved faster than most software stacks have adapted. From AVX2 and AVX-512 to NEON, SVE/SVE2, AMX, and SME, each generation introduced wider registers, richer predication, new mixed-precision formats, and entirely new execution models. Yet extracting sustained throughput from these extensions requires understanding architectural asymmetries, compiler behaviour, portability constraints, and the practical limits of auto-vectorisation.

This session distills ten years of developing SIMD kernels deployed inside large-scale open-source systems, including USearch — a high-throughput vector search engine embedded today in many modern DBMS projects — as well as StringZilla, SimSIMD, and bioinformatics tools. We will examine how different CPU families behave on identical workloads, which instructions consistently deliver real speedups, which look promising but rarely pay off, and where compilers do (and do not) generate optimal vector code.

Case studies include: AVX-512 reductions exceeding 300 GB/s on current x86 machines; i8/u8 and bf16 pipelines on NEON and SVE2; practical limits of AMX tiles for dense math; and early insights into SME’s streaming execution model. Each example is shown through minimal, reproducible kernel fragments.

Beyond raw performance, the talk outlines when SIMD is the right tool: which classes of problems benefit most, when auto-vectorisation is sufficient, and when hand-written intrinsics or assembly are justified. We will also discuss hardware selection for SIMD-heavy workloads — both on-prem and in the cloud — and what upcoming extensions mean for open-source systems in the next decade.

Speakers

Photo of Ash Vardanian Ash Vardanian

Links