Advanced SIMD: extending the reach of contemporary SIMD architectures

SIMD extensions have gained widespread acceptance in modern microprocessors as a way to exploit data-level parallelism in general-purpose cores. Popular SIMD architectures (e.g. Intel SSE/AVX) have evolved by adding support for wider registers and datapaths, and advanced features like indexed memory accesses, per-lane predication and inter-lane instructions, at the cost of additional silicon area and design complexity.
This paper evaluates the performance impact of such advanced features on a set of workloads considered hard to vectorize for traditional SIMD architectures. Their sensitivity to the most relevant design parameters (e.g. register/datapath width and L1 data cache configuration) is quantified and discussed.
We developed an ARMv7 NEON based ISA extension (ARGON), augmented a cycle accurate simulation framework for it, and derived a set of benchmarks from the Berkeley dwarfs. Our analyses demonstrate how ARGON can, depending on the structure of an algorithm, achieve speedups of 1.5x to 16x.

Boettcher, Matthias

d34d0210-df72-4f89-ad87-91d87a4f272a

Al-Hashimi, Bashir M.

0b29c671-a6d2-459c-af68-c4614dce3b5d

Eyole, Mbou

c954e758-34b7-4a65-995b-ec4f2a93ff42

Gabrielli, Giacomo

79e841c7-f0b2-48bb-81b0-9aba55fac63f

Reid, Alastair

6f778bae-124e-4299-b4f5-d024bdd487f0