Arch Dispatching

turbo::simd provides a generic way to dispatch a function call based on the architecture the code was compiled for and the architectures available at runtime. The turbo::simd::dispatch() function takes a functor whose call operator takes an architecture parameter as first operand, followed by any number of arguments Args... and turn it into a dispatching functor that takes Args... as arguments.

template<class ArchList = supported_architectures, class F>
inline detail::dispatcher<F, ArchList> turbo::simd::dispatch(F &&f) noexcept

Following code showcases a usage of the turbo::simd::dispatch() function:

#include "sum.h"

// Create the dispatching function, specifying the architecture we want to
// target.
auto dispatched = turbo::simd::dispatch<turbo::simd::arch_list<turbo::simd::avx2, turbo::simd::sse2>>(sum{});

// Call the appropriate implementation based on runtime information.
float res = dispatched(data, 17);

This code does not require any architecture-specific flags. The architecture specific details follow.

The sum.h header contains the function being actually called, in an architecture-agnostic description:

#ifndef _SUM_H_
#define _SUM_H_

#include "turbo/simd/simd.h"

// functor with a call method that depends on `Arch`
struct sum {
    // It's critical not to use an in-class definition here.
    // In-class and inline definition bypass extern template mechanism.
    template<class Arch, class T>
    T operator()(Arch, T const *data, unsigned size);
};

template<class Arch, class T>
T sum::operator()(Arch, T const *data, unsigned size) {
    using batch = turbo::simd::batch<T, Arch>;
    batch acc(static_cast<T>(0));
    const unsigned n = size / batch::size * batch::size;
    for (unsigned i = 0; i != n; i += batch::size)
        acc += batch::load_unaligned(data + i);
    T star_acc = turbo::simd::reduce_add(acc);
    for (unsigned i = n; i < size; ++i)
        star_acc += data[i];
    return star_acc;
}

// Inform the compiler that sse2 and avx2 implementation are to be found in another compilation unit.
extern template float sum::operator()<turbo::simd::avx2, float>(turbo::simd::avx2, float const *, unsigned);

extern template float sum::operator()<turbo::simd::avx, float>(turbo::simd::avx, float const *, unsigned);

extern template float sum::operator()<turbo::simd::sse2, float>(turbo::simd::sse2, float const *, unsigned);

#endif

The SSE2 and AVX2 version needs to be provided in other compilation units, compiled with the appropriate flags, for instance:

// compile with -mavx2
#include "sum.h"

template float sum::operator()<turbo::simd::avx2, float>(turbo::simd::avx2, float const *, unsigned);
// compile with -msse2
#include "sum.h"

template float sum::operator()<turbo::simd::sse2, float>(turbo::simd::sse2, float const *, unsigned);