What are the rules for the JIT to decide to use a software implementation of an intrinsic? #109002

Warpten · 2024-10-17T22:26:08Z

Warpten
Oct 17, 2024

I apologize, this is lengthy and probably overexplanative.

In the process of writing my own binary endianness swapper, i used BinaryPrimitives.ReverseEndianness as reference, with one major caveat: I use generics all the way down and source == dest.

Disclaimer: this is not an argument over whether or not what I'm doing is sane and if I should even bother doing it. It's an exploration into what I can do with the recent additions to the language, as I haven't seriously touched .NET since .NET Core 3.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe T[] ReadBE<T>(this Stream stream, int count) where T : unmanaged, IBinaryInteger<T>
{
    var value = new T[count];
    var valueBytes = MemoryMarshal.AsBytes(value.AsSpan());

    stream.ReadExactly(valueBytes);

    if (BitConverter.IsLittleEndian)
        ReverseEndianness(value.AsSpan());

    return value;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe uint[] ReadUInt32BE(this Stream stream, int count) => ReadBE<uint>(stream, count);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe void ReverseEndianness<T>(Span<T> value) where T : unmanaged, IBinaryInteger<T>
{
    var valueSpan = MemoryMarshal.AsBytes(value);
    ref T sourceRef = ref MemoryMarshal.GetReference(value);

    int i = 0;

    var iterationCount = value.Length - Vector128<T>.Count;
    if (Vector256.IsHardwareAccelerated && i <= iterationCount)
    {
        while (i <= iterationCount)
        {
            Vector256.StoreUnsafe(Vector256.Shuffle(
                Vector256.LoadUnsafe(ref sourceRef, (uint)i).AsByte(),
                MakeSwizzle256Fast<T>()
            ).As<byte, T>(), ref sourceRef, (uint) i);
            i += Vector256<T>.Count;
        }
    }

    iterationCount = value.Length - Vector128<T>.Count;
    if (Vector128.IsHardwareAccelerated && i <= iterationCount)
    {
        // var swizzle = MakeSwizzle128Fast<T>();

        while (i <= iterationCount)
        {
            Vector128.StoreUnsafe(Vector128.Shuffle(
                Vector128.LoadUnsafe(ref sourceRef, (uint)i).AsByte(),
                MakeSwizzle128Fast<T>()
            ).As<byte, T>(), ref sourceRef, (uint)i);
            i += Vector128<T>.Count;
        }
    }

    // bother with Vector64?

    // find less stupid solution
    i *= Unsafe.SizeOf<T>();
    while (i < valueSpan.Length)
    {
        for (int j = 0, k = Unsafe.SizeOf<T>() - 1; j < k; ++j, --k)
        {
            var leftIndex = i + j;
            var rightIndex = i + k;

            (valueSpan[rightIndex], valueSpan[leftIndex]) = (valueSpan[leftIndex], valueSpan[rightIndex]);
        }

        i += Unsafe.SizeOf<T>();
    }
}

[MethodImpl(MethodImplOptions.AggressiveInlining), SkipLocalsInit]
private unsafe static void MakeSwizzle<M>(ref M zero, int wordSize) where M : struct
{
    // Go through pointers to eliminate bounds checks
    var zeroPtr = (byte*) Unsafe.AsPointer(ref zero);
    for (var i = 0; i < Unsafe.SizeOf<M>(); i += wordSize)
    {
        for (var j = wordSize - 1; j >= 0; --j)
            zeroPtr[i + j] = (byte)(wordSize - j - 1 + i);
    }
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<byte> MakeSwizzle256Fast<T>() where T : unmanaged, IBinaryInteger<T>
{
    if (Unsafe.SizeOf<T>() == 4)
        return Vector256.Create((byte)3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, 19, 18, 17, 16, 23, 22, 21, 20, 27, 26, 25, 24, 31, 30, 29, 28);
    else if (Unsafe.SizeOf<T>() == 8)
        return Vector256.Create((byte)7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 23, 22, 21, 20, 19, 18, 17, 16, 31, 30, 29, 28, 27, 26, 25, 24);
    else
    {
        // u16 probs faster as shr pairs but let's try shuffling
        Unsafe.SkipInit(out Vector256<byte> swizzle);
        MakeSwizzle(ref swizzle, Unsafe.SizeOf<T>());
        return swizzle;
    }
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector128<byte> MakeSwizzle128Fast<T>() where T : unmanaged, IBinaryInteger<T>
{
    if (Unsafe.SizeOf<T>() == 4)
        return Vector128.Create((byte)3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12);
    else if (Unsafe.SizeOf<T>() == 8)
        return Vector128.Create((byte)7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8);
    else
    {
        Unsafe.SkipInit(out Vector128<byte> swizzle);
        MakeSwizzle(ref swizzle, Unsafe.SizeOf<T>());
        return swizzle;
    }
}

In the process of writing this and benchmarking it I noticed the following things:

MakeSwizzle128Fast and MakeSwizzle256Fast were necessary because I think the JIT is not smart enough to realize the output of MakeSwizzle will be constant for the same M and Ts, and can't shovel off the output of that method to a constant it can then load from. However, Unsafe.SizeOf<T>() == N is an obvious "always true", and the return value is a constant, so this works.
No matter how hard I try, the JIT seems to always selects the software implementation of VectorN.Shuffle, meaning my code ends up being three times slower than the builtin BinaryPrimitives.ReverseEndianness despite my best efforts:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800H with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2

| Method                                     | Mean     | Error    | StdDev   | Median   | Ratio | RatioSD | Code Size |
|------------------------------------------- |---------:|---------:|---------:|---------:|------:|--------:|----------:|
| 'Endian swap 21 32-bits numbers'           | 91.13 ns | 1.887 ns | 4.735 ns | 89.86 ns |  3.86 |    0.40 |     807 B |
| 'Endian swap 21 32-bits numbers - Builtin' | 23.79 ns | 0.753 ns | 2.221 ns | 23.19 ns |  1.01 |    0.13 |     935 B |

Where the benchmark is:

[Benchmark(Description = "Endian swap 21 32-bits numbers")]
public uint[] BenchmarkEndianness()
{
    _dataStream.Position = 0;

    var value = _dataStream.ReadUInt32BE(21);
    return value;
}

[Benchmark(Description = "Endian swap 21 32-bits numbers - Builtin", Baseline = true)]
public uint[] BenchmarkEndiannessBuiltin()
{
    _dataStream.Position = 0;

    var integers = new uint[21];
    _dataStream.ReadExactly(MemoryMarshal.AsBytes(integers.AsSpan()));

    BinaryPrimitives.ReverseEndianness(integers.AsSpan(), integers.AsSpan());
    return integers;
}

And an excerpt of the disassembly:

wowzer.fs.Extensions.StreamExtensions.ReadBE[[System.UInt32, System.Private.CoreLib]](System.IO.Stream, Int32)
            var value = new T[count];
            ^^^^^^^^^^^^^^^^^^^^^^^^^
            var valueBytes = MemoryMarshal.AsBytes(value.AsSpan());
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            stream.ReadExactly(valueBytes);
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                ReverseEndianness(value.AsSpan());
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            return value;
            ^^^^^^^^^^^^^
push      r15
push      r14
push      rdi
push      rsi
push      rbp
push      rbx
sub       rsp,0D8
vzeroupper
vxorps    xmm4,xmm4,xmm4
vmovdqa   xmmword ptr [rsp+40],xmm4
mov       rax,0FFFFFFFFFFFFFFA0
vmovdqa   xmmword ptr [rsp+rax+0B0],xmm4
vmovdqa   xmmword ptr [rsp+rax+0C0],xmm4
vmovdqa   xmmword ptr [rsp+rax+0D0],xmm4          ; all of this is zero init locals i guess
add       rax,30
jne       short 00007FFA7F3C9566
mov       rbx,rcx
movsxd    rdx,edx
mov       rcx,offset MT_System.UInt32[]
call      CORINFO_HELP_NEWARR_1_VC
mov       rsi,rax                                 ; what's all this?
lea       rdi,[rsi+10] 
mov       ecx,[rsi+8]
imul      ebp,ecx,4
jo        near ptr 00007FFA7F3C96B9
cmp       [rbx],bl
mov       r14d,ebp
xor       r15d,r15d
test      ebp,ebp
jle       short 00007FFA7F3C95FB
cmp       r15d,r14d
ja        near ptr 00007FFA7F3C96BE
mov       ecx,r15d
add       rcx,rdi
mov       edx,r14d
sub       edx,r15d
mov       [rsp+40],rcx
mov       [rsp+48],edx
mov       rcx,rbx
lea       rdx,[rsp+40]
mov       rax,[rbx]
mov       rax,[rax+60]
call      qword ptr [rax+20]
test      eax,eax
je        near ptr 00007FFA7F3C96C5
add       r15d,eax
cmp       r15d,ebp
jl        short 00007FFA7F3C95BB
lea       rbx,[rsi+10]
mov       edi,[rsi+8]
imul      ebp,edi,4
jo        near ptr 00007FFA7F3C96B9
xor       r14d,r14d
add       edi,0FFFFFFFC
mov       r15d,edi
test      r15d,r15d
jl        short 00007FFA7F3C968A
vmovups   ymm0,[7FFA7F3C9780]
mov       eax,r14d
vmovups   ymm1,[rbx+rax*4]                        ; [rsp+50] = [rbx + rax*4]
vmovups   [rsp+50],ymm1                            
vmovups   ymm1,[7FFA7F3C9780]                      ; Static load of constant swizzle mask
vmovups   [rsp+70],ymm1                            ; shoveled off to [rsp+70]
xor       ecx,ecx                                  ; software implementation of shuffle for 128 bits ?
lea       rdx,[rsp+70]
movsxd    r8,ecx
movzx     edx,byte ptr [rdx+r8]
xor       r10d,r10d
cmp       edx,20
jge       short 00007FFA7F3C9660
lea       r10,[rsp+50]
mov       edx,edx
movzx     r10d,byte ptr [r10+rdx]
lea       rdx,[rsp+90]
mov       [rdx+r8],r10b
inc       ecx
cmp       ecx,20
jl        short 00007FFA7F3C963F
vmovups   ymm1,[rsp+90]
vmovups   [rbx+rax*4],ymm1
add       r14d,8
cmp       r14d,r15d
jle       short 00007FFA7F3C9621
mov       r15d,edi
cmp       r14d,r15d
jle       near ptr 00007FFA7F3C9715
shl       r14d,2
cmp       r14d,ebp
jl        near ptr 00007FFA7F3C9723
mov       rax,rsi
vzeroupper
add       rsp,0D8
pop       rbx
pop       rbp
pop       rsi
pop       rdi
pop       r14
pop       r15
ret
call      CORINFO_HELP_OVERFLOW
call      qword ptr [7FFA7F52E9D0]
int       3
call      qword ptr [7FFA7F52ED78]
int       3
mov       edx,r14d
vmovups   xmm0,[rbx+rdx*4]
vmovaps   [rsp+30],xmm0                        
vmovups   xmm0,[7FFA7F3C9780]            
vmovaps   [rsp+20],xmm0                        
lea       rdx,[rsp+30]                         ; software implementation of shuffle for 128 bits ?
lea       r8,[rsp+20]
lea       rcx,[rsp+0C0]
call      qword ptr [7FFA7F5FDDE8]
mov       eax,r14d
vmovaps   xmm0,[rsp+0C0]
vmovups   [rbx+rax*4],xmm0
add       r14d,4
cmp       r14d,r15d
jle       short 00007FFA7F3C96CC
shl       r14d,2
cmp       r14d,ebp
jge       short 00007FFA7F3C96A3
xor       eax,eax
mov       ecx,3
jmp       short 00007FFA7F3C975C
lea       edx,[r14+rax]
lea       r8d,[r14+rcx]
cmp       r8d,ebp
jae       short 00007FFA7F3C976E
mov       r10d,r8d
add       r10,rbx
cmp       edx,ebp
jae       short 00007FFA7F3C976E
mov       r9d,edx
add       r9,rbx
movzx     edx,byte ptr [rbx+rdx]
movzx     r8d,byte ptr [rbx+r8]
mov       [r10],dl
mov       [r9],r8b
inc       eax
dec       ecx
cmp       eax,ecx
jl        short 00007FFA7F3C972C
add       r14d,4
cmp       r14d,ebp
jl        short 00007FFA7F3C9723
jmp       near ptr 00007FFA7F3C96A3
call      CORINFO_HELP_RNGCHKFAIL
int       3

Meanwhile BinaryPrimitives very obviously uses SIMD:

... snip
vmovups   ymm0,[rbx+rax*4]
vmovaps   ymm1,ymm0
vmovups   xmm6,[7FFA7F3E8CB0]
vpshufb   xmm1,xmm1,xmm6
vextracti128 xmm0,ymm0,1
vmovups   xmm7,[7FFA7F3E8CC0]
vpshufb   xmm0,xmm0,xmm7
vinserti128 ymm0,ymm1,xmm0,1

So my question is: what am I missing? What's causing the JIT to drop the intrinsic and fallback to a software implementation? I found #102702 and associated pull requests but I don't know enough about the JIT's rules on using intrinsics to really understand what's going on, besides the fact that the JIT gives up trying to use an intrinsic and falls back to calling the software implementation.

Answered by EgorBo

Oct 18, 2024

This a known limitation of Vector_.Shuffle API. It currently expects a constant mask as a parameter directly and it should be visible for JIT early so hiding a mask under aggressiveinlining might not work as expected. Eventually, it should be improved for cases like this.

View full answer

EgorBo · 2024-10-18T01:17:51Z

EgorBo
Oct 18, 2024
Collaborator

This a known limitation of Vector_.Shuffle API. It currently expects a constant mask as a parameter directly and it should be visible for JIT early so hiding a mask under aggressiveinlining might not work as expected. Eventually, it should be improved for cases like this.

4 replies

Warpten Oct 18, 2024
Author

It currently expects a constant mask as a parameter directly

Would that be the relevant piece of code? This code looks fairly opaque to me but also very related.
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L3465-L3502
(I also can't find the code for IsValidForShuffle, but I assume that's just GitHub failing to search properly, and I have not yet cloned the repository)

Since, from what I understand, inlining has already happened in IL when rationalization happens, wouldn't that mean that the information that inlining results in a constant is either lost or not determined? In which case, all that would need to be done is mark the inlined node as such. This is where my limited understanding of the runtime bites, as I'm not too sure how to carry the information down to the rationalization pass, or if that should be further processed so that rationalization "just knows" without any extra logic.

I'd be down to take a stab at this (a fun side project, if I'm being honest), however I may happen to ask questions and don't necessarily want to bother people too much :)

Follow-up: I took a look at fgInline and saw there is a folding pass for constant expressions. However I can see it happen in the resulting assembly:

vmovups   xmm0,[rbx+rdx*4]
vmovaps   [rsp+30],xmm0                        
vmovups   xmm0,[7FFA7F3C9780] ; <---
vmovaps   [rsp+20],xmm0

So the information is not seen by the normalization pass, I guess?

tannergooding Oct 18, 2024
Collaborator

So my question is: what am I missing?

You're testing on .NET 8 where-as the fix is in .NET 9.

Method	Job	Runtime	Mean	Error	StdDev	Ratio	RatioSD
'Endian swap 21 32-bits numbers'	.NET 8.0	.NET 8.0	55.81 ns	1.152 ns	1.861 ns	4.34	0.26
'Endian swap 21 32-bits numbers - Builtin'	.NET 8.0	.NET 8.0	12.88 ns	0.290 ns	0.661 ns	1.00	0.07
'Endian swap 21 32-bits numbers'	.NET 9.0	.NET 9.0	11.69 ns	0.265 ns	0.315 ns	0.91	0.05
'Endian swap 21 32-bits numbers - Builtin'	.NET 9.0	.NET 9.0	12.38 ns	0.283 ns	0.489 ns	0.96	0.06

Warpten Oct 18, 2024
Author

Oh, well that is a nice user error. I thought .NET 8 was the latest (including RCs). Sorry for the noise!

tannergooding Oct 18, 2024
Collaborator

.NET 9 is currently at RC2, with it planning on reaching GA (general availability) at .NET Conf next month.

The RC are "go live" (https://devblogs.microsoft.com/dotnet/dotnet-9-rc-2/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the rules for the JIT to decide to use a software implementation of an intrinsic? #109002

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What are the rules for the JIT to decide to use a software implementation of an intrinsic? #109002

Warpten Oct 17, 2024

Replies: 1 comment · 4 replies

EgorBo Oct 18, 2024 Collaborator

Warpten Oct 18, 2024 Author

tannergooding Oct 18, 2024 Collaborator

Warpten Oct 18, 2024 Author

tannergooding Oct 18, 2024 Collaborator

Warpten
Oct 17, 2024

Replies: 1 comment 4 replies

EgorBo
Oct 18, 2024
Collaborator

Warpten Oct 18, 2024
Author

tannergooding Oct 18, 2024
Collaborator

Warpten Oct 18, 2024
Author

tannergooding Oct 18, 2024
Collaborator