vxsort: Add Arm64 Neon implementation and tests #110692

a74nh · 2024-12-13T13:48:42Z

Add an implementation of vxsort for Neon.

Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.

a74nh · 2024-12-13T14:07:16Z

Work in progress.

Currently the testing for vxsort exists in src/coreclr/gc/vxsort/standalone. This needs refactoring and moving into src/tests somewhere.

I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too.

Performance testing is very basic, but running ./simple_bench/Project_demo 250 on Cobalt 100 I see roughly the same for both vxsort and insertsort:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us
insertsort: Time= 5 us
insertsort: Time= 3 us
insertsort: Time= 7 us

On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster.

vxsort: Time= 6 us
vxsort: Time= 6 us
vxsort: Time= 6 us
insertsort: Time= 8 us
insertsort: Time= 6 us
insertsort: Time= 5 us

On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us

Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once.

I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list?

a74nh · 2024-12-13T15:33:24Z

@kunalspathak @JulieLeeMSFT

Maoni0 · 2024-12-14T04:39:56Z

thanks for your interest in this!

@damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp

I noticed that for more than 255 the program segfaults on both X64 and Arm64.

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

am11 · 2024-12-14T09:27:49Z

src/coreclr/gc/vxsort/do_vxsort_neon.cpp

+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+
+#if defined(TARGET_ARM64)


Maybe just conditionally compile them (condition in cmake) instead of adding ifdefs?

Maybe just conditionally compile them (condition in cmake) instead of adding ifdefs?

I'm used to seeing files elsewhere in coreclr be compiled for all targets and just have an ifdef at the top. Eg, anything in jit/ with a cpu in the filename. So, was following that convention. But happy to switch.

src/coreclr/gc/vxsort/introsort.cpp

a74nh · 2024-12-16T10:42:22Z

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

Looks like that was a bug in my side. With that fixed, for 8000 elements:

AVX2 X64 (Gold 5120T):

vxsort: Time= 593 us
vxsort: Time= 576 us
vxsort: Time= 566 us
insertsort: Time= 1169 us
insertsort: Time= 1177 us
insertsort: Time= 1168 us

Cobalt 100:

vxsort: Time= 157 us
vxsort: Time= 153 us
vxsort: Time= 156 us
insertsort: Time= 233 us
insertsort: Time= 215 us
insertsort: Time= 220 us

kunalspathak · 2024-12-16T18:38:36Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

kunalspathak · 2024-12-16T18:51:23Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

a74nh · 2024-12-17T09:47:03Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find.

Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7

a74nh · 2024-12-20T16:42:18Z

I've added an implementation of Bitonic.

The plan was to do the bitonic sort using NEON. Unfortunately instructions like rev, min, max etc do not have variants that work on 64bit elements - they only have 8/16/32 variants. (A broken version showing what it would look like if those instructions existed is here).

For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg cmgt+bit instead of max). For other functions the most optimal way is to move the values into GPR registers and use scalar code. That's very messy and looses perf in all the moves.

An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise.

As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code.

Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR.

Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms)

❯ ./simple_bench/Project_demo 8000
vxsort: Time= 113 us
vxsort: Time= 123 us
vxsort: Time= 117 us
insertsort: Time= 220 us
insertsort: Time= 221 us
insertsort: Time= 238 us

In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain.

vxsort: Add Arm64 Neon implementation and tests

44be149

dotnet-issue-labeler bot added the area-GC-coreclr label Dec 13, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 13, 2024

am11 reviewed Dec 14, 2024

View reviewed changes

src/coreclr/gc/vxsort/introsort.cpp Outdated Show resolved Hide resolved

allow more than 255 elements

8e9af9a

cleanup insertsort

5ce2c0b

a74nh added 3 commits December 19, 2024 10:17

Add Neon to bitonic

9c8151d

Add scalar version of bitonic

f1b591a

Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7

remove Neon from bitonic

e4771c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vxsort: Add Arm64 Neon implementation and tests #110692

vxsort: Add Arm64 Neon implementation and tests #110692

a74nh commented Dec 13, 2024 •

edited

Loading

a74nh commented Dec 13, 2024 •

edited

Loading

a74nh commented Dec 13, 2024

Maoni0 commented Dec 14, 2024

am11 Dec 14, 2024 •

edited

Loading

a74nh Dec 16, 2024

a74nh commented Dec 16, 2024

kunalspathak commented Dec 16, 2024

kunalspathak commented Dec 16, 2024

a74nh commented Dec 17, 2024

a74nh commented Dec 20, 2024

vxsort: Add Arm64 Neon implementation and tests #110692

Are you sure you want to change the base?

vxsort: Add Arm64 Neon implementation and tests #110692

Conversation

a74nh commented Dec 13, 2024 • edited Loading

a74nh commented Dec 13, 2024 • edited Loading

a74nh commented Dec 13, 2024

Maoni0 commented Dec 14, 2024

am11 Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

a74nh Dec 16, 2024

Choose a reason for hiding this comment

a74nh commented Dec 16, 2024

kunalspathak commented Dec 16, 2024

kunalspathak commented Dec 16, 2024

a74nh commented Dec 17, 2024

a74nh commented Dec 20, 2024

a74nh commented Dec 13, 2024 •

edited

Loading

a74nh commented Dec 13, 2024 •

edited

Loading

am11 Dec 14, 2024 •

edited

Loading