Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vxsort: Add Arm64 Neon implementation and tests #110692

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Dec 13, 2024

Add an implementation of vxsort for Neon.

Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 13, 2024
@a74nh
Copy link
Contributor Author

a74nh commented Dec 13, 2024

Work in progress.

Currently the testing for vxsort exists in src/coreclr/gc/vxsort/standalone. This needs refactoring and moving into src/tests somewhere.

I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too.

Performance testing is very basic, but running ./simple_bench/Project_demo 250 on Cobalt 100 I see roughly the same for both vxsort and insertsort:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us
insertsort: Time= 5 us
insertsort: Time= 3 us
insertsort: Time= 7 us

On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster.

vxsort: Time= 6 us
vxsort: Time= 6 us
vxsort: Time= 6 us
insertsort: Time= 8 us
insertsort: Time= 6 us
insertsort: Time= 5 us

On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us

Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once.

I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list?

@a74nh
Copy link
Contributor Author

a74nh commented Dec 13, 2024

@Maoni0
Copy link
Member

Maoni0 commented Dec 14, 2024

thanks for your interest in this!

@damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp

I noticed that for more than 255 the program segfaults on both X64 and Arm64.

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

#if defined(TARGET_ARM64)
Copy link
Member

@am11 am11 Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just conditionally compile them (condition in cmake) instead of adding ifdefs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just conditionally compile them (condition in cmake) instead of adding ifdefs?

I'm used to seeing files elsewhere in coreclr be compiled for all targets and just have an ifdef at the top. Eg, anything in jit/ with a cpu in the filename. So, was following that convention. But happy to switch.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 16, 2024

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

Looks like that was a bug in my side. With that fixed, for 8000 elements:

AVX2 X64 (Gold 5120T):

vxsort: Time= 593 us
vxsort: Time= 576 us
vxsort: Time= 566 us
insertsort: Time= 1169 us
insertsort: Time= 1177 us
insertsort: Time= 1168 us

Cobalt 100:

vxsort: Time= 157 us
vxsort: Time= 153 us
vxsort: Time= 156 us
insertsort: Time= 233 us
insertsort: Time= 215 us
insertsort: Time= 220 us

@kunalspathak
Copy link
Member

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

@kunalspathak
Copy link
Member

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 17, 2024

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 20, 2024

I've added an implementation of Bitonic.

The plan was to do the bitonic sort using NEON. Unfortunately instructions like rev, min, max etc do not have variants that work on 64bit elements - they only have 8/16/32 variants. (A broken version showing what it would look like if those instructions existed is here).

For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg cmgt+bit instead of max). For other functions the most optimal way is to move the values into GPR registers and use scalar code. That's very messy and looses perf in all the moves.

An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise.

As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code.

Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR.

Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms)

❯ ./simple_bench/Project_demo 8000
vxsort: Time= 113 us
vxsort: Time= 123 us
vxsort: Time= 117 us
insertsort: Time= 220 us
insertsort: Time= 221 us
insertsort: Time= 238 us

In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants