-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vxsort: Add Arm64 Neon implementation and tests #110692
base: main
Are you sure you want to change the base?
Conversation
Work in progress. Currently the testing for vxsort exists in I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too. Performance testing is very basic, but running
On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster.
On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing:
Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once. I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list? |
thanks for your interest in this! @damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp
is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512. |
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
|
||
#if defined(TARGET_ARM64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just conditionally compile them (condition in cmake) instead of adding ifdef
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just conditionally compile them (condition in cmake) instead of adding
ifdef
s?
I'm used to seeing files elsewhere in coreclr be compiled for all targets and just have an ifdef at the top. Eg, anything in jit/ with a cpu in the filename. So, was following that convention. But happy to switch.
Looks like that was a bug in my side. With that fixed, for 8000 elements: AVX2 X64 (Gold 5120T):
Cobalt 100:
|
Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100? |
ignore...so seems today we use |
Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find. |
Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7
I've added an implementation of Bitonic. The plan was to do the bitonic sort using NEON. Unfortunately instructions like For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise. As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code. Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR. Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms)
In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain. |
Add an implementation of vxsort for Neon.
Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.