Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in NumPy buffered iteration #500

Open
seberg opened this issue Dec 20, 2024 · 3 comments
Open

Changes in NumPy buffered iteration #500

seberg opened this issue Dec 20, 2024 · 3 comments

Comments

@seberg
Copy link
Contributor

seberg commented Dec 20, 2024

I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.

This probably has little or no effect on numexpr (the core loop size may be smaller occasionally, but you don't use GROWINNER which is a change einsum noticed: a huge core reduced the summation precision).

I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:

/*
* First do all the blocks with a compile-time fixed size.
* This makes a big difference (30-50% on some tests).
*/
block_size = *size_ptr;
while (block_size == BLOCK_SIZE1) {
#define REDUCTION_INNER_LOOP
#define BLOCK_SIZE BLOCK_SIZE1
#include "interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
iternext(iter);
block_size = *size_ptr;
}
/* Then finish off the rest */
if (block_size > 0) do {
#define REDUCTION_INNER_LOOP
#define BLOCK_SIZE block_size
#include "interp_body.cpp"
#undef BLOCK_SIZE
#undef REDUCTION_INNER_LOOP
} while (iternext(iter));

Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:

  • NumPy may now shrink the buffersize a bit to align better with the iteration shape.
  • NumPy will more often have intermittently smaller buffers/chunks that are then grown to full size again. (Previously common in reduction operations only)

For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).

So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.

@seberg
Copy link
Contributor Author

seberg commented Dec 20, 2024

On the plus side: After cleaning out the iterator code a bit. I think I'll manage to remove that silly NPY_MAXARGS limitation for 2.3.

@FrancescAlted
Copy link
Contributor

That's good to know. In which version of NumPy are you introducing the changes?

In case you want to perform some benchmarks on your own, you can try with e.g. boolean_timing.py (or other in bench dir).

@seberg
Copy link
Contributor Author

seberg commented Dec 20, 2024

2.2 was just released, so 2.3 mid next year, so plenty of time.

In case you want to perform some benchmarks on your own,

Thanks but no promises right now :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants