Changes in NumPy buffered iteration #500

seberg · 2024-12-20T08:45:23Z

I changed the NumPy buffering setup a bit to simplify the code and make it faster in general.

This probably has little or no effect on numexpr (the core loop size may be smaller occasionally, but you don't use GROWINNER which is a change einsum noticed: a huge core reduced the summation precision).

I noticed this "fixed-size" optimization that assumes the inner-loop has a fixed size until the end of the iteration:

numexpr/numexpr/interpreter.cpp

Lines 638 to 660 in 2378606

    
               /* 
        
                * First do all the blocks with a compile-time fixed size. 
        
                * This makes a big difference (30-50% on some tests). 
        
                */ 
        
               block_size = *size_ptr; 
        
               while (block_size == BLOCK_SIZE1) { 
        
           #define REDUCTION_INNER_LOOP 
        
           #define BLOCK_SIZE BLOCK_SIZE1 
        
           #include "interp_body.cpp" 
        
           #undef BLOCK_SIZE 
        
           #undef REDUCTION_INNER_LOOP 
        
                   iternext(iter); 
        
                   block_size = *size_ptr; 
        
               } 
        
               /* Then finish off the rest */ 
        
               if (block_size > 0) do { 
        
           #define REDUCTION_INNER_LOOP 
        
           #define BLOCK_SIZE block_size 
        
           #include "interp_body.cpp" 
        
           #undef BLOCK_SIZE 
        
           #undef REDUCTION_INNER_LOOP 
        
               } while (iternext(iter));

Only for non-contiguous, non-reduction (except reduce to scalar) use-cases this fast path may not be hit anymore, because:

NumPy may now shrink the buffersize a bit to align better with the iteration shape.
NumPy will more often have intermittently smaller buffers/chunks that are then grown to full size again. (Previously common in reduction operations only)

For contiguous ops without a reduction (or a reduction along all axes). You still always get the requested buffersize (until the end).

So in the end, my hope is that the fast-path still kicks in for the most relevant use-cases. But if you/someone notices a performance regression, I can take a closer look.

The text was updated successfully, but these errors were encountered:

seberg · 2024-12-20T09:18:31Z

On the plus side: After cleaning out the iterator code a bit. I think I'll manage to remove that silly NPY_MAXARGS limitation for 2.3.

FrancescAlted · 2024-12-20T12:54:16Z

That's good to know. In which version of NumPy are you introducing the changes?

In case you want to perform some benchmarks on your own, you can try with e.g. boolean_timing.py (or other in bench dir).

seberg · 2024-12-20T12:57:27Z

2.2 was just released, so 2.3 mid next year, so plenty of time.

In case you want to perform some benchmarks on your own,

Thanks but no promises right now :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in NumPy buffered iteration #500

Changes in NumPy buffered iteration #500

seberg commented Dec 20, 2024

seberg commented Dec 20, 2024

FrancescAlted commented Dec 20, 2024

seberg commented Dec 20, 2024

Changes in NumPy buffered iteration #500

Changes in NumPy buffered iteration #500

Comments

seberg commented Dec 20, 2024

seberg commented Dec 20, 2024

FrancescAlted commented Dec 20, 2024

seberg commented Dec 20, 2024